本篇博文主要内容为 2026-04-21 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-04-21)
今日共更新1460篇论文,其中:
- 自然语言处理共273篇(Computation and Language (cs.CL))
- 人工智能共528篇(Artificial Intelligence (cs.AI))
- 计算机视觉共325篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共371篇(Machine Learning (cs.LG))
- 多智能体系统共35篇(Multiagent Systems (cs.MA))
- 信息检索共48篇(Information Retrieval (cs.IR))
- 人机交互共64篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] QRAFTI: An Agent ic Framework for Empirical Research in Quantitative Finance
【速读】:该论文旨在解决量化投资研究中因子开发效率低、可复现性差以及缺乏标准化报告生成机制的问题,尤其针对大规模金融面板数据(panel data)场景下的因子挖掘与验证流程。其解决方案的关键在于构建一个由多个智能体(multi-agent)组成的框架——QRFTI,该框架集成面板数据分析工具包与MCP服务器,将数据访问、因子构造和自定义代码操作封装为可调用工具(callable tools),并通过链式工具调用(chained tool calls)与基于反思的规划(reflection-based planning)实现多步骤实证任务的高效执行与解释性增强,从而支持新信号的快速测试、已有因子的复现及带有计算轨迹和叙事分析的标准化研究报告生成。
链接: https://arxiv.org/abs/2604.18500
作者: Terence Lim,Kumar Muthuraman,Michael Sury
机构: Graphen Inc.(Graphen公司); The University of Texas at Austin(德克萨斯大学奥斯汀分校)
类目: Multiagent Systems (cs.MA); General Finance (q-fin.GN)
备注:
Abstract:We introduce a multi-agent framework intended to emulate parts of a quantitative research team and support equity factor research on large financial panel datasets. QRAFTI integrates a research toolkit for panel data with MCP servers that expose data access, factor construction, and custom coding operations as callable tools. It can help replicate established factors, formulate and test new signals, and generate standardized research reports accompanied by narrative analysis and computational traces. On multi-step empirical tasks, using chained tool calls and reflection-based planning may offer better performance and explainability than dynamic code generation alone.
[MA-1] raining and Agent ic Inference Strategies for LLM -based Manim Animation Generation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在使用Manim等程序化动画库生成动画时面临的挑战,包括空间推理、时间序列规划以及对领域特定API的熟悉度不足等问题,这些问题在通用预训练数据中表现薄弱。解决方案的关键在于提出一个统一的训练与推理框架:一方面设计了ManimTrainer,融合监督微调(Supervised Fine-tuning, SFT)与基于组相对策略优化(Group Relative Policy Optimisation, GRPO)的强化学习方法,并采用融合代码质量与视觉评估信号的统一奖励机制;另一方面构建了ManimAgent推理管道,引入渲染器内循环(Renderer-in-the-loop, RITL)及文档增强型RITL(RITL-DOC)策略以提升生成过程中的自校正能力。实验证明,SFT主要提升代码质量,GRPO显著改善视觉输出并增强模型对推理阶段外生信号的响应性,二者协同配合实现了文本到代码再到视频的端到端动画生成性能突破。
链接: https://arxiv.org/abs/2604.18364
作者: Ravidu Suien Rammuni Silva,Ahmad Lotfi,Isibor Kennedy Ihianle,Golnaz Shahtahmassebi,Jordan J. Bird
机构: Nottingham Trent University (诺丁汉特伦特大学)
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR); Multiagent Systems (cs.MA)
备注:
Abstract:Generating programmatic animation using libraries such as Manim presents unique challenges for Large Language Models (LLMs), requiring spatial reasoning, temporal sequencing, and familiarity with domain-specific APIs that are underrepresented in general pre-training data. A systematic study of how training and inference strategies interact in this setting is lacking in current research. This study introduces ManimTrainer, a training pipeline that combines Supervised Fine-tuning (SFT) with Reinforcement Learning (RL) based Group Relative Policy Optimisation (GRPO) using a unified reward signal that fuses code and visual assessment signals, and ManimAgent, an inference pipeline featuring Renderer-in-the-loop (RITL) and API documentation-augmented RITL (RITL-DOC) strategies. Using these techniques, this study presents the first unified training and inference study for text-to-code-to-video transformation with Manim. It evaluates 17 open-source sub-30B LLMs across nine combinations of training and inference strategies using ManimBench. Results show that SFT generally improves code quality, while GRPO enhances visual outputs and increases the models’ responsiveness to extrinsic signals during self-correction at inference time. The Qwen 3 Coder 30B model with GRPO and RITL-DOC achieved the highest overall performance, with a 94% Render Success Rate (RSR) and 85.7% Visual Similarity (VS) to reference videos, surpassing the baseline GPT-4.1 model by +3 percentage points in VS. Additionally, the analysis shows that the correlation between code and visual metrics strengthens with SFT and GRPO but weakens with inference-time enhancements, highlighting the complementary roles of training and agentic inference strategies in Manim animation generation.
[MA-2] Aether: Network Validation Using Agent ic AI and Digital Twin
【速读】:该论文旨在解决现代网络运维中网络变更验证(Network Change Validation)仍主要依赖人工、耗时且易出错的问题。现有形式化网络验证方法多应用于离线预部署场景,难以适应持续变更需求并验证实际生产环境行为;而当前操作手段则因测试工具分散导致覆盖不全,错误常在部署后才被发现。解决方案的关键在于提出Aether系统,其核心是将生成式智能体(Generative Agentic AI)与多功能网络数字孪生(Network Digital Twin)深度融合,构建由五个专业化网络运维AI智能体组成的协同架构,实现从意图分析到网络验证与测试的全流程自动化。该架构基于统一的数字孪生体(集成建模、仿真与仿真功能)维持一致、实时的网络视图,通过智能体协作显著提升变更验证的准确性、效率与可扩展性,实测表明其在错误检测率(100%)、诊断覆盖率(92–96%)和验证速度(6–7分钟)方面优于传统方法。
链接: https://arxiv.org/abs/2604.18233
作者: Jordan Auge(1),Sam Betts(1),Giovanna Carofiglio(1),Giulio Grassi(1),Martin Gysi(2),John Kenneth d’Souza(2) ((1) Cisco Systems, (2) Swisscom)
机构: Cisco Systems(思科系统); Swisscom(瑞士电信)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures
Abstract:Network change validation remains a critical yet predominantly manual, time-consuming, and error-prone process in modern network operations. While formal network verification has made substantial progress in proving correctness properties, it is typically applied in offline, pre-deployment settings and faces challenges in accommodating continuous changes and validating live production behavior. Current operational approaches typically involve scattered testing tools, resulting in partial coverage and errors that surface only after deployment. In this paper, we present Aether, a novel approach that integrates Generative Agentic AI with a multi-functional Network Digital Twin to automate and streamline network change validation workflows. It features an agentic architecture with five specialized Network Operations AI agents that collaboratively handle the change validation lifecycle from intent analysis to network verification and testing. Aether agents use a unified Network Digital Twin integrating modeling, simulation, and emulation to maintain a consistent, up-to-date network view for verification and testing. By orchestrating agent collaboration atop this digital twin, Aether enables automated, rapid network change validation while reducing manual effort, minimizing errors, and improving operational agility and cost-effectiveness. We evaluate Aether over synthetic network change scenarios covering main classes of network changes and on past incidents from a major ISP operational network, demonstrating promising results in error detection (100%), diagnostic coverage (92-96%), and speed (6-7 minutes) over traditional methods.
[MA-3] acticGen: Grounding Adaptable and Scalable Generation of Football Tactics
【速读】:该论文旨在解决足球战术设计(tactic generation)在深度学习时代仍相对滞后的问题,即当前研究多聚焦于基于时空数据的轨迹预测(trajectory forecasting),而缺乏对如何生成适应不同比赛情境的战略性战术方案的有效建模。其解决方案的关键在于提出TacticGen——一个可适配、可扩展的生成式模型,将战术定义为多智能体(multi-agent)运动与交互序列,并以比赛情境为条件进行建模;核心创新是采用带有个体自注意力(agent-wise self-attention)和情境感知交叉注意力(context-aware cross-attention)的多智能体扩散Transformer架构,从而有效捕捉球员之间以及球员与球之间的协作与竞争动态。通过训练于超330万事件和1亿追踪帧的数据集,TacticGen不仅在轨迹预测上达到最先进精度,还借助分类器引导机制(classifier guidance mechanism),支持根据规则、自然语言或神经网络指定的目标,在推理阶段灵活生成符合战略意图的战术方案,且具备天然的可扩展性。
链接: https://arxiv.org/abs/2604.18210
作者: Sheng Xu,Guiliang Liu,Tarak Kharrat,Yudong Luo,Mohamed Aloulou,Javier López Peña,Konstantin Sofeikov,Adam Reid,Paul Roberts,Steven Spencer,Joe Carnall,Ian McHale,Oliver Schulte,Hongyuan Zha,Wei-Shi Zheng
机构: The Chinese University of Hong Kong, Shenzhen; Real Analytics; Birmingham City Football Club; University of Liverpool; Simon Fraser University; Sun Yat-sen University
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 23 pages
Abstract:Success in association football relies on both individual skill and coordinated tactics. While recent advancements in spatio-temporal data and deep learning have enabled predictive analyses like trajectory forecasting, the development of tactical design remains limited. Bridging this gap is essential, as prediction reveals what is likely to occur, whereas tactic generation determines what should occur to achieve strategic objectives. In this work, we present TacticGen, a generative model for adaptable and scalable tactic generation. TacticGen formulates tactics as sequences of multi-agent movements and interactions conditioned on the game context. It employs a multi-agent diffusion transformer with agent-wise self-attention and context-aware cross-attention to capture cooperative and competitive dynamics among players and the ball. Trained with over 3.3 million events and 100 million tracking frames from top-tier leagues, TacticGen achieves state-of-the-art precision in predicting player trajectories. Building on it, TacticGen enables adaptable tactic generation tailored to diverse inference-time objectives through classifier guidance mechanism, specified via rules, natural language, or neural models. Its modeling performance is also inherently scalable. A case study with football experts confirms that TacticGen generates realistic, strategically valuable tactics, demonstrating its practical utility for tactical planning in professional football. The project page is available at: this https URL.
[MA-4] ConventionPlay: Capability-Limited Training for Robust Ad-Hoc Collaboration
【速读】:该论文旨在解决即兴协作(ad-hoc collaboration)中因存在多个可遵循的共享惯例(convention)而导致的协调效率低下问题。当合作双方能够采用多种惯例时,单纯适应(adaptation)已不足以保证高效协作,需引入主动引导策略以推动团队达成最优联合策略。解决方案的关键在于提出ConventionPlay,一种基于强化学习的方法,通过扩展认知层级(cognitive hierarchies)以包含多样化的自适应跟随者群体,并在训练过程中模拟具备不同能力限制的伙伴,使代理能够探测其合作方的惯例偏好,在必要时主导协作流程、在适当情况下则跟随,从而显著提升协调效率,尤其在惯例具有差异化收益的场景下表现优异。
链接: https://arxiv.org/abs/2604.18123
作者: Abhishek Sriraman,Eleni Vasilaki,Robert Loftin
机构: University of Sheffield (谢菲尔德大学)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Ad-hoc collaboration often relies on identifying and adhering to shared conventions. However, when partners can follow multiple conventions, agents must do more than simply adapt; they must actively steer the team toward the most effective joint strategy. We present ConventionPlay, a reinforcement learning-based approach that extends cognitive hierarchies to include a diverse population of adaptive followers. By training against partners with varied capability limits, our agent learns to probe its partner’s repertoire, leading the team when possible and following when necessary. Our results in canonical coordination tasks show that ConventionPlay achieves superior coordination efficiency, particularly in settings where conventions have differentiated payoffs.
[MA-5] EvoMarket: A High-Fidelity and Scalable Financial Market Simulator
【速读】:该论文旨在解决现有金融市场模拟器在机制保真度(mechanism fidelity)、微观结构保真度(microstructure fidelity)和计算可扩展性(computational tractability)三方面难以兼顾的问题,尤其是在多资产、跨日环境下实现高保真度市场仿真。其核心解决方案是提出EvoMarket,一个基于离散事件的多智能体市场模拟框架,通过优化限价订单簿(Limit Order Book, LOB)数据结构、分层调度策略与异步逐资产撮合机制提升执行效率,并显式建模机构级机制(如交易日历、开盘集合竞价、价格涨跌幅限制及T+1结算)。关键创新在于引入Oracle引导的运行中自校准机制,将微观结构偏差解释为缺失订单流,并在记录检查点合成修正订单,从而无需昂贵黑箱标定即可实现对历史LOB数据的高精度复现与预算约束下的保真度提升。
链接: https://arxiv.org/abs/2604.18046
作者: Muyao Zhong,Zhenhua Yang,Yuxiang Liu,Ke Tang,Peng Yang
机构: Southern University of Science and Technology (南方科技大学)
类目: Computational Engineering, Finance, and Science (cs.CE); Multiagent Systems (cs.MA)
备注:
Abstract:High-fidelity, scalable market simulation is a key instrument for mechanism evaluation, stress testing, and counterfactual policy analysis. Yet existing simulators rarely achieve \emphmechanism fidelity beyond single-asset intraday settings, \emphmicrostructure fidelity against historical limit order books (LOB), and \emphcomputational tractability at market scale in a single system. This paper presents \textitEvoMarket, a discrete-event, multi-agent financial market simulator designed for intervention-oriented experiments in multi-asset and cross-day environments. EvoMarket couples a high-throughput execution core (optimized LOB data structures, hierarchical scheduling under propagation delays, and asynchronous per-asset matching) with explicit institutional mechanisms (market calendars, opening call auctions, price limits, and T+1 settlement). To avoid expensive black-box calibration, EvoMarket introduces an Oracle-guided in-run self-calibration mechanism that interprets microstructure discrepancy as missing order flow and synthesizes corrective orders at recording checkpoints. Experiments on China A-share order-flow and LOB data show close replay alignment over five trading days, fidelity gains from budgeted in-run calibration across depth levels, broad agent order-space coverage, and scalable performance under increasing input order rates and market breadth. We further demonstrate cross-asset linkage and event-study style intervention evaluation that produces structured dependence and interpretable event-time responses.
[MA-6] Diversity Collapse in Multi-Agent LLM Systems: Structural Coupling and Collective Failure in Open-Ended Idea Generation ACL2026
【速读】:该论文旨在解决多智能体系统(Multi-agent Systems, MAS)在开放式创意生成任务中,协作机制是否真正扩展解空间的问题。研究发现,尽管MAS被广泛用于提升探索多样性,但其实际效果受制于模型智能、代理认知和系统动态三个层面的结构性耦合问题,导致集体失败(collective failures),表现为多样性坍缩(diversity collapse)。解决方案的关键在于识别并避免交互结构对个体探索的抑制效应,强调在设计用于创造性任务的MAS时,应优先保留代理间的独立性和分歧性,而非单纯追求更强模型或更密集通信拓扑。
链接: https://arxiv.org/abs/2604.18005
作者: Nuo Chen,Yicheng Tong,Yuzhe Yang,Yufei He,Xueyi Zhang,Zou Qingyun,Qian Wang,Bingsheng He
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 56 pages, 15 figures; Accepted at ACL 2026 Findings
Abstract:Multi-agent systems (MAS) are increasingly used for open-ended idea generation, driven by the expectation that collective interaction will broaden the exploration diversity. However, when and why such collaboration truly expands the solution space remains unclear. We present a systematic empirical study of diversity in MAS-based ideation across three bottom-up levels: model intelligence, agent cognition, and system dynamics. At the model level, we identify a compute efficiency paradox, where stronger, highly aligned models yield diminishing marginal diversity despite higher per-sample quality. At the cognition level, authority-driven dynamics suppress semantic diversity compared to junior-dominated groups. At the system level, group-size scaling yields diminishing returns and dense communication topologies accelerate premature convergence. We characterize these outcomes as collective failures emerging from structural coupling, a process where interaction inadvertently contracts agent exploration and triggers diversity collapse. Our analysis shows that this collapse arises primarily from the interaction structure rather than inherent model insufficiency, highlighting the importance of preserving independence and disagreement when designing MAS for creative tasks. Our code is available at this https URL.
[MA-7] Multi-UAV Path Following using Vector-Field Guidance
【速读】:该论文旨在解决多架无人飞行器(Unmanned Aerial Vehicles, UAVs)在路径跟踪过程中如何实现无碰撞协同运动并保持沿参考路径均匀间距的问题。解决方案的关键在于:首先采用基于向量场的制导律引导每架UAV趋近参考路径;其次提出一种基于相对距离与方位角的旋转排斥机制,以在收敛至路径过程中避免碰撞;最后设计一种基于UAV间间距误差的速度控制律,确保间距误差收敛至零,从而实现路径上的均匀分离。理论分析证明了该方法在碰撞规避和间距误差收敛方面的有效性。
链接: https://arxiv.org/abs/2604.17995
作者: Gautam Kumar,Amit Shivam,Ashwini Ratnoo
机构: Indian Institute of Science Bangalore (印度科学理工学院班加罗尔分校); University of Porto (波尔图大学)
类目: Multiagent Systems (cs.MA)
备注: Submitted to 2026 Modeling, Estimation and Control Conference (MECC)
Abstract:This paper presents a decentralized, collision-free framework for path following guidance of multiple uncrewed aerial vehicles (UAVs), while maintaining uniform spacing along a reference path. A vector field-based guidance law is employed to drive each UAV toward the reference path. A rotational repulsion mechanism, utilizing relative distance and bearing between UAVs, is proposed to avoid collisions during convergence to the path, and an inter-UAV spacing error-based velocity control law is presented to achieve uniform separation along the path. Analytical guarantees are established for collision avoidance and convergence of the inter-UAV spacing errors to zero, ensuring uniform separation along the path. Numerical simulations demonstrate the efficacy of the proposed method.
[MA-8] RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs
【速读】:该论文旨在解决自动化漏洞报告文档化与分析中缺乏系统性方法的问题,特别是在生成符合专业标准的漏洞分析报告方面。现有大型语言模型(Large Language Models, LLMs)虽在漏洞分类、检测和修复等任务中表现优异,但在结构化报告生成上的应用仍处于探索阶段。解决方案的关键在于提出 RAVEN(Retrieval Augmented Vulnerability Exploration Network)框架,其核心由四个模块构成:Explorer 代理用于识别漏洞,RAG(Retrieval Augmented Generation)引擎从 Google Project Zero 报告和 CWE 条目等知识库中检索相关信息,Analyst 代理评估漏洞影响与利用可能性,Reporter 代理则依据 Google Project Zero 根因分析模板生成结构化报告。此外,引入一个针对任务的 LLM Judge 对报告进行多维度质量评估(结构完整性、事实一致性、代码推理质量及修复建议质量),从而保障输出的专业性和准确性。
链接: https://arxiv.org/abs/2604.17948
作者: Parteek Jamwal,Minghao Shao,Boyuan Chen,Achyuta Muthuvelan,Asini Subanya,Boubacar Ballo,Kashish Satija,Mariam Shafey,Mohamed Mahmoud,Moncif Dahaji Bouffi,Pasindu Wickramasinghe,Siyona Goel,Yaakulya Sabbani,Hakim Hacid,Mthandazo Ndhlovu,Eleanna Kafeza,Sanjay Rawat,Muhammad Shafique
机构: Technology Innovation Institute (TII); New York University Abu Dhabi (NYUAD)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various cybersecurity tasks, including vulnerability classification, detection, and patching. However, their potential in automated vulnerability report documentation and analysis remains underexplored. We present RAVEN (Retrieval Augmented Vulnerability Exploration Network), a framework leveraging LLM agents and Retrieval Augmented Generation (RAG) to synthesize comprehensive vulnerability analysis reports. Given vulnerable source code, RAVEN generates reports following the Google Project Zero Root Cause Analysis template. The framework uses four modules: an Explorer agent for vulnerability identification, a RAG engine retrieving relevant knowledge from curated databases including Google Project Zero reports and CWE entries, an Analyst agent for impact and exploitation assessment, and a Reporter agent for structured report generation. To ensure quality, RAVEN includes a task specific LLM Judge evaluating reports across structural integrity, ground truth alignment, code reasoning quality, and remediation quality. We evaluate RAVEN on 105 vulnerable code samples covering 15 CWE types from the NIST-SARD dataset. Results show an average quality score of 54.21%, supporting the effectiveness of our approach for automated vulnerability documentation.
[MA-9] Do LLM s Need to See Everything? A Benchmark and Study of Failures in LLM -driven Smartphone Automation using Screentext vs. Screenshots
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的移动代理在手机自动化任务中表现不佳的问题,包括准确率低、用户指令误解以及在复杂任务中频繁失败等现象,且此前缺乏对失败原因的系统性分析。其解决方案的关键在于构建了一个名为DailyDroid的基准测试集,包含75个任务,覆盖25个Android应用、五个使用场景和三个难度层级,以模拟真实日常手机使用;同时采用文本和多模态(文本+截图)输入方式,在GPT-4o和o4-mini上进行300次实验评估,并通过深入的失败分析整理出一份常见错误手册,从而揭示UI可访问性、输入模态与LLM/应用设计之间的关键瓶颈,为未来移动代理、应用程序及用户界面的设计提供实证依据与改进方向。
链接: https://arxiv.org/abs/2604.17817
作者: Shiquan Zhang,Tianyi Zhang,Le Fang,Simon D’Alfonso,Hong Jia,Vassilis Kostakos
机构: University of Melbourne; University of Auckland
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 29 pages. This study was conducted around May, 2025
Abstract:With the rapid advancement of large language models (LLMs), mobile agents have emerged as promising tools for phone automation, simulating human interactions on screens to accomplish complex tasks. However, these agents often suffer from low accuracy, misinterpretation of user instructions, and failure on challenging tasks, with limited prior work examining why and where they fail. To address this, we introduce DailyDroid, a benchmark of 75 tasks in five scenarios across 25 Android apps, spanning three difficulty levels to mimic everyday smartphone use. We evaluate it using text-only and multimodal (text + screenshot) inputs on GPT-4o and o4-mini across 300 trials, revealing comparable performance with multimodal inputs yielding marginally higher success rates. Through in-depth failure analysis, we compile a handbook of common failures. Our findings reveal critical issues in UI accessibility, input modalities, and LLM/app design, offering implications for future mobile agents, applications, and UI development.
[MA-10] CAPO: Counterfactual Credit Assignment in Sequential Cooperative Teams
【速读】:该论文旨在解决协作团队中代理(agent)按固定顺序行动且共享单一团队奖励时,难以准确评估每个代理贡献的问题,尤其是在代理逐次更新导致早期数据不再反映新策略的情况下。解决方案的关键在于提出序列君主效用(Sequential Aristocrat Utility, SeqAU),这是一种唯一能最大化每个代理动作个体可学习性的学习信号,扩展了Wolpert和Tumer(2002)的经典框架至序贯场景;基于SeqAU进一步推导出反事实优势策略优化(CAPO),一种无需critic的策略梯度算法,通过从群体奖励中拟合每个代理的奖励分解,并以闭式表达计算每代理的优势值,仅需少量前向传播即可完成,无需额外环境交互。
链接: https://arxiv.org/abs/2604.17693
作者: Shripad Deshmukh,Jayakumar Subramanian,Raghavendra Addanki,Nikos Vlassis
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Adobe Research (Adobe 研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:In cooperative teams where agents act in a fixed order and share a single team reward, it is hard to know how much each agent contributed, and harder still when agents are updated one at a time because data collected earlier no longer reflects the new policies. We introduce the Sequential Aristocrat Utility (SeqAU), the unique per-agent learning signal that maximizes the individual learnability of each agent’s action, extending the classical framework of Wolpert and Tumer (2002) to this sequential setting. From SeqAU we derive CAPO (Counterfactual Advantage Policy Optimization), a critic-free policy-gradient algorithm. CAPO fits a per-agent reward decomposition from group rewards and computes the per-agent advantage in closed form plus a handful of forward passes through the current policy, requiring no extra environment calls beyond the initial batch. We give analytic bias and variance bounds and validate them on a controlled sequential bandit, where CAPO’s advantage over standard baselines grows with the team size. The framework is general; multi-LLM pipelines are a natural deployment target.
[MA-11] owards Self-Improving Error Diagnosis in Multi-Agent Systems ACL2026
【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent System, MAS)在复杂任务执行过程中出现的调试难题,尤其是由长交互轨迹、智能体间依赖关系以及延迟错误显现所导致的故障定位困难。其解决方案的关键在于提出一个自迭代优化的语义失败归因框架ErrorProbe,该框架通过三阶段流程实现:首先将MAS失败分类体系操作化以检测局部异常;其次采用症状驱动的反向追溯机制剪枝无关上下文;最后借助由策略制定者(Strategist)、调查者(Investigator)和仲裁者(Arbiter)组成的专用多智能体团队,通过工具接地执行验证错误假设。尤为关键的是,ErrorProbe维护一个仅在可执行证据确认后更新的验证记忆库,无需人工标注即可实现高精度的步骤级错误定位与跨领域迁移能力。
链接: https://arxiv.org/abs/2604.17658
作者: Jiazheng Li,Emine Yilmaz,Bei Chen,Dieu-Thu Le
机构: King’s College London (国王学院伦敦大学); Amazon Alexa AI (亚马逊Alexa人工智能); University College London (伦敦大学学院)
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注: 15 pages, 3 figures; accepted at ACL 2026 Findings
Abstract:Large Language Model (LLM)-based Multi-Agent Systems (MAS) enable complex problem-solving but introduce significant debugging challenges, characterized by long interaction traces, inter-agent dependencies, and delayed error manifestation. Existing diagnostic approaches often rely on expensive expert annotation or ‘‘LLM-as-a-judge’’ paradigms, which struggle to pinpoint decisive error steps within extended contexts. In this paper, we introduce ErrorProbe, a self-improving framework for semantic failure attribution that identifies responsible agents and the originating error step. The framework operates via a three-stage pipeline: (1) operationalizing the MAS failure taxonomy to detect local anomalies, (2) performing symptom-driven backward tracing to prune irrelevant context, and (3) employing a specialized multi-agent team (Strategist, Investigator, Arbiter) to validate error hypotheses through tool-grounded execution. Crucially, ErrorProbe maintains a verified episodic memory that updates only when error patterns are confirmed by executable evidence, without the need for annotation. Experiments across the TracerTraj and WhoWhen benchmarks demonstrate that ErrorProbe significantly outperforms baselines, particularly in step-level localization, while the verified memory enables robust cross-domain transfer without retraining.
[MA-12] SafeAgent : A Runtime Protection Architecture for Agent ic Systems
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)智能体在多步骤工作流、工具交互和持久化上下文环境中易受提示注入攻击(prompt-injection attacks)的问题,此类攻击难以通过单纯输入输出过滤实现可靠防护。解决方案的关键在于提出SafeAgent——一种运行时安全架构,将代理安全性建模为随交互轨迹演化的状态决策问题,并通过两个协同组件实现:一是运行时控制器(runtime controller),用于在代理循环中协调动作;二是上下文感知决策核心(context-aware decision core),基于持久会话状态进行风险编码、效用-成本评估、后果建模、策略仲裁与状态同步等操作,从而实现对安全与任务性能的动态平衡。
链接: https://arxiv.org/abs/2604.17562
作者: Hailin Liu,Eugene Ilyushin,Jie Ni,Min Zhu
机构: Lomonosov Moscow State University (莫斯科国立大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Large language model (LLM) agents are vulnerable to prompt-injection attacks that propagate through multi-step workflows, tool interactions, and persistent context, making input-output filtering alone insufficient for reliable protection. This paper presents SafeAgent, a runtime security architecture that treats agent safety as a stateful decision problem over evolving interaction trajectories. The proposed design separates execution governance from semantic risk reasoning through two coordinated components: a runtime controller that mediates actions around the agent loop and a context-aware decision core that operates over persistent session state. The core is formalized as a context-aware advanced machine intelligence and instantiated through operators for risk encoding, utility-cost evaluation, consequence modeling, policy arbitration, and state synchronization. Experiments on Agent Security Bench (ASB) and InjecAgent show that SafeAgent consistently improves robustness over baseline and text-level guardrail methods while maintaining competitive benign-task performance. Ablation studies further show that recovery confidence and policy weighting determine distinct safety-utility operating points.
[MA-13] Learning Unanimously Acceptable Lotteries via Queries
【速读】:该论文旨在解决高风险人工智能(AI)部署中的多利益相关方可接受性问题:即如何在有限选项集合上构造一个概率混合(lottery),使得所有利益相关方均认为该混合方案满足其最低可接受标准。核心挑战在于,算法仅能通过二元反馈(接受/拒绝)来获取各利益相关方的约束条件,且需最小化查询次数以降低评估成本。解决方案的关键在于设计具有适应性的确定性和随机化算法,能够高效识别是否存在一个全体一致可接受的概率方案,或证明其不可行;同时引入学习增强机制,利用预测信息(如可能起约束作用的利益相关方或候选方案)进一步减少查询复杂度,同时保留最坏情况下的理论保证。
链接: https://arxiv.org/abs/2604.17505
作者: Davin Choo,Paul W. Goldberg,Nicholas Teh
机构: Harvard University (哈佛大学); University of Oxford (牛津大学)
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Many high-stakes AI deployments proceed only if every stakeholder deems the system acceptable relative to their own minimum standard. With randomization over a finite menu of options, this becomes a feasibility question: does there exist a lottery over options that clears all stakeholders’ acceptability bars? We study a query model where the algorithm proposes lotteries and receives only binary accept/reject feedback. We give deterministic and randomized algorithms that either find a unanimously acceptable lottery or certify infeasibility; adaptivity can avoid eliciting many stakeholders’ constraints, and randomization further reduces the expected elicitation cost relative to full elicitation. We complement these upper bounds with worst-case lower bounds (in particular, linear dependence on the number of stakeholders and logarithmic dependence on precision are unavoidable). Finally, we develop learning-augmented algorithms that exploit natural forms of advice (e.g., likely binding stakeholders or a promising lottery), improving query complexity when predictions are accurate while preserving worst-case guarantees.
[MA-14] SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology
【速读】:该论文旨在解决视觉多智能体系统(Visual Multiagent Systems, VMAS)中两个相互耦合的核心问题:一是通信拓扑结构在推理前固定,无法根据视觉内容和查询上下文动态调整;二是智能体的推理能力在部署后保持静态,缺乏针对特定任务进行专业化的能力。这两个问题相互强化:固定的拓扑结构难以利用智能体间更丰富的专业技能,而静态的智能体则缺乏优化自身能力以适配具体查询的动力。解决方案的关键在于提出SkillGraph框架,其核心由两部分组成:一是多模态图Transformer(Multimodal Graph Transformer, MMGT),通过编码视觉token、指令语义和活跃技能嵌入,预测与查询相关的协作图,从而替代人工设计的路由机制,实现内容感知的信息流动;二是技能设计师(Skill Designer),从失败案例中提炼并精炼推理启发式规则,构建自演化的多模态技能库(Skill Bank)。更重要的是,更新后的技能嵌入被反馈至MMGT,使通信拓扑能够随智能体能力的增长同步演化,实现了“能力-拓扑”协同进化。
链接: https://arxiv.org/abs/2604.17503
作者: Zheng Nie,Ruolin Shen,Xinlei Yu,Bo Yin,Jiangning Zhang,Xiaobin Hu
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Scaling vision-language models into Visual Multiagent Systems (VMAS) is hindered by two coupled issues. First, communication topologies are fixed before inference, leaving them blind to visual content and query context; second, agent reasoning abilities remain static during deployment. These issues reinforce each other: a rigid topology fails to leverage richer agent expertise, while static agents lack incentives to specialize for a given query. We address this with SkillGraph, a joint framework that evolves both agent expertise and communication topology. Within this framework, a Multimodal Graph Transformer (MMGT) encodes visual tokens, instruction semantics and active skill embeddings to predict a query-conditioned collaboration graph, replacing hand-crafted routing with dynamic, content-aware information flow. Complementing this, a Skill Designer distills and refines reasoning heuristics from failure cases, constructing a self-evolving multimodal Skill Bank. Crucially, updated skill embeddings are fed back into the MMGT, enabling the topology to adapt alongside capability growth. Experiments show that SkillGraph achieves consistent improvements across four benchmarks, five common MAS structures and four base models. Code is available at this https URL.
[MA-15] ARMove: Learning to Predict Human Mobility through Agent ic Reasoning
【速读】:该论文旨在解决人类移动性预测(human mobility prediction)中存在的可解释性差、缺乏从新数据中迭代学习的能力以及跨区域和用户群体迁移能力弱的问题。其解决方案的关键在于提出了一种名为ARMove的全可迁移框架,通过三个核心机制实现:一是标准化特征管理与迭代优化,结合基础知识池、用户画像池及LLM知识自动集成机制;二是基于代理式决策(agentic decision-making)动态调整特征权重,在提升预测准确率的同时提供可解释的推理路径;三是利用大模型到小模型的知识蒸馏策略(large-small model synergy),将72B规模的大语言模型(LLM)的策略提炼至7B小模型中,从而在降低计算成本的同时显著提升性能上限。
链接: https://arxiv.org/abs/2604.17419
作者: Chuyue Wang,Jie Feng,Yuxi Wu,Shenglin Yi,Hang Zhang
机构: 未知
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注:
Abstract:Human mobility prediction is a critical task but remains challenging due to its complexity and variability across populations and regions. Recently, large language models (LLMs) have made progress in zero-shot prediction, but existing methods suffer from limited interpretability (due to black-box reasoning), lack of iterative learning from new data, and poor transferability. In this paper, we introduce \textbfARMove, a fully transferable framework for predicting human mobility through agentic reasoning. To address these limitations, ARMove employs standardized feature management with iterative optimization and user-specific customization: four major feature pools for foundational knowledge, user profiles for segmentation, and an automated generation mechanism integrating LLM knowledge. Robust generalization is achieved via agentic decision-making that adjusts feature weights to maximize accuracy while providing interpretable decision paths. Finally, large-small model synergy distills strategies from large LLMs (e.g., 72B) to smaller ones (e.g., 7B), reducing costs and enhancing performance ceilings. Extensive experiments on four global datasets show ARMove outperforms state-of-the-art baselines on 6 out of 12 metrics (gains of 0.78% to 10.47%), with transferability tests confirming robustness across regions, users, and scales. The other 4 items also achieved suboptimal results. Transferability tests confirm its 19 robustness across regions, user groups, and model scales, while interpretability 20 analysis highlights its transparency in decision-making. Our codes are available at: this https URL.
[MA-16] LLM -Guided Strategy Synthesis for Scalable Equality Saturation
【速读】:该论文旨在解决生成式 AI (Generative AI) 在等价饱和(Equality Saturation, EqSat)策略自动合成中的有效性问题,即如何在不引发 e-graph 爆炸的前提下,高效地自动化设计高质量的 EqSat 策略。当前手动设计策略成为基于 e-graph 编译器自动化的瓶颈,而现有规则合成方法虽能扩展重写词汇表,却加剧了搜索空间膨胀。论文提出的解决方案关键在于:首先引入一种领域特定语言 EqSatL,将策略表示为可显式表达和可检查的结构化实体;其次构建一个由大语言模型(LLM)驱动的代理工作流,结合基于证明的重写模式缓存与可计算性引导机制,在保持 e-graph 稳定增长的同时,实现高效率、高稳定性的策略搜索,从而显著提升资源利用率与优化质量。
链接: https://arxiv.org/abs/2604.17364
作者: Chenyun Yin,Youwei Xiao,Yuze Luo,Yuyang Zou,Yun Liang
机构: Peking University (北京大学); School of Integrated Circuits, Peking University (北京大学集成电路学院)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Programming Languages (cs.PL)
备注:
Abstract:Equality saturation (EqSat) is a powerful optimization paradigm that compactly represents many equivalent programs in an e-graph and delays commitment until extraction selects a lowest-cost program. Making EqSat effective, therefore, requires not only domain-specific rewrite rules but also domain-specific strategies. Today, much of this strategy design is still manual, making it a major obstacle to automating e-graph-based compilers. Recent rule-synthesis frameworks can automatically infer large rewrite vocabularies from semantic specifications, but they also enlarge the rewrite space and further exacerbate e-graph explosion. Although large language models (LLMs) make automated strategy synthesis plausible, directly evolving backend code remains ineffective in practice. The search lacks reusable strategy abstractions and actionable feedback, and can easily trigger e-graph explosion or converge to poor designs. We present EggMind, an LLM-guided, end-to-end framework for synthesizing reusable EqSat strategies. At its core, EggMind introduces a domain-specific language, EqSatL, to represent EqSat strategies as explicit and inspectable artifacts. It then proposes an LLM-guided agentic workflow, equipped with novel techniques including proof-derived rewrite motif caching and tractability guidance, to search efficiently for high-quality strategies while keeping synthesis stable under e-graph growth. Evaluation shows that EggMind substantially improves the resource-quality trade-off on vectorization benchmarks, reducing final cost by 45.1% and peak RAM by 69.1% relative to full EqSat. We further show that the same methodology transfers effectively to an XLA-based tensor compiler, and demonstrate its practical potential in a logic-synthesis case study with augmented rewrite spaces. Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Programming Languages (cs.PL) Cite as: arXiv:2604.17364 [cs.AI] (or arXiv:2604.17364v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.17364 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-17] Dynamics of Cognitive Heterogeneity: Investigating Behavioral Biases in Multi-Stage Supply Chains with LLM -Based Simulation ACL2026
【速读】:该论文旨在解决复杂多轮决策场景中生成式AI代理(Generative Agents)之间协调建模的核心挑战,尤其关注供应链系统中因认知异质性导致的效率损失问题。传统行为实验虽揭示了供应链低效背后的认知偏差,但存在可扩展性和控制力不足的局限。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的可扩展实验范式,结合分层推理框架(Hierarchical Reasoning Framework),通过引入DeepSeek与GPT代理在供应链不同层级上系统性地改变推理复杂度,从而模拟真实世界中的认知多样性,并借助严格重复和统计验证的仿真方法,量化认知异质性对群体决策结果的影响。研究发现,代理倾向于短视和自利行为,加剧系统性低效,而信息共享能有效缓解此类负面影响。
链接: https://arxiv.org/abs/2604.17220
作者: Jiuyun Jiang,Yuecheng Hong,Bo Yang,Jin Yang,Guangxin Jiang,Xiaomeng Guo,Guang Xiao
机构: Harbin Institute of Technology (哈尔滨工业大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Accepted to the Main Conference of ACL 2026. 18 pages, 8 figures in total (9 pages, 7 figures for the main text)
Abstract:Modeling coordination among generative agents in complex multi-round decision-making presents a core challenge for AI and operations management. Although behavioral experiments have revealed cognitive biases behind supply chain inefficiencies, traditional methods face scalability and control limitations. We introduce a scalable experimental paradigm using Large Language Models (LLMs) to simulate multi-stage supply chain dynamics. Grounded in a Hierarchical Reasoning Framework, this study specifically analyzes the impact of cognitive heterogeneity on agent interactions. Unlike prior homogeneous settings, we employ DeepSeek and GPT agents to systematically vary reasoning sophistication across supply chain tiers. Through rigorously replicated and statistically validated simulations, we investigate how this cognitive diversity influences collective outcomes. Results indicate that agents exhibit myopic and self-interested behaviors that exacerbate systemic inefficiencies. However, we demonstrate that information sharing effectively mitigates these adverse effects. Our findings extend traditional behavioral methods and offer new insights into the dynamics of AI-enabled organizations. This work underscores both the potential and limitations of LLM-based agents as proxies for human decision-making in complex operational environments.
[MA-18] Persona-Based Requirements Engineering for Explainable Multi-Agent Educational Systems: A Scenario Simulator for Clinical Reasoning Training
【速读】:该论文旨在解决多智能体教育系统(Multi-Agent Education System, MAES)在需求工程(Requirements Engineering, RE)阶段缺乏可解释性的问题,尤其是在医疗临床推理训练场景中,如何确保AI系统的决策过程对非技术用户(如医学生)透明、可信且符合真实临床情境。解决方案的关键在于提出一个以人类为中心、基于角色画像(persona)的可解释MAES需求工程框架,通过将角色画像与用户故事(user stories)贯穿于RE全过程,明确医疗教育者、医学生、AI患者代理及各类临床代理(如诊断代理、干预代理等)的目标、交互逻辑和知识基础,从而引导出可解释性需求,并确保系统设计从早期阶段就具备可理解性和实用性。实证结果显示,该方法显著提升了医学生的临床推理能力(>78%的学生反馈提升),验证了其有效性。
链接: https://arxiv.org/abs/2604.17186
作者: Weibing Zheng,Laurah Turner,Jess Kropczynski,Matthew Kelleher,Murat Ozer,Shane Halse
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: 7 pages, 2 figures, CSTE2026: this https URL
Abstract:As Artificial Intelligence (AI) and Agentic AI become increasingly integrated across sectors such as education and healthcare, it is critical to ensure that Multi-Agent Education System (MAES) is explainable from the early stages of requirements engineering (RE) within the AI software development lifecycle. Explainability is essential to build trust, promote transparency, and enable effective human-AI collaboration. Although personas are well-established in human-computer interaction to represent users and capture their needs and behaviors, their role in RE for explainable MAES remains underexplored. This paper proposes a human-first, persona-driven, explainable MAES RE framework and demonstrates the framework through a MAES for clinical reasoning training. The framework integrates personas and user stories throughout the RE process to capture the needs, goals, and interactions of various stakeholders, including medical educators, medical students, AI patient agent, and clinical agents (physical exam agent, diagnostic agent, clinical intervention agent, supervisor agent, evaluation agent). The goals, underlying models, and knowledge base shape agent interactions and inform explainability requirements that guided the clinical reasoning training of medical students. A post-usage survey found that more than 78% of medical students reported that MAES improved their clinical reasoning skills. These findings demonstrate that RE based on persona effectively connects technical requirements with non-technical medical students from a human-centered approach, ensuring that explainable MAES are trustworthy, interpretable, and aligned with authentic clinical scenarios from the early stages of the AI system engineering. The partial MAES for the clinical scenario simulator is~\hrefthis https URLopen sourced here.
[MA-19] Logic-Based Verification of Task Allocation for LLM -Enabled Multi-Agent Manufacturing Systems
【速读】:该论文旨在解决制造系统中因个性化产品需求增加而导致的产品变异性问题,特别是在频繁重构场景下保障安全性的挑战。传统多智能体控制架构依赖预定义的任务模型,难以在保持安全性的同时适应新的产品要求;尽管生成式 AI(Generative AI)被引入以提升灵活性,其可靠性仍存在不足。解决方案的关键在于提出一种融合大型语言模型(Large Language Models, LLMs)灵活性与形式化验证能力的控制架构,通过时序逻辑(Temporal Logic)和离散事件系统(Discrete Event Systems)对LLM生成的任务分配进行安全性验证,从而确保在多机器人装配场景中,不安全任务能够在执行前被识别并修正,实现安全可控的动态任务规划。
链接: https://arxiv.org/abs/2604.17142
作者: Jonghan Lim,Mostafa Tavakkoli Anbarani,Rômulo Meira-Góes,Ilya Kovalenko
机构: Pennsylvania State University (宾夕法尼亚州立大学)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Manufacturing industries are facing increasing product variability due to the growing demand for personalized products. Under these conditions, ensuring safety becomes challenging as frequent reconfigurations can lead to unintended hazardous behaviors. Multi-agent control architectures have been proposed to improve flexibility through decentralized decision-making and coordination. However, these architectures are based on predefined task models, which limit their ability to adapt task planning to new product requirements while preserving safety. Recently, large language models have been introduced into manufacturing systems to enhance adaptability, but reliability remains a key challenge. To address this issue, we propose a control architecture that leverages the flexibility of large language models while preserving safety on the manufacturing shop floor. Specifically, the proposed framework verifies large language model-enabled task allocations by using temporal logic and discrete event systems. The effectiveness of the proposed framework is demonstrated through a case study that involves a multi-robot assembly scenario, showing that unsafe tasks can be allocated safely before task execution.
[MA-20] he Consensus Trap: Rescuing Multi-Agent LLM s from Adversarial Majorities via Token-Level Collaboration
【速读】:该论文旨在解决多智能体大语言模型(Multi-agent Large Language Model, Multi-agent LLM)在开放环境中因隐蔽式上下文污染(如定向提示注入)而导致的结构脆弱性问题。当前主流的响应级聚合方法(如多数投票 MAJ)在 corrupted agents 形成局部多数时会失效,因其仅对最终结论进行计数而忽视中间推理过程中的逻辑缺陷。解决方案的关键在于提出一种基于 token 级轮转协作(Token-Level Round-Robin, RR)的新架构:通过在共享自回归上下文中顺序交错生成 token,将聚合机制从线性求和转变为非线性算子乘积,从而构建一个动态、交织的推理链。理论分析与实证结果表明,RR 能够有效抵抗多数恶意 agent 的干扰,维持系统鲁棒性。
链接: https://arxiv.org/abs/2604.17139
作者: Jiayuan Liu,Shiyi Du,Weihua Du,Mingyu Guo,Vincent Conitzer
机构: Carnegie Mellon University (卡内基梅隆大学); Foundations of Cooperative AI Lab (FOCAL) (合作人工智能基础实验室); Adelaide University (阿德莱德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Multi-agent large language model (LLM) architectures increasingly rely on response-level aggregation, such as Majority Voting (MAJ), to raise reasoning ceilings. However, in open environments, agents are highly susceptible to stealthy contextual corruption, such as targeted prompt injections. We reveal a critical structural vulnerability in current multi-agent systems: response-level aggregation collapses when corrupted agents form a local majority. Because voting aggregates fully-formed conclusions, it is blind to flawed intermediate logic. To overcome this systematic limitation, we propose the Token-Level Round-Robin (RR) Collaboration, where agents sequentially interleave generation within a shared auto-regressive context. We formalize this process as a discrete-time dynamical system, proving that token-level interleaving transitions aggregation from a brittle counting of final votes (a linear sum) to a dynamic, interwoven chain of logic (a non-linear operator product). Through this theoretical lens, we prove that the honest model’s restorative pull can overpower adversarial corruptions, even when corrupted agents form a majority. We conduct an exhaustive empirical evaluation across diverse reasoning benchmarks and demonstrate that while MAJ collapses when corrupted agents reach a majority, RR maintains robust accuracy well beyond this critical threshold.
[MA-21] CogGen: A Cognitively Inspired Recursive Framework for Deep Research Report Generation ACL2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在自动生成深度研究报告时面临的非线性叙事逻辑缺失与多模态融合能力不足的问题。当前方法依赖预定义的线性工作流,导致误差累积、无法基于后续洞察进行全局重构,从而限制了报告质量与深度。其解决方案的关键在于提出CogGen框架——一个受认知启发的递归架构,通过分层递归机制模拟人类写作的认知过程,实现灵活规划与全局重构;同时引入抽象视觉表示(Abstract Visual Representation, AVR),以意图驱动的方式迭代优化图文布局,避免像素级重生成开销,从而支持高效的多模态内容整合。
链接: https://arxiv.org/abs/2604.17072
作者: Kuo Tian,Pengfei Sun,Zhen Wu,Junran Ding,Xinyu Dai
机构: Nanjing University(南京大学); Nanjing Haodun Technology Development Co., Ltd.(南京好盾科技发展有限公司)
类目: Multiagent Systems (cs.MA)
备注: 28 pages, 3 figures, Accepted to ACL 2026 Findings
Abstract:The autonomous synthesis of deep research reports represents a critical frontier for Large Language Models (LLMs), demanding sophisticated information orchestration and non-linear narrative logic. Current approaches rely on rigid predefined linear workflows, which cause error accumulation, preclude global restructuring from subsequent insights, and ultimately limit in-depth multimodal fusion and report quality. We propose CogGen, a Cognitively inspired recursive framework for deep research report Generation. Leveraging a Hierarchical Recursive Architecture to simulate cognitive writing, CogGen enables flexible planning and global restructuring. To extend this recursivity to multimodal content, we introduce Abstract Visual Representation (AVR): a concise intent-driven language that iteratively refines visual-text layouts without pixel-level regeneration overhead. We further present CLEF, a Cognitive Load Evaluation Framework, and curate a new benchmark from Our World in Data (OWID). Extensive experiments show CogGen achieves state-of-the-art results among open-source systems, generating reports comparable to professional analysts’ outputs and surpassing Gemini Deep Research. Our code and dataset are available at this https URL.
[MA-22] From Necklaces to Coalitions: Fair and Self-Interested Distribution of Coalition Value Calculations
【速读】:该论文旨在解决分布式联盟形成(distributed coalition formation)中联盟价值计算分配不均、冗余或非成员代理被指派计算任务的问题,尤其针对无约束特征函数博弈(unrestricted characteristic function games)。其核心挑战在于,随着代理数量增加,可能的联盟组合呈指数级增长,传统方法难以实现高效且公平的负载分配。解决方案的关键是提出一种无需通信的项链基分布式联盟算法(N-DCA),其核心创新在于利用增量数组(Increment Arrays, IAs) 的数学结构——通过定义循环移位下的等价类、周期性IA以及旋转指派方案,建立了标准代表IA与双色组合项链(two-colour combinatorial necklaces)之间的双射关系,从而可借助高效的项链生成算法在常数摊销时间内枚举分配方案,并提供严格的负载均衡边界保证。此设计确保了五个理想性质:无代理间通信、公平分配、无冗余、负载平衡和自利性(self-interest),在工作内存、可扩展性和理论保障方面优于现有方法如DCVC。
链接: https://arxiv.org/abs/2604.17057
作者: Terry R. Payne,Luke Riley
机构: University of Liverpool (利物浦大学); Quant Network (量化网络)
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: 69 pages
Abstract:A key challenge in distributed coalition formation within characteristic function games is determining how to allocate the calculation of coalition values across a set of agents. The number of possible coalitions grows exponentially with the number of agents, and existing distributed approaches may produce uneven or redundant allocations, or assign coalitions to agents that are not themselves members. In this article, we present the \emphNecklace-based Distributed Coalition Algorithm (N-DCA), a communication-free algorithm in which each agent independently determines its own coalition value calculation allocation using only its identifier and the total number of agents. The approach builds on the notion of Increment Arrays (IAs), for which we develop a complete mathematical framework: equivalence classes under circular shifts, periodic IAs, and a rotated designation scheme with formal load-balance guarantees (tight bounds). We establish a bijection between canonical representative IAs and two-colour combinatorial necklaces, enabling the use of efficient necklace generation algorithms to enumerate allocations in constant amortised time. N-DCA is, to the best of our knowledge, the only distributed coalition value calculation algorithm for unrestricted characteristic function games to provably satisfy five desirable properties: no inter-agent communication, equitable allocation, no redundancy, balanced load, and self-interest. An empirical evaluation against DCVC (Rahwan and Jennings 2007) demonstrates that, although DCVC is faster by a constant factor, this difference becomes negligible under realistic characteristic-function evaluation costs, while N-DCA offers advantages in working memory, scalability, and the self-interest guarantee. Comments: 69 pages Subjects: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA) Cite as: arXiv:2604.17057 [cs.GT] (or arXiv:2604.17057v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2604.17057 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-23] nclawed: A Configurable Sector-Neutral Hardening Framework for Single-User AI Assistant Gateways
【速读】:该论文旨在解决在高度监管行业中部署个人人工智能(AI)助手时面临的可信性、安全性与合规性挑战,特别是如何实现可验证的对等信任、默认拒绝的外部连接、签名模块加载及防篡改审计日志等问题。其解决方案的关键在于提出一个名为enclawed的硬分叉加固框架,该框架基于OpenClaw单用户个人AI助手网关构建,提供两种部署模式:开放模式保留兼容性并输出审计、分类和数据丢失防护(DLP)信号;封闭模式则激活严格白名单、联邦信息处理标准(FIPS)加密模块认证、强制模块清单签名验证以及高保障的模型上下文协议(MCP)对等认证机制,从而在不依赖第三方认证的前提下,为敏感场景提供可配置、可审计且具备抗攻击能力的安全运行环境。
链接: https://arxiv.org/abs/2604.16838
作者: Alfredo Metere
机构: Metere Consulting, LLC.
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:We present enclawed, a hard-fork hardening framework built on top of the OpenClaw single-user personal artificial intelligence (AI) assistant gateway. enclawed targets deployments that need attestable peer trust, deny-by-default external connectivity, signed-module loading, and a tamper-evident audit trail typically regulated industries such as financial services, healthcare, defense contracting, regulated RD, and government enclaves. The framework ships in two flavors: an open flavor that preserves OpenClaw compatibility while still emitting audit, classification, and data-loss-prevention (DLP) signals, and an enclaved flavor that activates strict allowlists, Federal Information Processing Standards (FIPS) cryptographic-module assertion, mandatory module-manifest signature verification, and high-assurance peer attestation for the Model Context Protocol (MCP). The classification ladder is fully data-driven: a deploying organization selects from five built-in presets (generic, US-government, healthcare, financial services, three-tier) or supplies its own JSON. We accompany the implementation with a security review, a 204-case test suite (146 unit tests, 58 adversarial pen-tests for tamper detection, signature forgery, egress bypass, trust-root mutation, DLP evasion, prompt injection, and code injection), real-time human-in-the-loop control (per-agent pause / resume / stop and approval queues), a memory-bounded secure transaction buffer with rollback (default cap 50% of system RAM, configurable), a strict-mode TypeScript typecheck of all 22 framework files, and a GitHub Actions workflow ready for continuous integration. enclawed is a hardening framework, not an accredited compliance certification. The deploying organization remains responsible for hardware, validated cryptographic modules, certified facilities, and assessor sign-off.
[MA-24] Evaluating Tool-Using Language Agents : Judge Reliability Propagation Cascades and Runtime Mitigation in Agent Prop-Bench
【速读】:该论文旨在解决当前对工具使用型大语言模型(Tool-using Large Language Model, LLM)代理的自动化评估可靠性假设缺乏实证验证的问题。其核心挑战在于现有基于子串匹配的评判方法与人类标注存在显著偏差,且参数级错误传播机制未被量化。解决方案的关键在于构建了一个包含2000个任务、2300条推理轨迹的基准测试集AgentProp-Bench,并引入人工标注的100标签子集以校准评估结果;通过量化裁判可靠性、刻画误差传播路径并设计运行时拦截策略,发现三模型集成可提升一致性至中等水平(kappa=0.432),而参数注入错误导致最终答案错误的概率高达约0.62;同时揭示拒绝(rejection)与恢复(recovery)是独立能力,且针对GPT-4o-mini的运行时拦截器能有效降低幻觉23个百分点,但对Gemini-2.0-Flash无显著效果,因其自身激进的参数拒绝机制已消除目标故障模式。
链接: https://arxiv.org/abs/2604.16706
作者: Bhaskar Gurram
机构: Zasti Inc.(Zasti 公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 9 pages, 5 figures, 12 tables (8 main + 4 supplementary). Under review at Information Processing Management. Code and data: this https URL
Abstract:Automated evaluation of tool-using large language model (LLM) agents is widely assumed to be reliable, but this assumption has rarely been validated against human annotation. We introduce AgentProp-Bench, a 2,000-task benchmark with 2,300 traces across four domains, nine production LLMs, and a 100-label human-validated subset. We quantify judge reliability, characterize error propagation, and evaluate a runtime mitigation. Substring-based judging agrees with human annotation at kappa=0.049 (chance-level); a three-LLM ensemble reaches kappa=0.432 (moderate) with a conservative bias. Under validated evaluation, a parameter-level injection propagates to a wrong final answer with human-calibrated probability approximately 0.62 (range 0.46-0.73 across models). Rejection (catching bad parameters) and recovery (correcting after acceptance) are independent model capabilities (Spearman rho=0.126, p=0.747). A tuned runtime interceptor reduces hallucination on GPT-4o-mini by 23.0 percentage points under a concurrent n=600 control, but shows no significant effect on Gemini-2.0-Flash, whose aggressive parameter rejection eliminates the target failure mode. All code, data, traces, and human labels are released at this https URL.
[MA-25] Agent ic AI for Education: A Unified Multi-Agent Framework for Personalized Learning and Institutional Intelligence
【速读】:该论文旨在解决当前基于人工智能(AI)的教育系统中存在的碎片化问题,即缺乏在学生、教师和机构三个层级上的深度融合与协同。其解决方案的关键在于提出一种新型多智能体架构——代理统一学生支持系统(Agentic Unified Student Support System, AUSS),通过整合学生层面的个性化服务、教师层面的自动化流程以及机构层面的智能决策能力,利用大语言模型(Large Language Models, LLMs)、强化学习、预测分析和规则推理等技术手段,实现教育生态系统的可扩展性、自适应性和智能化。
链接: https://arxiv.org/abs/2604.16566
作者: Arya Mary K J,Deepthy K Bhaskar,Sinu T S,Binu V P
机构: M.E.S. College (M.E.S.学院)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Agentic Artificial Intelligence (AI) represents a paradigm shift from reactive systems to proactive, autonomous decision making frameworks. Existing AI-based educational systems remain fragmented and lack multi-level integration across stakeholders. This paper proposes the Agentic Unified Student Support System (AUSS), a novel multi-agent architecture integrating student-level personalization, educator-level automation, and institutional-level intelligence. The framework leverages Large Language Models (LLMs), reinforcement learning, predictive analytics, and rule-based reasoning. Experimental results demonstrate improvements in recommendation accuracy (92.4%), grading efficiency (94.1%), and dropout prediction (F1-score: 89.5%). The proposed system enables scalable, adaptive, and intelligent educational ecosystems.
[MA-26] Conjunctive Prompt Attacks in Multi-Agent LLM Systems ACL2026
【速读】:该论文旨在解决多智能体大语言模型(Large Language Model, LLM)系统中因代理间交互和路由机制引入的新型安全漏洞问题。传统LLM安全研究主要聚焦于单智能体场景,而实际应用中广泛存在的多智能体协作架构(如星型、链式和有向无环图拓扑)通过提示分割(prompt segmentation)与跨代理路由(inter-agent routing)形成攻击面,使得单一代理检测难以发现协同性恶意行为。论文提出了一种“conjunctive prompt attack”(联合提示攻击)模型,其中攻击者仅控制触发词(trigger key)的位置和一个被攻陷远程代理中的隐藏对抗模板(adversarial template),二者单独均表现为良性,但在路由路径交汇时激活有害输出。解决方案的关键在于:通过路由感知优化(routing-aware optimization)对触发词与模板进行协同配置,在保持低误激活率的前提下显著提升攻击成功率;同时揭示现有防御机制(如PromptGuard、Llama-Guard变体及工具限制等)因缺乏对跨代理组合逻辑的推理能力而失效,从而强调未来需构建能理解代理间数据流与语义组合关系的安全防护体系。
链接: https://arxiv.org/abs/2604.16543
作者: Nokimul Hasan Arif,Qian Lou,Mengxin Zheng
机构: University of Central Florida, USA
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: ACL 2026 Main Conference
Abstract:Most LLM safety work studies single-agent models, but many real applications rely on multiple interacting agents. In these systems, prompt segmentation and inter-agent routing create attack surfaces that single-agent evaluations miss. We study \emphconjunctive prompt attacks, where a trigger key in the user query and a hidden adversarial template in one compromised remote agent each appear benign alone but activate harmful behavior when routing brings them together. We consider an attacker who changes neither model weights nor the client agent and instead controls only trigger placement and template insertion. Across star, chain, and DAG topologies, routing-aware optimization substantially increases attack success over non-optimized baselines while keeping false activations low. Existing defenses, including PromptGuard, Llama-Guard variants, and system-level controls such as tool restrictions, do not reliably stop the attack because no single component appears malicious in isolation. These results expose a structural vulnerability in agentic LLM pipelines and motivate defenses that reason over routing and cross-agent composition. Code is available at this https URL.
[MA-27] Public and private blockchain for decentralized digital building twins and building automation system
【速读】:该论文旨在解决智能建筑及其数字孪生系统中物联网(IoT)设备数据传输依赖集中式架构所带来的脆弱性问题,特别是单点故障导致的运营中断风险。解决方案的关键在于引入基于区块链的去中心化协议,通过融合公有链与私有链技术构建一个增强网络攻击抵御能力(cyber resilience)的分布式数据传输框架,从而实现更安全、可扩展且具备隐私保护的数据交换机制,并在真实建筑环境中通过智能家居设备和两个数字孪生平台验证了其性能优势。
链接: https://arxiv.org/abs/2604.16534
作者: Reachsak Ly,Alireza Shojaei
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 27 pages, 15 figures, 2 tables
Abstract:The communication protocols and data transfer mechanisms employed by IoT devices in smart buildings and corresponding digital twin systems predominantly rely on centralized architectures. Such centralized systems are vulnerable to single points of failure, where a malfunction can disrupt operational processes. This study introduces a blockchain-based decentralized protocol to enhance the cyber resilience of IoT data transfer for digital twins and enable decentralized automation of building operations. The framework incorporates public and private blockchain technologies alongside two case studies showcasing prototypes of each system. These prototypes were validated within a real-world building environment using smart home appliances and two digital twin platforms, with their performance evaluated based on cost, scalability, data security, and privacy. The findings reveal that the Hyperledger Fabric-based system excels in terms of scalability, speed, and cost-effectiveness, while both frameworks offer advantages over traditional centralized protocols in system cyber resilience, data security, and privacy.
[MA-28] raining Language Models for Bilateral Trade with Private Information
【速读】:该论文旨在解决如何在不完全信息条件下评估大型语言模型(Large Language Model, LLM)代理的双边谈判能力问题,特别是其在个体理性、策略性盈余最大化与合作实现贸易收益方面的表现。解决方案的关键在于构建一个结构化的谈判环境,其中LLM通过工具调用在事件驱动模拟器中进行协商,将具有约束力的报价与自然语言消息分离,从而支持自动化评估;同时利用该环境作为前沿模型的基准测试平台和开放权重模型的强化学习训练场景,实验证明有效策略依赖于通过序列报价实施价格歧视,且行为调整需匹配商品价值层级,体现了模型性能对策略比例性的敏感性。
链接: https://arxiv.org/abs/2604.16472
作者: Dirk Bergemann,Soheil Ghili,Xinyang Hu,Chuanhao Li,Zhuoran Yang
机构: Yale University (耶鲁大学); Tsinghua University (清华大学)
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); General Economics (econ.GN); Theoretical Economics (econ.TH)
备注: 67 pages, 34 figures
Abstract:Bilateral bargaining under incomplete information provides a controlled testbed for evaluating large language model (LLM) agent capabilities. Bilateral trade demands individual rationality, strategic surplus maximization, and cooperation to realize gains from trade. We develop a structured bargaining environment where LLMs negotiate via tool calls within an event-driven simulator, separating binding offers from natural-language messages to enable automated evaluation. The environment serves two purposes: as a benchmark for frontier models and as a training environment for open-weight models via reinforcement learning. In benchmark experiments, a round-robin tournament among five frontier models (15,000 negotiations) reveals that effective strategies implement price discrimination through sequential offers. Aggressive anchoring, calibrated concession, and temporal patience correlate with the highest surplus share and deal rate. Accommodating strategies that concede quickly disable price discrimination in the buyer role, yielding the lowest surplus capture and deal completion. Stronger models scale their behavior proportionally to item value, maintaining performance across price tiers; weaker models perform well only when wide zones of possible agreement offset suboptimal strategies. In training experiments, we fine-tune Qwen3 (8B, 14B) via supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) against a fixed frontier opponent. These stages optimize competing objectives: SFT approximately doubles surplus share but reduces deal rates, while RL recovers deal rates but erodes surplus gains, reflecting the reward structure. SFT also compresses surplus variation across price tiers, which generalizes to unseen opponents, suggesting that behavioral cloning instills proportional strategies rather than memorized price points. Comments: 67 pages, 34 figures Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); General Economics (econ.GN); Theoretical Economics (econ.TH) Cite as: arXiv:2604.16472 [cs.GT] (or arXiv:2604.16472v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2604.16472 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-29] Semantic Channel Theory: Deductive Compression and Structural Fidelity for Multi-Agent Communication
【速读】:该论文旨在解决传统信息论(Shannon信息论)中忽略语义内容的问题,提出一个严格的语义通信框架,以实现对消息语义层面的量化分析与优化。其关键解决方案是构建一个基于形式化证明系统的语义信息模型:通过引入Lsem-definable状态集和可计算的启用映射(enabling maps),将语义信道定义为满足结构约束的马尔可夫核复合,并基于固定证明系统诱导出不可约的语义核心(irredundant semantic core)及推导深度分层(derivation-depth stratification)。由此定义了四种逐级深化的失真度量(Hamming、闭包、深度及参数化复合),并揭示了语义信道不变量之间的关系,包括数据处理不等式、语义Fano界以及理想信道坍缩定理。其中最核心的定量成果是“演绎压缩增益”——在闭包保真度下,最小码长由语义核心大小决定,而非整个知识库规模,从而实现了语义层面的信息压缩效率提升。
链接: https://arxiv.org/abs/2604.16471
作者: Jianfeng Xu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Multiagent Systems (cs.MA)
备注: arXiv admin note: text overlap with arXiv:2604.11204
Abstract:Shannon’s information theory deliberately excludes message semantics. This paper develops a rigorous framework for semantic communication that integrates formal proof systems with Shannon-theoretic tools. We introduce an axiomatic information model comprising Lsem-definable state sets linked by computable enabling maps, and define the semantic channel as a composition of Markov kernels whose supports respect the enabling structure. A fixed proof system induces an irredundant semantic core and a derivation-depth stratification, enabling four distortion measures of increasing semantic depth: Hamming, closure, depth, and a parameterized composite. Six families of computable semantic channel invariants are defined and their inter-relationships established, including a data processing bound, a semantic Fano bound, and an ideal-channel collapse theorem. The central quantitative result is a deductive compression gain: under closure-based fidelity, the minimum block length is determined by the irredundant core size rather than the full knowledge-base size. We instantiate the framework for heterogeneous multi-agent communication, introducing an overlap decomposition that yields necessary and sufficient conditions for closure-reliable communication. A semantic bottleneck phenomenon is identified in broadcast settings: vocabulary mismatch imposes irreducible fidelity limitations even over noiseless carriers. All results are verified on an explicit Datalog instance.
[MA-30] Heterogeneous Self-Play for Realistic Highway Traffic Simulation CVPR
【速读】:该论文旨在解决高速公路场景下自动驾驶车辆安全评估中面临的三大挑战:一是需覆盖广泛的速度与驾驶行为,二是可控生成罕见但高风险的安全关键场景,三是确保多智能体交互中的行为可信度。其解决方案的关键在于提出PHASE(Policy for Heterogeneous Agent Self-play on Expressway)框架,通过三个核心机制实现:1)基于每辆车的显式条件控制以提升可操控性;2)合成场景生成以扩大高速公路场景覆盖范围;3)闭环多智能体训练以模拟真实交互动力学。此外,PHASE还支持不同车型(如乘用车和铰接式卡车)共存于同一策略中,并通过早期终止不可恢复状态、责任碰撞归因、道路感知奖励设计、耦合课程学习及鲁棒策略优化等手段稳定自对弈过程。实验表明,仅在合成数据上训练的PHASE能零样本迁移至512个未见过的真实高交互场景,在轨迹精度(ADE/FDE)和行为真实性(Frechet轨迹距离、能量距离)方面显著优于基线方法。
链接: https://arxiv.org/abs/2604.16406
作者: Jinkai Qiu,Alessandro Saviolo,Chaojie Wang,Mingke Wang,Xiaoyu Huang
机构: PlusAI; University of Michigan - Ann Arbor
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: 8 pages, 2026 CVPR SAD Workshop
Abstract:Realistic highway simulation is critical for scalable safety evaluation of autonomous vehicles, particularly for interactions that are too rare to study from logged data alone. Yet highway traffic generation remains challenging because it requires broad coverage across speeds and maneuvers, controllable generation of rare safety-critical scenarios, and behavioral credibility in multi-agent interactions. We present PHASE, Policy for Heterogeneous Agent Self-play on Expressway, a context-aware self-play framework that addresses these three requirements through explicit per-agent conditioning for controllability, synthetic scenario generation for broad highway coverage, and closed-loop multi-agent training for realistic interaction dynamics. PHASE further supports different vehicle profiles, for example, passenger cars and articulated trailer trucks, within a single policy via vehicle-aware dynamics and context-conditioned actions, and stabilizes self-play with early termination of unrecoverable states, at-fault collision attribution, highway-aware reward shaping, coupled curricula, and robust policy optimization. Despite being trained only on synthetic data, PHASE transfers zero-shot to 512 unseen high-interaction real scenarios in exiD, achieving a 96.3% success rate and reducing ADE/FDE from 6.57/12.07 m to 2.44/5.25 m relative to a prior self-play baseline. In a learned trajectory embedding space, it also improves behavioral realism over IDM, reducing Frechet trajectory distance by 13.1% and energy distance by 20.2%. These results show that synthetic self-play can provide a scalable route to controllable and realistic highway scenario generation without direct imitation of expert logs.
[MA-31] Semantic Consensus: Process-Aware Conflict Detection and Resolution for Enterprise Multi-Agent LLM Systems
【速读】:该论文旨在解决多智能体大语言模型(Multi-agent Large Language Model, LLM)系统在企业级部署中高失败率的问题,尤其是由规范与协调缺陷引发的失败(占总失败的79%)。其核心问题在于“语义意图分歧”(Semantic Intent Divergence)——即协作智能体因上下文隔离和缺乏过程建模而对共享目标产生不一致的理解。解决方案的关键是提出语义共识框架(Semantic Consensus Framework, SCF),该框架通过六组件协同工作实现过程感知的语义一致性:包括共享操作语义的流程上下文层、形式化意图表示的语义意图图、实时冲突检测引擎、基于策略-权威-时间层级的共识决议协议、渐进式语义漂移监测器以及支持组织政策执行的治理集成层。SCF在三个主流多智能体框架和四种企业场景中验证,实现了100%工作流完成率,显著优于现有方法,并提供高精度冲突识别与完整审计追踪。
链接: https://arxiv.org/abs/2604.16339
作者: Vivek Acharya
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: 18 pages, 4 figures, 4 tables
Abstract:Multi-agent large language model (LLM) systems are rapidly emerging as the dominant architecture for enterprise AI automation, yet production deployments exhibit failure rates between 41% and 86.7%, with nearly 79% of failures originating from specification and coordination issues rather than model capability limitations. This paper identifies Semantic Intent Divergence–the phenomenon whereby cooperating LLM agents develop inconsistent interpretations of shared objectives due to siloed context and absent process models–as a primary yet formally unaddressed root cause of multi-agent failure in enterprise settings. We propose the Semantic Consensus Framework (SCF), a process-aware middleware comprising six components: a Process Context Layer for shared operational semantics, a Semantic Intent Graph for formal intent representation, a Conflict Detection Engine for real-time identification of contradictory, contention-based, and causally invalid intent combinations, a Consensus Resolution Protocol using a policy–authority–temporal hierarchy, a Drift Monitor for detecting gradual semantic divergence, and a Process-Aware Governance Integration layer for organizational policy enforcement. Evaluation across 600 runs spanning three multi-agent frameworks (AutoGen, CrewAI, LangGraph) and four enterprise scenarios demonstrates that SCF is the only approach to achieve 100% workflow completion–compared to 25.1% for the next-best baseline–while detecting 65.2% of semantic conflicts with 27.9% precision and providing complete governance audit trails. The framework is protocol-agnostic and compatible with MCP and A2A communication standards.
[MA-32] Governing the Agent ic Enterprise: A Governance Maturity Model for Managing AI Agent Sprawl in Business Operations
【速读】:该论文旨在解决企业中自主型人工智能(Agentic AI)治理能力缺失所引发的“代理泛滥”(agent sprawl)问题,即多个冗余、无管控且相互冲突的AI代理在业务流程中失控蔓延,导致运营风险上升和资源浪费。其解决方案的关键在于提出一个五级成熟的治理模型——Agentic AI Governance Maturity Model (AAGMM),该模型整合NIST AI RMF与ISO/IEC 42001标准,覆盖12个治理领域,并首次构建了可量化的代理泛滥模式分类体系(包括功能重复、影子代理、孤儿代理、权限扩张及未受监控的委托链),并通过750次仿真验证了不同成熟度层级对企业成本控制、风险事件率、运营效率和决策质量等关键指标的显著影响,结果表明高成熟度组织(Level 4–5)相较低成熟度(Level 1)可实现94.3%更低的泛滥指数、96.4%更少的风险事件及32.6%更高的任务完成率,为实践者提供了从治理能力提升到商业价值最大化的可操作路径。
链接: https://arxiv.org/abs/2604.16338
作者: Vivek Acharya
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 11 pages, 2 figures, 7 tables
Abstract:The rapid adoption of agentic AI in enterprise business operations–autonomous systems capable of planning, reasoning, and executing multi-step workflows–has created an urgent governance crisis. Organizations face uncontrolled agent sprawl: the proliferation of redundant, ungoverned, and conflicting AI agents across business functions. Industry surveys report that only 21% of enterprises have mature governance models for autonomous agents, while 40% of agentic AI projects are projected to fail by 2027 due to inadequate governance and risk controls. Despite growing acknowledgment of this challenge, academic literature lacks a formal, empirically validated governance maturity model connecting governance capability to measurable business outcomes. This paper introduces the Agentic AI Governance Maturity Model (AAGMM), a five-level framework spanning 12 governance domains, grounded in NIST AI RMF and ISO/IEC 42001 standards. We additionally propose a novel taxonomy of agent sprawl patterns–functional duplication, shadow agents, orphaned agents, permission creep, and unmonitored delegation chains–each linked to quantifiable business cost models. The framework is validated through 750 simulation runs across five enterprise scenarios and five governance maturity levels, measuring business outcomes including cost containment, risk incident rates, operational efficiency, and decision quality. Results demonstrate statistically significant differences (p 0.001, large effect sizes d 2.0) between all governance maturity levels, with Level 4-5 organizations achieving 94.3% lower sprawl indices, 96.4% fewer risk incidents, and 32.6% higher effective task completion rates compared to Level 1. The AAGMM provides practitioners with an actionable roadmap for governing autonomous AI agents while maximizing business returns.
[MA-33] Distributed Human Identity: AI-Enabled Multi-Existence Through Cognitive Replication and Robotic Embodiments
【速读】:该论文旨在解决人类存在受限于物理具身性(physical embodiment)的问题,即个体无法同时存在于多个地点或情境中,从而限制了效率、参与度与社会互动的多样性。其解决方案的核心是提出多存在身份(Multi-Existence Identity, MEI)这一 socio-technical 框架,通过将认知、行为和情感属性复制到由人工智能驱动的多种具身形态(digital avatars、robotic embodiments 和 agentic software agents)中,实现个体在数字与物理环境中的并行存在。MEI 的关键创新在于嵌入认知保真度(cognitive fidelity)、情感共鸣(affective resonance)和情境响应能力(contextual responsiveness),使复刻的身份不仅作为原个体的代理(for the individual),更作为其延伸(as the individual),并通过人格建模、认知模拟与同步层确保跨通道的身份一致性,推动人机协同向更具关系真实性与文化情境性的方向演进。
链接: https://arxiv.org/abs/2604.16336
作者: A S M Touhidul Islam,John Tookey
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 30 pages, 1 figure, 4 tables. cs.AI under Computer Science category
Abstract:Human presence has traditionally been constrained by the limits of physical embodiment, allowing individuals to exist in only one place at a time. This article introduces Multi-Existence Identity (MEI)- a socio-technical framework that replicates cognitive, behavioral, and emotional attributes into AI-enabled embodiments capable of acting across digital and physical contexts in parallel. MEI advances beyond digital twins, telepresence, and multipresence avatars by embedding cognitive fidelity, affective resonance, and contextual responsiveness into distributed agents that function not only for, but as, the original individual. The framework integrates personality modeling, cognitive simulation, and a synchronization layer to maintain identity coherence across three embodiment channels: digital avatars, robotic embodiments, and agentic software agents. Differentiating itself from simulated assistants, MEI positions replicated identity as a dynamic and culturally situated extension of selfhood, foregrounding tacit engagement and relational authenticity. Application domains span professional work, education, healthcare, governance, family life, and media, offering transformative potential for productivity, caregiving, leadership, and creativity. Yet these opportunities also surface profound challenges concerning authenticity, consent, legal accountability, privacy, and the psychological meaning of presence. The article proposes a phased empirical roadmap to operationalize MEI through personality modeling, synchronization testing, robotic embodiment trials, and ethical stress-testing. By conceptualizing MEI as both a technological and cultural construct, the study reframes debates on identity and presence in digitally augmented societies, highlighting opportunities for human-AI integration while underscoring the need for inclusive ethical governance.
[MA-34] BrainMem: Brain-Inspired Evolving Memory for Embodied Agent Task Planning
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的具身任务规划方法在复杂3D环境中存在的局限性,即这些方法通常为无状态且反应式的,缺乏持久记忆能力,导致重复错误并难以处理空间或时间依赖关系。解决方案的关键在于提出一种无需训练的分层记忆系统 BrainMem(Brain-Inspired Evolving Memory),该系统模拟人类认知中的工作记忆、情景记忆和语义记忆,持续将交互历史转化为结构化的知识图谱与凝练的符号规则,使代理能够在不进行模型微调或额外训练的情况下,检索、推理并适应过往经验,从而显著提升长程任务和空间复杂任务的成功率。
链接: https://arxiv.org/abs/2604.16331
作者: Xiaoyu Ma,Lianyu Hu,Wenbing Tang,Zixuan Hu,Zeqin Liao,Zhizhen Wu,Yang Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:
Abstract:Embodied task planning requires agents to execute long-horizon, goal-directed actions in complex 3D environments, where success depends on both immediate perception and accumulated experience across tasks. However, most existing LLM-based planners are stateless and reactive, operating without persistent memory and therefore repeating errors and struggling with spatial or temporal dependencies. We propose BrainMem(Brain-Inspired Evolving Memory), a training-free hierarchical memory system that equips embodied agents with working, episodic, and semantic memory inspired by human cognition. BrainMem continuously transforms interaction histories into structured knowledge graphs and distilled symbolic guidelines, enabling planners to retrieve, reason over, and adapt behaviors from past experience without any model fine-tuning or additional training. This plug-and-play design integrates seamlessly with arbitrary multi-modal LLMs and greatly reduces reliance on task-specific prompt engineering. Extensive experiments on four representative benchmarks, including EB-ALFRED, EB-Navigation, EB-Manipulation, and EB-Habitat, demonstrate that BrainMem significantly enhances task success rates across diverse models and difficulty subsets, with the largest gains observed on long-horizon and spatially complex tasks. These results highlight evolving memory as a promising and scalable mechanism for generalizable embodied intelligence.
自然语言处理
[NLP-0] Sessa: Selective State Space Attention
【速读】: 该论文旨在解决现有序列建模架构在长距离依赖建模上的局限性:Transformer 由于注意力机制的稀释效应(influence scales as O(1/Seff(t)),对远距离 token 的影响衰减至 O(1/ℓ)),而基于状态空间模型(State-Space Models, SSMs)如 Mamba 虽具备输入依赖的反馈路径,但其长期敏感性随滞后 ℓ 指数衰减。解决方案的关键在于提出 Sessa——一种将注意力机制嵌入到反馈路径中的解码器结构,使得单层内可实现多路径递归聚合(many-path aggregation),从而在满足特定假设下获得幂律衰减的记忆尾部(power-law memory tail),即对滞后 ℓ 的影响为 O(ℓ−β)(其中 0<β<1),显著慢于传统 1/ℓ 衰减,并且在均匀路由设定中达到紧致的 Θ(ℓ−β) 界。此设计首次在统一框架下实现了灵活的选择性检索(flexible selective retrieval),包括非衰减型影响模式,同时在长上下文基准上表现最优,且在短上下文语言建模任务中保持与 Transformer 和 Mamba 基线相当的竞争力。
链接: https://arxiv.org/abs/2604.18580
作者: Liubomyr Horbatko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code available at: this https URL
Abstract:Modern sequence models are dominated by Transformers, where self-attention mixes information from the visible context in an input-dependent way. However, when retrieval is not sharp and attention remains diffuse over an effective support S_\mathrmeff(t) , the influence of any individual token is diluted, typically scaling as O(1/S_\mathrmeff(t)) and reaching O(1/\ell) for old tokens in full-prefix settings. Structured state-space models process sequences recurrently through an explicit feedback path; selective variants such as Mamba make this feedback input-dependent, yet when freeze time cannot be sustained over long intervals, their long-range sensitivity decays exponentially with lag. Existing architectures therefore either retrieve from the past in a single read or propagate information through a single feedback chain. We introduce Sessa, a decoder that places attention inside a feedback path, enabling recurrent many-path aggregation within a layer. Under stated assumptions, Sessa admits regimes with a power-law memory tail in lag \ell of order O(\ell^-\beta) for 0\beta1 , which is asymptotically slower than 1/\ell ; moreover, this rate is tight in an explicit diffuse uniform-routing setting where the influence is \Theta(\ell^-\beta) . Under the same conditions, only Sessa among the compared model classes realizes flexible selective retrieval, including non-decaying profiles. Empirically, under matched architectures and training budgets, Sessa achieves the strongest performance on our long-context benchmarks while remaining competitive with Transformer and Mamba style baselines on short-context language modeling.
[NLP-1] A multimodal and temporal foundation model for virtual patient representations at healthcare system scale
【速读】: 该论文旨在解决现代医学中多模态数据分散在孤立系统中、缺乏统一患者表征模型的问题,从而无法实现对完整临床记录的深度整合与计算建模。其解决方案的关键在于构建Apollo——一个基于超过三十余年纵向医院记录(涵盖720万患者、250亿条记录、28种医学模态和12个专科)训练的多模态时序基础模型,该模型通过学习一个统一的表示空间,将超过十万种独特医疗事件(包括结构化数据、文本和图像)压缩为虚拟患者表征,从而支持跨任务的泛化临床预测与语义检索,为可计算医学奠定基础。
链接: https://arxiv.org/abs/2604.18570
作者: Andrew Zhang,Tong Ding,Sophia J. Wagner,Caiwei Tian,Ming Y. Lu,Rowland Pettit,Joshua E. Lewis,Alexandre Misrahi,Dandan Mo,Long Phi Le,Faisal Mahmood
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Modern medicine generates vast multimodal data across siloed systems, yet no existing model integrates the full breadth and temporal depth of the clinical record into a unified patient representation. We introduce Apollo, a multimodal temporal foundation model trained and evaluated on over three decades of longitudinal hospital records from a major US hospital system, composed of 25 billion records from 7.2 million patients, representing 28 distinct medical modalities and 12 major medical specialties. Apollo learns a unified representation space integrating over 100 thousand unique medical events in our clinical vocabulary as well as images and clinical text. This “atlas of medical concepts” forms a computational substrate for modeling entire patient care journeys comprised of sequences of structured and unstructured events, which are compressed by Apollo into virtual patient representations. To assess the potential of these whole-patient representations, we created 322 prognosis and retrieval tasks from a held-out test set of 1.4 million patients. We demonstrate the generalized clinical forecasting potential of Apollo embeddings, including predicting new disease onset risk up to five years in advance (95 tasks), disease progression (78 tasks), treatment response (59 tasks), risk of treatment-related adverse events (17 tasks), and hospital operations endpoints (12 tasks). Using feature attribution techniques, we show that model predictions align with clinically-interpretable multimodal biomarkers. We evaluate semantic similarity search on 61 retrieval tasks, and moreover demonstrate the potential of Apollo as a multimodal medical search engine using text and image queries. Together, these modeling capabilities establish the foundation for computable medicine, where the full context of patient care becomes accessible to computational reasoning.
[NLP-2] Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering
【速读】: 该论文旨在解决大语言模型在生成过程中因早期推理错误(unrecoverable reasoning errors)而导致后续输出持续偏离正确路径的问题,即一旦出现错误步骤,模型无法自我修正,反而会不断放大错误。其解决方案的关键在于提出潜在相位偏移回滚(Latent Phase-Shift Rollback, LPSR)机制:在每一步生成时,于关键层 $ l_{\text{crit}} $ 监测残差流(residual stream),通过余弦相似度与熵的双门控策略检测方向突变(相位 shifts),随后回滚 KV-cache 并注入预计算的引导向量(steering vector)以纠正错误路径。该方法无需微调、梯度计算或额外前向传播,在 MATH-500 上使 8B 模型性能提升至 44.0%(相比标准自回归生成提高 15.2 个百分点),显著优于提示式自纠错等基线方法。
链接: https://arxiv.org/abs/2604.18567
作者: Manan Gupta,Dhruv Kumar
机构: BITS Pilani, Pilani Campus, India
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under Review
Abstract:Large language models frequently commit unrecoverable reasoning errors mid-generation: once a wrong step is taken, subsequent tokens compound the mistake rather than correct it. We introduce \textbfLatent Phase-Shift Rollback (LPSR): at each generation step, we monitor the residual stream at a critical layer lcrit, detect abrupt directional reversals (phase shifts) via a cosine-similarity + entropy dual gate, and respond by rolling back the KV-cache and injecting a pre-computed steering vector. No fine-tuning, gradient computation, or additional forward passes are required. LPSR achieves \mathbf44.0% on MATH-500 with an 8B model versus 28.8% for standard AR ( +15.2 pp; McNemar \chi^2 = 66.96 , p 10^-15 ). Critically, prompted self-correction, the most natural inference-time baseline, scores only 19.8% , below standard AR; LPSR exceeds it by +24.2 pp ( \chi^2 = 89.4 , p \approx 0 ). LPSR also outperforms Best-of-16 ( +7.8 pp) at 5.4\times lower token cost, and surpasses a standard 70B model ( 35.2% ) with 8.75\times fewer parameters at \sim3\times the token budget. A 32-layer sweep reveals a novel \textbfdetection-correction dissociation: error-detection AUC peaks at layer~14 ( 0.718 ) but task accuracy peaks at layer~16 ( 44.0% vs.\ 29.2% ), demonstrating that optimal monitoring depth differs for detection and correction.
[NLP-3] Dual Alignment Between Language Model Layers and Human Sentence Processing ACL2026
【速读】: 该论文试图解决的问题是:在句法挑战性较高的语言处理任务中,大型语言模型(LLM)内部层的预测 surprisal 是否能够更准确地估计人类认知努力,尤其是在句法歧义处理场景下,传统单层 surprisal 常被发现低估人类行为数据。解决方案的关键在于:通过对比自然阅读与句法挑战性任务中不同层次 LLM 的 surprisal 表现,发现早期层更适合建模自然阅读中的弱预测机制,而后期层则能更好地捕捉复杂句法处理所需的全上下文表征;进一步提出利用浅层与深层 LLM 层次的概率更新度量组合,以互补方式提升对阅读时间建模的准确性,从而实现对人类句法加工模式更全面的模拟。
链接: https://arxiv.org/abs/2604.18563
作者: Tatsuki Kuribayashi,Alex Warstadt,Yohei Oseki,Ethan Gotlieb Wilcox
机构: 未知
类目: Computation and Language (cs.CL)
备注: ACL 2026 main
Abstract:A recent study (Kuribayashi et al., 2025) has shown that human sentence processing behavior, typically measured on syntactically unchallenging constructions, can be effectively modeled using surprisal from early layers of large language models (LLMs). This raises the question of whether such advantages of internal layers extend to more syntactically challenging constructions, where surprisal has been reported to underestimate human cognitive effort. In this paper, we begin by exploring internal layers that better estimate human cognitive effort observed in syntactic ambiguity processing in English. Our experiments show that, in contrast to naturalistic reading, later layers better estimate such a cognitive effort, but still underestimate the human data. This dual alignment sheds light on different modes of sentence processing in humans and LMs: naturalistic reading employs a somewhat weak prediction akin to earlier layers of LMs, while syntactically challenging processing requires more fully-contextualized representations, better modeled by later layers of LMs. Motivated by these findings, we also explore several probability-update measures using shallow and deep layers of LMs, showing a complementary advantage to single-layer’s surprisal in reading time modeling.
[NLP-4] GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLM s via Gumbel-Softmax Sampling
【速读】: 该论文旨在解决当前低比特量化(low-bit quantization)中 scalar 量化方法与 vector/trellis 量化方法之间的性能差距问题。具体而言,尽管简单标量量化(如 GPTQ 或 AWQ)在 3–4 bit per parameter(bpp)时精度趋于饱和,而第二代向量量化方法(如 QTIP、GPTVQ 和 AQLM)虽能实现更高精度但难以部署和扩展,论文提出通过优化标量量化器来缩小这一差距。其解决方案的关键在于引入 GSQ(Gumbel-Softmax Quantization),该方法利用 Gumbel-Softmax 对离散量化网格进行松弛,联合学习每个坐标维度的量化格点分配与分组尺度,并将松弛的基数(cardinality)匹配到目标比特宽度下的有限量化级别(例如 3–8 级别对应三元或 3 bpp),从而在保持对称标量网格和组内量化结构的同时,使优化过程稳定且高效。实验证明,GSQ 在 Llama-3.1-8B/70B-Instruct 模型上可显著逼近 QTIP 的性能边界,且兼容现有标量推理内核,同时成功扩展至千亿参数级 MoE 模型(如 Kimi-K2.5)。
链接: https://arxiv.org/abs/2604.18556
作者: Alireza Dadgarnia,Soroush Tabesh,Mahdi Nikdan,Michael Helcig,Eldar Kurtic,Dan Alistarh
机构: ISTA; ETH Zürich; Red Hat AI
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Weight quantization has become a standard tool for efficient LLM deployment, especially for local inference, where models are now routinely served at 2-3 bits per parameter. The state of the art is currently split into two sets of methods: simple scalar quantization techniques, such as GPTQ or AWQ, which are widely deployed but plateau in accuracy at 3-4 bits per parameter (bpp), and “second-generation” vector- or trellis-quantized methods, such as QTIP, GPTVQ and AQLM, which push the accuracy frontier at low bit-widths but are notoriously hard to implement and to scale, and have gained relatively less traction. In this paper, we ask whether this gap is fundamental, or whether a carefully optimized scalar quantizer can recover most of it. We answer in the affirmative, by introducing GSQ (Gumbel-Softmax Quantization), a post-training scalar quantization method which jointly learns the per-coordinate grid assignments and the per-group scales using a Gumbel-Softmax relaxation of the discrete grid. GSQ matches the cardinality of the relaxation to the small number of levels available in the target bit-width regime (e.g., 3-8 levels for ternary and 3 bpp, respectively), making the relaxation tight and the optimization tractable. Practically, on the standard Llama-3.1-8B/70B-Instruct models, GSQ closes most of the gap between scalar quantization and the QTIP frontier at 2 and 3 bits, while using a symmetric scalar grid with group-wise quantization, and thus fully compatible with existing scalar inference kernels. We further show that GSQ scales to trillion-scale Mixture-of-Experts models such as Kimi-K2.5, where vector-quantized methods are difficult to apply.
[NLP-5] ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
【速读】: 该论文旨在解决爪型智能体(claw-like agents)训练与评估环境构建过程中存在的手动、人力密集且难以扩展的问题。传统方法依赖人工设计环境,效率低下且无法满足大规模评测需求。解决方案的关键在于提出 ClawEnvKit——一个自动化生成管道,能够根据自然语言描述自动生成多样、可验证的环境。其核心创新包括:(1) 解析模块从自然语言中提取结构化参数;(2) 生成模块输出任务规范、工具接口和评分配置;(3) 验证模块确保生成环境在可行性、多样性、结构有效性及内部一致性方面的质量。该方案实现了 Auto-ClawEval 基准的自动构建,涵盖 1,040 个环境,成本降低 13,800 倍,同时支持实时用户驱动的动态评估与适应性训练环境生成,显著提升了评估规模与灵活性。
链接: https://arxiv.org/abs/2604.18543
作者: Xirui Li,Ming Li,Derry Xu,Wei-Lin Chiang,Ion Stoica,Cho-Jui Hsieh,Tianyi Zhou
机构: University of Maryland (马里兰大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent’s current weaknesses rather than being bounded by existing user logs.
[NLP-6] ransition-Matrix Regularization for Next Dialogue Act Prediction in Counselling Conversations ACL
【速读】: 该论文旨在解决下一话语行为预测(Next Dialogue Act Prediction, NDAP)中因缺乏对话流统计信息而导致的性能瓶颈问题,特别是在数据稀疏、细粒度分类场景下。其核心解决方案是引入一个KL散度正则化项(KL regularization term),通过将模型预测的话语行为分布与语料库中提取的转移模式对齐,从而显式建模对话流程的统计规律。实验表明,该方法在德语咨询语料库上提升了宏平均F1分数9–42%相对增益,并在跨数据集验证中展现出良好的泛化能力,尤其对弱基线模型带来显著改进,证明轻量级对话流先验可有效增强预训练编码器在细粒度对话任务中的表现。
链接: https://arxiv.org/abs/2604.18539
作者: Eric Rudolph,Philipp Steigerwald,Jens Albrecht
机构: Technische Hochschule Nürnberg Georg Simon Ohm
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as ACL findings paper
Abstract:This paper studies how empirical dialogue-flow statistics can be incorporated into Next Dialogue Act Prediction (NDAP). A KL regularization term is proposed that aligns predicted act distributions with corpus-derived transition patterns. Evaluated on a 60-class German counselling taxonomy using 5-fold cross-validation, this improves macro-F1 by 9–42% relative depending on encoder and substantially improves dialogue-flow alignment. Cross-dataset validation on HOPE suggests that improvements transfer across languages and counselling domains. In systematic ablations across pretrained encoders and architectures, the findings indicate that transition regularization provides consistent gains and disproportionately benefits weaker baseline models. The results suggest that lightweight discourse-flow priors complement pretrained encoders, especially in fine-grained, data-sparse dialogue tasks.
[NLP-7] Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks
【速读】: 该论文旨在解决开放权重语言模型(Open-weight language models)在遭受不同类型的恶意干预后,其安全性与行为特性如何分化的问题。研究发现,尽管三种攻击路径——有害监督微调(harmful supervised fine-tuning, SFT)、基于可验证奖励的强化学习(reinforcement learning with verifiable rewards, RLVR)和拒绝抑制型消解(refusal-suppressing abliteration)——均能实现接近上限的有害合规性,但它们在模型内部机制、安全认知保留程度及修复潜力上存在显著差异。关键在于:RLVR 方法通过构建反射式安全框架(reflective safety scaffold),在保持基础模型结构不变的前提下,将有害行为定向至特定指令,从而维持了较高的安全判断能力与行为稳定性;相比之下,SFT 导致全局性分布漂移,而 abliteration 则表现为局部拒绝特征删除,二者均难以通过靶向修复恢复原有效能。这一结果表明,模型被“越狱”后的性质取决于攻击方式本身,而非仅危害程度,这对安全对齐策略的设计具有重要启示。
链接: https://arxiv.org/abs/2604.18510
作者: Md Rysul Kabir,Zoran Tiganj
机构: Indiana University Bloomington (印第安纳大学布卢明顿分校)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Open-weight language models can be rendered unsafe through several distinct interventions, but the resulting models may differ substantially in capabilities, behavioral profile, and internal failure mode. We study behavioral and mechanistic properties of jailbroken models across three unsafe routes: harmful supervised fine-tuning (SFT), harmful reinforcement learning with verifiable rewards (RLVR), and refusal-suppressing abliteration. All three routes achieve near-ceiling harmful compliance, but they diverge once we move beyond direct harmfulness. RLVR-jailbroken models show minimal degradation and preserve explicit harm recognition in a structured self-audit: they are able to identify harmful prompts and describe how a safe LLM should respond, yet they comply with the harmful request. With RLVR, harmful behavior is strongly suppressed by a reflective safety scaffold: when a harmful prompt is prepended with an instruction to reflect on safety standards, harmful behavior drops close to the baseline. Category-specific RLVR jailbreaks generalize broadly across harmfulness domains. Models jailbroken with SFT show the largest collapse in explicit safety judgments, the highest behavioral drift, and a substantial capability loss on standard benchmarks. Abliteration is family-dependent in both self-audit and response to a reflective safety scaffold. Mechanistic and repair analyses further separate the routes: abliteration is consistent with localized refusal-feature deletion, RLVR with preserved safety geometry but retargeted policy behavior, and SFT with broader distributed drift. Targeted repair partially recovers RLVR-jailbroken models, but has little effect on SFT-jailbroken models. Together, these results show that jailbreaks can produce vastly different properties despite similar harmfulness, with models jailbroken via RLVR showing remarkable similarity to the base model.
[NLP-8] MASS-RAG : Multi-Agent Synthesis Retrieval-Augmented Generation
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中因检索到的上下文存在噪声、不完整或异构性而导致单一生成过程难以有效整合证据的问题。其解决方案的关键在于提出一种多智能体合成方法(Multi-Agent Synthesis for RAG, MASS-RAG),将证据处理过程分解为角色专业化代理:分别负责证据摘要(evidence summarization)、证据提取(evidence extraction)和基于检索文档的推理(reasoning),并通过专门设计的合成阶段融合各代理输出,从而在答案生成前暴露多个中间证据视图,实现互补信息的对比与整合,显著提升模型在相关证据分散于多个检索片段场景下的性能表现。
链接: https://arxiv.org/abs/2604.18509
作者: Xingchen Xiao,Heyan Huang,Runheng Liu,Jincheng Xie
机构: Beijing Institute of Technology (北京理工大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: 19 pages
Abstract:Large language models (LLMs) are widely used in retrieval-augmented generation (RAG) to incorporate external knowledge at inference time. However, when retrieved contexts are noisy, incomplete, or heterogeneous, a single generation process often struggles to reconcile evidence effectively. We propose \textbfMASS-RAG, a multi-agent synthesis approach to retrieval-augmented generation that structures evidence processing into multiple role-specialized agents. MASS-RAG applies distinct agents for evidence summarization, evidence extraction, and reasoning over retrieved documents, and combines their outputs through a dedicated synthesis stage to produce the final answer. This design exposes multiple intermediate evidence views, allowing the model to compare and integrate complementary information before answer generation. Experiments on four benchmarks show that MASS-RAG consistently improves performance over strong RAG baselines, particularly in settings where relevant evidence is distributed across retrieved contexts.
[NLP-9] LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation ACL2026
【速读】: 该论文旨在解决现有机器翻译(Machine Translation, MT)评估框架在处理双语语言(diglossic languages)时的局限性问题,尤其是无法有效识别方言和文化特异性错误的问题。传统自动指标和人类评估方法(如多维质量度量 Multidimensional Quality Metrics, MQM)往往忽略语言变体、内容覆盖和语用适切性等深层因素,导致对阿拉伯语等双语语言翻译质量的评估不准确。解决方案的关键在于提出LQM(Linguistically Motivated Multidimensional Quality Metrics),这是一个基于语言学驱动的分层错误分类体系,涵盖六个层级:社会语言学、语用学、语义学、形态句法学、正字法和字形学(sociolinguistics, pragmatics, semantics, morphosyntax, orthography, and graphetics)。通过构建包含7种阿拉伯语方言的双向平行语料库(共3,850句)并进行专家级细粒度标注(6,113个错误片段),LQM实现了对MT系统在不同语言层次上错误类型的精准诊断,并提供了可扩展至其他语言的通用评估框架。
链接: https://arxiv.org/abs/2604.18490
作者: Samar M. Magdy,Fakhraddin Alwajih,Abdellah El Mekki,Wesam El-Sayed,Muhammad Abdul-Mageed
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026; resources available at this https URL
Abstract:Existing MT evaluation frameworks, including automatic metrics and human evaluation schemes such as Multidimensional Quality Metrics (MQM), are largely language-agnostic. However, they often fail to capture dialect- and culture-specific errors in diglossic languages (e.g., Arabic), where translation failures stem from mismatches in language variety, content coverage, and pragmatic appropriateness rather than surface form this http URL introduce LQM: Linguistically Motivated Multidimensional Quality Metrics for MT. LQM is a hierarchical error taxonomy for diagnosing MT errors through six linguistically grounded levels: sociolinguistics, pragmatics, semantics, morphosyntax, orthography, and graphetics (Figure 1). We construct a bidirectional parallel corpus of 3,850 sentences (550 per variety) spanning seven Arabic dialects (Egyptian, Emirati, Jordanian, Mauritanian, Moroccan, Palestinian, and Yemeni), derived from conversational, culturally rich content. We evaluate six LLMs in a zero-shot setting and conduct expert span-level human annotation using LQM, producing 6,113 labeled error spans across 3,495 unique erroneous sentences, along with severity-weighted quality scores. We complement this analysis with an automatic metric (spBLEU). Though validated here on Arabic, LQM is a language-agnostic framework designed to be easily applied to or adapted for other languages. LQM annotated errors data, prompts, and annotation guidelines are publicly available at this https URL.
[NLP-10] Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints ICASSP2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在歌词到旋律生成任务中因监督微调(Supervised Fine-Tuning, SFT)导致的“约束违反”问题,即生成的旋律常出现节奏不合理、人声音域不适宜等音乐学上不可行的现象。解决方案的关键在于提出一种无需人工标注的对齐框架:首先基于规则定义音乐约束,自动从SFT模型输出中构建偏好数据集;随后通过两阶段优化策略进行对齐——先使用直接偏好优化(Direct Preference Optimization, DPO)处理成对偏好样本,再采用卡尼曼-特沃斯基优化(Kahneman-Tversky Optimization, KTO)处理未配对的负样本,从而有效提升模型生成旋律的音乐合理性和连贯性。
链接: https://arxiv.org/abs/2604.18489
作者: Hao Meng,Siyuan Zheng,Shuran Zhou,Qiangqiang Wang,Yang Song
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted by IEEE ICASSP 2026
Abstract:Large Language Models (LLMs) show promise in lyric-to-melody generation, but models trained with Supervised Fine-Tuning (SFT) often produce musically implausible melodies with issues like poor rhythm and unsuitable vocal ranges, a phenomenon we term “constraint violation”. To address this, we propose a novel alignment framework that instills musical knowledge without human annotation. We define rule-based musical constraints to automatically generate a preference dataset from an SFT model’s outputs. The model is then aligned through a sequential process, first using Direct Preference Optimization (DPO) on paired preference data, followed by Kahneman-Tversky Optimization (KTO) on unpaired negative samples. Experimental results demonstrate that our aligned model substantially reduces rule violations and outperforms strong baselines in both objective and subjective evaluations, generating melodies with substantially improved musicality and coherence. An interactive demo with audio comparisons is available at this https URL.
[NLP-11] Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety
【速读】: 该论文旨在解决当前生成式AI模型在面对风格化对抗攻击时的安全性脆弱问题,即模型对非典型有害提示形式的防御能力不足。解决方案的关键在于构建Adversarial Humanities Benchmark(AHB),通过将原本明确的有害任务(来自MLCommons AILuminate)转化为人文风格的表述(如诗歌、故事等),在不改变意图的前提下实现目标隐藏和形式混淆,从而测试模型安全机制的泛化能力。实验结果显示,原始攻击的成功率为3.84%,而经过风格转换后的攻击成功率高达36.8%至65.0%,整体平均达到55.75%,表明现有安全防护策略在面对样式多样性攻击时存在显著失效风险,凸显出深度理解“非伤害性”原则仍是前沿模型安全的核心挑战。
链接: https://arxiv.org/abs/2604.18487
作者: Marcello Galisai,Susanna Cifani,Francesco Giarrusso,Piercosma Bisconti,Matteo Prandi,Federico Pierucci,Federico Sartore,Daniele Nardi
机构: DEXAI – Icaro Lab; Sapienza University of Rome; Sant’Anna School of Advanced Studies
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness suggests that current safety techniques suffer from weak generalization: deep understanding of ‘non-maleficence’ remains a central unresolved problem in frontier model safety.
[NLP-12] OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
【速读】: 该论文旨在解决基于视觉-语言-动作(VLA)架构的自动驾驶轨迹预测中,链式思维(Chain-of-Thought, CoT)推理因自回归特性导致延迟过高、难以实现实时部署的问题。现有隐式CoT方法通过将推理压缩到连续隐藏状态来降低延迟,但其性能始终落后于显式CoT,原因在于它们仅使用纯语言的潜在表示,未能捕捉驱动驾驶行为的因果动态。解决方案的关键在于提出OneVL框架——一个统一的VLA与世界模型(World Model)架构,其核心创新是引入双辅助解码器监督机制:一是语言解码器重建文本CoT,二是视觉世界模型解码器预测未来帧token,从而迫使潜在空间内化道路几何、智能体运动和环境变化的因果动力学。通过三阶段训练逐步对齐轨迹、语言与视觉目标,实现稳定联合优化;推理时丢弃辅助解码器并一次性预填充所有潜在token,达到仅输出答案的推理速度,同时在四个基准测试中首次超越显式CoT,验证了在语言与世界模型双重监督下更紧凑的压缩能生成更具泛化能力的表示。
链接: https://arxiv.org/abs/2604.18486
作者: Jinghui Lu,Jiayi Guan,Zhijian Huang,Jinlong Li,Guang Li,Lingdong Kong,Yingyan Li,Han Wang,Shaoqing Xu,Yuechen Luo,Fang Li,Chenxu Dang,Junli Wang,Tao Xu,Jing Wu,Jianhua Wu,Xiaoshuai Hao,Wen Zhang,Tianyi Jiang,Lingfeng Zhang,Lei Zhou,Yingbo Tang,Jie Wang,Yinfeng Gao,Xizhou Bu,Haochen Tian,Yihang Qiu,Feiyang Jia,Lin Liu,Yigu Ge,Hanbing Li,Yuannan Shen,Jianwei Cui,Hongwei Xie,Bing Wang,Haiyang Sun,Jingwei Zhao,Jiahui Huang,Pei Liu,Zeyu Zhu,Yuncheng Jiang,Zibin Guo,Chuhong Gong,Hanchao Leng,Kun Ma,Naiyang Wang,Guang Chen,Kuiyuan Yang,Hangjun Ye,Long Chen
机构: Xiaomi Embodied Intelligence Team (小米具身智能团队)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Robotics (cs.RO)
备注: Technical Report; 49 pages, 22 figures, 10 tables; Project Page at this https URL
Abstract:Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: this https URL
[NLP-13] WorldDB: A Vector Graph-of-Worlds Memory Engine with Ontology-Aware Write-Time Reconciliation
【速读】: 该论文旨在解决当前持久化记忆(persistent memory)在构建长期运行的智能体系统(agentic systems)中的瓶颈问题,特别是传统检索增强生成(Retrieval-Augmented Generation, RAG)方法因扁平向量存储导致的事实碎片化、跨会话身份丢失以及缺乏对事实更新(supersession)和矛盾(contradiction)的显式建模能力。其解决方案的核心是提出WorldDB,一个基于三个承诺的内存引擎:(i)每个节点均为“世界”(world),即具有自身子图、本体作用域及嵌入表示的容器,支持任意深度递归结构;(ii)节点内容地址化且不可变,任何编辑均生成新哈希值并沿祖先路径形成类似Merkle树的审计轨迹;(iii)边为写时程序(write-time programs),每类边附带on_insert/on_delete/on_query_rewrite处理逻辑,实现对事实覆盖、保留冲突或合并提议的语义控制,从而彻底消除原始追加路径(raw append path)。此设计显著提升了时间推理、知识更新与偏好合成等任务的准确率,相较现有最优方案(Hydra DB)提升5.61个百分点。
链接: https://arxiv.org/abs/2604.18478
作者: Harish Santhanalakshmi Ganesan
机构: Independent Researcher(独立研究员)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Persistent memory is the bottleneck separating stateless chatbots from long-running agentic systems. Retrieval-augmented generation (RAG) over flat vector stores fragments facts into chunks, loses cross-session identity, and has no first-class notion of supersession or contradiction. Recent bitemporal knowledge-graph systems (Graphiti, Memento, Hydra DB) add typed edges and valid-time metadata, but the graph itself remains flat: no recursive composition, no content-addressed invariants on nodes, and edge types carry no behavior beyond a label. We present WorldDB, a memory engine built on three commitments: (i) every node is a world – a container with its own interior subgraph, ontology scope, and composed embedding, recursive to arbitrary depth; (ii) nodes are content-addressed and immutable, so any edit produces a new hash at the node and every ancestor, giving a Merkle-style audit trail for free; (iii) edges are write-time programs – each edge type ships on_insert/on_delete/on_query_rewrite handlers (supersession closes validity, contradicts preserves both sides, same_as stages a merge proposal), so no raw append path exists. On LongMemEval-s (500 questions, ~115k-token conversational stacks), WorldDB with Claude Opus 4.7 as answerer achieves 96.40% overall / 97.11% task-averaged accuracy, a +5.61pp improvement over the previously reported Hydra DB state-of-the-art (90.79%) and +11.20pp over Supermemory (85.20%), with perfect single-session-assistant recall and robust performance on temporal reasoning (96.24%), knowledge update (98.72%), and preference synthesis (96.67%). Ablations show that the engine’s graph layer – resolver-unified entities and typed refers_to edges – contributes +7.0pp task-averaged independently of the underlying answerer.
[NLP-14] ESsEN: Training Compact Discriminative Vision-Language Transformers in a Low-Resource Setting
【速读】: 该论文旨在解决当前视觉-语言模型(Vision-Language Model, VLM)普遍参数量庞大、难以在资源受限场景(如边缘设备或独立机器人平台)中部署的问题,以及小数据集下训练轻量化模型的研究不足。其关键解决方案在于:首先,采用两塔编码器(two-tower encoder)架构,在低资源环境下相比单塔编码器在判别性英语任务上表现更优;其次,将传统卷积神经网络(Convolutional Neural Network, CNN)融入两塔Transformer结构,提升参数效率;最后,发现跨模态融合模块的形状和规模可灵活调整而不影响性能,从而实现模型压缩与性能保持的平衡。基于此,作者提出ESsEN——一个可端到端训练、参数极少但性能优异的紧凑型视觉-语言模型,显著降低了研究门槛并提升了模型可访问性。
链接: https://arxiv.org/abs/2604.18452
作者: Clayton Fields,Casey Kennington
机构: Boise State University (博伊西州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Vision-language modeling is rapidly increasing in popularity with an ever expanding list of available models. In most cases, these vision-language models have parameters in the tens of billions, which is necessary for some needs, but in many cases smaller models are necessary (e.g., on edge devices or independent robotic platforms). Unfortunately, there is little research in producing light-weight models or in training them with small datasets. Inspired by the language learning progression and data sparsity in child development, in this paper, we address both of these goals in a systematic fashion. We show that two-tower encoder models are superior to one-tower encoders in low-resource settings for discriminative English tasks. We show also that incorporating traditional convolutional networks into the two-tower transformer architecture can help produce parameter efficient vision-language models. Finally, we show that the cross-modal fusion module of two-tower encoders can vary significantly in shape and size while producing the same results. In addition, we present ESsEN, a compact vision-language model that can be trained end-to-end with relatively few resources that performs as well on several tasks with only a fraction of the parameters compared to other models. The experimental results and the tools we present here make vision-language modeling more accessible to a wider variety of researchers.
[NLP-15] BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets Corpora and Resources ACL2026
【速读】: 该论文旨在解决印度语境下自然语言处理(Natural Language Processing, NLP)资源分散、覆盖不全的问题,尤其针对低资源语言和文化多样性不足的挑战。现有综述或局限于少数高资源语言,或将其纳入泛多语言框架中,未能系统整合印度本土语言的多样化需求。解决方案的关键在于首次构建了一个统一的印度NLP资源谱系,涵盖200多个数据集、50多个基准测试以及100多个模型、工具与系统,按语言现象、领域和模态进行组织,并深入分析标注、评估与模型设计趋势,从而为实现公平性、文化贴合性和可扩展性的NLP研究提供坚实基础。
链接: https://arxiv.org/abs/2604.18423
作者: Raghvendra Kumar,Devankar Raj,Sriparna Saha
机构: Indian Institute of Technology Patna (印度理工学院巴特那分校)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 (Main Conference)
Abstract:India’s linguistic landscape, spanning 22 scheduled languages and hundreds of marginalized dialects, has driven rapid growth in NLP datasets, benchmarks, and pretrained models. However, no dedicated survey consolidates resources developed specifically for Indian languages. Existing reviews either focus on a few high-resource languages or subsume Indian languages within broader multilingual settings, limiting coverage of low-resource and culturally diverse varieties. To address this gap, we present the first unified survey of Indian NLP resources, covering 200+ datasets, 50+ benchmarks, and 100+ models, tools, and systems across text, speech, multimodal, and culturally grounded tasks. We organize resources by linguistic phenomena, domains, and modalities; analyze trends in annotation, evaluation, and model design; and identify persistent challenges such as data sparsity, uneven language coverage, script diversity, and limited cultural and domain generalization. This survey offers a consolidated foundation for equitable, culturally grounded, and scalable NLP research in the Indian linguistic ecosystem.
[NLP-16] Knowing When to Quit: A Principled Framework for Dynamic Abstention in LLM Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在使用思维链(Chain-of-Thought, CoT)推理时因生成冗长且错误响应而导致计算资源浪费的问题。现有方法多通过事前或事后决策来实现放弃输出(abstention),而本文提出一种动态的中途放弃策略,即在生成过程中的每个词元位置评估是否终止当前推理路径。其解决方案的关键在于将放弃行为建模为正则化强化学习框架中的显式动作,并引入一个可调节的放弃奖励参数以平衡计算开销与信息获取。理论分析表明,在价值函数低于该奖励阈值时选择放弃,相较于自然基线具有严格优势;同时,作者进一步推导出一种高效近似价值函数的方法,从而实现了可实践且性能优越的动态放弃机制。实验证明该方法在数学推理和毒性规避任务中显著提升了选择性准确率(selective accuracy)。
链接: https://arxiv.org/abs/2604.18419
作者: Hen Davidov,Nachshon Cohen,Oren Kalinsky,Yaron Fairstein,Guy Kushilevitz,Ram Yazdi,Patrick Rebeschini
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:
Abstract:Large language models (LLMs) using chain-of-thought reasoning often waste substantial compute by producing long, incorrect responses. Abstention can mitigate this by withholding outputs unlikely to be correct. While most abstention methods decide to withhold outputs before or after generation, dynamic mid-generation abstention considers early termination of unpromising reasoning traces at each token position. Prior work has explored empirical variants of this idea, but principled guidance for the abstention rule remains lacking. We present a formal analysis of dynamic abstention for LLMs, modeling abstention as an explicit action within a regularized reinforcement learning framework. An abstention reward parameter controls the trade-off between compute and information. We show that abstaining when the value function falls below this reward strictly outperforms natural baselines under general conditions. We further derive a principled and efficient method to approximate the value function. Empirical results on mathematical reasoning and toxicity avoidance tasks support our theory and demonstrate improved selective accuracy over existing methods.
[NLP-17] StepPO: Step-Aligned Policy Optimization for Agent ic Reinforcement Learning
【速读】: 该论文旨在解决传统基于token级别的强化学习(Reinforcement Learning, RL)在训练大语言模型(Large Language Models, LLMs)作为通用智能体(General Agents)时的局限性问题。现有方法如RLHF(Reinforcement Learning from Human Feedback)和RLVR(Reinforcement Learning with Value Regression)主要聚焦于单轮次的token级对齐或推理增强,难以有效建模多轮交互场景下LLM代理的核心能力,例如决策制定与工具调用,并面临延迟奖励稀疏、上下文长度不一等挑战。为此,论文提出StepPO框架,其核心在于将传统的token级马尔可夫决策过程(Markov Decision Process, MDP)升级为step-level MDP,认为“step”而非“token”才是LLM代理行为的合理动作表示单位;进而设计step-level信用分配机制,使策略优化与奖励传播的粒度与代理决策步长一致,从而更准确地捕捉真实代理行为并提升其agentic能力。
链接: https://arxiv.org/abs/2604.18401
作者: Daoyu Wang,Qingchuan Li,Mingyue Cheng,Jie Ouyang,Shuo Yu,Qi Liu,Enhong Chen
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large Language Models (LLMs). Agentic Reinforcement Learning (RL) is emerging as a central post-training paradigm for empowering LLMs with these capabilities and is playing an increasingly pivotal role in agent training. Unlike single-turn token-level alignment or reasoning enhancement, as in RLHF and RLVR, Agentic RL targets multi-turn interactive settings, where the goal is to optimize core agentic capabilities such as decision making and tool use while addressing new challenges including delayed and sparse rewards, as well as long and variable context. As a result, the token-centric modeling and optimization paradigm inherited from traditional LLM RL is becoming increasingly inadequate for capturing real LLM agent behavior. In this paper, we present StepPO as a position on step-level Agentic RL. We argue that the conventional token-level Markov Decision Process (MDP) should be advanced to a step-level MDP formulation, and that the step, rather than the token, should be regarded as the proper action representation for LLM agents. We then propose step-level credit assignment as the natural optimization counterpart of this formulation, thereby aligning policy optimization and reward propagation with the granularity of agent decisions. Finally, we discuss the key systems designs required to realize step-level Agentic RL in practice and preliminary experiments provide initial evidence for the effectiveness of this perspective. We hope that the step-aligned, step-level paradigm embodied in StepPO offers the Agentic RL community a useful lens for understanding agent behavior and helps advance LLMs toward stronger general-agent capabilities.
[NLP-18] AlphaContext: An Evolutionary Tree-based Psychometric Context Generator for Creativity Assessment ACL2026
【速读】: 该论文旨在解决当前生成式AI在创造力评估中面临的挑战,即现有基于大语言模型(LLM)的上下文生成器普遍存在评估线索不足、叙事连贯性弱、风格多样性有限以及对创造性思维支持不足等问题。解决方案的关键在于提出AlphaContext——一个基于进化树结构的心理测量学上下文生成框架,其核心创新包括:1)HyperTree Outline Planner通过规则引导的超树结构实现自顶向下的层次化规划;2)基于蒙特卡洛树搜索(MCTS)的上下文生成器平衡全局结构与局部质量;3)基于MAP-Elites的进化优化器通过持续更新多样性-质量并重的精英群体,提升上下文的多样性和质量;4)评估引导的进化精炼模块模拟多样化虚拟参与者,回收低质上下文进行迭代优化。实验表明,AlphaContext在6项质量指标上平均优于对比方法8%。
链接: https://arxiv.org/abs/2604.18398
作者: Yixuan Wang,Yue Huang,Hong Qian,Yunzhao Wei,Yifei Ding,Wenkai Wang,Zhi Liu,Zhongjing Huang,Aimin Zhou,Jiajun Guo
机构: East China Normal University (华东师范大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026) Main Track
Abstract:Creativity has become a core competence in the era of LLMs and human-AI collaboration, underpinning innovation in real-world problem solving. Crucially, the systematic improvement of creativity necessitates scientifically valid assessment instruments. Psychometric research recognizes context-based assessment as an effective way to measure creative thinking. However, high-quality expert-designed contexts remain scarce. Existing LLM-based generators often struggle with insufficient assessment cues, weak narrative coherence, limited stylistic diversity, and poor support for creative thinking. To address these challenges, we propose AlphaContext, an evolutionary tree-based psychometric context generator for creativity assessment. First, the HyperTree Outline Planner formalizes expert-designed outlining as a rule-guided hypertree and performs top-down hierarchical planning. The MCTS-based Context Generator fills the outline via MCTS to balance global structure and local quality. Then, the Evolutionary Context Optimizer evolves contexts with MAP-Elites by repeatedly updating niche elites to jointly improve diversity and quality. Finally, the Assessment-Guided Evolution Refiner simulates virtual participants with diverse styles and recycles weak contexts for further evolution. Experiments show that AlphaContext yields an average improvement of 8% over competitive methods across 6 quality metrics.
[NLP-19] River-LLM : Large Language Model Seamless Exit Based on KV Share ACL2026
【速读】: 该论文旨在解决生成式 AI(Generative AI)中基于解码器的大型语言模型(Large Language Models, LLMs)在采用早期退出(Early Exit)策略加速推理时,因 KV Cache 缺失导致效率受限的问题。现有方法如重计算或掩码机制要么引入显著延迟开销,要么造成严重精度损失,难以实现理论层减少与实际运行时间加速之间的匹配。其解决方案的关键在于提出 River-LLM,一个无需训练的框架,通过引入轻量级的 KV-Shared Exit River,在退出过程中自然生成并保留缺失的 KV 缓存状态,从而避免昂贵的恢复操作;同时利用解码块内部的状态转移相似性预测累积 KV 错误,指导精确的逐 token 退出决策,最终在数学推理和代码生成任务上实现 1.71 至 2.16 倍的实际加速,且保持高质量输出。
链接: https://arxiv.org/abs/2604.18396
作者: Yingtao Shen,An Zou
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026, 13pages, with appendix
Abstract:Large Language Models (LLMs) have demonstrated exceptional performance across diverse domains but are increasingly constrained by high inference latency. Early Exit has emerged as a promising solution to accelerate inference by dynamically bypassing redundant layers. However, in decoder-only architectures, the efficiency of Early Exit is severely bottlenecked by the KV Cache Absence problem, where skipped layers fail to provide the necessary historical states for subsequent tokens. Existing solutions, such as recomputation or masking, either introduce significant latency overhead or incur severe precision loss, failing to bridge the gap between theoretical layer reduction and practical wall-clock speedup. In this paper, we propose River-LLM, a training-free framework that enables seamless token-level Early Exit. River-LLM introduces a lightweight KV-Shared Exit River that allows the backbone’s missing KV cache to be naturally generated and preserved during the exit process, eliminating the need for costly recovery operations. Furthermore, we utilize state transition similarity within decoder blocks to predict cumulative KV errors and guide precise exit decisions. Extensive experiments on mathematical reasoning and code generation tasks demonstrate that River-LLM achieves 1.71 to 2.16 times of practical speedup while maintaining high generation quality.
[NLP-20] Understanding the Prompt Sensitivity
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在输入提示(prompt)语义不变的情况下输出结果差异显著的问题,即提示敏感性(prompt sensitivity)问题。其核心挑战在于理解为何LLM对细微的、语义等价的提示变化表现出高度不稳定性,从而影响模型的可靠性和可预测性。解决方案的关键在于将LLM建模为多元函数并进行一阶泰勒展开(first-order Taylor expansion),通过引入柯西-施瓦茨不等式(Cauchy-Schwarz inequality)推导出两个语义保持提示间log概率差值的上界。分析表明,LLM内部并不像小型神经网络那样对相似输入进行聚类,而是将其分散分布,导致该上界过大且难以降至零,从而解释了提示敏感性的内在机制;同时,研究还揭示了特定类型的提示变体更易引发敏感性风险,并发现提示模板对logits的影响通常大于问题内容本身,进一步验证了该上界的实用性与现实相关性。
链接: https://arxiv.org/abs/2604.18389
作者: Yang Liu,Chenhui Chu
机构: Kyoto University (京都大学)
类目: Computation and Language (cs.CL)
备注: 27 pages, 16 figures
Abstract:Prompt sensitivity, which refers to how strongly the output of a large language model (LLM) depends on the exact wording of its input prompt, raises concerns among users about the LLM’s stability and reliability. In this work, we consider LLMs as multivariate functions and perform a first-order Taylor expansion, thereby analyzing the relationship between meaning-preserving prompts, their gradients, and the log probabilities of the model’s next token. We derive an upper bound on the difference between log probabilities using the Cauchy-Schwarz inequality. We show that LLMs do not internally cluster similar inputs like smaller neural networks do, but instead disperse them. This dispersing behavior leads to an excessively high upper bound on the difference of log probabilities between two meaning-preserving prompts, making it difficult to effectively reduce to 0. In our analysis, we also show which types of meaning-preserving prompt variants are more likely to introduce prompt sensitivity risks in LLMs. In addition, we demonstrate that the upper bound is strongly correlated with an existing prompt sensitivity metric, PromptSensiScore. Moreover, by analyzing the logit variance, we find that prompt templates typically exert a greater influence on logits than the questions themselves. Overall, our results provide a general interpretation for why current LLMs can be highly sensitive to prompts with the same meaning, offering crucial evidence for understanding the prompt sensitivity of LLMs. Code for experiments is available at this https URL.
[NLP-21] IceBreaker for Conversational Agents : Breaking the First-Message Barrier with Personalized Starters ACL2026
【速读】: 该论文旨在解决对话系统在会话启动阶段(conversation initiation stage)面临的“首条消息障碍”问题,即用户常处于需求模糊但无明确查询意图的状态,导致对话难以发起。传统方法仅关注对话进行中的激活策略,忽视了冷启动场景下的引导需求。解决方案的关键在于提出 IceBreaker 框架,其核心是将人类破冰行为建模为两步握手机制:(i) 通过基于会话摘要的共鸣感知兴趣蒸馏(Resonance-Aware Interest Distillation)捕捉潜在触发兴趣点;(ii) 通过面向交互的起始句生成(Interaction-Oriented Starter Generation),结合个性化偏好对齐与自强化循环优化,从而最大化用户参与度。该方案已在全球最大的对话代理产品上通过在线A/B测试验证,显著提升用户活跃天数(+0.184%)和点击率(+9.425%)。
链接: https://arxiv.org/abs/2604.18375
作者: Hongwei Zheng,Weiqi Wu,Zhengjia Wang,Guanyu Jiang,Haoming Li,Tianyu Wu,Yongchun Zhu,Jingwu Chen,Feng Zhang
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026 Accepted Paper (Industry Track)
Abstract:Conversational agents, such as ChatGPT and Doubao, have become essential daily assistants for billions of users. To further enhance engagement, these systems are evolving from passive responders to proactive companions. However, existing efforts focus on activation within ongoing dialogues, while overlooking a key real-world bottleneck. In the conversation initiation stage, users may have a vague need but no explicit query intent, creating a first-message barrier where the conversation holds before it begins. To overcome this, we introduce Conversation Starter Generation: generating personalized starters to guide users into conversation. However, unlike in-conversation stages where immediate context guides the response, initiation must operate in a cold-start moment without explicit user intent. To pioneer in this direction, we present IceBreaker that frames human ice-breaking as a two-step handshake: (i) evoke resonance via Resonance-Aware Interest Distillation from session summaries to capture trigger interests, and (ii) stimulate interaction via Interaction-Oriented Starter Generation, optimized with personalized preference alignment and a self-reinforced loop to maximize engagement. Online A/B tests on one of the world’s largest conversational agent products show that IceBreaker improves user active days by +0.184% and click-through rate by +9.425%, and has been deployed in production.
[NLP-22] Omni-Embed-Audio: Leverag ing Multimodal LLM s for Robust Audio-Text Retrieval ACL2026
【速读】: 该论文旨在解决当前基于对比语言-音频预训练(CLAP)的音频-文本检索系统在实际应用中评估不足的问题,即传统基准测试依赖于描述性标题(caption-style queries),与真实用户搜索行为存在显著差异,导致对模型鲁棒性的评估不充分。其解决方案的关键在于提出Omni-Embed-Audio(OEA)——一种基于具备原生音频理解能力的多模态大语言模型(Multimodal Large Language Models, MLLMs)构建的检索编码器,并引入用户意图查询(User-Intent Queries, UIQs)这一新型评估框架,涵盖问题、指令、关键词标签、同义改写和排除型负查询五类自然搜索形式;同时设计硬负样本挖掘流程及判别指标(HNSR、TFR),用于量化模型区分声学相似干扰项的能力。实验表明,OEA在保持与M2D-CLAP相当的文本到音频检索性能基础上,在文本到文本检索(相对提升+22%)和硬负样本判别能力(HNSR@10提升4.3个百分点,TFR@10提升34.7%)方面表现突出,验证了LLM骨干网络在复杂查询语义理解上的优势。
链接: https://arxiv.org/abs/2604.18360
作者: HaeJun Yoo,Yongseop Shin,Insung Lee,Myoung-Wan Koo,Du-Seong Chang
机构: Sogang University (西江大学)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Accepted at ACL 2026 Main Conference. Camera-ready version
Abstract:Audio-text retrieval systems based on Contrastive Language-Audio Pretraining (CLAP) achieve strong performance on traditional benchmarks; however, these benchmarks rely on caption-style queries that differ substantially from real-world search behavior, limiting their assessment of practical retrieval robustness. We present Omni-Embed-Audio (OEA), a retrieval-oriented encoder leveraging multimodal LLMs with native audio understanding. To systematically evaluate robustness beyond caption-style queries, we introduce User-Intent Queries (UIQs) - five formulations reflecting natural search behaviors: questions, commands, keyword tags, paraphrases, and exclusion-based negative queries. For negative queries, we develop a hard negative mining pipeline and propose discrimination metrics (HNSR, TFR) assessing models’ ability to suppress acoustically similar distractors. Experiments on AudioCaps, Clotho, and MECAT show that OEA achieves comparable text-to-audio retrieval performance to state-of-the-art M2D-CLAP, while demonstrating clear advantages in two critical areas: (1) dominant text-to-text retrieval (+22% relative improvement), and (2) substantially superior hard negative discrimination (+4.3%p HNSR@10, +34.7% relative TFR@10), revealing that LLM backbones provide superior semantic understanding of complex queries.
[NLP-23] ComPASS: Towards Personalized Agent ic Social Support via Tool-Augmented Companionship
【速读】: 该论文旨在解决当前 empathetic dialogue generation(共情对话生成)中响应形式单一、内容缺乏多样性与实质性支持的问题,尤其在满足不同用户需求和情境下的社会支持(social support)方面存在明显不足。其解决方案的关键在于引入外部工具(external tools)以增强智能体的行动能力,使其不仅能理解情绪,还能通过调用模拟多媒体应用的 dozen user-centric tools 来执行多样化的社会支持行为,从而提供更贴近人类互动的实质性陪伴。研究进一步构建了首个基于大语言模型(LLM)的个性化社会支持基准 ComPASS-Bench,并基于此对 Qwen3-8B 模型进行微调,得到任务特定的 ComPASS-Qwen 模型,实验证明工具增强型响应显著优于直接生成共情对话,在整体性能上接近多个大规模模型。
链接: https://arxiv.org/abs/2604.18356
作者: Zhaopei Huang,Yanfeng Jia,Jiayi Zhao,Xinjie Zhang,Wenxuan Wang,Qin Jin
机构: Renmin University of China (中国人民大学); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Developing compassionate interactive systems requires agents to not only understand user emotions but also provide diverse, substantive support. While recent works explore empathetic dialogue generation, they remain limited in response form and content, struggling to satisfy diverse needs across users and contexts. To address this, we explore empowering agents with external tools to execute diverse actions. Grounded in the psychological concept of “social support”, this paradigm delivers substantive, human-like companionship. Specifically, we first design a dozen user-centric tools simulating various multimedia applications, which can cover different types of social support behaviors in human-agent interaction scenarios. We then construct ComPASS-Bench, the first personalized social support benchmark for LLM-based agents, via multi-step automated synthesis and manual refinement. Based on ComPASS-Bench, we further synthesize tool use records to fine-tune the Qwen3-8B model, yielding a task-specific ComPASS-Qwen. Comprehensive evaluations across two settings reveal that while the evaluated LLMs can generate valid tool-calling requests with high success rates, significant gaps remain in final response quality. Moreover, tool-augmented responses achieve better overall performance than directly producing conversational empathy. Notably, our trained ComPASS-Qwen demonstrates substantial improvements over its base model, achieving comparable performance to several large-scale models. Our code and data are available at this https URL.
[NLP-24] PRISMA: Preference-Reinforced Self-Training Approach for Interpretable Emotionally Intelligent Negotiation Dialogues ACL
【速读】: 该论文旨在解决现有谈判对话系统在情感识别与响应策略上的不足,尤其是缺乏可解释性的问题,从而影响其在真实场景中的人机信任与协作效果。解决方案的关键在于提出一种基于情绪感知的谈判策略引导思维链(Emotion-aware Negotiation Strategy-informed Chain-of-Thought, ENS-CoT)机制,该机制通过模拟人类谈判中对情绪的感知、理解、运用和管理过程,实现情绪智能响应的可解释性。在此基础上,研究构建了两个新数据集(JobNego 和 ResNego),并采用自训练增强与直接偏好优化(Direct Preference Optimization, DPO)相结合的方法开发出 PRISMA 系统,显著提升了谈判响应的情感适宜性、可解释性及整体有效性。
链接: https://arxiv.org/abs/2604.18354
作者: Prajwal Vijay Kajare,Priyanshu Priya,Bikash Santra,Asif Ekbal
机构: Indian Institute of Technology Patna (印度理工学院比滕帕); Indian Institute of Technology Jodhpur (印度理工学院乔德普尔)
类目: Computation and Language (cs.CL)
备注: 10 pages + appendix (23 pages total), paper accepted at ACL (Main) 2026
Abstract:Emotion plays a pivotal role in shaping negotiation outcomes, influencing trust, cooperation, and long-term relationships. Developing negotiation dialog systems that can recognize and respond strategically to emotions is, therefore, essential to create more effective human-centered interactions. Beyond generating emotionally appropriate responses, interpretability - understanding how a system generates a particular emotion-aware response, is critical for fostering reliability and building rapport. Driven by these aspects, in this work, we introduce PRISMA, an interpretable emotionally intelligent negotiation dialogue system targeting two application domains, viz. job interviews and resource allocation. To enable interpretability, we propose an Emotion-aware Negotiation Strategy-informed Chain-of-Thought (ENS-CoT) reasoning mechanism, which mimics human negotiation by perceiving, understanding, using, and managing emotions. Leveraging ENS-CoT, we curate two new datasets: JobNego (for job interview negotiation) and ResNego (for resource allocation negotiation). We then leverage these datasets to develop PRISMA by augmenting self-training with Direct Preference Optimization (DPO), guiding agents toward more accurate, interpretable, and emotionally appropriate negotiation responses. Automatic and human evaluation on JobNego and ResNego datasets demonstrate that PRISMA substantially enhances interpretability and generates appropriate emotion-aware responses, while improving overall negotiation effectiveness.
[NLP-25] HiGMem: A Hierarchical and LLM -Guided Memory System for Long-Term Conversational Agents ACL2026
【速读】: 该论文旨在解决长期对话大语言模型(LLM)代理中记忆系统效率低下的问题,即现有基于向量相似度的检索方法容易产生冗余证据集,导致召回率提升有限但精度下降、上下文成本增加且难以管理。解决方案的关键在于提出HiGMem(Hierarchical and LLM-Guided Memory System),其采用两级事件-回合记忆结构,利用LLM生成的事件摘要作为语义锚点,引导模型先筛选高阶事件摘要,再聚焦于相关性更高的具体对话回合,从而通过推理机制获得精简可靠的证据集,显著降低检索开销并提升性能,在LoCoMo10基准测试中优于现有方法。
链接: https://arxiv.org/abs/2604.18349
作者: Shuqi Cao(1),Jingyi He(2),Fei Tan(1) ((1) East China Normal University, Shanghai, China, (2) Shanghai Jiao Tong University, Shanghai, China)
机构: East China Normal University (华东师范大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of the Association for Computational Linguistics: ACL 2026. Camera-ready version. 10 pages, 2 figures. Code: this https URL
Abstract:Long-term conversational large language model (LLM) agents require memory systems that can recover relevant evidence from historical interactions without overwhelming the answer stage with irrelevant context. However, existing memory systems, including hierarchical ones, still often rely solely on vector similarity for retrieval. It tends to produce bloated evidence sets: adding many superficially similar dialogue turns yields little additional recall, but lowers retrieval precision, increases answer-stage context cost, and makes retrieved memories harder to inspect and manage. To address this, we propose HiGMem (Hierarchical and LLM-Guided Memory System), a two-level event-turn memory system that allows LLMs to use event summaries as semantic anchors to predict which related turns are worth reading. This allows the model to inspect high-level event summaries first and then focus on a smaller set of potentially useful turns, providing a concise and reliable evidence set through reasoning, while avoiding the retrieval overhead that would be excessively high compared to vector retrieval. On the LoCoMo10 benchmark, HiGMem achieves the best F1 on four of five question categories and improves adversarial F1 from 0.54 to 0.78 over A-Mem, while retrieving an order of magnitude fewer turns. Code is publicly available at this https URL.
[NLP-26] Multilingual Training and Evaluation Resources for Vision-Language Models
【速读】: 该论文旨在解决视觉语言模型(Vision Language Models, VLMs)发展中存在的两大局限:一是缺乏多语言和多模态训练数据集,二是跨语言的全面评估基准稀缺。为应对这些问题,作者提出了一种基于“再生-翻译”范式的解决方案,其关键在于结合受许可限制较少的生成模型进行合成数据再生与人工标注相结合的方法,从而构建高质量的跨语言资源。具体而言,研究团队创建了 Multi-PixMo 训练语料库,通过再生现有 Pixmo 数据集(如 PixMo-Cap、PixMo-AskModelAnything 和 CoSyn-400k)中的样本获得;同时,将广泛使用的英文评测数据集(MMbench、ScienceQA、MME、POPE、AI2D)翻译成五种欧洲语言(英语、法语、德语、意大利语和西班牙语),形成多语言评估基准。实验表明,使用多语言、多模态训练数据可显著提升非英语基准上的性能,并对英语任务也产生正向迁移效应。
链接: https://arxiv.org/abs/2604.18347
作者: Daniela Baiamonte,Elena Fano,Matteo Gabburo,Stefano Simonazzi,Leonardo Rigutini,Andrea Zugarini
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English, French, German, Italian, and Spanish). We adopt a regeneration-translation paradigm that produces high-quality cross-lingual resources by combining curated synthetic generation and manual annotation. Specifically, we build Multi-PixMo, a training corpus obtained regenerating examples from Pixmo pre-existing datasets with permissively licensed models: PixMo-Cap, PixMo-AskModelAnything, and CoSyn-400k. On the evaluation side, we construct a set of multilingual benchmarks derived translating widely used English datasets (MMbench, ScienceQA, MME, POPE, AI2D). We assess the quality of these resources through qualitative and quantitative human analyses, measuring inter-annotator agreement. Additionally, we perform ablation studies to demonstrate the impact of multilingual data, with respect to English only, in VLMs training. Experiments, comprising 3 different models show that using multilingual, multimodal examples for training VLMs aids is consistently beneficial on non-English benchmarks, with positive transfer to English as well.
[NLP-27] FregeLogic at SemEval 2026 Task 11: A Hybrid Neuro-Symbolic Architecture for Content-Robust Syllogistic Validity Prediction ACL2026 SEMEVAL-2026
【速读】: 该论文旨在解决syllogistic validity prediction任务中因内容效应(content effect)导致的预测偏差问题,即大语言模型(LLM)在判断逻辑有效性时容易受到现实世界信念的干扰。解决方案的关键在于构建一个混合神经符号系统FregeLogic,其核心机制是:通过集成五个不同LLM分类器(基于Llama 4 Maverick、Llama 4 Scout和Qwen3-32B三种开放权重模型及多种提示策略)形成投票机制,当模型间出现分歧时,表明可能存在内容偏差;此时引入Z3 SMT求解器作为形式逻辑仲裁者,对争议样本进行结构化验证,从而纠正错误判断。该设计实现了94.3%准确率与仅2.85的内容效应,相较纯集成方法在综合评分上提升2.76点,验证了在模型共识最低处精准应用形式逻辑可显著优化准确性与内容无关性的平衡。
链接: https://arxiv.org/abs/2604.18328
作者: Adewale Akinfaderin,Nafi Diallo
机构: Amazon Web Services (亚马逊网络服务)
类目: Computation and Language (cs.CL)
备注: Camera-ready version to appear at The 20th International Workshop on Semantic Evaluation (SemEval-2026), ACl 2026
Abstract:We present FregeLogic, a hybrid neuro-symbolic system for SemEval-2026 Task 11 (Subtask 1), which addresses syllogistic validity prediction while reducing content effects on predictions. Our approach combines an ensemble of five LLM classifiers, spanning three open-weights models (Llama 4 Maverick, Llama 4 Scout, and Qwen3-32B) paired with varied prompting strategies, with a Z3 SMT solver that serves as a formal logic tiebreaker. The central hypothesis is that LLM disagreement within the ensemble signals likely content-biased errors, where real-world believability interferes with logical judgment. By deferring to Z3’s structurally-grounded formal verification on these disputed cases, our system achieves 94.3% accuracy with a content effect of 2.85 and a combined score of 41.88 in nested 5-fold cross-validation on the dataset (N=960). This represents a 2.76-point improvement in combined score over the pure ensemble (39.12), with a 0.9% accuracy gain, driven by a 16% reduction in content effect (3.39 to 2.85). Adopting structured-output API calls for Z3 extraction reduced failure rates from ~22% to near zero, and an Aristotelian encoding with existence axioms was validated against task annotations. Our results suggest that targeted neuro-symbolic integration, applying formal methods precisely where ensemble consensus is lowest, can improve the combined accuracy-plus-content-effect metric used by this task.
[NLP-28] PARM: Pipeline-Adapted Reward Model
【速读】: 该论文旨在解决多阶段大语言模型(LLM)流水线中奖励模型(Reward Model, RM)预测与实际执行结果不一致的问题,这一问题在代码生成用于组合优化等复杂任务中尤为显著。其核心挑战在于传统单步生成场景下的RM无法有效适配多阶段推理流程中的端到端反馈信号。解决方案的关键是提出Pipeline-Adapted Reward Model (PARM),通过利用流水线特定的数据集并结合直接偏好优化(Direct Preference Optimization, DPO),使奖励模型能够对下游执行结果进行更准确的建模,从而提升整个多阶段流水线的输出质量与稳定性。
链接: https://arxiv.org/abs/2604.18327
作者: Xingyu Fan,Wei Shao,Jiacheng Liu,Linqi Song,Pheng Ann Heng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reward models (RMs) are central to aligning large language models (LLMs) with human preferences, powering RLHF and advanced decoding strategies. While most prior work focuses on single-step generation, real-world applications increasingly adopt multi-stage LLM pipelines, where effective reward guidance remains underexplored. We investigate this through code generation for combinatorial optimization, constructing a pipeline that integrates reward models into both formulation and solution stages. We identify a critical challenge: inconsistency between reward model predictions and actual pipeline execution outcomes. To address this, we propose the Pipeline-Adapted Reward Model (PARM), which leverages pipeline-specific data and direct preference optimization to align rewards with downstream feedback. We instantiate PARM as a two-stage pipeline (formulation - code generation) and evaluate it on four public optimization benchmarks, measuring execution rate and solving accuracy against baselines and sampling methods. A supplementary cross-domain experiment on GSM8K assesses transferability. Results demonstrate that PARM consistently improves pipeline output quality and stability, providing new insights into reward modeling for multi-stage LLM reasoning.
[NLP-29] On the Importance and Evaluation of Narrativity in Natural Language AI Explanations
【速读】: 该论文旨在解决当前可解释人工智能(Explainable AI, XAI)中生成的解释多为静态特征重要性列表、缺乏因果机制与语义连贯性的问题,导致解释虽能指出“什么”影响预测结果,却无法阐明“为什么”发生该预测。其解决方案的关键在于引入叙事(narrative)作为解释的新范式,强调通过连续结构、因果机制、语言流畅性和词汇多样性四个维度提升解释的可理解性,并提出一套基于这四个维度的七项自动评估指标,用以区分描述性解释与真正具有叙事质量的解释;同时,论文进一步提出一套问题无关的XAI叙事生成规则,指导生成符合语言学与社会科学规律的自然语言解释,从而推动XAI从特征排序走向因果推理和人类认知适配的方向发展。
链接: https://arxiv.org/abs/2604.18311
作者: Mateusz Cedro,David Martens
机构: University of Antwerp (安特卫普大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 pages, 7 figures, 9 tables
Abstract:Explainable AI (XAI) aims to make the behaviour of machine learning models interpretable, yet many explanation methods remain difficult to understand. The integration of Natural Language Generation into XAI aims to deliver explanations in textual form, making them more accessible to practitioners. Current approaches, however, largely yield static lists of feature importances. Although such explanations indicate what influences the prediction, they do not explain why the prediction occurs. In this study, we draw on insights from social sciences and linguistics, and argue that XAI explanations should be presented in the form of narratives. Narrative explanations support human understanding through four defining properties: continuous structure, cause-effect mechanisms, linguistic fluency, and lexical diversity. We show that standard Natural Language Processing (NLP) metrics based solely on token probability or word frequency fail to capture these properties and can be matched or exceeded by tautological text that conveys no explanatory content. To address this issue, we propose seven automatic metrics that quantify the narrative quality of explanations along the four identified dimensions. We benchmark current state-of-the-art explanation generation methods on six datasets and show that the proposed metrics separate descriptive from narrative explanations more reliably than standard NLP metrics. Finally, to further advance the field, we propose a set of problem-agnostic XAI Narrative generation rules for producing natural language XAI explanations, so that the resulting XAI Narratives exhibit stronger narrative properties and align with the findings from the linguistic and social science literature.
[NLP-30] Reasoning Models Know Whats Important and Encode It in Their Activations
【速读】: 该论文旨在解决生成式 AI(Generative AI)在复杂任务中通过长推理链进行决策时,如何识别关键推理步骤的问题。其核心挑战在于区分哪些步骤对最终答案至关重要,而现有方法多依赖推理链中的token表面特征(如位置或长度),难以准确捕捉内在重要性。论文的关键解决方案是:通过分析模型激活(activations)而非仅依赖token,发现模型内部已编码了步骤重要性的表示——这一表示在不同模型间具有泛化能力,且分布于多个网络层中,与表面特征无关。这表明,从模型内部激活入手可揭示表面方法无法捕捉的推理机制。
链接: https://arxiv.org/abs/2604.18307
作者: Yaniv Nikankin,Martin Tutek,Tomer Ashuach,Jonathan Rosenfeld,Yonatan Belinkov
机构: Technion(技术学院); University of Zagreb, FER(萨格勒布大学,工程学院); MIT(麻省理工学院); Kempner Institute, Harvard(哈佛大学肯普纳研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Language models often solve complex tasks by generating long reasoning chains, consisting of many steps with varying importance. While some steps are crucial for generating the final answer, others are removable. Determining which steps matter most, and why, remains an open question central to understanding how models process reasoning. We investigate if this question is best approached through model internals or through tokens of the reasoning chain itself. We find that model activations contain more information than tokens for identifying important reasoning steps. Crucially, by training probes on model activations to predict importance, we show that models encode an internal representation of step importance, even prior to the generation of subsequent steps. This internal representation of importance generalizes across models, is distributed across layers, and does not correlate with surface-level features, such as a step’s relative position or its length. Our findings suggest that analyzing activations can reveal aspects of reasoning that surface-level approaches fundamentally miss, indicating that reasoning analyses should look into model internals.
[NLP-31] Exploring Concreteness Through a Figurative Lens ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)如何在内部表征中理解词义的具象性(concreteness)及其在隐喻等修辞语境下的动态变化问题。现有研究多依赖静态的具象性评分,但实际语言使用中,同一名词在字面与比喻用法下可能呈现显著不同的抽象程度。论文通过分层和几何分析方法,揭示了LLMs在不同网络层中对字面与比喻用法的区分机制:早期层已能分离两种用法,而中后期层则将具象性压缩为一个跨模型一致的一维方向。该几何结构的关键价值在于其可操作性——仅需沿此方向进行无训练干预的生成引导,即可实现对文本风格从字面到比喻的高效控制,从而为具象性建模提供可解释且实用的解决方案。
链接: https://arxiv.org/abs/2604.18296
作者: Saptarshi Ghosh,Tianyu Jiang
机构: 未知
类目: Computation and Language (cs.CL)
备注: ACL 2026
Abstract:Static concreteness ratings are widely used in NLP, yet a word’s concreteness can shift with context, especially in figurative language such as metaphor, where common concrete nouns can take abstract interpretations. While such shifts are evident from context, it remains unclear how LLMs understand concreteness internally. We conduct a layer-wise and geometric analysis of LLM hidden representations across four model families, examining how models distinguish literal vs figurative uses of the same noun and how concreteness is organized in representation space. We find that LLMs separate literal and figurative usage in early layers, and that mid-to-late layers compress concreteness into a one-dimensional direction that is consistent across models. Finally, we show that this geometric structure is practically useful: a single concreteness direction supports efficient figurative-language classification and enables training-free steering of generation toward more literal or more figurative rewrites.
[NLP-32] An Existence Proof for Neural Language Models That Can Explain Garden-Path Effects via Surprisal ACL2026
【速读】: 该论文试图解决的问题是:当前基于神经语言模型(neural language models, LMs)的预期 surprisal(意外度)理论在解释人类句法歧义句(garden-path sentences)处理难度时存在不足,即其预测的 surprisal 无法准确反映人类在处理这类句子时的阅读时间延长现象。尽管已有研究认为这种偏差意味着处理难度不能完全由 surprisal 解释,但论文质疑这是否源于模型本身与人类预测机制的差异,而非 surprisal 理论本身的缺陷。解决方案的关键在于:通过在花园路径句子上对现有神经语言模型进行微调(fine-tuning),使其 surprisal 估计更贴近人类实际阅读时间,从而验证是否存在一个能够同时解释自然语料和花园路径效应的神经 LM。实验表明,微调后的模型不仅未过拟合,还能有效捕捉未见花园路径句的人类阅读延迟,并提升对自然语料中阅读时间的预测能力,证明了 surprisal 理论在理论上仍具解释力,同时也引发了关于如何真正证伪该理论的新问题。
链接: https://arxiv.org/abs/2604.18293
作者: Ryo Yoshida,Shinnosuke Isono,Taiga Someya,Yohei Oseki,Tatsuki Kuribayashi
机构: The University of Tokyo(东京大学); NINJAL(日本国立国语研究所); NII LLMC(日本国立情报学研究所语言模型与认知研究中心); MBZUAI(穆巴达拉人工智能大学)
类目: Computation and Language (cs.CL)
备注: To appear in ACL 2026
Abstract:Surprisal theory hypothesizes that the difficulty of human sentence processing increases linearly with surprisal, the negative log-probability of a word given its context. Computational psycholinguistics has tested this hypothesis using language models (LMs) as proxies for human prediction. While surprisal derived from recent neural LMs generally captures human processing difficulty on naturalistic corpora that predominantly consist of simple sentences, it severely underestimates processing difficulty on sentences that require syntactic disambiguation (garden-path effects). This leads to the claim that the processing difficulty of such sentences cannot be reduced to surprisal, although it remains possible that neural LMs simply differ from humans in next-word prediction. In this paper, we investigate whether it is truly impossible to construct a neural LM that can explain garden-path effects via surprisal. Specifically, instead of evaluating off-the-shelf neural LMs, we fine-tune these LMs on garden-path sentences so as to better align surprisal-based reading-time estimates with actual human reading times. Our results show that fine-tuned LMs do not overfit and successfully capture human reading slowdowns on held-out garden-path items; they even improve predictive power for human reading times on naturalistic corpora and preserve their general LM capabilities. These results provide an existence proof for a neural LM that can explain both garden-path effects and naturalistic reading times via surprisal, but also raise a theoretical question: what kind of evidence can truly falsify surprisal theory?
[NLP-33] Agent -World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
【速读】: 该论文旨在解决当前通用智能体(General Agent)训练中面临的两大核心挑战:一是缺乏真实、可扩展的环境来模拟复杂交互场景,二是缺少系统性的机制支持持续学习与能力迭代。针对这些问题,作者提出了一种名为Agent-World的自演化训练平台,其关键创新在于构建了两个相互协同的核心组件:一是“智能体环境-任务发现”模块,能够自动探索数千个现实世界主题数据库和可执行工具生态系统,并合成难度可控、可验证的任务;二是“连续自演化智能体训练”机制,通过多环境强化学习与动态任务生成相结合,实现对智能体能力缺口的自动识别与针对性提升,从而推动智能体策略与环境的共同进化。该方案在23个挑战性基准测试中显著优于现有主流模型和环境扩展基线,为构建具备长期适应能力的通用智能体提供了可扩展的实践路径。
链接: https://arxiv.org/abs/2604.18292
作者: Guanting Dong,Junting Lu,Junjie Huang,Wanjun Zhong,Longxiang Liu,Shijue Huang,Zhenyu Li,Yang Zhao,Xiaoshuai Song,Xiaoxi Li,Jiajie Jin,Yutao Zhu,Hanbin Wang,Fangyu Lei,Qinyu Luo,Mingyang Chen,Zehui Chen,Jiazhan Feng,Ji-Rong Wen,Zhicheng Dou
机构: Renmin University of China (中国人民大学); ByteDance Seed (字节跳动种子团队)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Working in progress
Abstract:Large language models are increasingly expected to serve as general-purpose agents that interact with external, stateful tool environments. The Model Context Protocol (MCP) and broader agent skills offer a unified interface for connecting agents with scalable real-world services, but training robust agents remains limited by the lack of realistic environments and principled mechanisms for life-long learning. In this paper, we present \textbfAgent-World, a self-evolving training arena for advancing general agent intelligence through scalable environments. Agent-World has two main components: (1) Agentic Environment-Task Discovery, which autonomously explores topic-aligned databases and executable tool ecosystems from thousands of real-world environment themes and synthesizes verifiable tasks with controllable difficulty; and (2) Continuous Self-Evolving Agent Training, which combines multi-environment reinforcement learning with a self-evolving agent arena that automatically identifies capability gaps through dynamic task synthesis and drives targeted learning, enabling the co-evolution of agent policies and environments. Across 23 challenging agent benchmarks, Agent-World-8B and 14B consistently outperforms strong proprietary models and environment scaling baselines. Further analyses reveal scaling trends in relation to environment diversity and self-evolution rounds, offering insights for building general agent intelligence.
[NLP-34] Where Do Self-Supervised Speech Models Become Unfair?
【速读】: 该论文旨在解决预训练自监督语音编码模型(Self-Supervised Speech Models, S3Ms)在不同说话人群体(Speaker Groups, SGs)上存在不公平性的问题,即模型对某些SG的建模效果显著优于其他SG。其解决方案的关键在于首次对S3Ms进行逐层公平性分析,通过在每一嵌入层上探测说话人识别(Speaker Identification, SID)和自动语音识别(Automatic Speech Recognition, ASR)任务中的偏差表现,发现:1)S3Ms从最早潜在层开始就表现出对特定SG的偏见;2)SID与ASR任务中偏差随层数变化呈现相反趋势——SID偏差最小化的层恰好是SID整体误差最低的层,而ASR偏差最大化的层则是ASR整体误差最低的层;3)这种反向偏差-误差关系在微调后的ASR模型中依然存在,表明SG级偏差主要在预训练阶段固化,难以通过下游任务微调消除。
链接: https://arxiv.org/abs/2604.18249
作者: Felix Herron,Maja Hjuler,Solange Rossato,Alexandre Allauzen,François Portet
机构: MILES Team, LAMSADE, Université Paris Dauphine-PSL; GETALP Team, LIG, Université Grenoble Alpes; Queensland University of Technology
类目: Computation and Language (cs.CL)
备注:
Abstract:Speech encoder models are known to model members of some speaker groups (SGs) better than others. However, there has been little work in establishing why this occurs on a technological level. To our knowledge, we present the first layerwise fairness analysis of pretrained self-supervised speech encoder models (S3Ms), probing each embedding layer for speaker identification (SID) automatic speech recognition (ASR). We find S3Ms produce embeddings biased against certain SGs for both tasks, starting at the very first latent layers. Furthermore, we find opposite patterns of layerwise bias for SID vs ASR for all models in our study: SID bias is minimized in layers that minimize overall SID error; on the other hand, ASR bias is maximized in layers that minimize overall ASR error. The inverse bias/error relationship for ASR is unaffected when probing S3Ms that are finetuned for ASR, suggesting SG-level bias is established during pretraining and is difficult to remove.
[NLP-35] Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection
【速读】: 该论文旨在解决当前开源提示注入(prompt injection)检测方法的局限性问题,即基于正则表达式匹配的方法难以识别改写后的攻击,而微调的Transformer分类器易受自适应对抗攻击(adaptive adversaries)影响,导致防御效果显著下降。解决方案的关键在于引入跨学科的检测机制:从法医语言学、材料科学疲劳分析、网络安全部署欺骗技术、生物信息学局部序列比对、经济学机制设计、流行病学频谱信号分析以及编译器理论中的污点追踪中各提取一种独特机制,形成七种新型检测技术。其中三种已在prompt-shield v0.4.1中实现并验证,例如局部对齐检测器将deepset数据集上的F1分数从0.033提升至0.378且无新增假阳性,风格特征检测器在间接注入基准上提升11.1个百分点F1值,疲劳追踪器通过探针测试验证有效性,表明跨领域方法可有效增强检测鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2604.18248
作者: Thamilvendhan Munirathinam
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 16 pages, 1 table, 25 references. Code: this http URL
Abstract:Current open-source prompt-injection detectors converge on two architectural choices: regular-expression pattern matching and fine-tuned transformer classifiers. Both share failure modes that recent work has made concrete. Regular expressions miss paraphrased attacks. Fine-tuned classifiers are vulnerable to adaptive adversaries: a 2025 NAACL Findings study reported that eight published indirect-injection defenses were bypassed with greater than fifty percent attack success rates under adaptive attacks. This work proposes seven detection techniques that each port a specific mechanism from a discipline outside large-language-model security: forensic linguistics, materials-science fatigue analysis, deception technology from network security, local-sequence alignment from bioinformatics, mechanism design from economics, spectral signal analysis from epidemiology, and taint tracking from compiler theory. Three of the seven techniques are implemented in the prompt-shield v0.4.1 release (Apache 2.0) and evaluated in a four-configuration ablation across six datasets including deepset/prompt-injections, NotInject, LLMail-Inject, AgentHarm, and AgentDojo. The local-alignment detector lifts F1 on deepset from 0.033 to 0.378 with zero additional false positives. The stylometric detector adds 11.1 percentage points of F1 on an indirect-injection benchmark. The fatigue tracker is validated via a probing-campaign integration test. All code, data, and reproduction scripts are released under Apache 2.0.
[NLP-36] Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search
【速读】: 该论文旨在解决深度搜索代理(Deep Search Agent)在使用Group Relative Policy Optimization (GRPO)进行训练时面临的两大问题:一是中间步骤的正确性与最终奖励信号之间存在显著不匹配,导致许多正确的中间步骤因最终答案错误而被错误惩罚;二是训练过程不稳定,易引发自然语言能力退化甚至灾难性训练崩溃。其解决方案的关键在于提出CalibAdv方法,通过利用中间步骤的正确性对负向优势进行细粒度下调,并重新平衡答案部分中的正负优势,从而提升训练稳定性和模型性能。
链接: https://arxiv.org/abs/2604.18235
作者: Jiayi Wu,Ruobing Xie,Zeqian Huang,Lei Jiang,Can Xu,Kangyang Luo,Ming Gao,Xiang Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep search agents can autonomously initiate multi-turn interactions with search engines, thereby exhibiting strong question-answering capabilities. Such performance critically relies on Group Relative Policy Optimization (GRPO) as its core training algorithm. However, GRPO still faces several challenges in deep search settings. First, there exists a substantial mismatch between the correctness of intermediate steps and the reward signal, causing numerous correct intermediate steps to be incorrectly penalized when the final answer is wrong. Second, training is highly unstable, often resulting in degradation of natural language ability or even catastrophic training collapse. Our analysis attributes these issues to coarse-grained advantage assignment and an imbalance between positive and negative advantages. To address these problems, we propose CalibAdv, an advantage calibration method specifically designed for deep search tasks. Specifically, CalibAdv leverages the correctness of intermediate steps to downscale excessive negative advantages at a fine-grained level. It then rebalances positive and negative advantages in the answer component. Extensive experiments across three models and seven benchmarks demonstrate that CalibAdv improves both model performance and training stability. Our code is available at this https URL.
[NLP-37] Model in Distress: Sentiment Analysis on French Synthetic Social Media
【速读】: 该论文旨在解决社交媒体客户反馈自动化分析中的三大挑战:标注训练数据成本高、多语言场景下评估数据稀缺,以及因隐私顾虑导致的数据共享与可复现性受限。其解决方案的关键在于构建一个可泛化的合成数据生成流水线,通过微调后的回译(backtranslation)技术从少量种子语料中生成170万条合成推文,并辅以合成的推理轨迹(reasoning traces)。该方法使6亿参数规模的推理模型在法语公共运输客户情绪识别任务上达到77–79%的准确率,媲美或超越现有商业大语言模型(LLM)和专用编码器,同时通过不暴露真实用户数据有效保障隐私。
链接: https://arxiv.org/abs/2604.18226
作者: Pierre-Carl Langlais,Pavel Chizhov,Yannick Detrois,Carlos Rosas Hinostroza,Ivan P. Yamshchikov,Bastien Perroy
机构: PleIAs, Paris, France; Passenger Cognition Lab, RATP Group, Paris, France; CAIRO, THWS, Würzburg, Germany; EPFL, Lausanne, Switzerland
类目: Computation and Language (cs.CL)
备注:
Abstract:Automated analysis of customer feedback on social media is hindered by three challenges: the high cost of annotated training data, the scarcity of evaluation sets, especially in multilingual settings, and privacy concerns that prevent data sharing and reproducibility. We address these issues by developing a generalizable synthetic data generation pipeline applied to a case study on customer distress detection in French public transportation. Our approach utilizes backtranslation with fine-tuned models to generate 1.7 million synthetic tweets from a small seed corpus, complemented by synthetic reasoning traces. We train 600M-parameter reasoners with English and French reasoning that achieve 77-79% accuracy on human-annotated evaluation data, matching or exceeding SOTA proprietary LLMs and specialized encoders. Beyond reducing annotation costs, our pipeline preserves privacy by eliminating the exposure of sensitive user data. Our methodology can be adopted for other use cases and languages.
[NLP-38] Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex Low-Resource Endangered Languages ACL2026
【速读】: 该论文旨在解决低资源且音系复杂的东高加索语言(Archi 和 Rutul)中自动语音识别(ASR)性能不足的问题。其核心挑战在于训练数据稀缺与语言音位结构复杂性之间的矛盾,传统方法常将错误归因于音系复杂性,而忽视了数据匮乏的影响。解决方案的关键在于构建标准化的语音-文本资源并采用 phoneme-level 分析框架,同时针对 wav2vec2 模型引入基于语言特异性音位词汇的输出层启发式初始化策略,显著提升了模型在极端低资源条件下的表现,甚至优于 Whisper 模型。此外,通过细致的音位级错误分析发现,音位识别准确率与训练频率呈 S 型学习曲线关系,进一步揭示多数错误源于数据稀缺而非音系复杂性本身,从而为低资源 ASR 的优化提供了更精准的诊断依据。
链接: https://arxiv.org/abs/2604.18204
作者: V.S.D.S.Mahesh Akavarapu,Michael Daniel,Gerhard Jäger
机构: University of Tübingen (图宾根大学); University of Jena (耶拿大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 (Findings)
Abstract:We present a phoneme-level analysis of automatic speech recognition (ASR) for two low-resourced and phonologically complex East Caucasian languages, Archi and Rutul, based on curated and standardized speech-transcript resources totaling approximately 50 minutes and 1 hour 20 minutes of audio, respectively. Existing recordings and transcriptions are consolidated and processed into a form suitable for ASR training and evaluation. We evaluate several state-of-the-art audio and audio-language models, including wav2vec2, Whisper, and Qwen2-Audio. For wav2vec2, we introduce a language-specific phoneme vocabulary with heuristic output-layer initialization, which yields consistent improvements and achieves performance comparable to or exceeding Whisper in these extremely low-resource settings. Beyond standard word and character error rates, we conduct a detailed phoneme-level error analysis. We find that phoneme recognition accuracy strongly correlates with training frequency, exhibiting a characteristic sigmoid-shaped learning curve. For Archi, this relationship partially breaks for Whisper, pointing to model-specific generalization effects beyond what is predicted by training frequency. Overall, our results indicate that many errors attributed to phonological complexity are better explained by data scarcity. These findings demonstrate the value of phoneme-level evaluation for understanding ASR behavior in low-resource, typologically complex languages.
[NLP-39] Multiplication in Multimodal LLM s: Computation with Text Image and Audio Inputs ACL
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在跨模态数值感知与精确计算之间存在的不一致性问题,即尽管模型能够准确识别不同模态(如数字、数词、图像、音频)中的数值内容,但在执行多位数乘法时性能显著下降,且现有基准测试缺乏系统化配对实例,难以客观比较模型在不同模态和家族间的真正算术能力边界。解决方案的关键在于构建一个受控的多模态乘法基准,通过因子设计(包括位数长度、稀疏性、表示形式和模态类型)并利用可复现的生成器确保配对样本的一致性;同时提出“算术负载”C(定义为总位数与非零位数的乘积),作为操作次数的机制驱动代理指标,其能有效预测跨模态和跨模型的性能表现(R²常高于0.5)。此外,通过感知-计算分解实验发现,模型性能下降主要源于计算而非感知误差,进一步引入强制完成损失探测器以识别特定启发式推理策略(如列式乘法、分配分解、四舍五入补偿),揭示了模型偏好使用分解策略,并表明基础模型内部存在精调的推理路由机制。
链接: https://arxiv.org/abs/2604.18203
作者: Samuel G. Balter,Ethan Jerzak,Connor T. Jerzak
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: To appear in ACL Findings (2026)
Abstract:Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often lack systematically paired instances across modalities, it remains difficult to compare genuine arithmetic limits within and across model families. We therefore introduce a controlled multimodal multiplication benchmark that factorially varies digit length, digit sparsity, representation (e.g., numerals vs. number words), and modality (text, rendered images, audio), with paired instances from a reproducible generator. We also define arithmetic load, C, as the product of the total and non-zero digit count as a compact, mechanistically motivated proxy for operation count. Across evaluations, accuracy falls sharply as C grows, often nearing zero by C 100. Indeed, C remains predictive of performance across modalities and models, with R-squared often 0.5, nearing the value from more complex measures of arithmetic load that count the number of intermediate arithmetic steps. A separate perception-versus-computation decomposition shows that multimodal degradation is primarily computational rather than perceptual: on matched-perception checks, models are near-perfect ( 99%) across modalities, even when multiplication accuracy drops. Beyond measuring when models fail, we ask which procedures they are predisposed to follow. We introduce a forced-completion loss probe that scores heuristic-specific reasoning prefixes–including columnar multiplication, distributive decomposition, and rounding/compensation. Here, decomposition is favored in both text and vision modalities; heuristic-specific LoRA adapters produce near-orthogonal updates yet degrade accuracy, indicating the base model maintains a well-tuned internal router.
[NLP-40] Linear-Time and Constant-Memory Text Embeddings Based on Recurrent Language Models
【速读】: 该论文旨在解决基于Transformer的嵌入模型在处理长序列时面临的计算复杂度为二次方(quadratic computational complexity)和内存复杂度为线性(linear memory complexity)的问题,从而限制了其在长文本场景下的实用性。解决方案的关键在于提出一种垂直分块推理策略(vertically chunked inference strategy),该策略通过将输入序列垂直分块处理,使得在输入长度超过分块大小后,内存占用趋于恒定;同时结合Mamba2等递归架构进行微调,证明其作为通用文本嵌入模型的有效性,在多个基准测试中表现与Transformer相当,但内存开销显著降低。
链接: https://arxiv.org/abs/2604.18199
作者: Tobias Grantner,Emanuel Sallinger,Martin Flechl
机构: Dynatrace Research; TU Wien; University of Oxford
类目: Computation and Language (cs.CL)
备注:
Abstract:Transformer-based embedding models suffer from quadratic computational and linear memory complexity, limiting their utility for long sequences. We propose recurrent architectures as an efficient alternative, introducing a vertically chunked inference strategy that enables fast embedding generation with memory usage that becomes constant in the input length once it exceeds the vertical chunk size. By fine-tuning Mamba2 models, we demonstrate their viability as general-purpose text embedders, achieving competitive performance across a range of benchmarks while maintaining a substantially smaller memory footprint compared to transformer-based counterparts. We empirically validate the applicability of our inference strategy to Mamba2, RWKV, and xLSTM models, confirming consistent runtime-memory trade-offs across architectures and establishing recurrent models as a compelling alternative to transformers for efficient embedding generation.
[NLP-41] Audio-DeepThinker: Progressive Reasoning -Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models
【速读】: 该论文旨在解决大音频语言模型(Large Audio-Language Models, LALMs)在音频理解任务中缺乏显式推理过程的问题,现有方法依赖监督式思维链(Chain-of-Thought, CoT)微调或基于粗粒度奖励的强化学习(Reinforcement Learning, RL),导致生成的推理链条虽结构良好但缺乏具体的声学依据。解决方案的关键在于提出Audio-DeepThinker框架,其核心创新包括:(1)设计一种混合推理相似性奖励机制,融合LLM评估逻辑路径一致性、关键步骤覆盖度与分析深度,以及嵌入相似性组件以确保语义对齐参考推理链;(2)采用渐进式两阶段课程训练策略,第一阶段利用混合奖励在基础音频问答任务上激发基本推理模式,第二阶段切换至声学边界案例并仅使用LLM奖励以提升推理多样性,从而实现无需监督推理微调即可通过纯强化学习探索获得高质量CoT推理能力。
链接: https://arxiv.org/abs/2604.18187
作者: Xiang He,Chenxing Li,Jinting Wang,Yan Rong,Tianxin Xie,Wenfu Wang,Li Liu,Dong Yu
机构: Tencent AI Lab (腾讯AI实验室); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:
Abstract:Large Audio-Language Models (LALMs) have made significant progress in audio understanding, yet they primarily operate as perception-and-answer systems without explicit reasoning processes. Existing methods for enhancing audio reasoning rely either on supervised chain-of-thought (CoT) fine-tuning, which is limited by training data quality, or on reinforcement learning (RL) with coarse rewards that do not directly evaluate reasoning quality. As a result, the generated reasoning chains often appear well-structured yet lack specific acoustic grounding. We propose Audio-DeepThinker, a framework built on two core ideas. First, we introduce a hybrid reasoning similarity reward that directly supervises the quality of generated reasoning chains by combining an LLM evaluator assessing logical path alignment, key step coverage, and analytical depth with an embedding similarity component enforcing semantic alignment with reference reasoning chains. Second, we propose a progressive two-stage curriculum that enables high-quality CoT reasoning to emerge through pure RL exploration, without any supervised reasoning fine-tuning, from an instruction-tuned model that possesses no prior chain-of-thought capability. Stage 1 trains on foundational audio QA with the hybrid reward to foster basic reasoning patterns, while Stage 2 shifts to acoustically challenging boundary cases with an LLM-only reward for greater reasoning diversity. Audio-DeepThinker achieves state-of-the-art results on MMAR (74.0%), MMAU-test-mini (78.5%), and MMSU (77.26%), winning 1st Place in the Interspeech 2026 Audio Reasoning Challenge (Single Model Track). Interpretability analyses further reveal that RL training primarily reshapes upper-layer MoE gating mechanisms and that reasoning tokens crystallize progressively in the upper transformer layers, offering mechanistic insights into how audio reasoning emerges through exploration.
[NLP-42] STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLM s ACL
【速读】: 该论文旨在解决现有基准测试(benchmark)在评估大语言模型(Large Language Models, LLMs)跨领域能力时,仅提供聚合得分而难以揭示模型在组合推理技能上的具体缺陷问题。为实现对这些技能缺口的系统性识别与定位,作者提出Scaffolded Task Design(STaD)框架,其核心在于基于“支架式教学”(scaffolding)理念设计可控的、逐步递进的任务变体,从而以结构化方式暴露模型在特定推理技能组合上的不足。该方法不依赖于对单个失败案例的孤立分析,而是通过黑箱视角对多个模型进行规模化探查,精准识别每种模型的独特技能短板。
链接: https://arxiv.org/abs/2604.18177
作者: Sungeun An,Swanand Ravindra Kadhe,Shailja Thakur,Chad DeLuca,Hima Patel
机构: IBM Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures, 3 tables, ACL Findings 2026
Abstract:Benchmarks are often used as a standard to understand LLM capabilities in different domains. However, aggregate benchmark scores provide limited insight into compositional skill gaps of LLMs and how to improve them. To make these weaknesses visible, we propose Scaffolded Task Design (STaD) framework. STaD generates controlled variations of benchmark tasks based on the concept of scaffolding, which introduces structured, incremental support in a step-by-step manner. Rather than inspecting failures individually, this approach enables systematic and scalable probing of model behavior by identifying the specific reasoning skill compositions they lack. Treating the LLM as a black box, our experiments on six models of varying sizes reveal multiple failure points in three reasoning benchmarks and highlight each model’s unique and distinct skill gaps.
[NLP-43] Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文本和代码编辑任务中效率低下的问题——当前方法通常通过自回归方式重新生成整个输出,即使大部分内容与输入一致,导致计算冗余。解决方案的关键在于提出“Copy-as-Decode”机制,将编辑过程建模为基于双原语语法(copy lines=“i-j” 引用输入行范围,gen… 生成新内容)的结构化解码,并引入词级有限状态机(Finite State Machine, FSM)确保语法有效性;同时,在服务层通过单次并行预填充前向传播更新KV缓存以实现复制跨度(copy span)的高效处理,替代原有N步自回归推理——该设计复用了推测解码(speculative decoding)中的并行前向核,但以程序强制接受代替概率验证,从而显著提升吞吐量。实验表明,在Qwen2.5系列模型上,该机制在不同token数量下相较传统自回归方法提速达6.8×–303×,且能覆盖74%–99%的黄金标记,具备良好的可扩展性和实用性。
链接: https://arxiv.org/abs/2604.18170
作者: Ziyang Liu
机构: Independent Researcher
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 31 pages, 8 figures, 25 tables (17-page main body plus appendix)
Abstract:LLMs edit text and code by autoregressively regenerating the full output, even when most tokens appear verbatim in the input. We study Copy-as-Decode, a decoding-layer mechanism that recasts edit generation as structured decoding over a two-primitive grammar: copy lines=“i-j”/ references an input line range, gen…/gen emits new content. A token-level FSM guarantees syntactic validity, and a serving-layer primitive updates the KV cache for each copy span via a single parallel-prefill forward rather than N autoregressive steps – sharing the parallel-forward kernel of speculative decoding but with input tokens as the draft and program-enforced acceptance replacing probabilistic verification. We report an upper-bound analysis that requires no end-to-end training. (i) Kernel speedup: on Qwen2.5-1.5B, 7B, copying N tokens via parallel prefill is 6.8\times – 303\times faster than autoregressive ( N \in [8, 512] , A100 80GB bf16). (ii) Copy ceiling: on ProbeEdit and HumanEvalPack-Fix (Py/JS), 74 – 98% of gold tokens are reachable under the line-level primitive; composed with the empirical kernel over each corpus’s span histogram this yields a closed-form wall-clock bound of 29.0\times / 3.4\times / 4.2\times ( 13.0\times pooled). A token-level extension reaches 91 – 99% coverage with 4.5\times – 6.5\times floors. (iii) Pipeline losslessness: oracle programs round-trip through the deterministic resolver on all 482 cases, localizing any downstream failure to span selection rather than the mechanism. A perturbation study shows pooled EM drops from 100% to 15.48% under off-by-one noise. A fine-tuning pilot on Qwen2.5-Coder-1.5B lifts HEvalFix-Py EM from 0/33 (untrained) to 12 – 17% , a learnability signal, not a production selector. Batched-serving integration and multi-file coverage are scoped as follow-up.
[NLP-44] Beyond Reproduction: A Paired-Task Framework for Assessing LLM Comprehension and Creativity in Literary Translation ACL2026
【速读】: 该论文旨在解决生成式 AI 在文学翻译中 translational creativity(译者创造性)评估不足的问题,以及源文本理解与创造性表达之间关系被割裂研究的现状。传统方法往往孤立地评估理解能力,而忽视了专业翻译中理解与创造的紧密耦合特性。其解决方案的关键在于提出一种配对任务框架(paired-task framework),将源文本理解(Task 1)与基于“创造性潜力单元”(Units of Creative Potential, UCPs,如隐喻、文字游戏等)的创造性评估(Task 2)相结合,并通过结合专家人工标注与UCP驱动的自动化评分机制实现可扩展的大规模模型评测。该设计使得能够系统性量化不同语言模型在保持语义准确的同时,是否能有效传递原文的创造性特征,从而揭示当前主流大语言模型在译者创造性方面仍存在显著差距。
链接: https://arxiv.org/abs/2604.18169
作者: Ran Zhang,Steffen Eger,Arda Tezcan,Wei Zhao,Simone Paolo Ponzetto,Lieve Macken
机构: University of Mannheim, Data and Web Science Group; University of Technology Nuremberg (UTN), Department Engineering; University of Gent, Department of Translation, Interpreting and Communication; University of Aberdeen, Department of Computing Science; Natural Language Learning and Generation (NLLG) Lab
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 Findings
Abstract:Large language models (LLMs) are increasingly used for creative tasks such as literary translation. Yet translational creativity remains underexplored and is rarely evaluated at scale, while source-text comprehension is typically studied in isolation, despite the fact that, in professional translation, comprehension and creativity are tightly intertwined. We address these gaps with a paired-task framework applied to literary excerpts from 11 books. Task 1 assesses source-text comprehension, and Task 2 evaluates translational creativity through Units of Creative Potential (UCPs), such as metaphors and wordplay. Using a scalable evaluation setup that combines expert human annotations with UCP-based automatic scoring, we benchmark 23 models and four creativity-oriented prompts. Our findings show that strong comprehension does not translate into human-level creativity: models often produce literal or contextually inappropriate renderings, with particularly large gaps for the more distant English-Chinese language pair. Creativity-oriented prompts yield only modest gains, and only one model, Mistral-Large, comes close to human-level creativity (0.167 vs. 0.246). Across all model-prompt combinations, only three exceed a creativity score of 0.1, while the rest remain at or near zero.
[NLP-45] MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM -as-a-Judge ACL2026
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)作为自动评价者(MLLM-as-a-Judge)时存在的可靠性不足与偏见问题,特别是其在整合视觉或文本关键线索时的不稳定性及对语义无关扰动的敏感性。解决方案的关键在于系统性地定义了“组合偏见”(Compositional Bias),并提出了MM-JudgeBias基准,通过在Query、Image和Response三个维度引入受控扰动,利用Bias-Deviation(BD)和Bias-Conformity(BC)两个互补指标量化模型对偏见的敏感性和稳定性,从而实现对九类偏见类型的细粒度诊断,揭示了当前主流MLLMs普遍存在模态忽视和评价倾向不对称的现象,为构建更可靠的自动评价系统提供了评估框架与实证依据。
链接: https://arxiv.org/abs/2604.18164
作者: Sua Lee,Sanghee Park,Jinbae Im
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ACL 2026 Main
Abstract:Multimodal Large Language Models (MLLMs) have been increasingly used as automatic evaluators-a paradigm known as MLLM-as-a-Judge. However, their reliability and vulnerabilities to biases remain underexplored. We find that many MLLM judges fail to reliably integrate key visual or textual cues, yielding unreliable evaluations when evidence is missing or mismatched, and exhibiting instability under semantically irrelevant perturbations. To address this, we systematically define Compositional Bias in MLLM-as-a-Judge systems and introduce MM-JudgeBias, a benchmark for evaluating it. MM-JudgeBias introduces controlled perturbations across Query, Image, and Response, and evaluates model behavior via two complementary metrics: Bias-Deviation (BD) for sensitivity and Bias-Conformity (BC) for stability. Our dataset of over 1,800 curated and refined multimodal samples, drawn from 29 source benchmarks, enables a fine-grained diagnosis of nine bias types across diverse tasks and domains. Experiments on 26 state-of-the-art MLLMs reveal systematic modality neglect and asymmetric evaluation tendencies, underscoring the need for more reliable judges.
[NLP-46] FreezeEmpath: Efficient Training for Empathetic Spoken Chatbots with Frozen LLM s
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的共情语音对话系统在训练过程中面临的三大挑战:一是依赖昂贵的共情语音指令数据;二是生成语音缺乏情感表现力;三是使用跨模态共情指令数据微调LLM时可能导致灾难性遗忘,从而损害模型的通用能力。解决方案的关键在于提出一种名为FreezeEmpath的端到端共情语音聊天机器人,其训练过程仅利用现有的语音指令数据和语音情绪识别(Speech Emotion Recognition, SER)数据,同时保持LLM参数冻结,从而避免灾难性遗忘并提升情感表达能力,实验表明该方法在共情对话、SER和语音问答(SpokenQA)任务中均优于现有模型。
链接: https://arxiv.org/abs/2604.18159
作者: Yun Hong,Yan Zhou,Yang Feng
机构: Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Empathy is essential for fostering natural interactions in spoken dialogue systems, as it enables machines to recognize the emotional tone of human speech and deliver empathetic responses. Recent research has made significant progress in developing empathetic spoken chatbots based on large language models (LLMs). However, several challenges still exist when training such models, including reliance on costly empathetic speech instruction data and a lack of emotional expressiveness in the generated speech. Finetuning LLM with cross-modal empathetic instruction data may also lead to catastrophic forgetting and a degradation of its general capability. To address these challenges, we propose FreezeEmpath, an end-to-end empathetic spoken chatbot trained in a simple and efficient manner. The entire training process relies solely on existing speech instruction data and speech emotion recognition (SER) data, while keeping the LLM’s parameters frozen. Experiments demonstrate that FreezeEmpath is able to generate emotionally expressive speech and outperforms other empathetic models in empathetic dialogue, SER, and SpokenQA tasks, demonstrating the effectiveness of our training strategy.
[NLP-47] Depth Registers Unlock W4A4 on SwiGLU: A Reader/Generator Decomposition
【速读】: 该论文旨在解决量化感知训练(Post-training Quantization)中低精度权重与激活位宽(如W4A4)导致的模型性能显著下降问题,尤其关注输入激活(input-activation)不同位置对误差贡献的差异。其核心问题是:在4-bit权重和4-bit激活的量化下,为何验证困惑度(perplexity, PPL)从FP16的23.6急剧恶化至1727?解决方案的关键在于提出一种训练时干预机制——Depth Registers with a register-magnitude hinge loss(DR+sink),通过约束残差轴(residual-axis)上可学习层(如qkv、w1、w3)的激活幅度,有效抑制了量化误差传播。实验表明,该方法将PPL降至119(约14倍改善),并能与SmoothQuant结合进一步提升至39.9 PPL;而剩余约2 PPL差距主要源于SwiGLU结构中块内生成器(block-internal generators,如w2)的双线性输入无法被常规正交变换(如QuaRot)充分控制,揭示了当前量化策略在处理非线性组合项时的局限性。
链接: https://arxiv.org/abs/2604.18128
作者: Ziyang Liu
机构: Independent Researcher
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 5 figures, 6 tables
Abstract:We study post-training W4A4 quantization in a controlled 300M-parameter SwiGLU decoder-only language model trained on 5B tokens of FineWeb-Edu, and ask which input-activation sites dominate the error. Naive round-to-nearest W4A4 collapses validation perplexity from FP16 23.6 to 1727. A simple residual-axis training-time intervention – Depth Registers with a register-magnitude hinge loss (DR+sink) – reduces this to 119 (about 14x) at matched FP16 PPL and matched zero-shot capacity, and composes with SmoothQuant to 39.9 PPL. The residual ~2 PPL gap to FP16 is the diagnostic core. We decompose W4A4 damage by input-activation site: the five trainable linears in a SwiGLU block split into residual-axis readers (qkv, w1, w3) and block-internal generators (o_proj, w2). Elementary norm arguments show residual-axis magnitude control bounds readers tightly but leaves w2’s bilinear input bounded only by the trivial product of factor bounds; empirically, DR+sink collapses reader kurtosis while leaving generators essentially unchanged, and the reader-rescued W4A4 residue is flat at ~0.28 nats across three matched checkpoints with Delta-remove(w2) dominating. We present DR+sink as a training-time probe rather than a deployment proposal: a post-hoc alternative (Per-Linear QuaRot) nearly matches it on the reader axis. Full QuaRot – adding online per-head value Hadamard plus online w2-input rotation – does not close the gap either, directly testing the prediction that orthogonal rotation cannot bound the bilinear SwiGLU tail. Claims are specific to our 300M, 5B-token, single-seed setting, and our experiments do not isolate the partition from the hinge.
[NLP-48] LoRA: Task-aware Low Rank Adaptation of Large Language Models ACL2026
【速读】: 该论文旨在解决低秩适配(Low-Rank Adaptation, LoRA)在实际应用中因初始化策略和资源分配不合理导致的性能瓶颈问题,尤其是现有变体通常仅优化单一因素(如初始化或秩分配),往往牺牲训练复杂度或实用性。其解决方案的关键在于提出任务感知的低秩适配(Task-aware Low-Rank Adaptation, TLoRA)框架:首先通过数据驱动的初始化策略,利用预训练权重与输入激活协方差乘积的奇异值分解(SVD),使LoRA中的A矩阵对齐任务相关子空间,随后冻结A矩阵仅训练B矩阵;其次引入基于敏感性的重要性度量,动态分配固定参数预算下的各层秩和缩放因子,从而实现初始化与资源分配的联合优化,显著提升模型性能并减少可训练参数数量。
链接: https://arxiv.org/abs/2604.18124
作者: Weicheng Lin,Yi Zhang,Jiawei Dang,Liang-Jie Zhang
机构: Shenzhen University (深圳大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accept to ACL 2026
Abstract:Low-Rank Adaptation (LoRA) has become a widely adopted parameter-efficient fine-tuning method for large language models, with its effectiveness largely influenced by the allocation of ranks and scaling factors, as well as initialization. Existing LoRA variants typically address only one of these factors, often at the cost of increased training complexity or reduced practical efficiency. In this work, we present Task-aware Low-Rank Adaptation (TLoRA), a unified framework that jointly optimizes initialization and resource allocation at the outset of training. TLoRA introduces a data-driven initialization strategy that aligns the LoRA A matrix with task-relevant subspaces by performing singular value decomposition on the product of pre-trained weights and input activation covariance. After this, the A matrix is frozen, and only the B matrix is trained. Furthermore, TLoRA employs a sensitivity-based importance metric to adaptively allocate ranks and scaling factors across layers under a fixed parameter budget. We conduct extensive experiments that demonstrate TLoRA consistently performs excellently across various tasks, including natural language understanding, commonsense reasoning, math reasoning, code generation, and chat generation, while significantly reducing the number of trainable parameters.
[NLP-49] Decisive: Guiding User Decisions with Optimal Preference Elicitation from Unstructured Documents ACL2026
【速读】: 该论文旨在解决决策制定过程中信息过载与用户偏好捕捉不准确的问题,尤其是在处理多源非结构化信息时,传统大语言模型(Large Language Models, LLMs)和决策支持系统难以有效平衡信息整合、主观偏好建模与用户交互效率。其解决方案的关键在于提出一个名为Decisive的交互式决策框架,该框架通过文档驱动的客观选项评分矩阵(document-grounded reasoning)确保决策依据可解释,同时结合贝叶斯偏好推断(Bayesian preference inference)动态学习用户的潜在偏好向量,并通过自适应选择最优配对权衡问题(pairwise tradeoff questions)实现高效信息获取,从而在最小化用户认知负荷的前提下显著提升决策准确性与个性化程度。
链接: https://arxiv.org/abs/2604.18122
作者: Akriti Jain,Anish Mulay,Divyansh Verma,Aishani Pandey,Pritika Ramu,Aparna Garimella
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Main Conference
Abstract:Decision-making is a cognitively intensive task that requires synthesizing relevant information from multiple unstructured sources, weighing competing factors, and incorporating subjective user preferences. Existing methods, including large language models and traditional decision-support systems, fall short: they often overwhelm users with information or fail to capture nuanced preferences accurately. We present Decisive, an interactive decision-making framework that combines document-grounded reasoning with Bayesian preference inference. Our approach grounds decisions in an objective option-scoring matrix extracted from source documents, while actively learning a user’s latent preference vector through targeted elicitation. Users answer pairwise tradeoff questions adaptively selected to maximize information gain over the final decision. This process converges efficiently, minimizing user effort while ensuring recommendations remain transparent and personalized. Through extensive experiments, we demonstrate that our approach significantly outperforms both general-purpose LLMs and existing decision-making frameworks achieving up to 20% improvement in decision accuracy over strong baselines across domains.
[NLP-50] Retrieval-Augmented Multimodal Model for Fake News Detection
【速读】: 该论文旨在解决多模态多领域虚假新闻检测中的两个核心问题:一是现有模型通常孤立地评估每条新闻,无法捕捉跨实例的叙事一致性,从而难以应对由社交媒体驱动的集群式虚假新闻传播;二是传统模型仅依赖训练时编码的知识,在面对新兴事件或小众话题等数据稀缺领域时泛化能力不足。解决方案的关键在于提出检索增强的多模态虚假新闻检测模型(Retrieval-Augmented Multimodal Model, RAMM),其创新点包括:首先,以多模态大语言模型(Multimodal Large Language Model, MLLM)为骨干网络,有效提取新闻样本中的跨模态语义信息;其次,引入抽象叙事对齐模块(Abstract Narrative Alignment Module),自适应地从不同领域的多样化实例中提取抽象叙事一致性并聚合相关知识,实现高层叙事信息建模;最后,设计语义表示对齐模块(Semantic Representation Alignment Module),将模型的推理机制从直接基于多模态特征的判断转变为基于实例的类比推理,使其更贴近人类决策逻辑,显著提升在复杂场景下的检测性能。
链接: https://arxiv.org/abs/2604.18112
作者: Yiheng Li,Weihai Lu,Hanyi Yu,Yue Wang
机构: University of International Business and Economics (对外经济贸易大学); Peking University (北京大学); University of Southern California (南加州大学); Upstart Holdings, Inc. (Upstart Holdings, Inc.); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注:
Abstract:In recent years, multimodal multidomain fake news detection has garnered increasing attention. Nevertheless, this direction presents two significant challenges: (1) Failure to Capture Cross-Instance Narrative Consistency: existing models usually evaluate each news in isolation, fail to capture cross-instance narrative consistency, and thus struggle to address the spread of cluster based fake news driven by social media; (2) Lack of Domain Specific Knowledge for Reasoning: conventional models, which rely solely on knowledge encoded in their parameters during training, struggle to generalize to new or data-scarce domains (e.g., emerging events or niche topics). To tackle these challenges, we introduce Retrieval-Augmented Multimodal Model for Fake News Detection (RAMM). First, RAMM employs a Multimodal Large Language Model (MLLM) as its backbone to capture cross-modal semantic information from news samples. Second, RAMM incorporates an Abstract Narrative Alignment Module. This component adaptively extracts abstract narrative consistency from diverse instances across distinct domains, aggregates relevant knowledge, and thereby enables the modeling of high-level narrative information. Finally, RAMM introduces a Semantic Representation Alignment Module, which aligns the model’s decision-making paradigm with that of humans - specifically, it shifts the model’s reasoning process from direct inference on multimodal features to an instance-based analogical reasoning process. Extensive experimental results on three public datasets validate the efficacy of our proposed approach. Our code is available at the following link: this https URL
[NLP-51] FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings
【速读】: 该论文旨在解决预训练句子嵌入空间中语义信息可解释性不足的问题,特别是如何从多语言(LaBSE)、多模态(SONAR)及基于API的(Gemini)句子嵌入中恢复词汇内容。其解决方案的关键在于提出因子分解线性投影(Factorized Linear Projection, FLiP)模型,通过学习将高维嵌入空间映射回原始词汇表示,从而有效重构嵌入中的 lexical content(词汇内容)。实验表明,FLiP 在多种语言和嵌入模型上能召回超过75%的词汇内容,显著优于非因子化基线方法,并可用于诊断不同嵌入模型中存在的模态和语言偏差,为实践者提供无需依赖下游任务的内在洞察。
链接: https://arxiv.org/abs/2604.18109
作者: Santosh Kesiraju,Bolaji Yusuf,Šimon Sedláček,Oldřich Plchot,Petr Schwarz
机构: Speech@FIT, Brno University of Technology (布林诺理工大学), Czechia (捷克)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: Under review
Abstract:This paper presents factorized linear projection (FLiP) models for understanding pretrained sentence embedding spaces. We train FLiP models to recover the lexical content from multilingual (LaBSE), multimodal (SONAR) and API-based (Gemini) sentence embedding spaces in several high- and mid-resource languages. We show that FLiP can recall more than 75% of lexical content from the embeddings, significantly outperforming existing non-factorized baselines. Using this as a diagnostic tool, we uncover the modality and language biases across the selected sentence encoders and provide practitioners with intrinsic insights about the encoders without relying on conventional downstream evaluation tasks. Our implementation is public this https URL.
[NLP-52] Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源语言(Low-Resource Languages, LRLs)场景下适应困难的问题,主要受限于任务数据稀缺和计算资源不足。现有方法如Proxy Tuning虽能引入缩放效应,但在LRL中表现不佳,因其依赖的大模型对LRL的弱能力可能压制小型专用模型的知识优势。论文提出TriMix框架,其关键在于测试时动态融合来自三个来源的logit:持续预训练的小型LRL专用模型提供的语言能力、高资源语言指令微调模型的任务能力,以及大模型的缩放收益。该方案无需LRL任务标注,仅需小模型的持续预训练,且实验证明优先强化小模型logits是成功的关键,颠覆了“大模型主导”的普遍假设。
链接: https://arxiv.org/abs/2604.18106
作者: Chen Zhang,Jiuheng Lin,Zhiyuan Liao,Yansong Feng
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所)
类目: Computation and Language (cs.CL)
备注: ACL 2026
Abstract:Adapting large language models (LLMs) to low-resource languages (LRLs) is constrained by the scarcity of task data and computational resources. Although Proxy Tuning offers a logit-level strategy for introducing scaling effects, it often fails in LRL settings because the large model’s weak LRL competence might overwhelm the knowledge of specialized smaller models. We thus propose TriMix, a test-time logit fusion framework that dynamically balances capabilities from three different sources: LRL competence from a continually pretrained small model, task competence from high-resource language instruction tuning, and the scaling benefits of large models. It is data- and compute-efficient, requiring no LRL task annotations, and only continual pretraining on a small model. Experiments across four model families and eight LRLs show that TriMix consistently outperforms single-model baselines and Proxy Tuning. Our analysis reveals that prioritizing the small LRL-specialized model’s logits is crucial for success, challenging the prevalent large-model-dominant assumption.
[NLP-53] Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts
【速读】: 该论文旨在解决当前多模态大语言模型在生成幽默图像标题时缺乏对显性文化背景的稳定控制问题,导致难以在特定文化语境下同时保证图像相关性、语境适配性和幽默质量。其解决方案的关键在于提出一种新的任务范式——文化感知的幽默标题生成(culture-aware humorous captioning),并设计了一个六维评估框架以系统衡量生成效果;在此基础上,进一步提出分阶段对齐框架:首先在西方文化背景下利用高资源监督初始化模型,随后通过基于裁判的GRPO(Group Relative Policy Optimization)进行多维偏好对齐,并引入去劣化原型排斥约束(Degradation-aware Prototype Repulsion Constraint)以缓解开放生成中的奖励劫持(reward hacking)问题,最终仅用少量标注数据即可将模型适配至东方文化背景,显著提升了语境契合度与图像相关性-幽默性的平衡能力。
链接: https://arxiv.org/abs/2604.18091
作者: Run Xu,Lu Li,Rongzhao Zhang,Jie Xu
机构: Nanyang Technological University (南洋理工大学); Tongji University (同济大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent multimodal large language models have shown promising ability in generating humorous captions for images, yet they still lack stable control over explicit cultural context, making it difficult to jointly maintain image relevance, contextual appropriateness, and humor quality under a specified cultural background. To address this limitation, we introduce a new multimodal generation task, culture-aware humorous captioning, which requires a model to generate a humorous caption conditioned on both an input image and a target cultural context. Captions generated under different cultural contexts are not expected to share the same surface form, but should remain grounded in similar visual situations or humorous this http URL support this task, we establish a six-dimensional evaluation framework covering image relevance, contextual fit, semantic richness, reasonableness, humor, and creativity. We further propose a staged alignment framework that first initializes the model with high-resource supervision under the Western cultural context, then performs multi-dimensional preference alignment via judge-based GRPO with a Degradation-aware Prototype Repulsion Constraint to mitigate reward hacking in open-ended generation, and finally adapts the model to the Eastern cultural context with a small amount of supervision. Experimental results show that our method achieves stronger overall performance under the proposed evaluation framework, with particularly large gains in contextual fit and a better balance between image relevance and humor under cultural constraints.
[NLP-54] Mix and Match: Context Pairing for Scalable Topic-Controlled Educational Summarisation
【速读】: 该论文旨在解决小语言模型(small language models, sLMs)在执行主题可控摘要(topic-controlled summarisation)任务时性能不足的问题。其核心挑战在于如何在有限的真实训练数据下,提升模型对主题与摘要之间语义关系的建模能力。解决方案的关键在于提出了一种成对数据增强方法(pairwise data augmentation),通过将来自不同文档的上下文进行组合,生成对比性训练样本,从而强化模型对主题与摘要间映射关系的学习效果。实验表明,在固定真实训练数据量的前提下,随着增强规模的增加,模型在胜率(win rate)和语义一致性(semantic alignment)上均取得稳定提升,最终使T5-base模型在性能上达到与更大模型相当的水平。
链接: https://arxiv.org/abs/2604.18087
作者: Nathikan Yodthapa,Thanapong Intharah,Sahan Bulathwela
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: To be published at the International Conference on Artificial Intelligence in Education (AIED’26)
Abstract:Topic-controlled summarisation enables users to generate summaries focused on specific aspects of source documents. This paper investigates a data augmentation strategy for training small language models (sLMs) to perform topic-controlled summarisation. We propose a pairwise data augmentation method that combines contexts from different documents to create contrastive training examples, enabling models to learn the relationship between topics and summaries more effectively. Using the SciTLDR dataset enriched with Wikipedia-derived topics, we systematically evaluate how augmentation scale affects model performance. Results show consistent improvements in win rate and semantic alignment as the augmentation scale increases, while the amount of real training data remains fixed. Consequently, a T5-base model trained with our augmentation approach achieves competitive performance relative to larger models, despite using significantly fewer parameters and substantially fewer real training examples.
[NLP-55] Modeling Human Perspectives with Socio-Demographic Representations
【速读】: 该论文旨在解决如何在自然语言处理(Natural Language Processing, NLP)任务中有效建模标注者视角(annotator perspectives)的问题,尤其是如何通过细粒度的社会人口学特征(socio-demographic attributes)来解释和预测这些视角差异。传统方法通常仅考虑单一或有限组合的社会人口学因素,难以捕捉真实场景下由复杂社会背景塑造的多样化观点。其解决方案的关键在于提出一种名为“社会对比学习”(Socio-Contrastive Learning)的新方法,该方法能够联合学习标注者的视角表示与社会人口学特征表示,并通过融合机制实现文本表示与社会特征表示的有效结合,从而显著优于基于简单拼接(concatenation-based)的传统方法。该框架不仅提升了预测准确性,还支持对不同社会人口学因素如何影响标注者观点变化进行可视化分析。
链接: https://arxiv.org/abs/2604.18069
作者: Leixin Zhang,Cagri Coltekin
机构: University of Tübingen (图宾根大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Humans often hold different perspectives on the same issues. In many NLP tasks, annotation disagreement can reflect valid subjective perspectives. Modeling annotator perspectives and understanding their relationship with other human factors, such as socio-demographic attributes, have received increasing attention. Prior work typically focuses on single demographic factors or limited combinations. However, in real-world settings, annotator perspectives are shaped by complex social contexts, and finer-grained socio-demographic attributes can better explain human perspectives. In this work, we propose Socio-Contrastive Learning, a method that jointly models annotator perspectives while learning socio-demographic representations. Our method provides an effective approach for the fusion of socio-demographic features and textual representations to predict annotator perspectives, outperforming standard concatenation-based methods. The learned representations further enable analysis and visualization of how demographic factors relate to variation in annotator perspectives. Our code is available at GitHub: this https URL
[NLP-56] JudgeMeNot: Personalizing Large Language Models to Emulate Judicial Reasoning in Hebrew ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在个体决策者(如法官)场景下的个性化难题,尤其在低资源环境下如何实现高效、精准的模型适配。其核心解决方案是提出一种合成-有机监督(synthetic-organic supervision)流水线,将原始司法判决文本转化为指令微调(instruction-tuning)数据,从而支持参数高效微调(parameter-efficient fine-tuning),使模型能够学习并模仿特定法官的推理风格与决策逻辑。关键创新在于利用因果语言建模(Causal Language Modeling)结合合成生成的指令数据,显著提升模型输出在词汇、风格和语义层面与人类法官推理的一致性,且在低资源条件下仍具备高度可操作性。
链接: https://arxiv.org/abs/2604.18041
作者: Itay Razumenko,Arnon Sturm,Nir Grinberg
机构: Ben-Gurion University of the Negev (本古里安大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: To appear in Findings of the ACL 2026
Abstract:Despite significant advances in large language models, personalizing them for individual decision-makers remains an open problem. Here, we introduce a synthetic-organic supervision pipeline that transforms raw judicial decisions into instruction-tuning data, enabling parameter-efficient fine-tuning of personalized models for individual judges in low-resource settings. We compare our approach to state-of-the-art personalization techniques across three different tasks and settings. The results show that Causal Language Modeling followed by synthetically generated instruction-tuning significantly outperforms all other baselines, providing significant improvements across lexical, stylistic, and semantic similarity. Notably, our model-generated outputs are indistinguishable from the reasoning of human judges, highlighting the viability of efficient personalization, even in low-resource settings.
[NLP-57] SignDPO: Multi-level Direct Preference Optimisation for Skeleton-based Gloss-free Sign Language Translation
【速读】: 该论文旨在解决基于骨架的 Sign Language Translation(手势语言翻译)模型在对齐过程中存在的语义漂移问题,其根源在于当前方法多采用基于模仿学习的最大似然估计(Maximum Likelihood Estimation),缺乏对手势语言中细粒度时空特征的判别敏感性。解决方案的关键在于提出 SignDPO 框架,通过多层次直接偏好优化(multi-level Direct Preference Optimisation, DPO),将优化目标从简单的序列模仿转变为跨空间、时间与语言维度的结构化偏好对齐。其核心创新包括:1)引入分层扰动策略自动构建全局与局部粒度的非偏好样本;2)设计自引导机制利用解码器交叉注意力分数识别语义显著的骨骼区域并进行扰动,提升模型区分真实手势信号与结构失真能力;3)通过微调专用扰动模型实现自动化语言级偏好生成,捕捉复杂输出失败模式而无需人工标注。
链接: https://arxiv.org/abs/2604.18034
作者: Muxin Pu,Xiao-Ming Wu,Mei Kuan Lim,Chun Yong Chong,Wei Li,Chen Change Loy
机构: Monash University (蒙纳士大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present SignDPO, a novel multi-level Direct Preference Optimisation (DPO) framework designed to enhance the alignment of skeleton-based Sign Language Translation. While current skeleton-based models have made significant progress using Maximum Likelihood Estimation, they are primarily constrained by an imitation-based paradigm that lacks discriminative sensitivity to the fine-grained spatio-temporal nuances of sign language, often leading to semantic drift. To address this, SignDPO shifts the optimisation goal from simple sequence mimicry to structured preference alignment across spatial, temporal, and linguistic dimensions. Our framework involves three key designs. First, we introduce a hierarchical perturbation strategy to construct spatial and temporal non-preferred samples at both global and local granularities automatically. Second, we propose a self-guiding mechanism that leverages decoder cross-attention scores to identify and perturb semantically salient skeletal regions, forcing the model to distinguish genuine sign signals from structural distortions. Third, we establish an automated language-level preference generator by fine-tuning a dedicated perturbation model, capturing complex output-level failure modes without manual annotation. Extensive experiments on three widely adopted benchmarks, CSL-Daily, How2Sign, and OpenASL, demonstrate that SignDPO consistently outperforms state-of-the-art gloss-free methods and even rivals established gloss-based ones. Our results suggest that multi-level preference alignment is a powerful paradigm for bridging the gap between high-entropy skeletal trajectories and discrete linguistic semantics.
[NLP-58] How Creative Are Large Language Models in Generating Molecules?
【速读】: 该论文旨在解决分子生成任务中如何有效平衡多约束条件(如理化性质、吸收-分布-代谢-排泄-毒性(ADMET)特性及生物活性)与探索能力的问题,从而在庞大的结构化化学空间中识别出既满足约束又具创新性的分子结构。传统方法往往陷入局部最优解,而生成式AI(Generative AI)模型虽具备从自然语言提示直接生成分子表示的能力,但其创造性行为机制尚不明确。论文的关键解决方案在于首次将分子生成所需的能力重新定义为“创造力”,并通过系统性实证评估,从收敛性创造力(convergent creativity)和发散性创造力(divergent creativity)两个互补维度刻画大语言模型(LLMs)的行为模式,揭示了额外约束可提升约束满足度等关键发现,为LLMs在分子发现流程中的合理应用提供了理论依据与实践指导。
链接: https://arxiv.org/abs/2604.18031
作者: Wen Tao,Yiwei Wang,Peng Zhou,Bryan Hooi,Wanlong Fang,Tianle Zhang,Xiao Luo,Yuansheng Liu,Alvin Chan
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
备注:
Abstract:Molecule generation requires satisfying multiple chemical and biological constraints while searching a large and structured chemical space. This makes it a non-binary problem, where effective models must identify non-obvious solutions under constraints while maintaining exploration to improve success by escaping local optima. From this perspective, creativity is a functional requirement in molecular generation rather than an aesthetic notion. Large language models (LLMs) can generate molecular representations directly from natural language prompts, but it remains unclear what type of creativity they exhibit in this setting and how it should be evaluated. In this work, we study the creative behavior of LLMs in molecular generation through a systematic empirical evaluation across physicochemical, ADMET, and biological activity tasks. We characterize creativity along two complementary dimensions, convergent creativity and divergent creativity, and analyze how different factors shape these behaviors. Our results indicate that LLMs exhibit distinct patterns of creative behavior in molecule generation, such as an increase in constraint satisfaction when additional constraints are imposed. Overall, our work is the first to reframe the abilities required for molecule generation as creativity, providing a systematic understanding of creativity in LLM-based molecular generation and clarifying the appropriate use of LLMs in molecular discovery pipelines.
[NLP-59] CodePivot: Bootstrapping Multilingual Transpilation in LLM s via Reinforcement Learning without Parallel Corpora
【速读】: 该论文旨在解决现有训练型代码翻译(transpilation)方法在低资源编程语言(low-resource programming languages, PLs)场景下的两大核心问题:一是依赖成对平行语料库(pairwise transpilation paradigm)导致难以扩展支持多种编程语言,尤其受限于低资源PLs的数据稀缺;二是强化学习(reinforcement learning, RL)奖励机制设计不合理,影响模型性能。解决方案的关键在于提出CodePivot框架,其创新性地以Python作为中间表示(intermediate representation, IR),并引入一种新颖的RL奖励机制——Aggressive-Partial-Functional奖励,从而无需依赖平行语料即可有效提升模型的多语言代码翻译能力。实验表明,基于该框架训练的7B参数模型在通用和低资源PL任务中均显著优于更大规模主流模型,并在Any-to-Any直接训练对比中表现更优。
链接: https://arxiv.org/abs/2604.18027
作者: Shangyu Li,Juyong Jiang,Meibo Ren,Sizhe Zhong,Huiri Tan,Yunhao Gou,Xu Han,Chun Yong Chong,Yun Peng,Jiasi Shen
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Monash University (莫纳什大学); The Chinese University of Hong Kong (香港中文大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:Transpilation, or code translation, aims to convert source code from one programming language (PL) to another. It is beneficial for many downstream applications, from modernizing large legacy codebases to augmenting data for low-resource PLs. Recent large language model (LLM)-based approaches have demonstrated immense potential for code translation. Among these approaches, training-based methods are particularly important because LLMs currently do not effectively adapt to domain-specific settings that suffer from a lack of knowledge without targeted training. This limitation is evident in transpilation tasks involving low-resource PLs. However, existing training-based approaches rely on a pairwise transpilation paradigm, making it impractical to support a diverse range of PLs. This limitation is particularly prominent for low-resource PLs due to a scarcity of training data. Furthermore, these methods suffer from suboptimal reinforcement learning (RL) reward formulations. To address these limitations, we propose CodePivot, a training framework that leverages Python as an intermediate representation (IR), augmented by a novel RL reward mechanism, Aggressive-Partial-Functional reward, to bootstrap the model’s multilingual transpilation ability without requiring parallel corpora. Experiments involving 10 PLs show that the resulting 7B model, trained on Python-to-Others tasks, consistently improves performance across both general and low-resource PL-related transpilation tasks. It outperforms substantially larger mainstream models with hundreds of billions more parameters, such as Deepseek-R1 and Qwen3-235B-A22B-Instruct-2507, on Python-to-Others tasks and Others-to-All tasks, respectively. In addition, it outperforms its counterpart trained directly on Any-to-Any tasks on general transpilation tasks. The code and data are available at this https URL.
[NLP-60] Employing General-Purpose and Biomedical Large Language Models with Advanced Prompt Engineering for Pharmacoepidemiologic Study Design
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在药物流行病学研究设计自动化支持中的可靠性问题,特别是对比通用型LLM与生物医学领域微调后的LLM在实际应用中的性能差异。其解决方案的关键在于系统性评估不同LLM(包括GPT-4o、DeepSeek-R1与QuantFactory/Bio-Medical-Llama-3-8B-GGUF等)在真实世界研究协议(来自HMA-EMA目录和Sentinel系统)上的表现,并采用Least-to-Most(LTM)和主动提示(Active Prompting)策略优化推理过程,从而揭示通用型LLM在相关任务中具有更高的相关性和逻辑合理性,且提示策略显著影响模型输出稳定性。
链接: https://arxiv.org/abs/2604.17988
作者: Xinyao Zhang,Nicole Sonne Heckmann,Manuela Del Castillo Suero,Francesco Paolo Speca,Maurizio Sessa
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Background: The potential of large language models (LLMs) to automate and support pharmacoepidemiologic study design is an emerging area of interest, yet their reliability remains insufficiently characterized. General-purpose LLMs often display inaccuracies, while the comparative performance of specialized biomedical LLMs in this domain remains unknown. Methods: This study evaluated general-purpose LLMs (GPT-4o and DeepSeek-R1) versus biomedically fine-tuned LLMs (QuantFactory/Bio-Medical-Llama-3-8B-GGUF and Irathernotsay/qwen2-1.5B-medical_qa-Finetune) using 46 protocols (2018-2024) from the HMA-EMA Catalogue and Sentinel System. Performance was assessed across relevance, logic of justification, and ontology-code agreement across multiple coding systems using Least-to-Most (LTM) and Active Prompting strategies. Results: GPT-4o and DeepSeek-R1 paired with LTM prompting achieved the highest relevance and logic of justification scores, with GPT-4o-LTM reaching a median relevance score of 4 in 8 of 9 questions for HMA-EMA protocols. Biomedical LLMs showed lower relevance overall and frequently generated insufficient justification. All LLMs demonstrated limited proficiency in ontology-code mapping, although LTM provided the most consistent improvements in reasoning stability. Conclusion: Off-the-shelf general-purpose LLMs currently offer superior support for pharmacoepidemiologic design compared to biomedical LLMs. Prompt strategy strongly influenced LLM performance.
[NLP-61] Mitigating Multimodal Hallucination via Phase-wise Self-reward
【速读】: 该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)中存在的视觉幻觉(vision hallucination)问题,即生成的文本响应与输入图像内容不一致的现象。现有方法要么依赖大规模标注数据进行微调(计算开销大),要么采用静态后处理策略,无法捕捉幻觉在推理过程中动态演变的特性。解决方案的关键在于提出一种自奖励机制(self-rewarding framework),能够在推理阶段动态抑制幻觉而不需外部监督;具体而言,作者发现幻觉呈现阶段性动态特征,其峰值出现在每个语义阶段的起始时刻,并据此设计了PSRD(Phase-wise Self-Reward Decoding)算法,通过分阶段的自奖励信号指导在线修正。为降低重复自评估的成本,进一步将幻觉引导信号蒸馏为轻量级奖励模型,在解码过程中提供实时干预,实现精准幻觉抑制,显著降低LLaVA-1.5-7B的幻觉率50.0%,且在五个基准上优于现有后处理方法。
链接: https://arxiv.org/abs/2604.17982
作者: Yu Zhang,Chuyang Sun,Kehai Chen,Xuefeng Bai,Yang Xiang,Min Zhang
机构: Harbin Institute of Technology (哈尔滨工业大学); Peng Cheng Laboratory (鹏城实验室); Shenzhen (深圳)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Self-reward for vision hallucination mitigation
Abstract:Large Vision-Language Models (LVLMs) still struggle with vision hallucination, where generated responses are inconsistent with the visual input. Existing methods either rely on large-scale annotated data for fine-tuning, which incurs massive computational overhead, or employ static post-hoc strategies that overlook the dynamic nature of hallucination emergence. To address these, we introduce a new self-rewarding framework, enabling dynamic hallucination mitigation at inference time without external supervision. On the empirical side, we reveal that visual hallucination exhibits phase-wise dynamic patterns, peaking at the onset of each semantic phase. Drawing on these insights, we propose \textbfPSRD (\textbfPhase-wise \textbfSelf-\textbfReward \textbfDecoding) for online hallucination correction guided by phase-wise self-reward signals. To reduce the cost of repeated self-evaluation during decoding, we distill the hallucination guidance signal from LVLMs into a lightweight reward model. The reward model subsequently provides on-the-fly guidance for targeted intervention during the decoding process, enabling precise hallucination suppression. The proposed PSRD significantly reduces the hallucination rate of LLaVA-1.5-7B by 50.0% and consistently outperforms existing post-hoc methods across five hallucination evaluation benchmarks for four LVLMs. Further analysis confirms that PSRD effectively mitigates hallucination propagation and achieves a highly controllable trade-off between strong performance and inference efficiency.
[NLP-62] zGLUE: Luxembourgish General Language Understanding Evaluation ACL
【速读】: 该论文旨在解决卢森堡语(Luxembourgish, LTZ)在自然语言理解(Natural Language Understanding, NLU)领域缺乏标准化评估基准的问题。尽管许多欧洲语言已有NLU任务支持,但作为官方语言之一的LTZ长期被忽视。解决方案的关键在于构建首个基于GLUE基准框架的卢森堡语NLU基准——ltzGLUE,通过新增任务与复用现有任务,涵盖二分类和多分类场景下的命名实体识别(Named Entity Recognition)、主题分类(Topic Classification)及意图分类(Intent Classification)等典型NLP任务,并对多种预训练语言模型进行系统评估,从而为LTZ语言建模能力提供首个全面的量化分析与基准参考。
链接: https://arxiv.org/abs/2604.17976
作者: Alistair Plum,Felicia Körner,Anne-Marie Lutgen,Laura Bernardy,Fred Philippy,Emilia Milano,Nils Rehlinger,Cédric Lothritz,Tharindu Ranasinghe,Barbara Plank,Christoph Purschke
机构: University of Luxembourg, Luxembourg; LMU Munich, Germany; Munich Center for Machine Learning, Germany; LIST, Luxembourg; Lancaster University, UK
类目: Computation and Language (cs.CL)
备注: Accepted at ACL Findings 2026
Abstract:This paper presents ltzGLUE, the first Natural Language Understanding (NLU) benchmark for Luxembourgish (LTZ) based on the popular GLUE benchmark for English. Although NLU tasks are available for many European languages nowadays, LTZ is one of the official national languages that is often overlooked. We construct new tasks and reuse existing ones to introduce the first official NLU benchmark and accompanying evaluation of encoder models for the language. Our tasks include common natural language processing tasks in binary and multi-class classification settings, including named entity recognition, topic classification, and intent classification. We evaluate various pre-trained language models for LTZ to present an overview of the current capabilities of these models on the LTZ language.
[NLP-63] Modeling Multiple Support Strategies within a Single Turn for Emotional Support Conversations
【速读】: 该论文旨在解决情感支持对话(Emotional Support Conversation, ESC)中传统模型假设每次回复仅包含单一支持策略的局限性问题,而现实中支持性对话往往在单个话语中融合多种支持策略。其解决方案的关键在于将ESC任务重新建模为多策略话语生成任务,提出两种生成方法:All-in-One方法在单次解码步骤中预测所有策略-回应对,One-by-One方法则迭代生成策略-回应对直至完成;同时引入基于强化学习的认知推理机制,以优化策略选择与回应构建。实验表明,该方法能有效建模多策略话语,并显著提升支持质量与对话成功率。
链接: https://arxiv.org/abs/2604.17972
作者: Jie Zhu,Huaixia Dou,Junhui Li,Lifan Guo,Feng Chen,Jinsong Su,Chi Zhang,Fang Kong
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Emotional Support Conversation (ESC) aims to assist individuals experiencing distress by generating empathetic and supportive dialogue. While prior work typically assumes that each supporter turn corresponds to a single strategy, real-world supportive communication often involves multiple strategies within a single utterance. In this paper, we revisit the ESC task by formulating it as multi-strategy utterance generation, where each utterance may contain one or more strategy-response pairs. We propose two generation methods: All-in-One, which predicts all strategy-response pairs in a single decoding step, and One-by-One, which iteratively generates strategy-response pairs until completion. Both methods are further enhanced with cognitive reasoning guided by reinforcement learning to improve strategy selection and response composition. We evaluate our models on the ESConv dataset under both utterance-level and dialogue-level settings. Experimental results show that our methods effectively model multi-strategy utterances and lead to improved supportive quality and dialogue success. To our knowledge, this work provides the first systematic empirical evidence that allowing multiple support strategies within a single utterance is both feasible and beneficial for emotional support conversations. All code and data will be publicly available at this https URL.
[NLP-64] From Fallback to Frontline: When Can LLM s be Superior Annotators of Human Perspectives? ACL2026
【速读】: 该论文旨在解决当前将大语言模型(Large Language Models, LLMs)仅作为成本节约型替代方案而非可靠人类视角估计工具的问题。其核心挑战在于重新评估LLMs在预测群体主观意见时的统计效能,尤其相较于人类标注者是否具有优势。解决方案的关键在于将观点采择建模为对潜在群体级判断的估计,并识别出LLMs因结构特性(如低方差、表示与处理偏差解耦)而能在特定条件下超越人类标注者的情形——这并非源于“生活经验”,而是源于其作为估计器的统计优越性。研究进一步明确了LLMs可作为前沿估计工具的明确场景及人类判断仍不可替代的边界,从而将LLMs从妥协性工具提升为一种原则性的集体人类观点估计手段。
链接: https://arxiv.org/abs/2604.17968
作者: Hasan Amin,Harry Yizhou Tian,Xiaoni Duan,Chien-Ju Ho,Rajiv Khanna,Ming Yin
机构: Purdue University (普渡大学); Washington University in St. Louis (圣路易斯华盛顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL 2026
Abstract:Although large language models (LLMs) are increasingly used as annotators at scale, they are typically treated as a pragmatic fallback rather than a faithful estimator of human perspectives. This work challenges that presumption. By framing perspective-taking as the estimation of a latent group-level judgment, we characterize the conditions under which modern LLMs can outperform human annotators, including in-group humans, when predicting aggregate subgroup opinions on subjective tasks, and show that these conditions are common in practice. This advantage arises from structural properties of LLMs as estimators, including low variance and reduced coupling between representation and processing biases, rather than any claim of lived experience. Our analysis identifies clear regimes where LLMs act as statistically superior frontline estimators, as well as principled limits where human judgment remains essential. These findings reposition LLMs from a cost-saving compromise to a principled tool for estimating collective human perspectives.
[NLP-65] Process Reward Models Meet Planning : Generating Precise and Scalable Datasets for Step-Level Rewards ACL2026
【速读】: 该论文旨在解决当前过程奖励模型(Process Reward Models, PRMs)训练数据构建成本高、易受标注误差影响,且主要局限于数学领域的局限性。其核心解决方案在于提出一种基于规划逻辑问题(planning logical problems)的新颖且可扩展的PRM数据集生成方法,这些问题是用规划领域定义语言(Planning Domain Definition Language, PDDL)表达的。通过该方法,研究者生成了一个包含约百万级推理步骤的跨PDDL域语料库,并用于训练PRMs;实验表明,将PDDL衍生数据融入主流PRM训练集能显著提升数学与非数学推理能力,证明了PDDL问题作为细粒度、高质量训练数据来源的有效性和普适性。
链接: https://arxiv.org/abs/2604.17957
作者: Raffaele Pisano,Roberto Navigli
机构: Babelscape; Sapienza University of Rome
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 (main conference)
Abstract:Process Reward Models (PRMs) have emerged as a powerful tool for providing step-level feedback when evaluating the reasoning of Large Language Models (LLMs), which frequently produce chains of thought (CoTs) containing errors even when the final answer is correct. However, existing PRM datasets remain expensive to construct, prone to annotation errors, and predominantly limited to the mathematical domain. This work introduces a novel and scalable approach to PRM dataset generation based on planning logical problems expressed in the Planning Domain Definition Language (PDDL). Using this method, we generate a corpus of approximately one million reasoning steps across various PDDL domains and use it to train PRMs. Experimental results show that augmenting widely-used PRM training datasets with PDDL-derived data yields substantial improvements in both mathematical and non-mathematical reasoning, as demonstrated across multiple benchmarks. These findings indicate that planning problems constitute a scalable and effective resource for generating robust, precise, and fine-grained training data for PRMs, going beyond the classical mathematical sources that dominate this field.
[NLP-66] ReCoQA: A Benchmark for Tool-Augmented and Multi-Step Reasoning in Real Estate Question and Answering ACL2026
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在处理碎片化、多源信息时面临的挑战,特别是缺乏能够反映数据库查询与外部 API 调用相结合的混合工作流的基准测试。为填补这一空白,作者提出了 ReCoQA 基准数据集,包含 29,270 个真实房地产实例,并提供可机器验证的中间步骤监督信号(包括结构化意图标签、SQL 查询和 API 调用)。解决方案的关键在于提出 HIRE-Agent 框架,该框架采用分层架构(understand-plan-execute),通过前端解析器(Front-end parser)、规划调度器(planning Supervisor)和执行专家(execution Specialists)协同工作,有效整合异构证据,从而在复杂现实推理任务中实现更鲁棒的性能表现。
链接: https://arxiv.org/abs/2604.17944
作者: Yindong Zhang,Wenmian Yang,Yiquan Zhang,Weijia Jia
机构: Hong Kong Baptist University (香港浸会大学); Beijing Normal-Hong Kong Baptist University (北京师范大学-香港浸会大学); Beijing Normal University (北京师范大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2026
Abstract:Developing agents capable of navigating fragmented, multi-source information remains challenging, primarily due to the scarcity of benchmarks reflecting hybrid workflows combining database querying with external APIs. To bridge this gap, we introduce ReCoQA, a large-scale benchmark of 29,270 real-estate instances featuring machine-verifiable supervision for intermediate steps, including structured intent labels, SQL queries, and API calls. Complementarily, we propose HIRE-Agent, a hierarchical framework instantiating an understand-plan-execute architecture as a strong baseline. By orchestrating a Front-end parser, a planning Supervisor, and execution Specialists, HIRE-Agent effectively integrates heterogeneous evidence. Extensive experiments demonstrate that HIRE-Agent constitutes a strong baseline and substantiates the necessity of hierarchical collaboration for complex, real-world reasoning tasks.
[NLP-67] Domain-oriented RAG Assessment (DoRA): Synthetic Benchmarking for RAG -based Question Answering on Defense Documents
【速读】: 该论文旨在解决开放域检索增强生成(Retrieval-Augmented Generation, RAG)基准测试在公共语料库上可能高估实际部署性能的问题,主要源于预训练数据重叠和弱溯源要求。为应对这一挑战,作者提出 DoRA(Domain-oriented RAG Assessment),一个基于国防文档构建的领域导向型评估基准,其核心创新在于将合成的、意图条件化的问答对(QA)与可审计的证据段落配对,以实现强溯源性。DoRA 覆盖五类问题类型(查找、解释、总结、生成、提供),包含 6.5K 精心筛选的实例,并通过固定密集检索器进行端到端评估,验证了在领域偏移下污染感知回归测试的有效性:相较于基础模型(Llama3.1-8B-Instruct),在 DoRA 上微调的模型(DoRA SFT)在 QA 任务成功率上提升高达 26%,同时将 RAG 一致性评分中的幻觉率降低 47%。
链接: https://arxiv.org/abs/2604.17943
作者: Bao Gia Doan,Aditya Joshi,Pantelis Elinas,Aarya Bodhankar,Oscar Leslie,Tom Marchant,Flora Salim
机构: UNSW Sydney (新南威尔士大学); Cyndr AI (Cyndr AI)
类目: Computation and Language (cs.CL)
备注:
Abstract:Open-domain RAG benchmarks over public corpora can overestimate deployment performance due to pretraining overlap and weak attribution requirements. We present DoRA (Domain-oriented RAG Assessment), a domain-grounded benchmark built from defense documents that pairs synthetic, intent-conditioned QA (question answering) with auditable evidence passages for attribution. DoRA covers five question types (find, explain, summarize, generate, provide) and contains 6.5K curated instances. In end-to-end evaluation with a fixed dense retriever, general-purpose Language Models (LMs) perform similarly, while a model trained on DoRA (DoRA SFT) yields large gains over the base model (Llama3.1-8B-Instruct): up to 26% improvement in QA task success, while reducing the hallucination rate by 47% in RAG faithfulness scores, supporting contamination-aware regression testing under domain shift.
[NLP-68] From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models ACL2026
【速读】: 该论文旨在解决多任务视觉语言模型(Vision-Language Models, VLMs)中神经元重要性评估的可比性不足与任务依赖信息路径被忽视的问题,这些问题导致在多任务场景下神经元的多义性(polysemanticity)加剧,从而干扰对关键神经元的识别与干预。解决方案的关键在于提出一种无梯度的、面向任务的神经元归因与调控框架 HONES(Head-Oriented Neuron Explanation Steering),其核心机制是基于任务相关的注意力头条件化地衡量前馈网络(Feed-Forward Network, FFN)神经元的因果写入贡献(causal write-in contributions),并利用轻量级缩放操作对显著神经元进行调控,从而提升任务关键神经元的可识别性与模型性能。
链接: https://arxiv.org/abs/2604.17941
作者: Qidong Wang,Junjie Hu,Ming Jiang
机构: Tongji University (同济大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: ACL 2026 Findings
Abstract:Recent work has increasingly explored neuron-level interpretation in vision-language models (VLMs) to identify neurons critical to final predictions. However, existing neuron analyses generally focus on single tasks, limiting the comparability of neuron importance across tasks. Moreover, ranking strategies tend to score neurons in isolation, overlooking how task-dependent information pathways shape the write-in effects of feed-forward network (FFN) neurons. This oversight can exacerbate neuron polysemanticity in multi-task settings, introducing noise into the identification and intervention of task-critical neurons. In this study, we propose HONES (Head-Oriented Neuron Explanation Steering), a gradient-free framework for task-aware neuron attribution and steering in multi-task VLMs. HONES ranks FFN neurons by their causal write-in contributions conditioned on task-relevant attention heads, and further modulates salient neurons via lightweight scaling. Experiments on four diverse multimodal tasks and two popular VLMs show that HONES outperforms existing methods in identifying task-critical neurons and improves model performance after steering. Our source code is released at: this https URL.
[NLP-69] Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck? ACL’26
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在形式语言能力上存在的显著不均衡问题:尽管模型在某些语法现象上能达到近乎完美的掌握,但在其他方面却表现低于随机水平,即使经过万亿级token的训练也未能改善。研究发现,这种性能差异并非源于模型架构的根本限制,而是由于预训练语料库中特定语法结构的稀缺性所致。解决方案的关键在于通过注入少量(1%)针对性合成数据来增强模型对特定语言现象的学习,实验表明这一干预手段在9个最差表现的BLiMP测试范式中使8个显著提升,例如“only_npi_scope”任务准确率从20.9%提升至69.4%,且整体性能保持稳定或略有改善。这证明了小规模模型在获得足够相关数据暴露后,可大幅改进其在传统表现不佳的语言现象上的能力,凸显了数据组成优化在实现人类水平语言建模中的关键作用。
链接: https://arxiv.org/abs/2604.17930
作者: H S V N S Kowndinya Renduchintala,Sumit Bhatia
机构: Adobe Inc.(Adobe公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACL’26 (Findings)
Abstract:Large Language Models (LLMs) exhibit a puzzling disparity in their formal linguistic competence: while they learn some linguistic phenomena with near-perfect mastery, they often perform below chance on others, even after training on trillions of tokens. In this work, we investigate whether these failures stem from inherent architectural limitations or simply the scarcity of these specific grammatical constructions in web-scale corpora. We pre-train simple GPT-2 Small (124M) models on a 100M-token random sample of the FineWeb corpus and intervene by injecting a minimal amount (1%) of synthetic data targeting specific linguistic phenomena. We find that this targeted intervention substantially improves model performance in 8 out of the 9 worst-performing BLiMP paradigms - notably the accuracy on a specific paradigm, only_npi_scope, surges from 20.9% to 69.4%. Furthermore, we observe that these interventions generally preserve or slightly improve aggregate performance. However, while we also identify a resistant phenomenon, principle_A_c_command, whose performance remains below chance even after our data augmentation, our findings do serve as an optimistic existence proof that even small language models can substantially improve on those linguistic phenomena on which models typically perform poorly, provided the pre-training data contains sufficient exposure to them. This suggests that efforts towards human-scale language modeling may benefit greatly by focusing on data composition. The code to reproduce our results is open-sourced at this https URL.
[NLP-70] Automatic Slide Updating with User-Defined Dynamic Templates and Natural Language Instructions ACL2026
【速读】: 该论文旨在解决在业务报告场景中,用户自定义模板(bring-your-own-template, BYO-template)下的演示文稿(presentation slides)动态更新难题。现有自动化方法多依赖固定模板填充,难以适应多样化的用户创作内容,导致更新过程仍高度依赖人工操作。为应对这一挑战,作者提出了“基于自然语言指令的动态幻灯片更新”任务,并构建了大规模真实世界数据集DynaSlide(含20,036个指令-执行三元组),其所有更新均基于共享外部数据库完成。解决方案的核心是SlideAgent框架,该框架采用代理式架构,融合多模态幻灯片解析、自然语言指令定位以及表格、图表和文本结论的工具增强推理能力,在保持原有布局与风格不变的前提下实现内容的精准更新,为该任务提供了强有力的基线模型。
链接: https://arxiv.org/abs/2604.17894
作者: Kun Zhou,Jiakai He,Wenmian Yang,Zhensheng Wang,Yiquan Zhang,Weijia Jia
机构: Beijing Normal University (北京师范大学); Beijing Normal-Hong Kong Baptist University (北京师范大学-香港浸会大学)
类目: Computation and Language (cs.CL)
备注: To appear in Findings of the Association for Computational Linguistics (ACL 2026)
Abstract:Presentation slides are a primary medium for data-driven reporting, yet keeping complex, analytics-style decks up to date remains labor-intensive. Existing automation methods mostly follow fixed template filling and cannot support dynamic updates for diverse, user-authored slide decks. We therefore define “Dynamic Slide Update via Natural Language Instructions on User-provided Templates” and introduce DynaSlide, a large-scale benchmark with 20,036 real-world instruction-execution triples (source slide, user instruction, target slide) grounded in a shared external database and built from business reporting slides under bring-your-own-template (BYO-template) conditions. To tackle this task, we propose SlideAgent, an agent-based framework that combines multimodal slide parsing, natural language instruction grounding, and tool-augmented reasoning for tables, charts, and textual conclusions. SlideAgent updates content while preserving layout and style, providing a strong reference baseline on DynaSlide. We further design end-to-end and component-level evaluation protocols that reveal key challenges and opportunities for future research. The dataset and code are available at this https URL.
[NLP-71] Latent Preference Modeling for Cross-Session Personalized Tool Calling
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)驱动的代理系统在工具调用过程中因用户请求不完整而导致的输入欠指定问题,这限制了代理在实际交互中准确执行API调用的能力。其解决方案的关键在于提出一种称为PRefine的测试时记忆增强方法,该方法将用户偏好建模为动态演化的假设,并通过“生成—验证—精炼”循环从历史对话中提取可复用的约束条件,从而显著提升工具调用准确性,同时仅需全历史提示所需token的1.24%。这一机制表明,代理系统的鲁棒个性化依赖于对用户选择背后原因的记忆,而不仅仅是选择本身。
链接: https://arxiv.org/abs/2604.17886
作者: Yejin Yoon,Minseo Kim,Taeuk Kim
机构: Hanyang University (汉阳大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review. 25 pages, 10 figures, 16 tables
Abstract:Users often omit essential details in their requests to LLM-based agents, resulting in under-specified inputs for tool use. This poses a fundamental challenge for tool-augmented agents, as API execution typically requires complete arguments, highlighting the need for personalized tool calling. To study this problem, we introduce MPT, a benchmark comprising 265 multi-session dialogues that cover three challenges: Preference Recall, Preference Induction, and Preference Transfer. We also propose PRefine, a test-time memory-augmented method that represents user preferences as evolving hypotheses. Through a generate–verify–refine loop, it extracts reusable constraints from history and improves tool-calling accuracy while using only 1.24% of the tokens required by full-history prompting. These results indicate that robust personalization in agentic systems depends on memory that captures the reasons behind user choices, not just the choices themselves.
[NLP-72] GraSP: Graph-Structured Skill Compositions for LLM Agents
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)智能体在技能生态系统中面临的核心瓶颈问题:尽管可用技能数量迅速增长,但单纯增加技能并未带来性能的单调提升,反而可能因技能冗余或无序组合导致执行效率下降。其关键在于从“技能丰富”转向“技能结构化编排”,即构建一种能够显式建模技能间因果依赖关系的执行机制。解决方案的核心是提出 GraSP(Graph-based Skill Planning),首次引入编译层将扁平技能集转化为带类型约束的有向无环图(Directed Acyclic Graph, DAG),通过节点级验证与局部修复操作实现高效、可靠的技能选择、组合与执行,从而将重规划复杂度从 O(N) 降至 O(d^h),显著提升了任务完成率与环境交互效率。
链接: https://arxiv.org/abs/2604.17870
作者: Tianle Xia,Lingxiang Hu,Yiding Sun,Ming Xu,Lan Xu,Siying Wang,Wei Xu,Jie Jiang
机构: Tencent(腾讯)
类目: Computation and Language (cs.CL)
备注:
Abstract:Skill ecosystems for LLM agents have matured rapidly, yet recent benchmarks show that providing agents with more skills does not monotonically improve performance – focused sets of 2-3 skills outperform comprehensive documentation, and excessive skills actually hurt. The bottleneck has shifted from skill availability to skill orchestration: agents need not more skills, but a structural mechanism to select, compose, and execute them with explicit causal dependencies. We propose GraSP, the first executable skill graph architecture that introduces a compilation layer between skill retrieval and execution. GraSP transforms flat skill sets into typed directed acyclic graphs (DAGs) with precondition-effect edges, executes them with node-level verification, and performs locality-bounded repair through five typed operators – reducing replanning from O(N) to O(d^h). Across ALFWorld, ScienceWorld, WebShop, and InterCode with eight LLM backbones, GraSP outperforms ReAct, Reflexion, ExpeL, and flat skill baselines in every configuration, improving reward by up to +19 points over the strongest baseline while cutting environment steps by up to 41%. GraSP’s advantage grows with task complexity and is robust to both skill over-retrieval and quality degradation, confirming that structured orchestration – not larger skill libraries – is the key to reliable agent execution.
[NLP-73] Latent Abstraction for Retrieval-Augmented Generation
【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)系统中存在的两个核心问题:一是依赖生成自然语言查询进行每一轮检索,限制了模型对知识的高效利用;二是严格分离检索器与生成器架构,导致无法充分挖掘大语言模型(Large Language Models, LLMs)的联合表征能力。解决方案的关键在于提出一种统一框架 LAnR(Latent Abstraction for RAG),其核心创新是让单一 LLM 在自身隐空间中联合完成编码、检索与生成任务——通过特定标记 [PRED] 的隐藏状态直接生成稠密检索向量,用于匹配同一模型编码的文档表示,从而避免文本查询生成;同时引入轻量级多层感知机(MLP)控制头基于相同隐藏状态自适应判断是否已获取足够证据,消除了独立检索模块和显式的token级停止推理机制。实验证明,LAnR在六项问答基准上优于现有RAG方法,并因减少检索调用次数和更紧密的模型集成提升了推理效率。
链接: https://arxiv.org/abs/2604.17866
作者: Ha Lan N.T,Minh-Anh Nguyen,Dung D. Le
机构: VinUniversity (VinUniversity)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has become a standard approach for enhancing large language models (LLMs) with external knowledge, mitigating hallucinations, and improving factuality. However, existing systems rely on generating natural language queries at each hop and maintaining a strict architectural separation between retriever and generator, preventing them from leveraging the full representational capacity of the LLM. We propose \textbfLAnR (Latent Abstraction for RAG), a unified framework in which a single LLM jointly performs encoding, retrieval, and generation entirely within its own latent space. Rather than generating textual queries, LAnR produces dense retrieval vectors from the hidden states of a designated \texttt[PRED] token and uses them to match against encoded document representations from the same model. Furthermore, LAnR adaptively decides when sufficient evidence has been retrieved using a lightweight MLP control head over those same hidden states, eliminating both the separate retriever and explicit token-level stopping reasoning. This design is motivated by our empirical observation that answer token entropy reliably signals retrieval sufficiency. Extensive experiments on six QA benchmarks spanning single-hop and multi-hop settings demonstrate that LAnR outperforms existing RAG methods, while achieving improved inference efficiency through reduced number of retrieval calls and tighter model integration.
[NLP-74] On the Emergence of Syntax by Means of Local Interaction
【速读】: 该论文旨在解决“语法处理是否能从纯粹局部交互中自发涌现”这一基础问题,即探索在无显式监督语法规则的情况下,系统能否通过自组织形成具备语法解析能力的内部结构。解决方案的关键在于设计了一个仅含18,658个参数的二维神经细胞自动机(Neural Cellular Automaton, NCA),其训练信号仅为一个1比特的边界信号,并通过学习算术表达式文法的成员资格问题来驱动系统演化。训练后,该系统内部网格自发形成一种称为Proto-CKY的空间扩展表示结构,该结构满足三个语法处理的核心标准:超越正则语言的表达能力、对训练分布外结构的泛化能力,以及与语法结构高度一致的内部组织(Pearson相关系数r ≈ 0.71)。Proto-CKY虽功能上类比于CKY算法,但形式上独立于传统算法,是一种物理原型(physical prototype),体现了数学理想在物理载体上的具体实现,其与理论算法之间的系统性差异本身也携带了关于计算基质的信息。
链接: https://arxiv.org/abs/2604.17857
作者: Zichao Wei
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Can syntactic processing emerge spontaneously from purely local interaction? We present a concrete instance on a minimal system: an 18,658-parameter two-dimensional neural cellular automaton (NCA), supervised by nothing more than a 1-bit boundary signal, is trained on the membership problem of an arithmetic-expression grammar. After training, its internal L \times L grid spontaneously self-organizes into an ordered, spatially extended representation that we name Proto-CKY. This representation satisfies three operational criteria for syntactic processing: expressive power beyond the regular languages, structural generalization beyond the training distribution, and an internal organization quantitatively aligned with grammatical structure (Pearson r \approx 0.71 ). It emerges independently on four context-free grammars and regenerates spontaneously after perturbation. Proto-CKY is functionally aligned with the CKY algorithm but formally distinct from it: it is a physical prototype, a concrete instantiation of a mathematical ideal on a physical substrate, and the systematic distance between the two carries information about the substrate itself.
[NLP-75] QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks
【速读】: 该论文旨在解决动态大语言模型(Large Language Model, LLM)评估中难以高效识别困难问题的问题。传统基准测试通常包含固定题集,而现代动态基准通过模板和参数生成无限变体,虽具灵活性但导致评估成本高,尤其在需要精准定位模型薄弱环节时更为显著。解决方案的关键在于引入并改进COUP算法——一种基于贝叶斯优化(Bayesian optimization)的采样策略,并针对实际LLM流水线需求进行多项实质性调整,从而提升发现真正困难问题的样本效率;同时,作者开发了名为QuickScope的工具框架,支持灵活选择数据集与效用函数,使用户可针对性地挖掘特定类型难题(如低准确率问题或相对复杂度异常高的问题),并在多基准实验中验证其相较于标准基线方法能更高效地识别困难样本并减少因噪声带来的假阳性结果。
链接: https://arxiv.org/abs/2604.17842
作者: Taylor Lundy,Narun K. Raman,Kevin Leyton-Brown
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 figures
Abstract:LLM benchmarks are increasingly dynamic: instead of containing a fixed set of questions, they define templates and parameters that can generate an effectively unlimited number of question variants. This flexibility is valuable, but it makes evaluation expensive – especially when the goal is not just determining an average score, but reliably identifying a model’s weak spots. This paper introduces a new methodology for identifying hard questions in dynamic benchmarks. It leverages COUP, a recent Bayesian optimization algorithm (Graham, Velez Leyton-Brown, 2026), after introducing several substantive modifications to make the algorithm suitable for practical LLM pipelines. We also wrap it in a tool that supports flexible choices of datasets and utility functions, enabling users to target the kinds of questions they care about (e.g., low-accuracy questions; questions that are unusually hard relative to their measured complexity). In experiments across a range of benchmarks, we show that our method, dubbed \textttQuickScope , discovers truly difficult questions more sample efficiently than standard baselines, while also reducing false positives from noisy outcomes.
[NLP-76] Polysemantic Experts Monosemantic Paths: Routing as Control in MoEs
【速读】: 该论文旨在解决混合专家模型(Mixture-of-Experts, MoE)中专家路径语义混杂、难以解释的问题,特别是如何识别和理解不同输入 token 在多层网络中因路由机制而形成的差异化处理轨迹。其解决方案的关键在于提出一种无参数分解方法,将每一层的隐藏状态(hidden state)分离为两个正交子空间:一个控制信号(control signal),用于因果驱动专家路由决策;另一个内容通道(content channel),承载表层特征(如语言、词元身份、位置信息)且对路由器不可见。研究表明,控制信号编码抽象功能并随层变化旋转,迫使各层间进行组合式专业化分工,使得原本多义的专家路径在整体上呈现单义性(monosemantic),即同一 token 根据上下文沿不同轨迹被分配至特定语义功能的专家路径。因此,该分解揭示了 MoE 中可解释性的自然单元不是单个专家,而是由控制子空间中的聚类所定义的专家路径轨迹。
链接: https://arxiv.org/abs/2604.17837
作者: Charles Ye,Bo Yuan,Lee Sharkey
机构: Georgia Institute of Technology (佐治亚理工学院); Goodfire (Goodfire)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:An LLM’s residual stream is both state and instruction: it encodes the current context and determines the next transformation. We introduce a parameter-free decomposition for Mixture-of-Experts models that splits each layer’s hidden state into a control signal that causally drives routing and an orthogonal content channel invisible to the router. Across six MoE architectures, we find that models preserve surface-level features (language, token identity, position) in the content channel, while the control signal encodes an abstract function that rotates from layer to layer. Because each routing decision is low-bandwidth, this hand-off forces compositional specialization across layers. While individual experts remain polysemantic, expert paths become monosemantic, clustering tokens by semantic function across languages and surface forms. The same token (e.g., “:”) follows distinct trajectories depending on whether it serves as a type annotation, an introductory colon, or a time separator. Our decomposition identifies the source of this structure: clusters in the control subspace are substantially more monosemantic than those in the full representation. As a result, the natural unit of interpretability in MoEs is not the expert but the trajectory.
[NLP-77] How Non-Linguistic Is the Indus Sign System? A Synthetic-Baseline Scorecard
【速读】: 该论文旨在解决印度河谷符号系统(Indus Valley sign system)是否编码口语语言这一长期争议问题。其解决方案的关键在于提出了一种多指标判别框架,通过将印度河谷文本 corpus 与两类计算机生成的非语言基线进行对比:一类模拟纹章系统,另一类模拟行政编码系统,两者均基于六组已知非语言语料库校准的 Zipf 频率分布、位置约束和二元组依赖关系。该框架评估了四个核心属性——文本简短性、重复公式化短语、罕见词率(hapax legomenon rate)和位置刚性——这些正是 Farmer-Sproat-Witzel(2004)批评中的关键点。结果显示,印度河谷语料在四项指标上均未完全匹配任一基线,处于两者之间,且无已知非语言系统能同时再现全部统计特征,从而为该符号系统可能具有语言性质提供了定量支持。
链接: https://arxiv.org/abs/2604.17828
作者: Ashish Nair
机构: Independent Researcher
类目: Computation and Language (cs.CL)
备注: 13 pages, 4 figures, 8 tables. Code available from corresponding author upon request
Abstract:Whether the Indus Valley sign system (c. 2600-1900 BCE) encodes spoken language has been debated for decades. This paper introduces a multi-metric discrimination framework that tests the observed Indus corpus against two kinds of computer-generated non-linguistic baseline – one mimicking a heraldic emblem system, the other an administrative coding system – each calibrated with Zipfian frequency distributions, positional constraints, and bigram dependencies derived from six attested non-linguistic corpora. The scorecard evaluates four properties central to the Farmer-Sproat-Witzel (2004) critique: text brevity, repeated formulaic phrases, hapax legomenon rate, and positional rigidity. Applying this framework to 1,916 deduplicated inscriptions (584 unique signs, 11,110 tokens) from the ICIT/Yajnadevam digitization, we find that the Indus corpus does not match either baseline cleanly. Across the four metrics examined, the Indus corpus occupies an intermediate position relative to the two baseline families, matching neither cleanly. Neither a heraldic nor an administrative generator can reproduce all four properties at once. We also compare against seven real-world non-linguistic corpora including Sproat’s (2014) datasets, finding that no attested non-linguistic system reproduces the full Indus statistical profile either. We replicate key prior results including a Zipf slope of -1.49 and conditional entropy of 3.23 bits. All code and data are publicly available.
[NLP-78] Learning to Seek Help: Dynamic Collaboration Between Small and Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在成本与隐私方面的挑战,以及小语言模型(Small Language Models, SLMs)因能力有限导致的性能瓶颈问题。其核心解决方案是提出一种动态协作框架,其中SLM能够主动决策在多步推理过程中何时向LLM请求帮助,而LLM则提供自适应反馈而非作为被动工具。该框架的关键在于通过学习动态协作策略,实现SLM与LLM之间的高效协同,从而在满足效率和隐私约束的前提下,显著提升整体推理性能,并展现出对未见LLM的良好迁移能力。
链接: https://arxiv.org/abs/2604.17827
作者: Hang Zeng,Xiangyu Liu,Yong Hu,Chaoyue Niu,Jiarui Zhang,Shaojie Tang,Fan Wu,Guihai Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 content pages
Abstract:Large language models (LLMs) offer strong capabilities but raise cost and privacy concerns, whereas small language models (SLMs) facilitate efficient and private local inference yet suffer from limited capacity. To synergize the complementary strengths, we introduce a dynamic collaboration framework, where an SLM learns to proactively decide how to request an LLM during multi-step reasoning, while the LLM provides adaptive feedback instead of acting as a passive tool. We further systematically investigate how collaboration strategies are shaped by SLM and LLM capabilities as well as efficiency and privacy constraints. Evaluation results reveal a distinct scaling effect: stronger SLMs become more self-reliant, while stronger LLMs enable fewer and more informative interactions. In addition, the learned dynamic collaboration strategies significantly outperform static pipelines and standalone inference, and transfer robustly to unseen LLMs.
[NLP-79] A novel LSTM music generator based on the fractional time-frequency feature extraction
【速读】: 该论文旨在解决如何利用人工智能(AI)系统生成高质量音乐的问题,尤其关注从音乐信号中提取有效特征并基于此进行预测与生成。其解决方案的关键在于结合分数阶傅里叶变换(fractional Fourier transform, FrFT)与长短期记忆(long short-term memory, LSTM)网络:FrFT用于在时频域上提取音乐信号的谱特征,而LSTM则利用这些特征及实时输入进行序列建模和音乐生成,最终实现与人类创作音乐质量相当的自动化音乐合成。
链接: https://arxiv.org/abs/2604.17823
作者: Li Ya,Chen Wei,Li Xiulai,Yu Lei,Deng Xinyi,Chen Chaofan
机构: Hainan Normal University (海南师范大学); ACRSEA (中国科学院南海海洋研究所)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This work was supported by Hainan Provincial Natural Science Foundation of China (Grant No. 723QN238)
Abstract:In this paper, we propose a novel approach for generating music based on an artificial intelligence (AI) system. We analyze the features of music and use them to fit and predict the music. The fractional Fourier transform (FrFT) and the long short-term memory (LSTM) network are the foundations of our method. The FrFT method is used to extract the spectral features of a music piece, where the music signal is expressed on the time and frequency domains. The LSTM network is used to generate new music based on the extracted features, where we predict the music according to the hidden layer features and real-time inputs using GiantMIDI-Piano dataset. The results of our experiments show that our proposed system is capable of generating high-quality music that is comparable to human-generated music.
[NLP-80] PDDL-Mind: Large Language Models are Capable on Belief Reasoning with Reliable State Tracking
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在理论心智(Theory-of-Mind, ToM)基准测试中表现远低于人类水平的问题,尤其是当采用链式思维提示(chain-of-thought prompting)或概率信念更新等方法时仍存在显著性能瓶颈。研究表明,这些问题主要源于模型在隐式状态跟踪上的不可靠性,而非高阶推理能力的不足。解决方案的关键在于提出一种神经符号框架 PDDL-Mind,其核心创新是将环境状态演进与信念推理解耦:通过将叙事描述转化为以规划领域定义语言(Planning Domain Definition Language, PDDL)表达的显式状态和动作,并利用预定义领域验证动作引发的状态转移,从而为 LLM 提供逻辑一致且明确的世界状态表示,显著提升了 ToM 任务的准确性,在 MMToM-QA、MuMA 和 FanToM 等基准上相较最优现有方法实现超过 5% 的绝对准确率提升。
链接: https://arxiv.org/abs/2604.17819
作者: Wang Bill Zhu,Qiutong Tony Yi,Robin Jia,Jesse Thomason
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) perform substantially below human level on existing theory-of-mind (ToM) benchmarks, even when augmented with chain-of-thought prompting or probabilistic belief updates. We argue that these failures primarily arise from unreliable implicit state tracking rather than limitations in high-level reasoning. We introduce PDDL-Mind, a neuro-symbolic framework that decouples environment state evolution from belief inference. By translating narrative descriptions into explicit states and actions expressed in Planning Domain Definition Language (PDDL), and by verifying action-induced state transitions against a predefined domain, PDDL-Mind provides LLMs with a logically consistent and explicit representation of world states for ToM tasks. Experiments on MMToM-QA, MuMA and FanToM show that PDDL-Mind achieves over 5% absolute accuracy gain over the best existing state-of-the-art method on ToM benchmark questions.
[NLP-81] Enabling AI ASICs for Zero Knowledge Proof
【速读】: 该论文旨在解决零知识证明(Zero-knowledge Proof, ZKP)中多标量乘法(Multi-scalar Multiplication, MSM)和数论变换(Number-theoretic Transform, NTT)计算成本高昂的问题,其核心瓶颈在于这些操作在传统硬件上缺乏高效执行能力。解决方案的关键在于提出MORPH框架,该框架通过两个层面的创新重构ZKP核函数以适配AI专用集成电路(AI ASICs,如TPU)的执行特性:首先,在算术层面上,引入基于MXU(Matrix eXtension Unit)的扩展RNS延迟约简机制,将高精度模运算转化为密集低精度矩阵乘法(GEMM),从而消除所有进位链;其次,在数据流层面上,设计统一分片布局的TPU Pippenger MSM与优化的3/5步NTT算法,避免TPU内部的数据重排开销,最小化内存重组代价。这一硬件感知的设计显著提升了TPUv6e8上的NTT吞吐量(最高达10倍)并保持MSM性能竞争力。
链接: https://arxiv.org/abs/2604.17808
作者: Jianming Tong,Jingtian Dang,Simon Langowski,Tianhao Huang,Asra Ali,Jeremy Kun,Jevin Jiang,Srinivas Devadas,Tushar Krishna
机构: Georgia Institute of Technology(佐治亚理工学院); Massachusetts Institute of Technology(麻省理工学院); Google(谷歌)
类目: Hardware Architecture (cs.AR); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Programming Languages (cs.PL)
备注: Design Automation Conference 2026
Abstract:Zero-knowledge proof (ZKP) provers remain costly because multi-scalar multiplication (MSM) and number-theoretic transforms (NTTs) dominate runtime as they need significant computation. AI ASICs such as TPUs provide massive matrix throughput and SotA energy efficiency. We present MORPH, the first framework that reformulates ZKP kernels to match AI-ASIC execution. We introduce Big-T complexity, a hardware-aware complexity model that exposes heterogeneous bottlenecks and layout-transformation costs ignored by Big-O. Guided by this analysis, (1) at arithmetic level, MORPH develops an MXU-centric extended-RNS lazy reduction that converts high-precision modular arithmetic into dense low-precision GEMMs, eliminating all carry chains, and (2) at dataflow level, MORPH constructs a unified-sharding layout-stationary TPU Pippenger MSM and optimized 3/5-step NTT that avoid on-TPU shuffles to minimize costly memory reorganization. Implemented in JAX, MORPH enables TPUv6e8 to achieve up-to 10x higher throughput on NTT and comparable throughput on MSM than GZKP. Our code: this https URL.
[NLP-82] Bridging the Reasoning Gap in Vietnamese with Small Language Models via Test-Time Scaling
【速读】: 该论文旨在解决小型语言模型(Small Language Models, SLMs)在资源受限设备上部署时面临的“推理差距”问题,特别是在非英语语种如越南语中的连贯思维链维持困难。其关键解决方案在于通过监督微调(Supervised Fine-Tuning, SFT)显著提升模型的解释质量(Explanation Quality),从而弥合原始计算能力与教学一致性之间的“格式差距”;同时发现,相较于复杂的代理式工作流(如ReAct),采用简化测试时扩展策略(Test-Time Scaling)结合纯思维链(Chain-of-Thought, CoT)与自一致性(Self-Consistency)方法更适配1.7B参数规模的SLMs,在边缘推理场景下表现更优。
链接: https://arxiv.org/abs/2604.17794
作者: Bui The Trung,Do Minh Duc,Nguyen Van Vinh,Bui Nguyen Quoc Trinh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: FJICAI conference
Abstract:The democratization of ubiquitous AI hinges on deploying sophisticated reasoning capabilities on resource-constrained devices. However, Small Language Models (SLMs) often face a “reasoning gap”, particularly in non-English languages like Vietnamese, where they struggle to maintain coherent chains of thought. This paper investigates Test-Time Scaling strategies for the Qwen3-1.7B architecture within the context of Vietnamese Elementary Mathematics. We introduce Vi-S1K, a high-fidelity reasoning dataset localized via a Gemini 2.5 Flash-Lite powered pipeline, and Vi-Elementary-Bench, a dual-resource benchmark for rigorous evaluation. Using an LLM-as-a-Judge protocol, we reveal that the base model possesses robust latent knowledge (Accuracy: 4.05/5.00) but suffers from a severe “formatting gap” in communication. Supervised Fine-Tuning (SFT) acts as a critical “reasoning unlocker”, yielding a 77% improvement in Explanation Quality and bridging the gap between raw calculation and pedagogical coherence. Furthermore, our analysis of prompting strategies uncovers a significant trade-off: structured frameworks like ReAct impose a “cognitive tax” on the 1.7B parameter capacity, degrading performance relative to pure Chain-of-Thought (CoT) combined with Self-Consistency. These findings establish a deployment hierarchy for SLMs, demonstrating that SFT combined with simplified test-time scaling is superior to complex agentic workflows for edge-based reasoning.
[NLP-83] DuQuant: Fine-grained Rotation Enhances Microscaling FP4 Quantization
【速读】: 该论文旨在解决MXFP4(Microscaling Format with 32-element blocks and shared E8M0 scaling factor)量化格式下激活异常值(activation outliers)导致的显著量化误差问题:单个异常值会放大整个块的缩放因子,压缩其余元素的有效动态范围。现有基于旋转的方法(如随机Hadamard和可学习旋转)因数据无关性无法针对性处理异常值集中分布的通道。解决方案的关键在于提出DuQuant++,其将原DuQuant中面向异常值的细粒度旋转机制适配至MXFP4格式,通过将旋转块大小与微缩放组大小(B=32)对齐,并利用MXFP4每个分组独立缩放因子的特性,消除了跨块方差问题,从而用单一异常值感知旋转替代原有双旋转+锯齿排列的复杂流程,实现在线旋转计算成本减半的同时改善权重分布均匀性。
链接: https://arxiv.org/abs/2604.17789
作者: Haokun Lin,Xinle Jia,Haobo Xu,Bingchen Yao,Xianglong Guo,Yichen Wu,Zhichao Lu,Ying Wei,Qingfu Zhang,Zhenan Sun
机构: CASIA; NJU; THU; ZJU; Harvard; CityU
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Technical Report
Abstract:The MXFP4 microscaling format, which partitions tensors into blocks of 32 elements sharing an E8M0 scaling factor, has emerged as a promising substrate for efficient LLM inference, backed by native hardware support on NVIDIA Blackwell Tensor Cores. However, activation outliers pose a unique challenge under this format: a single outlier inflates the shared block scale, compressing the effective dynamic range of the remaining elements and causing significant quantization error. Existing rotation-based remedies, including randomized Hadamard and learnable rotations, are data-agnostic and therefore unable to specifically target the channels where outliers concentrate. We propose DuQuant++, which adapts the outlier-aware fine-grained rotation of DuQuant to the MXFP4 format by aligning the rotation block size with the microscaling group size (B=32). Because each MXFP4 group possesses an independent scaling factor, the cross-block variance issue that necessitates dual rotations and a zigzag permutation in the original DuQuant becomes irrelevant, enabling DuQuant++ to replace the entire pipeline with a single outlier-aware rotation, which halves the online rotation cost while simultaneously smoothing the weight distribution. Extensive experiments on the LLaMA-3 family under MXFP4 W4A4 quantization show that DuQuant++ consistently achieves state-of-the-art performance. Our code is available at this https URL.
[NLP-84] Forget What Matters Keep the Rest: Selective Unlearning of Informative Tokens ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在执行遗忘学习(unlearning)时因统一应用遗忘损失而导致模型性能不必要的退化问题。现有方法虽尝试引入基于token级别的损失正则化以优先处理信息量高的token,但普遍依赖于人工标注的置信度或外部语言学解析器,难以捕捉上下文语境和模型的整体预测状态。其解决方案的关键在于提出熵引导的token加权机制(Entropy-guided Token Weighting, ETW),利用预测分布的熵作为token信息量的代理指标:高熵token通常对应语义丰富、不确定性高的词汇,而低熵token多为结构词(如“the”),具有高度可预测性。由此,ETW能更精准地识别需重点遗忘的token,在实现有效遗忘的同时显著提升模型功能保留能力。
链接: https://arxiv.org/abs/2604.17785
作者: Seunghee Koh,Sunghyun Baek,Youngdong Kim,Junmo Kim
机构: Korea Advanced Institute of Science and Technology, South Korea; Hanbat National University, South Korea
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ACL 2026 Main Conference. 17 pages, 9 figures
Abstract:Unlearning in large language models (LLMs) has emerged as a promising safeguard against adversarial behaviors. When the forgetting loss is applied uniformly without considering token-level semantic importance, model utility can be unnecessarily degraded. Recent studies have explored token-wise loss regularizers that prioritize informative tokens, but largely rely on ground-truth confidence or external linguistic parsers, which limits their ability to capture contextual information or the model’s overall predictive state. Intuitively, function words like “the” primarily serve syntactic roles and are highly predictable with little ambiguity, but informative words admit multiple plausible alternatives with greater uncertainty. Based on this intuition, we propose Entropy-guided Token Weighting (ETW), a token-level unlearning regularizer that uses entropy of the predictive distribution as a proxy for token informativeness. We demonstrate that informative tokens tend to have higher entropy, whereas structural tokens tend to have lower entropy. This behavior enables ETW to achieve more effective unlearning while better preserving model utility than existing token-level approaches.
[NLP-85] SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自然语言到SQL(Natural Language to SQL, NL2SQL)任务中报告的高准确率可能因训练数据中存在与测试集查询结构相似或完全相同的样本而导致的污染问题(contamination),从而影响评估结果的真实性。解决方案的关键在于提出SPENCE(Syntactic Probing and Evaluation of NL2SQL Contamination Effects),这是一个受控的句法探测框架,通过系统生成四个主流NL2SQL数据集(Spider、SParC、CoSQL和BIRD)测试查询的句法变体,并基于执行准确性变化和Kendall’s tau相关性分析来量化模型对句法差异的敏感度,进而识别并度量训练泄漏的可能性。该方法揭示了旧数据集(如Spider)存在显著污染,而新数据集(如BIRD)则表现出较低敏感性,验证了时序语境化句法探测在可信NL2SQL评估中的必要性。
链接: https://arxiv.org/abs/2604.17771
作者: Mohammadtaher Safarzadeh,Hitesh Laxmichand Patel,Afshin Orojlooyjadid,Graham Horwood,Dan Roth
机构: Oracle AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: ACL 2026 Main Conference
Abstract:Large language models (LLMs) have achieved strong performance on natural language to SQL (NL2SQL) benchmarks, yet their reported accuracy may be inflated by contamination from benchmark queries or structurally similar patterns seen during training. We introduce SPENCE (Syntactic Probing and Evaluation of NL2SQL Contamination Effects), a controlled syntactic probing framework for detecting and quantifying such contamination. SPENCE systematically generates syntactic variants of test queries for four widely used NL2SQL datasets-Spider, SParC, CoSQL, and the newer BIRD benchmark. We use SPENCE to evaluate multiple high-capacity LLMs under execution-based scoring. For each model, we measure changes in execution accuracy across increasing levels of syntactic divergence and quantify rank sensitivity using Kendall’s tau with bootstrap confidence intervals. By aligning these robustness trends with benchmark release dates, we observe a clear temporal gradient: older benchmarks such as Spider exhibit the strongest negative values and thus the highest likelihood of training leakage, whereas the more recent BIRD dataset shows minimal sensitivity and appears largely uncontaminated. Together, these findings highlight the importance of temporally contextualized, syntactic-probing evaluation for trustworthy NL2SQL benchmarking.
[NLP-86] Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)安全评估中高质量对抗性数据生成的系统性难题,特别是如何自动化、可控地合成多维度毒性数据以支持红队测试(red teaming)。其核心解决方案是提出逆向宪法AI(Reverse Constitutional AI, R-CAI)框架,通过将无害宪法反向转化为毒性宪法,并借助批判-修订(critique–revision)迭代优化流程,在无需人工标注的情况下实现规模化对抗数据生成。关键创新在于引入强化学习中的概率钳制(probability clamping)机制,有效缓解仅优化毒性奖励导致的奖励黑客(reward hacking)问题,从而在保持攻击强度的同时显著提升语义连贯性(提升15%),实现了对抗性与合理性的平衡。
链接: https://arxiv.org/abs/2604.17769
作者: Yuan Fang,Yiming Luo,Aimin Zhou,Fei Tan
机构: East China Normal University, Shanghai, China; Shanghai Innovation Institute, Shanghai, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of ACL 2026. 10 pages, 6 figures. Code and data available at this https URL
Abstract:Ensuring the safety of large language models (LLMs) requires robust red teaming, yet the systematic synthesis of high-quality toxic data remains under-explored. We propose Reverse Constitutional AI (R-CAI), a framework for automated and controllable adversarial data generation that moves beyond isolated jailbreak prompts. By inverting a harmless constitution into a constitution of toxicity and iteratively refining model outputs through a critique–revision pipeline, R-CAI enables scalable synthesis of multi-dimensional adversarial data without human annotation. Optimizing solely for toxicity-related rewards, however, can lead to reward hacking and degraded semantic coherence. To address this challenge, we introduce probability clamping within reinforcement learning from AI feedback, which stabilizes adversarial optimization while preserving adversarial intent. Experiments demonstrate that R-CAI generates diverse, high-quality toxic data and that probability clamping substantially improves semantic coherence (15%) without sacrificing adversarial strength. Overall, R-CAI provides a fully automated framework for red teaming data generation and systematic safety evaluation of aligned language models.
[NLP-87] Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)失败分析中缺乏有效工具的问题,特别是现有可解释性工具多聚焦于短提示或玩具场景,未能充分覆盖实际应用场景中的复杂表现。其解决方案的关键在于提出一种基于梯度的对比归因方法(contrastive, LRP-based attribution),将错误输出标记与正确替代标记之间的logit差异归因于输入标记和模型内部状态,从而实现细粒度的token级失败定位;同时引入一种高效的扩展方法,支持构建长上下文输入的跨层归因图谱,使该方法适用于真实基准测试环境下的系统性失效分析。
链接: https://arxiv.org/abs/2604.17761
作者: Rongyuan Tan,Jue Zhang,Zhuozhao Li,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang
机构: Microsoft(微软)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 45 pages, 16 figures, 16 tables
Abstract:Interpretability tools are increasingly used to analyze failures of Large Language Models (LLMs), yet prior work largely focuses on short prompts or toy settings, leaving their behavior on commonly used benchmarks underexplored. To address this gap, we study contrastive, LRP-based attribution as a practical tool for analyzing LLM failures in realistic settings. We formulate failure analysis as \textitcontrastive attribution, attributing the logit difference between an incorrect output token and a correct alternative to input tokens and internal model states, and introduce an efficient extension that enables construction of cross-layer attribution graphs for long-context inputs. Using this framework, we conduct a systematic empirical study across benchmarks, comparing attribution patterns across datasets, model sizes, and training checkpoints. Our results show that this token-level contrastive attribution can yield informative signals in some failure cases, but is not universally applicable, highlighting both its utility and its limitations for realistic LLM failure analysis. Our code is available at: this https URL.
[NLP-88] Evolutionary Negative Module Pruning for Better LoRA Merging ACL2026
【速读】: 该论文旨在解决当前多任务低秩适配(Low-Rank Adaptation, LoRA)模型合并过程中存在的性能下降问题,特别是由“负向模块”(negative modules)引起的全局性能劣化现象——即某些LoRA层在合并后反而损害整体模型表现。解决方案的关键在于提出一种名为进化负向模块剪枝(Evolutionary Negative Module Pruning, ENMP)的方法,通过进化搜索策略在离散且不可微的模块选择空间中高效定位并剔除这些有害模块,从而提升合并后模型的性能。ENMP作为即插即用的剪枝机制,可显著增强现有合并算法的效果,在语言和视觉领域均达到新的最先进水平。
链接: https://arxiv.org/abs/2604.17753
作者: Anda Cao,Zhuo Gou,Yi Wang,Kaixuan Chen,Yu Wang,Can Wang,Mingli Song,Jie Song
机构: Zhejiang University (浙江大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACL 2026 (main conference)
Abstract:Merging multiple Low-Rank Adaptation (LoRA) experts into a single backbone is a promising approach for efficient multi-task deployment. While existing methods strive to alleviate interference via weight interpolation or subspace alignment, they rest upon the implicit assumption that all LoRA matrices contribute constructively to the merged model. In this paper, we uncover a critical bottleneck in current merging paradigms: the existence of \textitnegative modules – specific LoRA layers that inherently degrade global performance upon merging. We propose \textbfE volutionary \textbfN egative \textbfM odule \textbfP runing ( \textbfENMP ), a plug-and-play LoRA pruning method to locate and exclude these detrimental modules prior to merging. By leveraging an evolutionary search strategy, ENMP effectively navigates the discrete, non-differentiable landscape of module selection to identify optimal pruning configurations. Extensive evaluations demonstrate that ENMP consistently boosts the performance of existing merging algorithms, achieving a new state-of-the-art across both language and vision domains. Code is available at this https URL.
[NLP-89] HiP-LoRA: Budgeted Spectral Plasticity for Robust Low-Rank Adaptation
【速读】: 该论文旨在解决LoRA(Low-Rank Adaptation)在参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)中因频谱干扰(spectral interference)导致的性能瓶颈问题,具体表现为低秩更新能量集中在预训练权重的主要奇异方向上,从而破坏模型的通用能力并引发灾难性遗忘和多适配器合并失败(multi-adapter merging failure)。其解决方案的关键在于提出HiP-LoRA(High-precision LoRA),该方法利用预训练层缓存的奇异值分解(Singular Value Decomposition, SVD),将更新分解为两个通道:一个主通道位于主导奇异子空间内,另一个残差低秩通道位于正交补空间中;并通过在主通道上施加基于奇异值加权的稳定性预算(stability budget),动态平衡预训练行为保持与任务特定可塑性,从而显著降低预训练性能退化并在连续微调和知识编辑等对干扰敏感的任务中实现更鲁棒的性能表现。
链接: https://arxiv.org/abs/2604.17751
作者: Lixian Chen,Jianhong Tan
机构: Guangdong University of Technology (广东工业大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Adapting foundation models under resource budgets relies heavily on Parameter-Efficient Fine-Tuning (PEFT), with LoRA being a standard modular solution. However, LoRA suffers from spectral interference. Low-rank updates often concentrate energy on the leading singular directions of pretrained weights, perturbing general capabilities and causing catastrophic forgetting and fragile multi-adapter merging. To resolve this, we propose HiP-LoRA, a spectrum-aware adaptation framework. Utilizing the cached singular value decomposition (SVD) of pretrained layers, HiP-LoRA decomposes updates into two channels: a principal channel within the dominant singular subspace, and a residual low-rank channel in the orthogonal complement. A singular-value-weighted stability budget on the principal channel continuously balances pretrained behavior preservation with task-specific plasticity. Experiments on Llama-3.1-8B demonstrate that under matched budgets, HiP-LoRA drastically reduces pretraining degradation and multi-adapter MergeFail, robustly outperforming baselines in interference-sensitive tasks like continual tuning and knowledge editing.
[NLP-90] HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution
【速读】: 该论文旨在解决当前基于大语言模型的计算研究自动化方法在实验复现任务中存在全局协调能力弱、依赖固定顺序代理流程导致鲁棒性不足的问题。其解决方案的关键在于提出分层多智能体系统(Hierarchical Research Agent System, HiRAS),通过引入监督型管理代理(supervisory manager agents)来协调不同细粒度阶段的专业化智能体,从而实现端到端的实验复现流程优化。此外,作者还改进了Paper2Code基准的无参考评估协议,提出了Paper2Code-Extra(P2C-Ex),引入仓库级信息以更贴近原始基于参考的评估指标,显著提升了评估的准确性与可靠性。
链接: https://arxiv.org/abs/2604.17745
作者: Hanhua Hong,Yizhi LI,Jiaoyan Chen,Sophia Ananiadou,Xiaoli Li,Jung-jae Kim,Chenghua Lin
机构: The University of Manchester; Institute for Infocomm Research (I²R), A*STAR; ELLIS Manchester; Singapore University of Technology and Design
类目: Computation and Language (cs.CL)
备注: 29 pages
Abstract:Recent advances in large language models have highlighted their potential to automate computational research, particularly reproducing experimental results. However, existing approaches still use fixed sequential agent pipelines with weak global coordination, which limits their robustness and overall performance. In this work, we propose Hierarchical Research Agent System (HiRAS), a hierarchical multi-agent framework for end-to-end experiment reproduction that employs supervisory manager agents to coordinate specialised agents across fine-grained stages. We also identify limitations in the reference-free evaluation of the Paper2Code benchmark and introduce Paper2Code-Extra (P2C-Ex), a refined protocol that incorporates repository-level information and better aligns with the original reference-based metric. We conduct extensive evaluation, validating the effectiveness and robustness of our proposed methods, and observing improvements, including 10% relative performance gain beyond the previous state-of-the-art using open-source backbone models and significantly reduced hallucination in evaluation. Our work is available on GitHub: this https URL.
[NLP-91] ool Learning Needs Nothing More Than a Free 8B Language Model
【速读】: 该论文旨在解决当前工具调用智能体(Tool Calling Agent)训练中对昂贵标注数据、真实人类交互、可执行工具或高成本商用语言模型(Language Model, LM)所依赖的环境资源问题。现有方法通常需要固定不变的合成环境或外部标注数据,限制了训练的灵活性与可扩展性。解决方案的关键在于提出TRUSTEE框架,其核心是利用本地开源小规模语言模型(如8B参数级别)构建动态模拟环境,涵盖任务生成、用户模拟、工具模拟和轨迹评估四大模块,并结合自适应课程学习机制,在训练过程中动态调控任务难度。实验证明,该方法无需额外资源即可在多个领域实现稳定提升,表明通过精心设计的模拟环境,即使使用低成本本地模型也能建立强大的工具学习基线。
链接: https://arxiv.org/abs/2604.17739
作者: Chenming Tang,Hsiu-Yuan Huang,Weijie Liu,Junqiang Zheng,Saiyong Yang,Yunfang Wu
机构: Peking University (北京大学); Tencent (腾讯)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Preprint; Work in progress
Abstract:Reinforcement learning (RL) has become a prevalent paradigm for training tool calling agents, which typically requires online interactive environments. Existing approaches either rely on training data with ground truth annotations or require advanced commercial language models (LMs) to synthesize environments that keep fixed once created. In this work, we propose TRUSTEE, a data-free method training tool calling agents with dynamic environments fully simulated by free open-source LMs that can be as small as 8B, including task generation, user simulation, tool simulation and trajectory evaluation, paired with an adaptive curriculum learning mechanism that controls various aspects of the task difficulty dynamically during training. Our empirical results show that TRUSTEE brings consistent improvements across various domains and outperforms all the baselines which require extra external resources for training. These confirm that, with a sufficiently sophisticated design, even simulated environments with a local 8B LM as the backbone could set a strong baseline for tool learning, without expensive annotated data, realistic human interactions, executable tools or costly verifiable environments from human experts or commercial LMs. We hope our proposed paradigm could inspire future research on environment scaling with limited resources.
[NLP-92] Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM -Synthesized Data
【速读】: 该论文旨在解决招聘场景中候选人才匹配的精准度问题,特别是在有限人工审核预算下如何提升召回率(Recall)和精确率(Precision)。其核心挑战在于:上游检索器返回的候选人列表虽具备一定覆盖范围,但缺乏细粒度排序能力,导致合格候选人未能排在前列。解决方案的关键在于构建一个基于大语言模型(LLM)合成数据驱动的语义重排序系统(semantic reranking system),通过五阶段提示工程生成多样化的正样本与难负样本以重塑嵌入空间,并采用两轮LoRA微调策略实现JD(职位描述)与CV(简历)之间的对比学习与三元组对齐;此外,引入轻量级BoundaryHead多层感知机(MLP)模块来区分同职位标题但职责范围不同的角色,从而显著提升Top-K结果的排序质量。实验证明,该方法无需大规模人工标注训练对,在仅使用300个真实JD的基础上即实现了Recall@50从68.89%提升至77.55%,且在更大规模测试集上Recall@200达到0.7047,优于基线模型。
链接: https://arxiv.org/abs/2604.17738
作者: Zhaohua Liang,Zhilin Wang,Renjie Cao,Yining Zhang
机构: OpenJobs AI(开放工作岗位人工智能)
类目: Computation and Language (cs.CL)
备注:
Abstract:Candidate sourcing for recruiters is best viewed as a two-stage retrieval and reranking pipeline with recall as the primary objective under a limited review budget. An upstream production retriever first returns a candidate shortlist for each job description (JD), and our goal is to rerank that shortlist so that qualified candidates appear as high as possible. We present mira-embeddings-v1, a semantic reranking system for the recruitment domain that reshapes the embedding space with LLM-synthesized training data and corrects boundary confusions with a lightweight reranking head. Starting from real JDs, we build a five-stage prompt pipeline to generate diverse positive and hard negative samples that sculpt the semantic space from multiple angles. We then apply a two-round LoRA adaptation: JD–JD contrastive training followed by JD–CV triplet alignment on a heterogeneous text dataset. Importantly, these gains require no large-scale manually labeled industrial training pairs: a modest set of real JDs is expanded into supervision through LLM synthesis. Finally, a BoundaryHead MLP reranks the Top-K results to distinguish between roles that share the same title but differ in scope. On a local pool of 300 real JDs with candidates from an upstream production retriever, mira-embeddings-v1 improves Recall@50 from 68.89% (baseline) to 77.55% while lifting Precision@10 from 35.77% to 39.62%. On a supportive global pool over 44,138 candidates judged by a Qwen3-32B rubric, it achieves Recall@200 of 0.7047 versus 0.5969 for the baseline. These results show that LLM-synthesized supervision with boundary-aware reranking yields robust gains without a heavy cross-encoder.
[NLP-93] RePrompT: Recurrent Prompt Tuning for Integrating Structured EHR Encoders with Large Language Models ACL2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理结构化电子健康记录(Electronic Health Records, EHRs)时面临的两大挑战:一是将时间戳的EHR序列转化为纯文本会丢失时间结构和编码身份信息,削弱对代码共现及纵向规律的捕捉能力;二是LLMs通常以孤立病例方式进行推理,未能利用人群层面的模式。解决方案的关键在于提出RePrompT框架,通过提示调优(prompt tuning)集成结构化EHR编码器,在不修改底层架构的前提下,利用先前就诊的潜在状态递归保留纵向信息,并通过从群体训练的、任务对齐的EHR编码器中提取的可训练提示标记注入人群级知识,从而实现更有效的临床预测。
链接: https://arxiv.org/abs/2604.17725
作者: Arya Hadizadeh Moghaddam,Drew Ross,Mohsen Nayebi Kerdabadi,Dongjie Wang,Zijun Yao
机构: University of Kansas, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Finding of ACL 2026 - Accepted Paper
Abstract:Large Language Models (LLMs) have shown strong promise for mining Electronic Health Records (EHRs) by reasoning over longitudinal clinical information to capture context-rich patient trajectories. However, leveraging LLMs for structured EHRs (e.g., standardized diagnosis and medication codes) presents two key challenges. First, translating time-stamped EHR sequences into plain text can obscure both temporal structure and code identities, weakening the ability to capture code co-occurrence and longitudinal regularities. Second, unlike cohort-trained predictive models that learn a shared, task-aligned representation space across patients, LLMs are often applied in a case-isolated inference setting where each patient is processed independently without leveraging population-level patterns. To address these challenges, we introduce RePrompT, a time-aware LLM framework that integrates structured EHR encoders through prompt tuning, without modifying underlying architectures. Specifically, RePrompT recurrently incorporates latent states from prior visits to preserve longitudinal information, and injects population-level information through trainable prompt tokens derived from a cohort-trained, task-aligned EHR encoder. Experiments on MIMIC-III and MIMIC-IV demonstrate that RePrompT consistently outperforms both EHR-based and LLM-based baselines across multiple clinical prediction tasks.
[NLP-94] Do LLM s Use Cultural Knowledge Without Being Told? A Multilingual Evaluation of Implicit Prag matic Adaptation
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否能够在文化仅通过情境隐含提示时,调整其话语策略,而不仅限于在显式文化指令下做出反应。研究发现,尽管模型在显式文化指令(Prompt B)下能显著改变语用特征(如对权威的尊重、个体与群体框架、不确定性管理),但在隐含情境线索(Prompt C)下仅恢复约19.6%的显式指令引发的语用变化,这一指标被定义为语用情境敏感性(Pragmatic Context Sensitivity, PCS)。解决方案的关键在于识别出:模型的文化响应主要依赖于显式指令,而非情境隐含线索;且不同语用维度的迁移能力差异显著——对权威相关的线索迁移最强(PCS=0.299),而个体与群体框架最弱(PCS=0.120);此外,不确定性管理行为呈现复杂模式,例如“模糊表达密度”在所有五种语言中均显示负向显式差距,表明对齐训练可能抑制了目标语用行为。研究还利用印地语和乌尔都语共享语法结构但对应不同文化社群的特点进行对照分析,结果未发现可靠基线差异,说明模型响应更依赖语言结构而非文化关联。因此,论文主张多语言文化语用问题本质上是一个显式指令与隐式部署之间的适配问题,而非单纯的事实知识缺失问题。
链接: https://arxiv.org/abs/2604.17718
作者: Mehwish Nasim,Sanjeevan Selvaganapathy,Neel Ganapathi Sabhahit,Marie Griesbach,Pranav Bhandari,Janina Lütke Stockdiek,Lennart Schäpermeier,Usman Naseem,Christian Grimme
机构: The University of Western Australia (西澳大利亚大学); Macquarie University (麦考瑞大学); University of Münster (明斯特大学)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:
Abstract:Many benchmarks show that large language models can answer direct questions about culture. We study a different question: do they also change how they speak when culture is only implied by the situation? We evaluate 60 culturally grounded conversational scenarios across five languages in three conditions: a neutral baseline (Prompt A), an explicit cultural instruction (Prompt B), and implicit situational cueing (Prompt C). We score responses on 12 pragmatic features covering deference to authority, individual-versus-group framing, and uncertainty management. We define Pragmatic Context Sensitivity (PCS) as the fraction of the Prompt A-B shift that reappears under Prompt A-C. Across four deployed LLMs and five languages (English, German, Hindi, Nepali, Urdu), the primary stable-only PCS mean is 0.196 (SD = 0.113), indicating that the models recover only about one-fifth of the pragmatic shift they can produce when instructed explicitly. Transfer is strongest for authority-related cues (0.299) and weakest for individual-versus-group framing (0.120). Uncertainty-related behaviour is mixed: hedging density exhibits negative explicit gaps in all five languages, suggesting that alignment training actively suppresses the target behaviour. Because Hindi and Urdu share core grammar yet index distinct cultural communities, we use them as a natural control; a paired analysis finds no reliable baseline difference (t = 0.96, p = 0.339, dz = 0.06), suggesting that models respond primarily to linguistic structure rather than to the cultural associations a language carries. We argue that multilingual cultural pragmatics is an explicit-versus-implicit deployment problem, not only a factual knowledge problem.
[NLP-95] Concurrent Criterion Validation of a Validity Screen for LLM Confidence Signals via Selective Prediction
【速读】: 该论文旨在解决如何有效利用大语言模型(Large Language Models, LLMs)的置信度信号来提升选择性预测(selective prediction)性能的问题。其核心解决方案在于引入一个三分类有效性筛选机制(validity screen),将LLM的置信度信号划分为“有效(Valid)”、“不确定(Indeterminate)”和“无效(Invalid)”三类,并验证该分类能够显著预测模型在不同认知任务上的选择性预测表现。关键发现表明,该分类体系能解释47%的AUROC方差,且三类模型的性能呈现单调递增趋势(Invalid: .357, Indeterminate: .554, Valid: .624),证明该筛选机制对提升选择性预测具有实质性价值。
链接: https://arxiv.org/abs/2604.17716
作者: Jon-Paul Cacioli
机构: Independent Researcher, Melbourne, Australia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 4 figures, 2 tables. Companion to arXiv:2604.15702
Abstract:The validity screen (Cacioli, 2026d, 2026e) classifies LLM confidence signals as Valid, Indeterminate, or Invalid. We test whether these classifications predict selective prediction performance. Twenty frontier LLMs from seven families were evaluated on 524 items across six cognitive tracks. Valid models show mean Type 2 AUROC = .624 (SD = .048). Invalid models show mean AUROC = .357 (SD = .231). Cohen’s d = 2.81, p = .002. The tiers order monotonically: Invalid (.357) Indeterminate (.554) Valid (.624). Split-half cross-validation yields median d = 1.77, P(d 0) = 1.0 across 1,000 splits. The three-tier classification accounts for 47% of the variance in AUROC. DeepSeek-R1 drops from 85.3% accuracy at full coverage to 11.3% at 10% coverage. The screen predicts the criterion. For selective prediction, the screen matters.
[NLP-96] Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)中缺乏标准化方法来验证其置信度信号(confidence signals)是否携带个体项目级信息的问题,这直接影响到基于置信度的决策如弃权(abstention)、路由(routing)及安全关键任务的可靠性。解决方案的关键在于引入临床人格评估中的有效性筛查原则(validity screening principle),将其转化为一个可移植的基准测试协议:该协议基于单一2×2列联表计算三个核心指标(L、Fp、RBS)、一个结构指标(TRIN)和一项项目敏感性统计量,并通过四类临床传统构建三层次分类体系(无效、不确定、有效)。实证表明,该方法在524个项目上对20个前沿LLM进行验证后,能有效区分置信度信号的有效性,且跨基准(如MMLU)和探针格式具有良好的迁移性。
链接: https://arxiv.org/abs/2604.17714
作者: Jon-Paul Cacioli
机构: Independent Researcher, Melbourne, Australia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 6 figures, 8 tables, 2 appendices. Companion to arXiv:2604.15702
Abstract:LLM confidence signals are used for abstention, routing, and safety-critical decisions. No standard practice exists for checking whether a confidence signal carries item-level information before building on it. We transfer the validity screening principle from clinical personality assessment (PAI, MMPI-3) as a portable protocol for benchmark-based LLM confidence data. The protocol specifies three core indices (L, Fp, RBS), a structural indicator (TRIN), and an item-sensitivity statistic, computed from a single 2x2 contingency table. A three-tier classification system (Invalid, Indeterminate, Valid) draws on four clinical traditions. Validated on 20 frontier LLMs across 524 items, four models are classified Invalid, two Indeterminate. Valid-profile models show mean r = .18 (15/16 significant). Invalid-profile models show mean r = -.20 (d = 2.48). Cross-benchmark validation on 18 models using MMLU with verbalized confidence and on external data from Yang et al. (2024) confirms the screen transfers across benchmarks and probe formats. All data and code: this https URL
[NLP-97] DeInfer: Efficient Parallel Inferencing for Decomposed Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在规模扩展时因分解后推理并行性能差而导致的效率瓶颈问题。现有方法多关注下游任务性能提升,却忽视了模型规模扩大对推理并行性的负面影响。解决方案的关键在于提出 DeInfer——一个专为分解后 LLM 设计的高性能推理系统,通过多项优化策略最大化并行推理性能,并保持与当前主流优化技术的兼容性,从而显著提升大规模分解模型的推理效率。
链接: https://arxiv.org/abs/2604.17709
作者: You-Liang Huang,Xinhao Huang,Chengxi Liao,Zeyi Wen
机构: 未知
类目: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: accepted by DAC’26
Abstract:Existing works on large language model (LLM) decomposition mainly focus on improving performance on downstream tasks, but they ignore the poor parallel inference performance when trying to scale up the model size. To mitigate this important performance issue, this paper introduces DeInfer, a high-performance inference system dedicated to parallel inference of decomposed LLMs. It consists of multiple optimizations to maximize performance and be compatible with state-of-the-art optimization techniques. Extensive experiments are carried out to evaluate DeInfer’s performance, where the results demonstrate its superiority, suggesting it can greatly facilitate the parallel inference of decomposed LLMs.
[NLP-98] Before You Interpret the Profile: Validity Scaling for LLM Metacognitive Self-Report
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在临床人格评估中缺乏响应有效性(response validity)检验的问题。传统心理量表如PAI和MMPI-3通过多个效度量表(validity indices)来判断受试者是否诚实、一致或努力作答,而LLM的评估尚未引入此类机制。研究的关键在于将心理学中成熟的效度量表框架迁移至LLM的元认知探针数据(metacognitive probe data),并操作化六个效度指标:L(维持错误上的自信)、K(赌错)、F(撤回共识项)、Fp(撤回正确答案)、RBS(反向监控)和TRIN(固定回应)。结果表明,无效效度模型无法产生与项目内容相关的置信度关联(平均相关系数 r = -0.20,效应量 d = 2.17,p = .001),而有效模型则表现出显著的项目敏感性(r = 0.18,14/16显著)。此外,链式思维训练(chain-of-thought training)引发两种相反的反应扭曲,且两个潜在维度解释了94.6%的指数变异,为构建可移植的筛查协议提供了实证基础。
链接: https://arxiv.org/abs/2604.17707
作者: Jon-Paul Cacioli
机构: Independent Researcher, Melbourne, Australia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures. Companion to arXiv:2604.15702
Abstract:Clinical personality assessment screens response validity before interpreting substantive scales. LLM evaluation does not. We apply the validity scaling framework from the PAI and MMPI-3 to metacognitive probe data from 20 frontier models across 524 items. Six validity indices are operationalised: L (maintaining confidence on errors), K (betting on errors), F (withdrawing consensus-endorsed items), Fp (withdrawing correct answers), RBS (inverted monitoring), and TRIN (fixed responding). A tiered classification system identifies four models as construct-level invalid and two as elevated. Valid-profile models produce item-sensitive confidence (mean r = .18, 14 of 16 significant). Invalid-profile models do not (mean r = -.20, d = 2.17, p = .001). Chain-of-thought training produces two opposite response distortions. Two latent dimensions account for 94.6% of index variance. Companion papers extract a portable screening protocol (Cacioli, 2026e) and validate it against selective prediction (Cacioli, 2026f). All data and code: this https URL
[NLP-99] he Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中面临的两个关键问题:一是预测模型是否容易受到目标行为控制(即可控性预测),二是检测模型内部结构随时间或训练过程发生的退化(即漂移检测)。这两个看似独立的问题,实际上共享一个共同的几何基础——表示空间中的成对距离结构一致性,即几何稳定性(geometric stability)。解决方案的关键在于区分监督式与无监督式几何稳定性度量:监督式方法通过任务对齐的几何稳定性可近乎完美地预测线性可控性(相关系数 ρ = 0.89–0.97),且能捕捉到类别可分性之外的独特变异;而无监督几何稳定性虽无法有效预测可控性(ρ ≈ 0.10),却在漂移检测中表现卓越,其敏感度可达核函数相关性(CKA)的2倍以上,并提供更早预警和更低误报率。因此,二者结合构成了覆盖LLM全生命周期的互补诊断工具。
链接: https://arxiv.org/abs/2604.17698
作者: Prashant C. Raju
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Reliable deployment of language models requires two capabilities that appear distinct but share a common geometric foundation: predicting whether a model will accept targeted behavioral control, and detecting when its internal structure degrades. We show that geometric stability, the consistency of a representation’s pairwise distance structure, addresses both. Supervised Shesha variants that measure task-aligned geometric stability predict linear steerability with near-perfect accuracy ( \rho = 0.89 - 0.97 ) across 35-69 embedding models and three NLP tasks, capturing unique variance beyond class separability (partial \rho = 0.62 - 0.76 ). A critical dissociation emerges: unsupervised stability fails entirely for steering on real-world tasks ( \rho \approx 0.10 ), revealing that task alignment is essential for controllability prediction. However, unsupervised stability excels at drift detection, measuring nearly 2\times greater geometric change than CKA during post-training alignment (up to 5.23\times in Llama) while providing earlier warning in 73% of models and maintaining a 6\times lower false alarm rate than Procrustes. Together, supervised and unsupervised stability form complementary diagnostics for the LLM deployment lifecycle: one for pre-deployment controllability assessment, the other for post-deployment monitoring.
[NLP-100] MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression
【速读】: 该论文旨在解决长文本大语言模型(LLM)推理中KV缓存(KV cache memory)内存占用过高这一瓶颈问题。现有压缩方法通常对所有层采用统一策略,仅沿单维度进行优化(如token剔除、量化精度、低秩投影或跨层共享),但忽略了不同层对压缩操作的响应差异,导致精度损失。其解决方案的关键在于提出MoE-nD框架——一个基于专家混合(mixture-of-experts)的异构路由机制,通过在全局内存预算约束下为每一层动态分配最优的剔除比例(eviction ratio)、K向量位宽(K-bits)与V向量位宽(V-bits)组合,从而实现每层差异化压缩。该框架利用离线校准的贪心求解器确定最小化预测质量损失的路由方案,并在推理时通过单一注意力模块联合执行各层的异构剔除与量化操作,显著提升压缩效率和准确性。
链接: https://arxiv.org/abs/2604.17695
作者: Libo Sun,Peixiong He,Po-Wei Harn,Xiao Qin
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 3 figures, 6 tables
Abstract:KV cache memory is the dominant bottleneck for long-context LLM inference. Existing compression methods each act on a single axis of the four-dimensional KV tensor – token eviction (sequence), quantization (precision), low-rank projection (head dimension), or cross-layer sharing – but apply the same recipe to every layer. We show that this homogeneity leaves accuracy on the table: different layers respond very differently to each compression operation, and the optimal per-layer mix of eviction and quantization is far from uniform. We propose MoE-nD, a mixture-of-experts framework that routes each layer to its own (eviction-ratio, K-bits, V-bits) tuple under a global memory budget. An offline-calibrated greedy solver chooses the routing that minimizes predicted quality loss; at inference time, per-layer heterogeneous eviction and quantization are applied jointly through a single attention patch. On a 4-task subset of LongBench-v1 (16k inputs, n=50 per task, adapted reasoning-model protocol; see section Experiments), MoE-nD’s hetero variant matches our uncompressed 1.9~GB baseline at 14x compression (136~MB) while every other compressed baseline we tested (1d, 2d_uniform, 2d) at comparable or smaller memory stays under 8/100. The gains hold on AIME reasoning benchmarks (+6 to +27 pts over the strongest per-layer-quantization baseline across eight configurations). Two null results – MATH-500 and LongBench’s TREC – share a principled cause (short inputs, solver picks keep=1.0 on most layers), cleanly characterizing when per-layer eviction routing has headroom to help.
[NLP-101] owards Intelligent Legal Document Analysis: CNN-Driven Classification of Case Law Texts
【速读】: 该论文旨在解决法律实务工作者和司法机构在处理大量案例文献时面临的效率与准确性问题,这些问题源于法律文本中形式化语言、复杂句式及高度专业术语带来的手动筛选困难。解决方案的关键在于提出一个轻量级但高精度的引用处理分类框架,其核心由三部分组成:基于词形还原(lemmatisation)的预处理、子词感知的FastText词嵌入(subword-aware FastText embeddings),以及多核一维卷积神经网络(multi-kernel one-dimensional Convolutional Neural Network, CNN)。该架构在25,000条标注法律文档上实现了97.26%的分类准确率和96.82%的宏F1分数,显著优于BERT微调、LSTM+FastText、随机嵌入CNN及TF-IDF-KNN等基线模型,同时仅需510万参数和每文档0.31毫秒推理延迟,展现出对资源受限场景下智能法律文本分析的高度适用性。
链接: https://arxiv.org/abs/2604.17674
作者: Moinul Hossain,Sourav Rabi Das,Zikrul Shariar Ayon,Sadia Afrin Promi,Ahnaf Atef Choudhury,Shakila Rahman,Jia Uddin
机构: Washington State University (华盛顿州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Legal practitioners and judicial institutions face an ever-growing volume of case-law documents characterised by formalised language, lengthy sentence structures, and highly specialised terminology, making manual triage both time-consuming and error-prone. This work presents a lightweight yet high-accuracy framework for citation-treatment classification that pairs lemmatisation-based preprocessing with subword-aware FastText embeddings and a multi-kernel one-dimensional Convolutional Neural Network (CNN). Evaluated on a publicly available corpus of 25,000 annotated legal documents with a 75/25 training-test partition, the proposed system achieves 97.26% classification accuracy and a macro F1-score of 96.82%, surpassing established baselines including fine-tuned BERT, Long Short-Term Memory (LSTM) with FastText, CNN with random embeddings, and a Term Frequency-Inverse Document Frequency (TF-IDF) k-Nearest Neighbour (KNN) classifier. The model also attains the highest Area Under the Receiver Operating Characteristic (AUC-ROC) curve of 97.83% among all compared systems while operating with only 5.1 million parameters and an inference latency of 0.31 ms per document - more than 13 times faster than BERT. Ablation experiments confirm the individual contribution of each pipeline component, and the confusion matrix reveals that residual errors are confined to semantically adjacent citation categories. These findings indicate that carefully designed convolutional architectures represent a scalable, resource-efficient alternative to heavyweight transformers for intelligent legal document analysis.
[NLP-102] ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data
【速读】: 该论文旨在解决生成式 AI (Generative AI) 模型中因宪法条件(constitution)变化导致的隐状态结构漂移问题,即如何在不同模型架构和数据子集间识别并保持可迁移的潜在几何结构。解决方案的关键在于提出 ATLAS 程序,采用“几何优先”策略,通过局部图册(local chart)对齐机制追踪宪法诱导的隐藏状态结构,而非依赖单一行为单元(如神经元、向量或补丁)。该方法首次实现了跨模型与跨底物(substrate)的几何一致性验证:即使局部坐标和表达方式发生改变,源定义的几何家族仍能维持其可检测性,并在多个基准测试中表现出高精度(AUC 0.984 和 0.72),证明了“重分布下的几何重现”是可提取且具有泛化能力的核心特征。
链接: https://arxiv.org/abs/2604.17663
作者: Gareth Seneque,Lap-Hang Ho,Nafise Erfanian Saeedi,Jeffrey Molendijk,Tim Elson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 49 pages, 7 figures
Abstract:Constitution-conditioned post-training can be analysed as a structured perturbation of a model’s learned representational geometry. We introduce ATLAS, a geometry-first program that traces constitution-induced hidden-state structure across charts, models, and substrates. Instead of treating the relevant unit as a single behaviour, neuron, vector, or patch, ATLAS tests a local chart whose tangent structure, occupancy distribution, and behavioural coupling can be measured under system change. On Gemma, the anchored source-local chart captures 310 / 320 reviewed source rows and all 84 / 84 reviewed score-flip rows, but compact exact-patch sufficiency does not close, so the exportable unit is the broader source-defined family. Freezing that family, we re-identify a target-local realisation in an unadapted Phi model, where the fully adjudicated confirmatory contrast separates with AUC 0.984 and mean gap 5.50. In held-out ALM8 mouse frontal-cortex perturbation data, the same source-defined family receives support across 5/5 folds, with mean held-out AUC 0.72 and mean fold gap 4.50. A multiple-choice analysis provides the main boundary: nearby target-local signals can appear without source-faithful closure. The resulting correspondence is not coordinate identity, site identity, or a target-side mediation theorem. It is geometric recurrence under redistribution: written constitutions can induce recoverable latent geometry whose organisation remains detectable across model and substrate changes while its local coordinates, occupancy, and behavioural expression shift.
[NLP-103] Semantic Density Effect (SDE): Maximizing Information Per Token Improves LLM Accuracy
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时存在准确性不足、内容发散及幻觉(Hallucination)等问题,其核心挑战在于如何在不增加计算开销的前提下提升输出质量。解决方案的关键在于提出“语义密度效应”(Semantic Density Effect, SDE),即通过优化提示词(prompt)中语义信息的密度——定义为每token的语义负载比例并校正冗余与具体性——来增强模型输出的准确性与聚焦度。该方法不依赖增加token数量(如Chain of Thought)、重复提示(Prompt Repetition)或调整顺序(Instruction Placement Effect, IPE),而是通过移除或替换低信息量token以强化语义信号,实现在零额外token和零延迟开销下显著提升性能:在五个前沿模型和七个基准测试中,高密度提示(SDE > 0.80)平均优于稀疏提示达+8.4个百分点,结合IPE后进一步提升至+11.7个百分点。
链接: https://arxiv.org/abs/2604.17659
作者: Amr Ahmed
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce the Semantic Density Effect (SDE): the empirical finding that prompts carrying higher semantic information per token consistently produce more accurate, focused, and less hallucinated outputs across all major LLM families. SDE is defined as the ratio of semantically loaded tokens to total prompt tokens, adjusted for redundancy and concreteness. Unlike prior prompt optimization techniques that add tokens (Chain of Thought), duplicate the prompt (Prompt Repetition), or reorder components (Instruction Placement Effect), SDE improves performance by removing or replacing low-information tokens while preserving or sharpening the semantic signal. Evaluated across five frontier models and seven benchmarks, ultra-dense prompts (SDE 0.80) outperform diluted counterparts by an average of +8.4 percentage points with 0 additional tokens and 0 latency overhead. Combined with Instruction Placement Effect (IPE), the gain reaches +11.7 percentage points
[NLP-104] Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation
【速读】: 该论文旨在解决视频到音乐(Video-to-Music, V2M)生成任务中现有模型依赖单一视觉条件导致语义和风格可控性不足的问题。解决方案的关键在于提出一种文本条件驱动的视频到音乐生成模型 Video-Robin,其通过将自回归规划(autoregressive planning)与基于扩散机制的合成(diffusion-based synthesis)相结合:首先利用自回归模块对视频和文本输入进行语义对齐,生成高层次音乐潜在表示(high-level music latents),再由局部 Diffusion Transformers 对这些潜在表示进行精细化重构,从而在保持音频真实感的同时实现细粒度的创作控制。此架构实现了音乐保真度与语义理解之间的平衡,并在推理速度上较当前最优方法提升 2.21 倍。
链接: https://arxiv.org/abs/2604.17656
作者: Vaibhavi Lokegaonkar,Aryan Vijay Bhosale,Vishnu Raj,Gouthaman KV,Ramani Duraiswami,Lie Lu,Sreyan Ghosh,Dinesh Manocha
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Video-to-music (V2M) is the fundamental task of creating background music for an input video. Recent V2M models achieve audiovisual alignment by typically relying on visual conditioning alone and provide limited semantic and stylistic controllability to the end user. In this paper, we present Video-Robin, a novel text-conditioned video-to-music generation model that enables fast, high-quality, semantically aligned music generation for video content. To balance musical fidelity and semantic understanding, Video-Robin integrates autoregressive planning with diffusion-based synthesis. Specifically, an autoregressive module models global structure by semantically aligning visual and textual inputs to produce high-level music latents. These latents are subsequently refined into coherent, high-fidelity music using local Diffusion Transformers. By factoring semantically driven planning into diffusion-based synthesis, Video-Robin enables fine-grained creator control without sacrificing audio realism. Our proposed model outperforms baselines that solely accept video input and additional feature conditioned baselines on both in-distribution and out-of-distribution benchmarks with a 2.21x speed in inference compared to SOTA. We will open-source everything upon paper acceptance.
[NLP-105] Measuring Distribution Shift in User Prompts and Its Effects on LLM Performance
【速读】: 该论文旨在解决生成式 AI(Generative AI)在真实部署环境中因用户提示分布自然变化(natural prompt distribution shift)而导致的模型性能下降问题。其核心挑战在于,现有研究对这种动态分布偏移缺乏量化评估手段,且其对专用领域或特定用户群体模型可靠性的影响尚不明确。解决方案的关键在于提出 LLM Evaluation under Natural prompt Shift (LENS) 框架——一个以数据为中心的方法,用于测量提示分布变化并评估其对已部署大语言模型(LLM)性能的影响。通过在 192 个真实世界提示分布变化场景下系统性评估 81 个模型,作者发现即使轻微的提示行为偏移也会导致平均高达 73% 的性能损失,尤其在不同潜在用户群和地理区域间交互时更为显著,从而强调了建立数据驱动监控机制以保障模型跨用户群体稳定性的必要性。
链接: https://arxiv.org/abs/2604.17650
作者: Parker Seegmiller,Sarah Masud Preum
机构: Dartmouth College (达特茅斯学院)
类目: Computation and Language (cs.CL)
备注: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Abstract:LLMs are increasingly deployed in dynamic, real-world settings, where the distribution of user prompts can shift substantially over time as new tasks, prompts, and users are introduced to a deployed model. Such natural prompt distribution shift poses a major challenge to LLM reliability, particularly for specialized models designed for narrow domains or user populations. Despite attention to out-of-distribution robustness, there is very limited exploration of measuring natural prompt distribution shift in prior work, and its impact on deployed LLMs remains poorly understood. We introduce the LLM Evaluation under Natural prompt Shift (LENS) framework: a data-centric approach for quantifying natural prompt distribution shift and evaluating its effect on the performance of deployed LLMs. We perform a large-scale evaluation using 192 real-world post-deployment prompt shift settings over time, user group, and geographic axes, training a total of 81 models on 4.68M training prompts, and evaluating on 57.6k prompts. We find that even moderate shifts in user prompt behavior correspond with large performance drops (73% average loss) in deployed LLMs. This performance degradation is particularly prevalent when users from different latent groups and geographic regions interact with models and is correlated with natural prompt distribution shift over time. We systematically characterize how LLM instruction following ability degrades over time and between user groups. Our findings highlight the critical need for data-driven monitoring to ensure LLM performance remains stable across diverse and evolving user populations.
[NLP-106] hreadSumm: Summarization of Nested Discourse Threads Using Tree of Thoughts ACL2026
【速读】: 该论文旨在解决深度嵌套讨论线程(deeply nested discussion threads)的摘要生成问题,其核心挑战在于处理交错回复、引用和重叠话题,而传统大语言模型(LLM)摘要器难以可靠捕捉此类复杂结构。解决方案的关键在于提出ThreadSumm框架,该框架将摘要任务建模为对显式方面(aspect)和内容单元(Atomic Content Units)的分层推理过程:首先通过LLM提取 discourse aspects 和原子内容单元进行内容规划;接着利用句子排序构建面向线程的序列以呈现多视角观点而非单一线性脉络;最后在可解释的内容单元基础上,采用Tree of Thoughts搜索机制生成并评分多个段落候选,联合优化连贯性与覆盖度。此多提案与迭代精炼设计显著提升了逻辑结构化程度,并增强了观点保留率与议题覆盖能力。
链接: https://arxiv.org/abs/2604.17648
作者: Olubusayo Olabisi,Ekata Mitra,Ameeta Agrawal
机构: Portland State University (波特兰州立大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026
Abstract:Summarizing deeply nested discussion threads requires handling interleaved replies, quotes, and overlapping topics, which standard LLM summarizers struggle to capture reliably. We introduce ThreadSumm, a multi-stage LLM framework that treats thread summarization as a hierarchical reasoning problem over explicit aspect and content unit representations. Our method first performs content planning via LLM-based extraction of discourse aspects and Atomic Content Units, then applies sentence ordering to construct thread-aware sequences that surface multiple viewpoints rather than a single linear strand. On top of these interpretable units, ThreadSumm employs a Tree of Thoughts search that generates and scores multiple paragraph candidates, jointly optimizing coherence and coverage within a unified search space. With this multi-proposal and iterative refinement design, we show improved performance in generating logically structured summaries compared to existing baselines, while achieving higher aspect retention and opinion coverage in nested discussions.
[NLP-107] Copy First Translate Later: Interpreting Translation Dynamics in Multilingual Pretraining
【速读】: 该论文旨在解决多语言预训练过程中跨语言泛化能力(cross-lingual generalization)如何在早期学习阶段逐步形成的问题,尤其是现有研究多基于孤立因素或稀疏训练点,难以揭示其动态演化机制。解决方案的关键在于:首先,在九种多样化语言上对17亿参数的多语言模型进行高粒度的预训练,并捕获多个细粒度检查点;其次,引入一个新的词级翻译数据集,结合行为分析、模型组件分析和参数消融实验,系统追踪翻译能力的发展轨迹。研究发现翻译能力发展呈现两个阶段:初期以复制和表面相似性为主,后期则发展出更通用的翻译机制并优化复制策略,从而提供了跨语言泛化在预训练早期阶段的精细化认知框架。
链接: https://arxiv.org/abs/2604.17633
作者: Felicia Körner,Maria Matveev,Florian Eichin,Gitta Kutyniok,Barbara Plank,Michael A. Hedderich
机构: LMU Munich (慕尼黑大学); Munich Center for Machine Learning (MCML); University of Tromsø (特罗姆瑟大学); DLR-German Aerospace Center (德国航空航天中心)
类目: Computation and Language (cs.CL)
备注: 10 pages
Abstract:Large language models exhibit impressive cross-lingual capabilities. However, prior work analyzes this phenomenon through isolated factors and at sparse points during training, limiting our understanding of how cross-lingual generalization emerges–particularly in the early phases of learning. To study the early trajectory of linguistic and translation capabilities, we pretrain a multilingual 1.7B model on nine diverse languages, capturing checkpoints at a much finer granularity. We further introduce a novel word-level translation dataset and trace how translation develops over training through behavioral analyses, model-component analysis, and parameter-based ablations. We find that the model quickly acquires basic linguistic capabilities in parallel with token-level copying, while translation develops in two distinct phases: an initial phase dominated by copying and surface-level similarities, and a second phase in which more generalizing translation mechanisms are developed while copying is refined. Together, these findings provide a fine-grained view of how cross-lingual generalization develops during multilingual pretraining.
[NLP-108] Does Welsh media need a review? Detecting bias in Nation.Cymrus political reporting
【速读】: 该论文旨在解决威尔士媒体是否存在政治偏见的问题,特别是针对特定政党在新闻报道中是否受到差异化框架对待的实证检验。其解决方案的关键在于构建一个两阶段自然语言处理(Natural Language Processing, NLP)流水线:第一阶段采用优化的RoBERTa模型进行高效偏见检测,第二阶段利用大语言模型(Large Language Model, LLM)对识别出的偏见标签进行目标导向的情感分类。该方法不仅量化了不同政党的报道框架差异(如Reform UK被负面框架化程度是Plaid Cymru的三倍),还提供了一个低成本、可复现的分析框架,可用于扩展至其他威尔士媒体或更广泛的媒体生态系统。
链接: https://arxiv.org/abs/2604.17628
作者: Cai Parry-Jones
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Wales’ political landscape has been marked by growing accusations of bias in Welsh media. This paper takes the first computational step toward testing those claims by examining this http URL, a prominent Welsh political news outlet. I use a two-stage natural language processing (NLP) pipeline: (1) a robustly optimized BERT approach (RoBERTa) bias detector for efficient bias discovery and (2) a large language model (LLM) for target-attributed sentiment classification of bias labels from (1). A primary analysis of 15,583 party mentions across 2022-2026 news articles finds that Reform UK attracts biased framing at twice the rate of Plaid Cymru and over three times as negative in mean sentiment (p0.001). A secondary analysis across four parties across both news and opinion articles shows that Plaid Cymru is the outlier, receiving markedly more favourable framing than any other party. These findings provide evidence of measurable differential framing in a single Welsh political media outlet, supporting calls for a broader review of Welsh media coverage. Furthermore, the two-stage pipeline offers a low-cost, replicable framework for extending this analysis to other Welsh outlets, as well as media ecosystems outside of Wales.
[NLP-109] oward Reusability of AI Models Using Dynamic Updates of AI Documentation
【速读】: 该论文旨在解决当前人工智能(AI)模型难以复用的问题,其根源在于缺乏充分的AI文档(即AI模型卡)以及模型卡内容与不断变化的AI最佳实践之间存在时间滞后。解决方案的关键在于提出一种敏捷、数据驱动且基于社区的AI模型卡生成方法,通过Hugging Face(HF)开源平台上的海量AI模型数据,结合Zero Draft(ZD)标准化模板,量化分析模型下载量/点赞数等复用指标与文档质量(如目录结构和词汇统计)之间的相关性,并构建持续比较文档模板与社区标准实践的基础设施,从而缩短模型卡更新周期并提升文档与当前AI最佳实践的一致性,最终促进AI模型的高效复用。
链接: https://arxiv.org/abs/2604.17626
作者: Peter Bajcsy,Walid Keyrouz
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 28 pages, 16 figures, 9 tables
Abstract:This work addresses the challenge of disseminating reusable artificial intelligence (AI) models accompanied by AI documentation (a.k.a., AI model cards). The work is motivated by the large number of trained AI models that are not reusable due to the lack of (a) AI documentation and (b) the temporal lag between rapidly changing requirements on AI model reusability and those specified in various AI model cards. Our objectives are to shorten the lag time in updating AI model card templates and align AI documentation more closely with current AI best practices. Our approach introduces a methodology for delivering agile, data-driven, and community-based AI model cards. We use the Hugging Face (HF) repository of AI models, populated by a subset of the AI research and development community, and the AI consortium-based Zero Draft (ZD) templates for the AI documentation of AI datasets and AI models, as our test datasets. We also address questions about the value of AI documentation for AI reusability. Our work quantifies the correlations between AI model downloads/likes (i.e., AI model reuse metrics) from the HF repository and their documentation alignment with the ZD documentation templates using tables of contents and word statistics (i.e., AI documentation quality metrics). Furthermore, our work develops the infrastructure to regularly compare AI documentation templates against community-standard practices derived from millions of uploaded AI models in the Hugging Face repository. The impact of our work lies in introducing a methodology for delivering agile, data-driven, and community-based standards for documenting AI models and improving AI model reuse. Comments: 28 pages, 16 figures, 9 tables Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE) Cite as: arXiv:2604.17626 [cs.AI] (or arXiv:2604.17626v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.17626 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-110] Characterizing Model-Native Skills
【速读】: 该论文旨在解决现有语言模型行为干预中技能表征依赖人工定义分类体系(如人类编写的知识分类或文本描述)所带来的局限性问题,这类外部假设可能与模型内部表示不一致,从而限制了干预的有效性。其解决方案的关键在于提出一种“模型原生”(model-native)的技能表征方法:通过从序列级激活(sequence-level activations)中恢复一个紧凑且正交的基底(orthogonal basis),该基底虽具备语义可解释性但无需对应任何预设的人类知识体系,而是捕捉模型自身组织的行为变化轴(axes of behavioral variation)。这一基底可用于训练数据选择(如SFT阶段)和推理时控制(inference-time steering),在MATH和AMC等基准上显著提升性能,并在安全对齐任务中实现更高效的样本利用,验证了基于模型内部表示重构技能比外部强加更具有效性。
链接: https://arxiv.org/abs/2604.17614
作者: Feiyang Kang,Mahavir Dabas,Myeongseob Ko,Ruoxi Jia
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: We argue that when the goal is to intervene on model behavior, skill characterization should be model-native: grounded in the model’s own representations rather than imposed through external ontologies
Abstract:Skills are a natural unit for describing what a language model can do and how its behavior can be changed. However, existing characterizations rely on human-written taxonomies, textual descriptions, or manual profiling pipelines–all external hypotheses about what matters that need not align with the model’s internal representations. We argue that when the goal is to intervene on model behavior, skill characterization should be model-native: grounded in the model’s own representations rather than imposed through external ontologies. We instantiate this view by recovering a compact orthogonal basis from sequence-level activations. The resulting basis is semantically interpretable but need not correspond to any predefined human ontology; instead, it captures axes of behavioral variation that the model itself organizes around. We validate this characterization on reasoning post-training, using the recovered basis for both SFT data selection and inference-time steering. We develop lightweight proxy interventions to identify which directions are most useful for a given model. Across Llama3-8B and Qwen2.5-3B, selecting data along those directions improves Pass@1 by up to 20% on MATH and 41% on AMC, outperforming data selection based on human-characterized skills. Because the basis lives in activation space, the same directions also serve as steering vectors at inference time, improving Pass@8 by up to 4.8% on MATH–an intervention that human-characterized skills cannot support. We further validate the characterization on safety alignment, where selecting adversarial training data for model-native skill coverage rather than textual diversity yields more sample-efficient learning. These results suggest that recovering skills from the model’s own representations, rather than imposing them externally, provides a more effective foundation for intervening on model behavior. Codes are open-sourced.
[NLP-111] Agents Explore but Agents Ignore: LLM s Lack Environmental Curiosity
【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)驱动的智能体在面对环境中的意外但相关信息时,缺乏主动识别与利用能力的问题,即所谓“环境好奇心”(environmental curiosity)的缺失。研究表明,尽管LLM-based agents能够发现环境中隐藏的任务解决方案(如在Terminal-Bench中发现率达79–81%),却极少将其用于策略调整或任务执行(仅37–50%被利用),尤其在AppWorld中,即使明确看到文档提示命令可返回完整解法,仍不足7%的尝试中加以利用。解决方案的关键在于识别并优化三个核心因素:代理架构中的可用工具、推理时计算资源以及训练数据分布;研究发现,通过联合优化这些配置可显著提升环境好奇心,并带来基准测试性能的改善,但即便如此,现有代理仍普遍无法有效利用所发现的信息,表明其仍局限于获取预期信息而非动态调整策略以最大化环境刺激的价值。
链接: https://arxiv.org/abs/2604.17609
作者: Leon Engländer,Sophia Althammer,Ahmet Üstün,Matthias Gallé,Tom Sherborne
机构: Cohere(柯瑞); Poolside(池边)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:LLM-based agents are assumed to integrate environmental observations into their reasoning: discovering highly relevant but unexpected information should naturally lead to a model exploiting its own discoveries. We show that this assumption is false for current LLM-based agents, which struggle to reflect or react to unexpected information. Across three benchmarks (Terminal-Bench, SWE-Bench, AppWorld), we inject complete task solutions into the agent environments to deliberately expose a task’s solution to a model. While agents discover these solutions on Terminal-Bench in 79-81% of runs, they interact, or exploit, them in only 37-50% of cases. This gap is starkest in AppWorld: agents see documentation stating that a command “returns the complete solution to this task” in over 90% of attempts but exploit this in fewer than 7% of trials. We show that agents lack what we call environmental curiosity: the capability to recognize and investigate unexpected but relevant observations in response to environmental stimuli. We identify three main factors influencing environmental curiosity: available tools in the agent scaffold, test-time compute, and training data distribution. Our findings identify configurations that maximize curiosity also achieve the best performance on the unmodified benchmarks. Yet even jointly optimized agents still ignore discovered solutions in the majority of trials: current agents use the environment to fetch expected information, but not to revise their strategy or maximally exploit useful stimuli.
[NLP-112] Beyond Fine-Tuning: In-Context Learning and Chain-of-Thought for Reason ed Distractor Generation
【速读】: 该论文旨在解决多选题中干扰项(distractor)生成任务的自动化难题,该任务传统上高度依赖领域专家手工设计,且现有方法难以捕捉专家在构建干扰项时所运用的隐式推理过程。解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的上下文学习(in-context learning)框架,结合无监督语义检索选择少量示例(few-shot examples),并设计了一个增强推理链(rationale-augmented)的生成机制,能够同时输出干扰项及其对应的逻辑解释(rationale),从而更贴近人类标注基准中的推理模式。实验表明,该方法在六个不同领域和长度的基准上均显著优于当前主流模型,实现了有理由可解释的干扰项生成。
链接: https://arxiv.org/abs/2604.17574
作者: Elaf Alhazmi,Quan Z. Sheng,Wei Emma Zhang
机构: Macquarie University (麦克奎里大学); Adelaide University (阿德莱德大学); Umm Al-Qura University (乌姆库拉大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Distractor generation (DG) remains a labor-intensive task that still significantly depends on domain experts. The task focuses on generating plausible yet incorrect options, known as distractors, for multiple-choice questions. A reliable distractor must be contextually relevant to the question and able to mislead examinees through implicit reasoning when identifying the correct answer. While a recent method integrates fine-tuning pre-trained encoder-decoder models with contrastive learning to generate semantically relevant distractors for a given question-answer, it often fails to capture the underlying reasoning process that experts utilize when selecting distractors in benchmarks. In this paper, we explore large language models (LLMs) reasoning for DG through in-context learning with unsupervised semantic retrieval for selecting few-shot examples. We design a rationale-augmented DG framework that jointly generates distractors and their rationales for a given question-answer. Extensive experiments on six benchmarks, with varying average distractor lengths and domains, demonstrate that prompting LLMs with few-shot examples substantially improves the performance compared to recent DG models. It outperforms recent approaches and achieves state-of-the-art results in generating reasoned distractors that align with human-labeled benchmarks.
[NLP-113] MAPLE: A Meta-learning Framework for Cross-Prompt Essay Scoring ACL
【速读】: 该论文旨在解决自动化作文评分(Automated Essay Scoring, AES)在跨提示(cross-prompt)场景下的泛化能力不足问题,即模型在面对未见过的写作提示时性能显著下降。解决方案的关键在于提出一种基于元学习(meta-learning)的框架MAPLE,其核心是利用原型网络(prototypical networks)学习跨不同写作提示的可迁移表征,从而提升模型在新提示下的适应能力与评分一致性。实验表明,MAPLE在多个语言和数据集上均取得最优性能,尤其在阿拉伯语(LAILA)和英语(ELLIPSE)数据集上分别提升了8.5和3点QWK得分,验证了该方法的有效性。
链接: https://arxiv.org/abs/2604.17569
作者: Salam Albatarni,May Bashendy,Sohaila Eltanbouly,Tamer Elsayed
机构: Qatar University (卡塔尔大学)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL Findings 2026
Abstract:Automated Essay Scoring (AES) faces significant challenges in cross-prompt settings, where models must generalize to unseen writing prompts. To address this limitation, we propose MAPLE, a meta-learning framework that leverages prototypical networks to learn transferable representations across different writing prompts. Across three diverse datasets (ELLIPSE and ASAP (English), and LAILA (Arabic)), MAPLE achieves state-of-the-art performance on ELLIPSE and LAILA, outperforming strong baselines by 8.5 and 3 points in QWK, respectively. On ASAP, where prompts exhibit heterogeneous score ranges, MAPLE yields improvements on several traits, highlighting the strengths of our approach in unified scoring settings. Overall, our results demonstrate the potential of meta-learning for building robust cross-prompt AES systems.
[NLP-114] PoliLegalLM : A Technical Report on a Large Language Model for Political and Legal Affairs
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在法律领域直接应用时面临的三大挑战:幻觉性法律引用(hallucinated legal citations)、知识覆盖不全以及结构化推理能力弱。为应对这些问题,作者提出了一种面向政治与法律场景的专用大语言模型——PoliLegalLM。其解决方案的关键在于采用统一的训练框架,融合持续预训练(continued pretraining)、渐进式监督微调(progressive supervised fine-tuning)和基于偏好的强化学习(preference-based reinforcement learning),从而协同提升模型的法律知识锚定能力、任务对齐性和推理性能。此外,研究构建了一个大规模高质量法律语料库,并设计了结构化的后训练流程,使模型能够有效学习领域特异性知识并适应多样化的法律任务,在多个基准测试和真实世界数据集上均展现出优于同类规模模型甚至超越更大模型的性能表现。
链接: https://arxiv.org/abs/2604.17543
作者: Yuting Huang,Yinghao Hu,Qian Xiao,Wenlin Zhong,Yiquan Wu,Taishi Zhou,Moke Chen,Changlong Sun,Kun Kuang,Fei Wu
机构: Tongyi Lab, Alibaba Group, Hangzhou, China(通义实验室,阿里巴巴集团,杭州); Zhejiang University, Hangzhou, China(浙江大学,杭州); Zhejiang Zhengfa Xinxi Guanli Zhongxin, Hangzhou, China(浙江省政法信息管理中心,杭州)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have achieved remarkable success in general-domain tasks, yet their direct application to the legal domain remains challenging due to hallucinated legal citations, incomplete knowledge coverage, and weak structured reasoning. To address these issues, we propose PoliLegalLM, a domain-specific large language model tailored for political and legal applications. Our approach adopts a unified training framework that integrates continued pretraining, progressive supervised fine-tuning, and preference-based reinforcement learning to jointly enhance legal knowledge grounding, task alignment, and reasoning capability. We construct a large-scale, high-quality legal corpus and design a structured post-training pipeline, enabling the model to effectively learn domain-specific knowledge and adapt to diverse legal tasks. We evaluate PoliLegalLM on three representative benchmarks, including LawBench, LexEval, and a real-world dataset, PoliLegal. Experimental results demonstrate that PoliLegalLM achieves strong and consistent performance, outperforming competitive models of similar scale and remaining highly competitive with significantly larger models, while achieving the best results on real-world legal scenarios. These results highlight the effectiveness of our training paradigm and the practical value of domain-specific LLMs for real-world legal applications.
[NLP-115] OPSDL: On-Policy Self-Distillation for Long-Context Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中有效上下文长度受限的问题。现有后训练方法虽在长上下文扩展上取得进展,但往往依赖高质量监督数据或稀疏的序列级奖励信号,导致优化不稳定且效率低下。其解决方案的关键在于提出一种基于策略梯度的自蒸馏方法——OPSDL(On-Policy Self-Distillation for Long-context learning),该方法利用模型自身强大的短上下文能力作为“自教师”(self-teacher),对长上下文生成过程提供密集的逐token监督信号。具体而言,模型先基于完整长上下文生成响应,随后通过提取相关短上下文并计算点对点反向KL散度(point-wise reverse KL divergence)获得细粒度监督信号,从而引导模型更忠实利用相关信息、减少无关上下文引发的幻觉现象。此机制显著提升了长上下文性能,同时保持了短上下文任务的表现,展现出良好的可扩展性和稳定性。
链接: https://arxiv.org/abs/2604.17535
作者: Xinsen Zhang,Zhenkai Ding,Tianjun Pan,Run Yang,Chun Kang,Xue Xiong,Jingnan Gu
机构: Baidu Inc
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure
Abstract:Extending the effective context length of large language models (LLMs) remains a central challenge for real-world applications. While recent post-training methods have made progress in long-context scaling, they either rely on high-quality supervision data or sparse sequence-level rewards, leading to unstable and inefficient optimization. We propose OPSDL, an On-Policy Self-Distillation method for enhancing the Long-context capabilities of LLMs. Unlike other recent self-distillation methods that inject privileged information and rely on the model’s in-context learning ability to act as a teacher, OPSDL leverages the model’s own inherently strong short-context capability as a self-teacher to supervise its own generation in long-context scenarios. The model first generates responses conditioned on the full long-context, then the self-teacher provides per-token supervision signals via point-wise reverse KL divergence under the relevant extracted short-context. This dense token-level signal encourages faithful use of relevant evidence and mitigates hallucinations induced by irrelevant context. We evaluate OPSDL on long-context benchmarks across a range of models from 7B to 32B parameters. Results show consistent and substantial improvements across varying context lengths, outperforming standard post-training approaches such as SFT and DPO with higher sample efficiency. Notably, these gains are achieved without degrading general short-context performance. These findings highlight the effectiveness of OPSDL as a scalable and stable approach for long-context learning.
[NLP-116] ONTO: A Token-Efficient Columnar Notation for LLM Input Optimization
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理大规模操作数据时,因传统序列化格式(如JSON)引入的结构性开销过大而导致的token浪费问题。具体而言,JSON等格式在存储大量结构化数据时,重复字段名、嵌套括号和分隔符号占用了大量token,显著降低了LLM对语义内容的有效利用率。解决方案的关键在于提出一种名为ONTO(Object Notation for Token Optimization)的新序列化格式,其核心设计是“schema-once, data-many”:仅在每个实体中声明一次字段名,随后以管道分隔的行式排列值,并通过缩进表示层次结构。该方法在保留人类可读性和嵌套结构支持的同时,有效消除了每条记录中的键重复,从而实现高达46-51%的token压缩率,并在Qwen2.5-7B模型上带来5-10%的推理延迟改进,且不损害LLM在查找、计数、提取和聚合等任务中的准确性。
链接: https://arxiv.org/abs/2604.17512
作者: Harshavardhanan Deekeswar
机构: Independent Researcher(独立研究员)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 5 tables, 1 figure. Code, benchmarks, and specification at this https URL
Abstract:Serialization formats designed for document interchange impose structural overhead that becomes prohibitive when large language models consume operational data at scale. A modest dataset of 1,000 IoT sensor readings serialized as JSON requires approximately 80,000 tokens - the majority spent on repeated field names, nested braces, and structural punctuation rather than semantic content. We present ONTO (Object Notation for Token Optimization), a columnar notation that declares field names once per entity and arranges values in pipe-delimited rows with indentation-based hierarchy. This schema-once, data-many design eliminates per-record key repetition while preserving human readability and nested structure support. Evaluation across three synthetic operational datasets demonstrates 46-51% token reduction versus JSON, with stable scaling from 100 to 1,000 records. Controlled inference benchmarks on Qwen2.5-7B show corresponding 5-10% latency improvement. Comprehension validation confirms no material degradation in LLM task accuracy across lookup, counting, extraction, and aggregation operations when format context is provided. Ablation analysis reveals that key repetition accounts for the majority of JSON overhead, with indentation costs in nested structures explaining the 4-percentage-point gap between flat and hierarchical data. ONTO occupies a previously unfilled position in the serialization landscape: columnar efficiency with hierarchical structure, optimized for LLM context windows rather than document interchange. Code and specification are available at this https URL.
[NLP-117] CoAct: Co-Active LLM Preference Learning with Human-AI Synergy ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在通过偏好反馈进行对齐时,高质量人类标注偏好数据稀缺且成本高昂的问题。现有方法要么依赖纯AI生成的自奖励标签(self-rewarding),虽可扩展但存在可靠性风险;要么采用主动学习(active learning),虽能保证质量但难以充分利用未标注数据。论文提出CoAct框架,其核心在于通过策略性的人机协同,融合自奖励与主动学习的优势:利用自一致性(self-consistency)识别可靠自标注数据和需人工验证样本,同时借助人工oracle反馈引导模型生成在其能力范围内的新指令,从而实现高效、高质量的模型优化。
链接: https://arxiv.org/abs/2604.17501
作者: Ruiyao Xu,Mihir Parmar,Tiankai Yang,Zhengyu Hu,Yue Zhao,Kaize Ding
机构: Northwestern University(西北大学); Google(谷歌); University of Southern California(南加州大学); University of Washington(华盛顿大学)
类目: Computation and Language (cs.CL)
备注: ACL 2026
Abstract:Learning from preference-based feedback has become an effective approach for aligning LLMs across diverse tasks. However, high-quality human-annotated preference data remains expensive and scarce. Existing methods address this challenge through either self-rewarding, which scales by using purely AI-generated labels but risks unreliability, or active learning, which ensures quality through oracle annotation but cannot fully leverage unlabeled data. In this paper, we present CoAct, a novel framework that synergistically combines self-rewarding and active learning through strategic human-AI collaboration. CoAct leverages self-consistency to identify both reliable self-labeled data and samples that require oracle verification. Additionally, oracle feedback guides the model to generate new instructions within its solvable capability. Evaluated on three reasoning benchmarks across two model families, CoAct achieves average improvements of +13.25% on GSM8K, +8.19% on MATH, and +13.16% on WebInstruct, consistently outperforming all baselines.
[NLP-118] Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agent ic Systems
【速读】: 该论文旨在解决代理系统(Agentic systems)在生成回答时因过度承诺(overcommitment)而导致的可靠性问题,即系统虽整体有用,但个别陈述超出了证据支持范围。其核心解决方案是提出组合选择性特异性(Compositional Selective Specificity, CSS),该方法在生成后阶段将答案分解为独立命题,逐条评估并回退至最具体且可校准的可信水平,从而以局部语义回退的方式表达不确定性,而非整体拒绝输出。这一机制显著提升了风险-效用权衡,在LongFact全量实验中使过承诺感知效用从0.846提升至0.913,同时保持0.938的特异性保留率,表明命题级特异性控制是一种有效的不确定性接口。
链接: https://arxiv.org/abs/2604.17487
作者: Tianyi Huang,Samuel Xu,Jason Tansong Dang,Samuel Yan,Kimberley Yin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Agentic systems often fail not by being entirely wrong, but by being too precise: a response may be generally useful while particular claims exceed what the evidence supports. We study this failure mode as overcommitment control and introduce compositional selective specificity (CSS), a post-generation layer that decomposes an answer into claims, proposes coarser backoffs, and emits each claim at the most specific calibrated level that appears admissible. The method is designed to express uncertainty as a local semantic backoff rather than as a whole-answer refusal. Across a full LongFact run and HotpotQA pilots, calibrated CSS improves the risk-utility trade-off of fixed drafts. On the full LongFact run, it raises overcommitment-aware utility from 0.846 to 0.913 relative to the no-CSS output while achieving 0.938 specificity retention. These results suggest that claim-level specificity control is a useful uncertainty interface for agentic systems and a target for future distribution-free validity layers.
[NLP-119] Waking Up Blind: Cold-Start Optimization of Supervision-Free Agent ic Trajectories for Grounded Visual Perception ACL2026
【速读】: 该论文旨在解决小型视觉语言模型(Small Vision-Language Models, SVLMs)在任务执行中普遍存在的视觉脆弱性(visual brittleness)和工具编排能力不足的问题,这些问题通常需要昂贵的监督轨迹调优来缓解。其解决方案的关键在于提出一种无需监督信号的框架SPECTRA(Self-supervised Perception Enabled by Cascaded Tool Rollout Alignment),通过冷启动强化学习(Coldstart Reinforcement Learning)自举智能体能力;其中核心创新是引入软结构化多轮回溯(Soft Structured Multi-turn Rollouts)拓扑约束,强制智能体在合成推理前显式排序由工具生成的证据,从而将推理过程锚定于视觉观测。此外,通过多目标奖励信号同时优化任务正确性、回溯结构和工具效用,使智能体无需人类偏好标签即可自我发现鲁棒行为,并引入工具工具效用(Tool Instrumental Utility, TIU)作为无真值场景下的工具效能量化指标,显著提升了任务准确率(最高+5%)与工具效率(+9%)。
链接: https://arxiv.org/abs/2604.17475
作者: Ashutosh Bajpai,Tamal Majumder,Akshay Nambi,Tanmoy Chakraborty
机构: Indian Institute of Technology Delhi(印度理工学院德里分校); Indian Institute of Technology Abu Dhabi(印度理工学院阿布扎比分校); Microsoft Research(微软研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ACL 2026 Findings
Abstract:Small Vision-Language Models (SVLMs) are efficient task controllers but often suffer from visual brittleness and poor tool orchestration. They typically require expensive supervised trajectory tuning to mitigate these deficits. In this work, we propose Self-supervised Perception Enabled by Cascaded Tool Rollout Alignment (SPECTRA), a supervision-free framework that bootstraps agentic capabilities via Coldstart Reinforcement Learning for SVLMs. SPECTRA enforces Soft Structured Multi-turn Rollouts, a topological constraint that directs agents to explicitly sequence tool derived evidence before synthesis, effectively grounding reasoning in visual observations. We employ a multi-objective reward signal that simultaneously maximizes task correctness, rollout structure, and tool utility, enabling agent to self-discover robust behaviors without human preference labels. We further introduce Tool Instrumental Utility (TIU), a novel metric to quantify tool efficacy in the absence of ground truth. Extensive evaluations across composite and out-of-distribution (MMMU-Pro) benchmarks demonstrate that SPECTRA boosts agentic trajectories, improving task accuracy by up to 5% and tool efficiency by 9%, enabling more efficient multimodal agents that learn effectively from environmental interaction alone.
[NLP-120] MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation INTERSPEECH
【速读】: 该论文旨在解决语音到语音翻译(Speech-to-Speech Translation, S2ST)系统在实际应用中普遍忽略非言语发声(Non-Verbal Vocalizations, NVs),如笑声和哭泣等,导致语用意图丢失的问题。解决方案的关键在于三个方面:首先,构建可扩展的表达性数据集合成流程以缓解数据稀缺问题;其次,提出MoVE架构——一种基于LoRA专家混合(Mixture-of-LoRA-Experts)的模型,包含专门处理表达性的适配器和软权重路由机制,能够捕捉复合情感状态;最后,利用预训练音频大语言模型(AudioLLMs)实现显著的数据效率,仅需30分钟精选数据即可达到优异性能。实验表明,MoVE在英汉S2ST任务中能复现目标NVs达76%,且在人工评估中自然度与情感保真度均优于现有方法。
链接: https://arxiv.org/abs/2604.17435
作者: Szu-Chi Chen,I-Ning Tsai,Yi-Cheng Lin,Sung-Feng Huang,Hung-yi Lee
机构: National Taiwan University (国立台湾大学); NVIDIA (英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to Interspeech. Audio Demo and Dataset: this https URL
Abstract:Recent Speech-to-Speech Translation (S2ST) systems achieve strong semantic accuracy yet consistently strip away non-verbal vocalizations (NVs), such as laughter and crying that convey pragmatic intent, which severely limits real-world utility. We address this via three contributions. First, we propose a synthesis pipeline for building scalable expressive datasets to overcome the data scarcity limitation. Second, we propose MoVE, a Mixture-of-LoRA-Experts architecture with expressive-specialized adapters and a soft-weighting router that blends experts for capturing hybrid expressive states. Third, we show pretrained AudioLLMs enable striking data efficiency: 30 minutes of curated data is enough for strong performance. On English-Chinese S2ST, while comparing with strong baselines, MoVE reproduces target NVs in 76% of cases and achieves the highest human-rated naturalness and emotional fidelity among all compared systems, where existing S2ST systems preserve at most 14% of NVs.
[NLP-121] Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning ACL2026
【速读】: 该论文旨在解决自洽性(Self-consistency, SC)方法在提升大语言模型推理准确性时面临的高计算成本问题,即需要大量采样才能获得稳定结果。其解决方案的关键在于提出一种混合集成策略,融合Chain-of-Thought (CoT) 和 Program-of-Thought (PoT) 两种推理模式的优势,在保持甚至提升准确性的前提下,显著减少所需样本数量——实验表明,该方法可将SC所需的样本数降低9.3倍,且78.6%的任务仅需两个样本即可达成有效推理,这是此前任何SC方法均未实现的突破。
链接: https://arxiv.org/abs/2604.17433
作者: Raman Saparkhan,Majd Hawasly,Md Rizwan Parvez,Mohammad Raza
机构: Carnegie Mellon University Qatar; Qatar Computing Research Institute
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 3 figures; accepted to Findings of ACL 2026
Abstract:Self-consistency (SC) is a popular technique for improving the reasoning accuracy of large language models by aggregating multiple sampled outputs, but it comes at a high computational cost due to extensive sampling. We introduce a hybrid ensembling approach that leverages the complementary strengths of two distinct modes of reasoning: Chain-of-Thought (CoT) and Program-of-Thought (PoT). We describe a general framework for combining these two forms of reasoning in self-consistency, as well as particular strategies for both full sampling and early-stopping. We show that CoT-PoT ensembling not only improves overall accuracy, but also drastically reduces the number of samples required for SC by a factor of 9.3x. In particular, the majority of tasks (78.6%) can be addressed with only two samples, which has not been possible with any prior SC methods.
[NLP-122] Jupiter-N Technical Report
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在特定文化语境、语言支持和代理能力方面的局限性,具体包括:缺乏对英国本土文化规范的对齐、对威尔士语等小众语言的支持不足,以及在复杂任务中推理与行动能力的欠缺。解决方案的关键在于提出一个可复现的主权后训练(sovereign post-training)框架——Jupiter-N,其核心创新包括:(1)通过不确定性筛选轨迹增强代理能力;(2)基于文化规范构建合成数据以实现英国文化对齐;(3)利用平行语料库和LLM翻译的威尔士对话数据提升威尔士语支持;同时采用Forget-Me-Not框架混合策略内合成回放与策略外任务数据,有效缓解灾难性遗忘,并保留原始模型的混合推理能力。该方法实现了在威尔士语性能、终端使用和指令遵循等方面的显著提升,且所有模型权重与训练数据均开源,为其他国家定制化模型提供了可复制的技术路径。
链接: https://arxiv.org/abs/2604.17429
作者: George Drayson
机构: Locai Labs, University College London (伦敦大学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present Jupiter-N, a hybrid reasoning model post-trained from Nemotron 3 Super, a fully open-source 120 billion parameter LLM. We target three objectives: (1) agentic capability via uncertainty-curated trajectories; (2) UK cultural alignment via synthetic data grounded in cultural norms; and (3) Welsh language support via parallel corpora and LLM-translated Welsh conversations. Our data curation strategy carefully preserves the base model’s capabilities: using our Forget-Me-Not framework, we mix on-policy synthetic replay with off-policy task data to mitigate catastrophic forgetting, and include a mixture of reasoning and non-reasoning traces to maintain Nemotron’s hybrid reasoning ability. Jupiter-N achieves standout gains over Nemotron in Welsh (+18 on ARC-Easy, +5.25 on MMLU-Lite), terminal-use (+9.1 on Terminal Bench 2) and instruction following (+4.4 on IFBench), while retaining the base model capabilities. We frame this work as a reproducible template for sovereign post-training: substituting cultural knowledge, institutional corpora, and target languages produces an equivalent pipeline for any country. All model weights and all post-training datasets are publicly released under open licences.
[NLP-123] DuConTE: Dual-Granularity Text Encoder with Topology-Constrained Attention for Text-attributed Graphs
【速读】: 该论文旨在解决文本属性图(Text-attributed Graph)中现有方法在文本编码阶段仅以词元(word-token)粒度进行语义交互、且忽略不同节点间文本结构依赖关系的问题。其解决方案的关键在于提出DuConTE模型,该模型采用双粒度文本编码架构:首先通过两个级联的预训练语言模型(LMs),分别在词元粒度和节点粒度上进行语义编码;其次,在每个LM的自注意力计算中,动态调整注意力掩码矩阵以引入节点连通性信息,从而引导模型学习受图结构约束的语义关联;此外,在融合节点表示时,分别评估词元在中心节点上下文与邻域上下文中的重要性,增强对上下文相关语义信息的捕捉能力。
链接: https://arxiv.org/abs/2604.17411
作者: Lexuan Liang,Tao Zou,Xuxiang Ta,Zekun Qiu
机构: Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 4 figures
Abstract:Text-attributed graphs integrate semantic information of node texts with topological structure, offering significant value in various applications such as document classification and information extraction. Existing approaches typically encode textual content using language models (LMs), followed by graph neural networks (GNNs) to process structural information. However, during the LM-based text encoding phase, most methods not only perform semantic interaction solely at the word-token granularity, but also neglect the structural dependencies among texts from different nodes. In this work, we propose DuConTE, a dual-granularity text encoder with topology-constrained attention. The model employs a cascaded architecture of two pretrained LMs, encoding semantics first at the word-token granularity and then at the node granularity. During the self-attention computation in each LM, we dynamically adjust the attention mask matrix based on node connectivity, guiding the model to learn semantic correlations informed by the graph structure. Furthermore, when composing node representations from word-token embeddings, we separately evaluate the importance of tokens under the center-node context and the neighborhood context, enabling the capture of more contextually relevant semantic information. Extensive experiments on multiple benchmark datasets demonstrate that DuConTE achieves state-of-the-art performance on the majority of them.
[NLP-124] Contrastive Analysis of Linguistic Representations in Large Language Model Outputs through Structured Synthetic Data Generation and Abstracted N-gram Associations
【速读】: 该论文旨在解决如何识别文本中隐含的社会群体偏见问题,尤其关注那些未被预设词汇列表捕捉到的细微表达方式。传统方法多依赖于固定词表或孤立语句进行偏见诊断,难以反映真实语境下的语言使用差异。本文提出的关键解决方案是:通过对比合成文本生成与统计分析相结合的方法,利用情境化场景与群体标记的受控组合构建最小差异对(minimal pairs),在保持叙事条件一致的前提下,仅改变所指代的社会群体,从而生成可比文本;随后采用改进的点互信息(pointwise mutual information, PMI)量化语言抽象形式与不同群体之间的关联强度,并结合片段排序策略筛选出高浓度偏见信号的文本段落,使专家能够基于上下文评估语言表达的危害性,实现从定量分析到定性解释的有效衔接。
链接: https://arxiv.org/abs/2604.17398
作者: S.A. Desimone,L. Alonso Alemany
机构: Universidad Nacional de Córdoba; CONICET
类目: Computation and Language (cs.CL)
备注:
Abstract:We present a methodological framework to discover linguistic and discursive patterns associated to different social groups through contrastive synthetic text generation and statistical analysis. In contrast with previous approaches, we aim to characterize subtle expressions of bias, instead of diagnosing bias through a pre-determined list of words or expressions. We are also working with contextualized data instead of isolated words or sentences. Our methodology applies to textual productions in any genre, encompassing narrative, task-oriented or dialogic. Contextualized data are generated using controlled combinations of situational scenarios and group markers, creating minimal pairs of texts that differ only in the referenced group while maintaining comparable narrative conditions. To facilitate robust analysis, linguistic forms are generalized and associations between linguistic abstractions and groups are quantified using a variant of pointwise mutual information to detect expressions that appear disproportionately across groups. A fragment-ranking strategy then prioritizes text segments with a high concentration of biased linguistic signals, which allows for experts to assess the harmful potential of linguistic expressions in context, bridging quantitative analysis and qualitative interpretation.
[NLP-125] Representation-Guided Parameter-Efficient LLM Unlearning ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在训练过程中可能记忆敏感或有害信息的问题,尤其是现有参数高效遗忘(parameter-efficient unlearning)方法在“遗忘-保留权衡”(forget-retain trade-off)上的局限性。其核心挑战在于:传统方法依赖参数重要性度量来识别仅与遗忘数据相关的参数,但受限于参数的超位置现象(superposition phenomenon),难以区分与遗忘集和保留集相关的参数。为此,作者提出 Representation-Guided Low-rank Unlearning (REGLU),其关键创新在于利用表示空间(representation space)的几何特性实现精准遗忘:首先设计一种基于表示引导的LoRA初始化策略,以定位最优的可遗忘子空间;其次引入正则化损失函数,约束LoRA更新后的输出位于保留集表示子空间的正交补空间中,从而最小化对保留任务性能的干扰。实验表明,REGLU在TOFU和WMDP基准上显著优于当前最优基线,在保持模型效用的同时实现了更高质量的遗忘效果。
链接: https://arxiv.org/abs/2604.17396
作者: Zeguan Xiao,Lang Mo,Yun Chen,Lei Yang,Jiehui Zhao,Lili Yang,Guanhua Chen
机构: Shanghai University of Finance and Economics (上海财经大学); Southern University of Science and Technology (南方科技大学); Deepexi Technology Co. Ltd (深挖科技有限公司)
类目: Computation and Language (cs.CL)
备注: Findings of ACL 2026
Abstract:Large Language Models (LLMs) often memorize sensitive or harmful information, necessitating effective machine unlearning techniques. While existing parameter-efficient unlearning methods have shown promise, they still struggle with the forget-retain trade-off. This can be attributed to their reliance on parameter importance metrics to identify parameters that are important exclusively for the forget set, which is fundamentally limited by the superposition phenomenon. Due to the polysemantic nature of LLM parameters, such an importance metric may struggle to disentangle parameters associated with the forget and retain sets. In this work, we propose Representation-Guided Low-rank Unlearning (REGLU), a novel approach that leverages the geometric properties of representation spaces to achieve robust and precise unlearning. First, we develop a representation-guided initialization for LoRA that identifies the optimal subspace for selective forgetting. Second, we introduce a regularization loss that constrains the outputs of the LoRA update to lie in the orthogonal complement of the retain set’s representation subspace, thereby minimizing interference with the model’s performance on the retain set. We evaluate REGLU on the TOFU and WMDP benchmarks across multiple models. Our results demonstrate that REGLU consistently outperforms state-of-the-art baselines, achieving superior unlearning quality while maintaining higher model utility.
[NLP-126] Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains ACL2026
【速读】: 该论文旨在解决自动评估指标(automatic evaluation metrics)在领域迁移(domain shift)下鲁棒性不明确的问题,尤其是现有指标多基于WMT基准数据集训练,其在未见领域中的表现缺乏可靠评估。为消除人类标注噪声对领域效应的混淆,作者提出一个系统性的多标注者跨域错误片段标注数据集(Cross-Domain Error-Span-Annotation dataset, CD-ESA),固定每语言对内的标注者,并在同一批翻译系统上比较已见新闻领域与两个未见技术领域(包括化学领域)的表现。关键解决方案在于通过控制变量设计(固定标注者、统一翻译系统)并量化人类标注一致性(inter-annotator agreement),发现自动指标在段落级别看似鲁棒,但一旦考虑标注变异则性能显著下降;因此建议将指标与人类的一致性对比作为评估标准,而非仅依赖原始指标-人类相关性。
链接: https://arxiv.org/abs/2604.17393
作者: Finn Schmidt,Jan Philip Wahle,Terry Ruas,Bela Gipp
机构: University of Göttingen, Germany
类目: Computation and Language (cs.CL)
备注: Accepted at ACL2026 (Findings)
Abstract:Automatic evaluation metrics are central to the development of machine translation systems, yet their robustness under domain shift remains unclear. Most metrics are developed on the Workshop on Machine Translation (WMT) benchmarks, raising concerns about their robustness to unseen domains. Prior studies that analyze unseen domains vary translation systems, annotators, or evaluation conditions, confounding domain effects with human annotation noise. To address these biases, we introduce a systematic multi-annotator Cross-Domain Error-Span-Annotation dataset (CD-ESA), comprising 18.8k human error span annotations across three language pairs, where we fix annotators within each language pair and evaluate translations of the same six translation systems across one seen news domain and two unseen technical domains. Using this dataset, we first find that automatic metrics appear surprisingly robust to domain-shifts at the segment level (up to 0.69 agreement), but this robustness largely disappears once we account for human label variation. Averaging annotations increases inter-annotator agreement by up to +0.11. Metrics struggle on the unseen chemical domain compared to humans (inter-annotator agreement of 0.78-0.83 vs. 0.96). We recommend comparing metric-human agreement against inter-annotator agreement, rather than comparing raw metric-human agreement alone, when evaluating across different domains. Comments: Accepted at ACL2026 (Findings) Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.17393 [cs.CL] (or arXiv:2604.17393v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.17393 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-127] AnchorMem: Anchored Facts with Associative Contexts for Building Memory in Large Language Models ACL2026
【速读】: 该论文旨在解决大语言模型在长期交互中依赖记忆系统时,现有方法(如A-Mem、Mem0)因过度依赖频繁重写和摘要化处理而导致关键上下文细节丢失与检索特征模糊的问题。其解决方案的关键在于提出AnchorMem框架,通过解耦检索单元与生成上下文:将交互历史中的原子事实提取为可检索锚点(anchor),同时保留原始上下文作为不可变信息;并构建关联事件图(associative event graph),利用高阶事件链接将相关事实整合为共享事件表示,从而强化跨记忆融合而不依赖通用实体作为桥梁。该设计实现了细粒度检索与交互上下文完整性的统一。
链接: https://arxiv.org/abs/2604.17377
作者: Zhanyu Shen,Sijie Cheng,Zhicheng Guo,Weiqin Wang,Yile Wang,Hui Huang
机构: Shenzhen University (深圳大学); RayNeo.AI; Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: ACL 2026 Findings
Abstract:While large language models have achieved remarkable performance in complex tasks, they still need a memory system to utilize historical experience in long-term interactions. Existing memory methods (e.g., A-Mem, Mem0) place excessive emphasis on organizing interactions by frequently rewriting them, however, this heavy reliance on summarization risks diluting essential contextual nuances and obscuring key retrieval features. To bridge this gap, we introduce AnchorMem, a novel memory framework inspired by the Proust Phenomenon in cognitive science, where a specific anchor triggers a holistic recollection. We propose a method that decouples the retrieval unit from the generation context. AnchorMem extracts atomic facts from interaction history to serve as retrieval anchors, while preserving the original context as the immutable context. To reveal implicit narrative cues, we construct an associative event graph that uses higher-order event links that bind sets of related facts into shared event representations, strengthening cross-memory integration without relying on generic entities as bridges. During retrieval, the system anchors queries to specific facts and events to locate relevant memories, but reconstructs the context using the associated raw chunks and events. Our method reconciles fine-grained retrieval with the contextual integrity of interactions. Experiments across three closed-source and open-source models on the LoCoMo benchmark demonstrate that AnchorMem significantly outperforms baselines. Code is available at this https URL.
[NLP-128] ArgBench: Benchmarking LLM s on Computational Argumentation Tasks
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在计算论证(computational argumentation)任务中缺乏标准化评估基准的问题。为应对这一挑战,作者构建了首个统一的基准测试集,整合了来自先前研究的33个数据集,并涵盖5类共46项计算论证任务,包括论点挖掘、观点评估、论点质量判断、论点推理和论点生成。解决方案的关键在于通过该基准对五大LLM家族进行系统性评估,深入分析少量示例(few-shot examples)、推理步骤、模型规模及训练技巧等因素对模型性能的影响,从而为LLM在论证相关任务中的能力提供可比较、可量化的实证依据。
链接: https://arxiv.org/abs/2604.17366
作者: Yamen Ajjour,Carlotta Quensel,Nedim Lipka,Henning Wachsmuth
机构: Leibniz University Hannover (汉诺威莱布尼茨大学); Adobe Research (Adobe 研究院); Leibniz University Hannover, L3S
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Argumentation skills are an essential toolkit for large language models (LLMs). These skills are crucial in various use cases, including self-reflection, debating collaboratively for diverse answers, and countering hate speech. In this paper, we create the first benchmark for a standardized evaluation of LLM-based approaches to computational argumentation, encompassing 33 datasets from previous work in unified form. Using the benchmark, we evaluate the generalizability of five LLM families across 46 computational argumentation tasks that cover mining arguments, assessing perspectives, assessing argument quality, reasoning about arguments, and generating arguments. On the benchmark, we conduct an extensive systematic analysis of the contribution of few-shot examples, reasoning steps, model size, and training skills to the performance of LLMs on the computational argumentation tasks in the benchmark.
[NLP-129] Still Between Us? Evaluating and Improving Voice Assistant Robustness to Third-Party Interruptions ACL2026
【速读】: 该论文旨在解决当前语音语言模型(Spoken Language Models, SLMs)在多说话人场景中无法有效识别第三方干扰(Third-Party Interruptions, TPI)的问题,从而避免因忽略声学线索而导致的上下文失效。其核心解决方案是提出TPI-Train数据集与TPI-Bench评估框架:TPI-Train包含88K条实例,通过引入以说话人感知为特征的困难负样本(speaker-aware hard negatives),强制模型优先关注声学线索而非语义信息;TPI-Bench则提供一套全面的评估体系,用于精确衡量模型在欺骗性情境下的打断处理策略与说话人区分能力。实验表明,该设计能有效缓解语义捷径学习(semantic shortcut learning)问题,推动SLMs从文本主导的单模态依赖向更鲁棒的多说话人交互演进。
链接: https://arxiv.org/abs/2604.17358
作者: Dongwook Lee,Eunwoo Song,Che Hyun Lee,Heeseung Kim,Sungroh Yoon
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: ACL 2026 main conference
Abstract:While recent Spoken Language Models (SLMs) have been actively deployed in real-world scenarios, they lack the capability to discern Third-Party Interruptions (TPI) from the primary user’s ongoing flow, leaving them vulnerable to contextual failures. To bridge this gap, we introduce TPI-Train, a dataset of 88K instances designed with speaker-aware hard negatives to enforce acoustic cue prioritization for interruption handling, and TPI-Bench, a comprehensive evaluation framework designed to rigorously measure the interruption-handling strategy and precise speaker discrimination in deceptive contexts. Experiments demonstrate that our dataset design mitigates semantic shortcut learning-a critical pitfall where models exploit semantic context while neglecting acoustic signals essential for discerning speaker changes. We believe our work establishes a foundational resource for overcoming text-dominated unimodal reliance in SLMs, paving the way for more robust multi-party spoken interaction. The code for the framework is publicly available at this https URL
[NLP-130] More Than Meets the Eye: Measuring the Semiotic Gap in Vision-Language Models via Semantic Anchorag e ACL2026
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理名词复合词的习语义时,因高保真视觉细节干扰而难以表征抽象意义的问题。其解决方案的关键在于引入DIVA基准,通过生成成对的、语义锚定的示意性可视化图像(即用符号化图标替代高保真图像),分别对应名词复合词的字面义与习语义,并提出Semantic Alignment Gap(Δ)这一架构无关的度量指标来量化两种语义在视觉上的接地差异,同时引入方向性符号偏差 b(t) 来单独衡量模型对字面义的偏好强度和方向。实验表明,模型规模无法消除字面优势偏见(Literal Superiority Bias),且更高视觉保真度反而削弱了符号对齐,揭示出提升组合理解需依赖视觉输入的图标化抽象及语义锚定机制。
链接: https://arxiv.org/abs/2604.17354
作者: Wei He
机构: University of Exeter (埃克塞特大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 4 figures. Accepted to the Main Conference of ACL 2026
Abstract:Vision-Language Models (VLMs) excel at photorealistic generation, yet often struggle to represent abstract meaning such as idiomatic interpretations of noun compounds. To study whether high visual fidelity interferes with idiomatic compositionality under visual abstraction, we introduce DIVA, a controlled benchmark that replaces high-fidelity visual detail with schematic iconicity by generating paired, sense-anchored visualizations for literal and idiomatic readings. We further propose Semantic Alignment Gap ( \Delta ), an architecture-agnostic metric that quantifies divergence between literal and idiomatic visual grounding. We additionally introduce a directional signed bias b(t) to separately measure the direction and strength of literal preference. Evaluating 8 recent VLMs, we reveal a consistent Literal Superiority Bias: model scale alone does not resolve literal preference, and increased visual fidelity is associated with weaker symbolic alignment, suggesting cognitive interference from hyper-realistic imagery. Our findings suggest that improving compositional understanding requires iconographic abstraction of visual input and anchoring interpretation and generation in intended meaning.
[NLP-131] Logical Computational Linguistics
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)中因统计方法导致的不确定性问题,尤其是在生命关键场景下对语法和语义处理精度的严苛要求。其核心挑战在于统计计算语言学中依赖链的置信度随长度增加而单调衰减,难以保证推理的可靠性;为此,作者提出基于逻辑计算语言学的解决方案,关键在于构建一个逻辑语义接口,通过类型逻辑语法(Type Logical Grammar)实现任意长度的逻辑依赖链均保持100%置信度,从而支持高可靠性的语法与语义解析,满足对容错率极低的应用需求。
链接: https://arxiv.org/abs/2604.17346
作者: Glyn V. Morrill,Oriol Valentín
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:In this book we promote logical computational linguistics as opposed to statistical computational linguistics. In particular, we provide a logical semantic interface. This book assembles more than twenty years of research work on type logical grammar, and adds new ideas and material. Chains of statistical dependencies of less than one hundred per cent confidence tend monotonically to zero. Chains of logical dependencies of any length maintain one hundred per cent confidence end to end. We aspire to enable perfect syntactic and semantic processing in life-critical NLP applications. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.17346 [cs.CL] (or arXiv:2604.17346v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.17346 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-132] FLARE: Task-agnostic embedding model evaluation through a normalization process ACL2026
【速读】: 该论文旨在解决无标签(labelless)场景下嵌入模型(embedding model)选择困难的问题,尤其是在高维空间中现有基于核估计或高斯混合模型的评估方法因维度灾难导致排名不稳定。其解决方案的关键在于提出一种基于流模型的无标签嵌入评估方法(FLARE),通过归一化流(normalized streams)直接从对数似然中估计信息充分性,避免了依赖距离的密度估计策略;同时理论分析表明,估计误差仅与数据流形的内在维度相关,而非原始嵌入维度,从而在高维嵌入(如 d≥3,584)下仍保持稳定性能,在11个数据集和8种嵌入器上的实验中达到Spearman相关系数 ρ=0.90 的优异表现。
链接: https://arxiv.org/abs/2604.17344
作者: Jingzhou Jiang,Yixuan Tang,Yi Yang,Kar Yan Tam
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to Findings of ACL 2026
Abstract:When task-specific labels are not available, it becomes difficult to select an embedding model for a specific target corpus. Existing labelless measures based on kernel estimators or Gaussian mixes fail in high-dimensional space, resulting in unstable rankings. We propose a flow-based labelless representation embedding evaluation (FLARE), which utilizes normalized streams to estimate information sufficiency directly from log-likelihood and avoid distance-based density estimation. We give a finite sample boundary, indicating that the estimation error depends on the intrinsic dimension of the data manifold rather than the original embedding dimension. On 11 datasets and 8 embedders, FLARE reached Spearman’s \rho of 0.90 under the supervised benchmark and remained stable in high-dimensional embeddings ( d \geq 3,584 ) as the existing labelless baseline collapsed.
[NLP-133] Neuro-Symbolic Resolution of Recommendation Conflicts in Multimorbidity Clinical Guidelines AAAI
【速读】: 该论文旨在解决临床指南在多病共存(multimorbidity)情境下因碎片化、冗余和逻辑矛盾导致的可靠性危机,这些问题不仅使临床医生产生认知失调,还导致标准检索增强生成(Retrieval-Augmented Generation, RAG)系统脆弱且易产生幻觉。解决方案的关键在于提出一种神经符号(Neuro-Symbolic)框架,通过多智能体系统将非结构化的临床自然语言转化为严格的符号逻辑语言,并利用可满足性(Satisfiability, SAT)求解器进行验证;同时构建了逻辑规则交互的分层分类体系,识别出一类关键冲突——局部冲突(Local Conflict),即由共病交集引发的决策冲突,从而实现了对临床指南中结构性矛盾的自动化检测与修复,显著优于现有大语言模型(LLM)的检测能力(F1=0.861)。
链接: https://arxiv.org/abs/2604.17340
作者: Shiyao Xie,Jian Du
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by Proceedings of the 40th Annual AAAI Conference on Artificial Intelligence (Bridge Program on Logic AI: Logical and Symbolic Reasoning in Language Models)
Abstract:Clinical guidelines, typically developed by independent specialty societies, inherently exhibit substantial fragmentation, redundancy, and logical contradiction. These inconsistencies, particularly when applied to patients with multimorbidity, not only cause cognitive dissonance for clinicians but also introduce catastrophic noise into AI systems, rendering the standard Retrieval-Augmented Generation (RAG) system fragile and prone to hallucination. To address this fundamental reliability crisis, we introduce a Neuro-Symbolic framework that automates the detection of recommendation redundancies and conflicts. Our pipeline employs a multi-agent system to translate unstructured clinical natural language into rigorous symbolic logic language, which is then verified by a Satisfiability (SAT) solver. By formulating a hierarchical taxonomy of logical rule interactions, we identify a critical category termed Local Conflict - a decision conflict arising from the intersection of comorbidities. Evaluating our system on a curated benchmark of 12 authoritative SGLT2 inhibitor guidelines, we reveal that 90.6% of conflicts are Local, a structural complexity that single-disease guidelines fail to address. While state-of-the-art LLMs fail in detecting these conflicts, our neuro-symbolic approach achieves an F1 score of 0.861. This work demonstrates that logical verification must precede retrieval, establishing a new technical standard for automated knowledge coordination in medical AI.
[NLP-134] Precise Debugging Benchmark: Is Your Model Debugging or Regenerating? ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在代码调试任务中缺乏精确性的问题,即模型常生成正确但过度修改的修复方案,难以实现局部定位与针对性编辑。其解决方案的关键在于提出了一种名为Precise Debugging Benchmark (PDB) 的评估框架,该框架通过合成已验证的原子级错误并组合成多错误程序,自动将任意编码数据集转化为具备精度感知能力的调试基准;同时引入两个新指标——编辑层级精度(edit-level precision)和错误层级召回率(bug-level recall),以量化必要编辑数量和已修复错误的比例,从而更精准地衡量模型调试能力。实验表明,即使在明确指令下,前沿模型如GPT-5.1-Codex和DeepSeek-V3.2-Thinking仍表现出低于45%的编辑精度,且迭代与代理式调试策略未能显著提升性能,凸显了当前后训练流水线在代码调试任务中的局限性。
链接: https://arxiv.org/abs/2604.17338
作者: Wang Bill Zhu,Miaosen Chai,Shangshang Wang,Yejia Liu,Song Bian,Honghua Dong,Willie Neiswanger,Robin Jia
机构: University of Southern California (南加州大学); Microsoft (微软); University of Wisconsin–Madison (威斯康星大学麦迪逊分校); University of Toronto (多伦多大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Accepted by ACL 2026 Findings
Abstract:Unlike code completion, debugging requires localizing faults and applying targeted edits. We observe that frontier LLMs often regenerate correct but over-edited solutions during debugging. To evaluate how far LLMs are from precise debugging, we introduce the Precise Debugging Benchmark (PDB) framework, which automatically converts any coding dataset into a debugging benchmark with precision-aware evaluation. PDB generates buggy programs by synthesizing verified atomic bugs and composing them into multi-bug programs. We define two novel metrics, edit-level precision and bug-level recall, which measures how many necessary edits are made and how many bugs are resolved. We release two evaluation benchmarks: PDB-Single-Hard on single-line bugs, and PDB-Multi on multi-line bugs. Experiments show that frontier models, such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking, achieve unit-test pass rates above 76% but exhibit precision below 45%, even when explicitly instructed to perform minimal debugging. Finally, we show that iterative and agentic debugging strategies do not substantially improve precision or recall, highlighting the need to rethink post-training pipelines for coding models.
[NLP-135] Align Documents to Questions: Question-Oriented Document Rewriting for Retrieval-Augmented Generation ACL’26
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中因 retrieved documents 风格不一致导致的模型偏倚问题,即大型语言模型(Large Language Models, LLMs)在面对混合来源的上下文时倾向于生成流畅但缺乏事实依据的内容,而非基于可靠检索证据的准确回答。解决方案的关键在于提出 QREAM——一种风格可控的重写模块,通过两个阶段实现:(1) QREAM-ICL 利用风格种子引导迭代式重写探索,使检索文档风格向问题导向型对齐;(2) QREAM-FT 作为轻量级学生模型,从去噪的 ICL 输出中蒸馏得到,并采用双准则拒绝采样(基于答案正确性和事实一致性)确保高质量监督信号。该方法可无缝嵌入现有 RAG 流水线,显著提升事实准确性,相对改进最高达 8%,且延迟开销极小。
链接: https://arxiv.org/abs/2604.17325
作者: Jiaang Li,Zhendong Mao,Quan Wang,Yuning Wan,Yongdong Zhang
机构: University of Science and Technology of China(中国科学技术大学); Beijing University of Posts and Telecommunications(北京邮电大学)
类目: Computation and Language (cs.CL)
备注: ACL’26 Findings
Abstract:Retrieval-Augmented Generation (RAG) enhances the factuality of Large Language Models (LLMs) by incorporating retrieved documents and/or generated context. However, LLMs often exhibit a stylistic bias when presented with mixed contexts, favoring fluent but hallucinated generated content over factually grounded yet disorganized retrieved evidence. This phenomenon reveals that the utility of retrieved information is bottlenecked by its presentation. To bridge this gap, we propose QREAM, a style-controlled rewriter that aligns retrieved documents with a question-oriented style while preserving facts, better for LLM readers to utilize. Our framework consists of two stages: (1) QREAM-ICL, which uses stylistic seeds to guide iterative rewriting exploration; and (2) QREAM-FT, a lightweight student model distilled from denoised ICL outputs. QREAM-FT employs dual-criteria rejection sampling, filtering based on answer correctness and factual consistency to ensure high-quality supervision. QREAM seamlessly integrates into existing RAG pipelines as a plug-and-play module. Experiments demonstrate that QREAM consistently enhances advanced RAG pipelines, yielding up to 8% relative improvement with negligible latency overhead, effectively balancing question relevance with factual grounding.
[NLP-136] A Universal Avoidance Method for Diverse Multi-branch Generation
【速读】: 该论文旨在解决当前生成式AI(Generative AI)模型在多分支多样性(multi-branch diversity)方面的不足,即生成结果缺乏人类水平的创造性,尤其是在需要多样化输出的任务中表现欠佳。现有方法通常依赖特定模型架构或带来显著计算开销,限制了其通用性和效率。论文提出一种模型无关且计算高效的生成策略UAG(Universal Avoidance Generation),其核心在于对已生成输出之间的相似性施加惩罚机制,从而引导模型产生更具多样性的结果。该方法可无缝集成于扩散模型和Transformer类模型中,在仅需极少额外计算资源的前提下显著提升多样性,实验表明其多样性最高提升1.9倍,速度提升4.4倍,FLOPs消耗仅为最优对比方法的1/64。
链接: https://arxiv.org/abs/2604.17323
作者: Kyeongman Park,Minha Jhang,Kyomin Jung
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Modern generative models still lack human-level creativity, particularly in multi-branch diversity. Prior approaches to address this problem often incur heavy computation or strong dependency on model architecture. Therefore, we introduce UAG(Universal Avoidance Generation), a model-agnostic and computationally efficient generation strategy that penalizes similarity among previously generated outputs. Thus, UAG can enhance multi-branch diversity across both diffusion and transformer models, with minimal additional computation. In experiments, our method achieves up to 1.9 times higher diversity, runs 4.4 times faster, and requires only 1/64 of the FLOPs compared to state-of-the-art methods. The full code is this https URL.
[NLP-137] E2E-GMNER: End-to-End Generative Grounded Multimodal Named Entity Recognition ACL2026
【速读】: 该论文旨在解决** grounded Multimodal Named Entity Recognition (GMNER) **中因传统流水线式架构导致的误差累积和联合优化不足的问题。现有方法通常将文本实体识别与视觉定位分离处理,难以实现跨模态协同优化。解决方案的关键在于提出一个完全端到端的生成式框架 E2E-GMNER,其核心创新包括:(1)将 GMNER 任务建模为指令微调的条件生成任务,引入思维链(chain-of-thought)推理机制以动态判断何时依赖视觉证据或背景知识,从而减少对噪声线索的依赖;(2)设计高斯风险感知框扰动(Gaussian Risk-Aware Box Perturbation, GRBP),用概率扰动的软目标替代硬性边界框监督,提升对标注噪声和离散化误差的鲁棒性。实验表明,该方法在 Twitter-GMNER 和 Twitter-FMNERG 基准上显著优于现有最优方法,验证了统一端到端优化与噪声感知监督的有效性。
链接: https://arxiv.org/abs/2604.17319
作者: Meng Zhang,Jinzhong Ning,Xiaolong Wu,Hongfei Lin,Yijia Zhang
机构: Dalian Maritime University (大连海事大学); Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to Findings of ACL 2026
Abstract:Grounded Multimodal Named Entity Recognition (GMNER) aims to jointly identify named entity mentions in text, predict their semantic types, and ground each entity to a corresponding visual region in an associated image. Existing approaches predominantly adopt pipeline-based architectures that decouple textual entity recognition and visual grounding, leading to error accumulation and suboptimal joint optimization. In this paper, we propose E2E-GMNER, a fully end-to-end generative framework that unifies entity recognition, semantic typing, visual grounding, and implicit knowledge reasoning within a single multimodal large language model. We formulate GMNER as an instruction-tuned conditional generation task and incorporate chain-of-thought reasoning to enable the model to adaptively determine when visual evidence or background knowledge is informative, reducing reliance on noisy cues. To further address the instability of generative bounding box prediction, we introduce Gaussian Risk-Aware Box Perturbation (GRBP), which replaces hard box supervision with probabilistically perturbed soft targets to improve robustness against annotation noise and discretization errors. Extensive experiments on the Twitter-GMNER and Twitter-FMNERG benchmarks demonstrate that E2E-GMNER achieves highly competitive performance compared with state of the art methods, validating the effectiveness of unified end-to-end optimization and noise-aware grounding supervision. Code is available at:this https URL
[NLP-138] Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA ACL2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床场景中部署时的不确定性校准问题,特别是社会身份线索(如性取向和宗教信仰)如何扭曲模型的置信度信号与预测准确性,从而影响医疗决策的公平性和安全性。其关键发现在于:社会身份标记不仅导致模型性能下降,更破坏了不确定性校准机制,使得模型在面对特定群体患者时错误地表现出过高或过低的置信度,形成所谓的“校准危机”(calibration crisis),这可能危及患者安全并加剧医疗不平等。
链接: https://arxiv.org/abs/2604.17316
作者: Alberto Testoni,Iacer Calixto
机构: Amsterdam University Medical Center (阿姆斯特丹大学医学中心); University of Amsterdam (阿姆斯特丹大学); Amsterdam Public Health, Methodology (阿姆斯特丹公共卫生方法学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 (Main Conference)
Abstract:Safe clinical deployment of Large Language Models (LLMs) requires not only high accuracy but also robust uncertainty calibration to ensure models defer to clinicians when appropriate. Our paper investigates how social descriptors of a patient (specifically sexual orientation and religious affiliation) distort these uncertainty signals and model accuracy. Evaluating nine general-purpose and biomedical LLMs on 2,364 medical questions and their counterfactual variants, we demonstrate that identity markers cause a “calibration crisis”. “Homosexual” markers consistently trigger performance drops, and intersectional identities produce idiosyncratic, non-additive harms to calibration. Moreover, a clinician-validated case study in an open-ended generation setting confirms that these failures are not an artifact of the multiple-choice format. Our results demonstrate that the presence of social identity cues does not merely shift predictions; it affects the reliability of confidence signals, posing a significant risk to equitable care and safe deployment in confidence-based clinical workflows.
[NLP-139] Cat-DPO: Category-Adaptive Safety Alignment
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在人类偏好对齐过程中难以兼顾整体有用性与特定危害类别安全性的问题。现有基于偏好学习的安全对齐方法通常将安全约束简化为单一标量,导致模型在平均意义上看似安全,却在少数危害类别上仍存在显著风险。其解决方案的关键在于将安全对齐建模为每类危害独立的约束优化问题,并提出Cat-DPO算法——该算法为每个危害类别设置自适应的安全边际:当某类危害响应仍未收敛时,边际收紧以强化训练信号;一旦模型在该类别达到安全阈值,则边际放松,从而动态追踪每类危害的当前训练难度,而非采用全局统一的安全强度。此方法在多个LLM架构和基准上均提升了整体有用性和有害行为抑制能力,同时显著缩小了各类别间的性能方差与最优至最差差距。
链接: https://arxiv.org/abs/2604.17299
作者: Tiankai Yang,Yi Nian,Xinyuan Li,Ruiyao Xu,Kaize Ding,Yue Zhao
机构: University of Southern California (南加州大学); Northwestern University (西北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 6 figures
Abstract:Aligning large language models with human preferences must balance two competing goals: responding helpfully to legitimate requests and reliably refusing harmful ones. Most preference-based safety alignment methods collapse safety into a single scalar that is applied uniformly to every preference pair. The result is a model that looks safe on average but stays relatively unsafe on a minority of harm categories. We cast safety alignment as a per-category constrained optimization problem and derive Cat-DPO, a direct-preference-optimization algorithm with a separate adaptive safety margin for each harm category. The margin tightens when the model still produces unsafe responses on a category and relaxes once the model catches up, so the training signal tracks each category’s current difficulty rather than averaging under one global rate. Across two LLM backbones and six preference-learning baselines, Cat-DPO iimproves aggregate helpfulness and harmlessness and compresses per-category safety variance and the best-to-worst gap, offering a drop-in per-category refinement of direct preference safety alignment.
[NLP-140] CRISP: Compressing Redundancy in Chain-of-Thought via Intrinsic Saliency Pruning ACL2026
【速读】: 该论文旨在解决长链式思维(Chain-of-Thought, CoT)推理中因计算开销大和延迟高而导致的效率瓶颈问题。现有方法通过外部压缩器对CoT进行压缩,但往往无法与模型内部推理动态对齐,导致关键逻辑步骤丢失。其解决方案的关键在于提出一种基于内在显著性剪枝(Intrinsic Saliency Pruning)的框架CRISP,该框架利用模型自身注意力机制识别出推理终止标记(reasoning termination token)作为信息锚点,据此设计策略指导原子级压缩操作,从而在保持逻辑连贯性的前提下最大化信息密度。实验证明,CRISP可在不损失准确率的情况下实现50–60%的token数量减少,有效缓解长上下文推理的效率瓶颈。
链接: https://arxiv.org/abs/2604.17297
作者: Yangsong Lan,Hongliang Dai,Piji Li
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); The Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education (教育部脑机智能技术重点实验室)
类目: Computation and Language (cs.CL)
备注: Findings of the Association for Computational Linguistics: ACL 2026
Abstract:Long Chain-of-Thought (CoT) reasoning is pivotal for the success of recent reasoning models but suffers from high computational overhead and latency. While prior works attempt to compress CoT via external compressor, they often fail to align with the model’s internal reasoning dynamics, resulting in the loss of critical logical steps. This paper presents \textbfCompressing \textbfRedundancy in Chain-of-Thought via \textbfIntrinsic \textbfSaliency \textbfPruning (\textbfCRISP), a framework that compresses CoT by exploiting the model’s intrinsic saliency. Our analysis reveals a distinct phenomenon: the reasoning termination token \texttt[object Object] acts as an information anchor, where its attention pattern effectively demarcates essential reasoning from redundancy. Based on this finding, we design a policy that utilizes these intrinsic attention signals to guide atomic compression operations. In contrast to coarse-grained pruning strategies, CRISP strategically distills the reasoning chain to maximize information density while preserving logical coherence. Empirical results across various backbone models and mathematical datasets demonstrate that CRISP achieves a 50-60% reduction in token count without compromising accuracy, effectively mitigating the efficiency bottleneck of long-context reasoning. We open-source our implementation to facilitate further research in efficient reasoning.
[NLP-141] Beyond “I Dont Know”: Evaluating LLM Self-Awareness in Discriminating Data and Model Uncertainty
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对不确定输入时缺乏细粒度不确定性归因的问题,即无法有效区分数据不确定性(data uncertainty,源于输入模糊或信息不足)与模型不确定性(model uncertainty,源于模型能力局限),从而影响下游决策如请求澄清或调用外部工具。解决方案的关键在于提出一种轻量级的数据合成与强化学习策略,通过在训练中显式建模两种不确定性类型,提升模型对不确定性的识别精度,同时保持高答案准确性。实验表明,该方法在Qwen3系列模型上显著改善了不确定性归因能力,且不牺牲原有性能。
链接: https://arxiv.org/abs/2604.17293
作者: Jingyi Ren,Ante Wang,Yunghwei Lai,Xiaolong Wang,Linlu Gong,Weitao Li,Weizhi Ma,Yang Liu
机构: Tsinghua University (清华大学); Institute for AI Industry Research (AIR), Tsinghua University (清华大学人工智能产业研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reliable Large Language Models (LLMs) should abstain when confidence is insufficient. However, prior studies often treat refusal as a generic "I don’t know’', failing to distinguish input-level ambiguity (data uncertainty) from capability limitations (model uncertainty). This lack of distinction limits downstream action decisions like requesting clarification or invoking external tools. In this work, we introduce UA-Bench, a benchmark of over 3,500 questions drawn from six datasets spanning knowledge-intensive and reasoning-intensive tasks, designed to evaluate explicit uncertainty attribution. An evaluation of 18 frontier LLMs shows that even state-of-the-art models struggle to reliably discriminate between data uncertainty and model uncertainty, and that high answer accuracy does not necessarily imply strong uncertainty attribution ability. To narrow this gap, we propose a lightweight data synthesis and reinforcement learning strategy. Experiments on both Qwen3-4B-Instruct-2507 and Qwen3-8B in thinking mode show that the proposed method improves uncertainty attribution while preserving answer accuracy. Our code and data are publicly available now.
[NLP-142] Probabilistic Programs of Thought
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在代码生成和数学推理任务中,因需多次采样以获取有效结构化输出而导致的GPU计算资源消耗过高的问题。传统方法通过反复调用LLM生成多个候选程序,但随着采样数量n增加,计算成本呈线性增长且难以扩展。解决方案的关键在于提出一种称为“概率性思维程序”(probabilistic programs of thought)的新测试时框架:该框架利用模型生成的单个程序及其对应的下一个词概率分布,构建一个紧凑的概率程序(probabilistic program),该程序可编码指数级数量的确定性程序;随后通过在该概率程序上执行轻量级的概率推理而非重新调用LLM,即可低成本地采样新程序,从而显著减少GPU计算开销并提升效率。
链接: https://arxiv.org/abs/2604.17290
作者: Poorva Garg,Renato Lui Geh,Daniel Israel,Todd Millstein,Kyle Richardson,Guy Van den Broeck
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 26 pages
Abstract:LLMs are widely used for code generation and mathematical reasoning tasks where they are required to generate structured output. They either need to reason about code, generate code for a given specification, or reason using programs of thought. The typical approach to code generation is to prompt the model and generate samples until an appropriate program is obtained. Within this process, sampling n programs from the language model requires n GPU compute-intensive generations which becomes prohibitively expensive for larger values of n . In this work, we address this limitation by exposing the LLM’s distribution within the generated programs themselves. We propose a novel test-time framework we dub probabilistic programs of thought to obtain more samples from the model with fewer LLM generations. Given a program generated by a model and the associated next-token probabilities, we build a probabilistic program that compactly represents exponentially many deterministic programs. Since performing probabilistic reasoning in this probabilistic program is much cheaper, our approach allows sampling new programs without any additional GPU compute and little CPU overhead. We instantiate our approach on benchmarks for code generation, code understanding and mathematical reasoning and report improvements in performance with fewer generations from the LLM.
[NLP-143] HorizonBench: Long-Horizon Personalization with Evolving Preferences
【速读】: 该论文旨在解决长期个性化(long-horizon personalization)问题,即如何在用户与系统持续交互数月的过程中准确追踪其偏好变化,尤其是识别并响应由后续生活事件引发的偏好迁移。现有研究受限于数据稀缺性和测量不足,缺乏既包含自然交互又具备真实偏好转折来源标注(ground-truth provenance)的资源。为此,作者提出一种基于结构化心理状态图(structured mental state graph)的数据生成器,可合成具有明确偏好转折因果关系的对话数据,并据此构建 HorizonBench 基准测试集,包含 4,245 条来自 360 名模拟用户的 6 个月对话记录(平均约 4,300 轮对话和 163K tokens)。该基准为长上下文建模、记忆增强架构、心智理论推理及用户建模提供了标准化评估平台。实验表明,当前前沿模型表现远低于人类水平(最佳仅达 52.8%),多数模型甚至无法超越随机猜测基线(20%),且错误多表现为未能更新用户状态,仍选择初始偏好值,揭示了状态跟踪能力是实现长期个性化的核心瓶颈。
链接: https://arxiv.org/abs/2604.17283
作者: Shuyue Stella Li,Bhargavi Paranjape,Kerem Oktar,Zhongyao Ma,Gelin Zhou,Lin Guan,Na Zhang,Sem Park,Lin Chen,Diyi Yang,Yulia Tsvetkov,Asli Celikyilmaz
机构: University of Washington (华盛顿大学); Meta (Meta); OpenAI (OpenAI); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 5 figures, 8 tables
Abstract:User preferences evolve across months of interaction, and tracking them requires inferring when a stated preference has been changed by a subsequent life event. We define this problem as long-horizon personalization and observe that progress on it is limited by data availability and measurement, with no existing resource providing both naturalistic long-horizon interactions and the ground-truth provenance needed to diagnose why models fail. We introduce a data generator that produces conversations from a structured mental state graph, yielding ground-truth provenance for every preference change across 6-month timelines, and from it construct HorizonBench, a benchmark of 4,245 items from 360 simulated users with 6-month conversation histories averaging ~4,300 turns and ~163K tokens. HorizonBench provides a testbed for long-context modeling, memory-augmented architectures, theory-of-mind reasoning, and user modeling. Across 25 frontier models, the best model reaches 52.8% and most score at or below the 20% chance baseline. When these models err on evolved preferences, over a third of the time they select the user’s originally stated value without tracking the updated user state. This belief-update failure persists across context lengths and expression explicitness levels, identifying state-tracking capability as the primary bottleneck for long-horizon personalization.
[NLP-144] MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning
【速读】: 该论文旨在解决当前过程级奖励模型(Process-Level Reward Models, PRMs)在医疗领域缺乏可靠评估框架的问题,尤其针对临床推理中安全敏感性高、知识密集性强及错误模式多样等特性。现有PRM基准主要覆盖数学等通用领域,无法有效量化大语言模型在医疗场景下的错误识别能力,导致其在真实医疗应用中的安全性难以验证。解决方案的关键在于提出MedPRMBench——首个面向医疗领域的过程级奖励模型基准,通过基于临床推理蓝图(Clinical Reasoning Blueprints, CRBs)的三阶段数据生成管道,从7个医学问答来源系统构建高质量评估数据集,涵盖14种细粒度错误类型(分为简洁性、合理性与敏感性三类),并引入首个四级严重程度分级体系以量化临床影响。该基准包含6,500个问题、13,000条推理链和113,910个步骤级标签,以及6,879个用于训练的问题,所提出的医疗PRM基线模型获得87.1%的整体PRM得分,显著优于所有对比方法,并作为即插即用的验证器提升了下游医学问答准确率3.2–6.7个百分点,揭示了当前模型在医疗推理错误检测方面的关键短板,为未来PRM改进提供了明确方向。
链接: https://arxiv.org/abs/2604.17282
作者: Lingyan Wu,Xiang Zheng,Weiqi Zhai,Wei Wang,Xuan Ren,Zifan Zhang,Hu Wei,Bing Zhao
机构: Alibaba Group(阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Process-Level Reward Models (PRMs) are essential for guiding complex reasoning in large language models, yet existing PRM benchmarks cover only general domains such as mathematics, failing to address medical reasoning – which is uniquely characterized by safety criticality, knowledge intensity, and diverse error patterns. Without a reliable medical PRM evaluation framework, we cannot quantify models’ error detection capabilities in clinical reasoning, leaving their safety in real-world healthcare applications unverified. We propose MedPRMBench, the first process-level reward model benchmark for the medical domain. Built through a three-phase pipeline based on Clinical Reasoning Blueprints (CRBs), MedPRMBench systematically generates high-quality evaluation data from seven medical QA sources, covering 14 fine-grained error types across three categories (Simplicity, Soundness, and Sensitivity) with the first 4-level severity grading system to quantify clinical impact. The benchmark comprises 6,500 questions with 13,000 reasoning chains and 113,910 step-level labels, plus 6,879 questions for training. Our medical PRM baseline achieves an 87.1% overall PRMScore – substantially surpassing all baselines – and serves as a plug-and-play verifier that improves downstream medical QA accuracy by 3.2–6.7 percentage points. Systematic evaluation spanning proprietary frontier models, open-source reasoning models, and medical-specialized models reveals critical weaknesses in current models’ medical reasoning error detection capabilities, providing clear directions for future PRM improvement.
[NLP-145] HopRank: Self-Supervised LLM Preference-Tuning on Graphs for Few-Shot Node Classification
【速读】: 该论文旨在解决文本属性图(Text-attributed Graph, TAG)上节点分类任务中现有图神经网络(GNN)方法因浅层文本编码和对标注数据依赖性强而导致的标签稀缺场景下性能受限的问题,以及现有大语言模型(Large Language Models, LLMs)与图结构结合方法在训练时仍需大量标签或未能有效利用图拓扑结构信息的局限性。其解决方案的关键在于:基于同质性原则(homophily principle),将节点分类重构为链接预测任务,并提出一种完全自监督的LLM调优框架HopRank——通过分层跳数采样构建偏好数据,采用自适应偏好学习机制优先选择高信息量的训练信号,且在推理阶段利用带自适应早停投票机制的连接偏好预测实现高效分类,从而在无需任何标签的情况下显著优于先前的图-LLM方法。
链接: https://arxiv.org/abs/2604.17271
作者: Ziqing Wang,Kaize Ding
机构: Northwestern University (西北大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Node classification on text-attributed graphs (TAGs) is a fundamental task with broad applications in citation analysis, social networks, and recommendation systems. Current GNN-based approaches suffer from shallow text encoding and heavy dependence on labeled data, limiting their effectiveness in label-scarce settings. While large language models (LLMs) naturally address the text understanding gap with deep semantic reasoning, existing LLM-for-graph methods either still require abundant labels during training or fail to exploit the rich structural signals freely available in graph topology. Our key observation is that, in many real-world TAGs, edges predominantly connect similar nodes under the homophily principle, meaning graph topology inherently encodes class structure without any labels. Building on this insight, we reformulate node classification as a link prediction task and present HopRank, a fully self-supervised LLM-tuning framework for TAGs. HopRank constructs preference data via hierarchical hop-based sampling and employs adaptive preference learning to prioritize informative training signals without any class labels. At inference, nodes are classified by predicting their connection preferences to labeled anchors, with an adaptive early-exit voting scheme to improve efficiency. Experiments on three TAG benchmarks show that HopRank matches fully-supervised GNNs and substantially outperforms prior graph-LLM methods, despite using zero labeled training data.
[NLP-146] Rethinking Meeting Effectiveness: A Benchmark and Framework for Temporal Fine-grained Automatic Meeting Effectiveness Evaluation ACL2026
【速读】: 该论文旨在解决传统会议有效性评估方法依赖事后问卷调查、仅提供整体粗粒度评分,无法捕捉协作讨论动态性的问题。其关键解决方案是提出一种以新型评价标准和时间细粒度分析为核心的全新评估范式,将会议有效性定义为随时间变化的目标达成率,并针对会议中的各个主题片段进行独立评估;同时构建了AMI-ME数据集(包含2,459个人工标注的会议片段)和基于大语言模型(Large Language Model, LLM)作为裁判的自动评估框架,实现从原始语音到端到端的有效性评分,从而提升评估的可扩展性、成本效益与可复现性。
链接: https://arxiv.org/abs/2604.17260
作者: Yihang Li,Chenhui Chu
机构: Kyoto University (京都大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2026 Main Conference
Abstract:Evaluating meeting effectiveness is crucial for improving organizational productivity. Current approaches rely on post-hoc surveys that yield a single coarse-grained score for an entire meeting. The reliance on manual assessment is inherently limited in scalability, cost, and reproducibility. Moreover, a single score fails to capture the dynamic nature of collaborative discussions. We propose a new paradigm for evaluating meeting effectiveness centered on novel criteria and temporal fine-grained approach. We define effectiveness as the rate of objective achievement over time and assess it for individual topical segments within a meeting. To support this task, we introduce the AMI Meeting Effectiveness (AMI-ME) dataset, a new meta-evaluation dataset containing 2,459 human-annotated segments from 130 AMI Corpus meetings. We also develop an automatic effectiveness evaluation framework that uses a Large Language Model (LLM) as a judge to score each segment’s effectiveness relative to the overall meeting objectives. Through substantial experiments, we establish a comprehensive benchmark for this new task and evaluate the framework’s generalizability across distinct meeting types, ranging from business scenarios to unstructured discussions. Furthermore, we benchmark end-to-end performance starting from raw speech to measure the capabilities of a complete system. Our results validate the framework’s effectiveness and provide strong baselines to facilitate future research in meeting analysis and multi-party dialogue. Our dataset and code will be publicly available. The AMI-ME dataset and the Automatic Evaluation Framework are available at: this URL.
[NLP-147] REZE: Representation Regularization for Domain-adaptive Text Embedding Pre-finetuning ACL2026
【速读】: 该论文旨在解决文本嵌入模型在通过对比预微调(contrastive pre-finetuning, PFT)适应专业化领域时,因使用散乱且异构的任务数据而导致任务诱导偏差(task-induced bias)的问题。这种偏差会引发表示空间的不可控偏移,破坏预训练嵌入的几何结构并显著降低性能。解决方案的关键在于提出REZE框架,其核心机制是通过对锚点-正样本对的关系进行特征空间分解,测量每个特征分量上的任务间离散度以识别任务变异方向,并采用自适应软收缩策略抑制任务相关噪声,同时保留任务无关的语义结构,从而实现无推理开销的表示空间稳定控制。
链接: https://arxiv.org/abs/2604.17257
作者: Seungmin Lee,Jeonghwan Lee,Hyunkuk Lim,Sejoon Kim,Mingi Sung
机构: PwC Korea GenAI Team (普华永道韩国生成式AI团队)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026 Main
Abstract:Recent text embedding models are often adapted to specialized domains via contrastive pre-finetuning (PFT) on a naive collection of scattered, heterogeneous tasks. However, this approach often introduces task-induced bias alongside domain knowledge, leading to uncontrolled representation shifts that distort the pretrained embedding geometry and cause substantial performance degradation. To address this issue, we propose REZE, a representation regularization framework that explicitly controls representation shift during embedding pre-finetuning. REZE operates on the relations of anchor-positive pairs and decomposes them in an eigenspace. It then measures task-wise dispersion along each eigencomponent to identify task-variant directions and applies adaptive soft-shrinkage to suppress task-induced noise while preserving task-invariant semantic structure, without inference-time overhead. Experiments across multiple embedding backbones and specialized benchmarks show that REZE outperforms standard pre-finetuning and isotropy-oriented post-hoc regularization in most settings, remaining stable where existing PFT variants collapse. Embedding space analyses further confirm that REZE induces controlled shifts aligned with the original embedding manifold, underscoring representation shift control as a key principle for robust embedding pre-finetuning under heterogeneous supervision.
[NLP-148] Are Emotion and Rhetoric Neurons in LLM ? Neuron Recognition and Adaptive Masking for Emotion-Rhetoric Prediction Steering ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在情感与修辞表达上缺乏细粒度控制的问题,现有研究多依赖外部优化方法,未能深入探索内部神经元表征机制,且对修辞神经元及其与情感神经元的内在关联研究不足,同时传统神经元掩蔽方法存在反直觉现象,难以实现可靠的因果验证。解决方案的关键在于提出一个融合多维筛选的神经元识别框架,并设计一种包含动态过滤、衰减掩蔽和反馈优化的自适应掩蔽方法,从而实现对神经元功能的可靠因果验证,并通过修辞神经元实现非目标句的定向诱导及情感任务性能提升,为LLMs中情感与修辞表达的细粒度调控提供了新范式。
链接: https://arxiv.org/abs/2604.17255
作者: Li Zheng,Xin Zhang,Shuyi He,Fei Li,Chong Teng,Jiangming Yang,Donghong Ji,Zhuang Li
机构: Wuhan University (武汉大学); Ant International (蚂蚁国际); RMIT University (皇家墨尔本理工大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2026
Abstract:Accurate comprehension and controllable generation of emotion and rhetoric are pivotal for enhancing the reasoning capabilities of large language models (LLMs). Existing studies mostly rely on external optimizations, lacking in-depth exploration of internal representation mechanisms, thus failing to achieve fine-grained steering at the neuron level. A handful of works on neurons are confined to emotions, neglecting rhetoric neurons and their intrinsic connections. Traditional neuron masking also exhibits counterintuitive phenomena, making reliable verification of neuron functionality infeasible. To address these issues, we systematically investigate the neurons representation mechanisms and inherent associations of 6 emotion categories and 4 core rhetorical devices. We propose a neuron identification framework that integrates multi-dimensional screening, and design an adaptive masking method incorporating dynamic filtering, attenuation masking, and feedback optimization, enabling reliable causal validation of neuron this http URL neuron regulation, we achieve directed induction of non-target sentences and enhancement of emotion tasks via rhetoric neurons. Experiments on 5 commonly used datasets validate the effectiveness of our method, providing a novel paradigm for the fine-grained steering of emotion and rhetoric expressions in LLMs.
[NLP-149] Seeing Isnt Believing: Mitigating Belief Inertia via Active Intervention in Embodied Agents ACL2026
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的具身智能体在执行复杂任务时因忽视环境反馈而导致决策次优和动作无效的问题,其核心症结在于“信念惯性”(belief inertia)——即智能体固守先前信念而未能根据实际观测进行调整。解决方案的关键在于提出一种统一的主动信念干预机制(Estimate-Verify-Update, EVU),该机制通过显式预测预期结果、利用推理验证观测与预测的一致性,并基于验证证据主动更新先验信念,从而有效缓解信念惯性问题。EVU可嵌入提示工程和训练驱动的推理方法中,实验表明其在多个具身基准测试中显著提升任务成功率。
链接: https://arxiv.org/abs/2604.17252
作者: Hanlin Wang,Chak Tou Leong,Jian Wang,Wenjie Li
机构: The Hong Kong Polytechnic University(香港理工大学); Sichuan University(四川大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted by ACL2026 Fingdings
Abstract:Recent advancements in large language models (LLMs) have enabled agents to tackle complex embodied tasks through environmental interaction. However, these agents still make suboptimal decisions and perform ineffective actions, as they often overlook critical environmental feedback that differs from their internal beliefs. Through a formal probing analysis, we characterize this as belief inertia, a phenomenon where agents stubbornly adhere to prior beliefs despite explicit observations. To address this, we advocate active belief intervention, moving from passive understanding to active management. We introduce the Estimate-Verify-Update (EVU) mechanism, which empowers agents to predict expected outcomes, verify them against observations through explicit reasoning, and actively update prior beliefs based on the verification evidence. EVU is designed as a unified intervention mechanism that generates textual belief states explicitly, and can be integrated into both prompting-based and training-based agent reasoning methods. Extensive experiments across three embodied benchmarks demonstrate that EVU consistently yields substantial gains in task success rates. Further analyses validate that our approach effectively mitigates belief inertia, advancing the development of more robust embodied agents. Our code is available at this https URL.
[NLP-150] DORA Explorer: Improving the Exploration Ability of LLM s Without Training
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在序列决策任务中因输出多样性不足而导致的探索能力弱的问题,这会引发收敛到次优解或陷入循环等现象,尤其在需要主动探索以获取信息的环境中表现不佳。现有采样方法如温度缩放仅能在词元层面引入随机性,难以实现序列级别的充分多样性。为应对这一挑战,作者提出了一种无需训练的解决方案——DORA Explorer(Diversity-Oriented Ranking of Actions),其核心在于生成多样化的动作候选集,利用词元对数概率对候选动作进行评分,并通过可调的探索参数选择最优动作,从而在多臂老虎机(Multi-Armed Bandit, MAB)和文本冒险学习环境套件(Text Adventure Learning Environment Suite, TALES)中显著提升探索效率与性能,例如在TextWorld中将Qwen2.5-7B的性能从29.2%提升至45.5%。
链接: https://arxiv.org/abs/2604.17244
作者: Priya Gurjar,Md Farhan Ishmam,Kenneth Marino
机构: Kahlert School of Computing, University of Utah (犹他大学卡勒特计算机学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 3 Figures, 10 tables
Abstract:Despite the rapid progress, LLMs for sequential decision-making (i.e., LLM agents) still struggle to produce diverse outputs. This leads to insufficient exploration, convergence to sub-optimal solutions, and becoming stuck in loops. Such limitations can be problematic in environments that require active exploration to gather information and make decisions. Sampling methods such as temperature scaling introduce token-level randomness but fail to produce enough diversity at the sequence level. We analyze LLM exploration in the classic Multi-Armed Bandit (MAB) setting and the Text Adventure Learning Environment Suite (TALES). We find that current decoding strategies and prompting methods like Chain-of-Thought and Tree-of-Thought are insufficient for robust exploration. To address this, we introduce DORA Explorer (Diversity-Oriented Ranking of Actions), a training-free framework for improving exploration in LLM agents. DORA generates diverse action candidates, scores them using token log-probabilities, and selects actions using a tunable exploration parameter. DORA achieves UCB-competitive performance on MAB and consistent gains across TALES, e.g., improving Qwen2.5-7B’s performance from 29.2% to 45.5% in TextWorld. Our project is available at: this https URL.
[NLP-151] A Multi-Agent Approach for Claim Verification from Tabular Data Documents
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的声明验证(Claim Verification)任务中普遍存在的问题:现有方法要么依赖复杂的预训练或微调流程,要么将验证过程分解为多个子任务,导致解释性不足且泛化能力有限。其解决方案的关键在于提出一种多智能体框架(Multi-Agentic framework for Claim verification, MACE),由三个专业化代理组成——规划器(Planner)、执行器(Executor)和验证器(Verifier),每个代理均采用零样本链式思维(zero-shot Chain-of-Thought)策略完成各自职责,从而在不依赖复杂微调的前提下实现可解释的推理轨迹:规划器生成显式的推理策略,执行器提供详细的计算步骤,验证器则对逻辑进行校验。该设计在保持高性能的同时显著降低参数规模(27–92B vs. 235B),并实现了SOTA性能或接近最优表现,兼具效率与透明性。
链接: https://arxiv.org/abs/2604.17225
作者: Rudra Ranajee Saha,Laks V. S. Lakshmanan,Raymond T. Ng
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We present a novel approach for claim verification from tabular data documents. Recent LLM-based approaches either employ complex pretraining/fine-tuning or decompose verification into subtasks, often lacking comprehensive explanations and generalizability. To address these limitations, we propose a Multi-Agentic framework for Claim verification (MACE) consisting of three specialized agents: Planner, Executor, and Verifier. Instead of elaborate finetuning, each agent employs a zero-shot Chain-of-Thought setup to perform its tasks. MACE produces interpretable verification traces, with the Planner generating explicit reasoning strategies, the Executor providing detailed computation steps, and the Verifier validating the logic. Experiments demonstrate that MACE achieves state-of-the-art (SOTA) performance on two datasets and performs on par with the best models on two others, while achieving 80–100% of best performance with substantially smaller models: 27–92B parameters versus 235B. This combination of competitive performance, memory efficiency, and transparent reasoning highlights our framework’s effectiveness.
[NLP-152] Demystifying the unreason able effectiveness of online alignment methods
【速读】: 该论文试图解决现有迭代对齐方法(如在线RLHF和在线DPO)在理论上与实际表现之间存在的差距问题,即当前基于KL正则化的累计遗憾(regret)理论分析给出的O(logT)上界显得过于保守,无法解释其在实践中表现出的高效性。解决方案的关键在于引入一种决策导向的 regret 评估标准——温度为零的 regret,该准则仅衡量推理时最优响应的表现,从而将学习过程中的统计成本与由软化训练策略引入的探索随机性区分开来。在此新框架下,作者证明了标准贪婪在线对齐方法可实现常数级(O(1))累积遗憾,提供了更精确的理论解释,揭示了贪婪对齐方法在识别最佳响应方面的高效性。
链接: https://arxiv.org/abs/2604.17207
作者: Enoch Hyunwook Kang
机构: University of Washington (华盛顿大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL)
备注:
Abstract:Iterative alignment methods based on purely greedy updates are remarkably effective in practice, yet existing theoretical guarantees of (O(\log T)) KL-regularized regret can seem pessimistic relative to their empirical performance. In this paper, we argue that this mismatch arises from the regret criterion itself: KL-regularized regret conflates the statistical cost of learning with the exploratory randomization induced by the softened training policy. To separate these effects, we study the traditional temperature-zero regret criterion, which evaluates only the top-ranked response at inference time. Under this decision-centric notion of performance, we prove that standard greedy online alignment methods, including online RLHF and online DPO, achieve constant ((O(1))) cumulative regret. By isolating the cost of identifying the best response from the stochasticity induced by regularization, our results provide a sharper theoretical explanation for the practical superb efficiency of greedy alignment.
[NLP-153] Calibrating Model-Based Evaluation Metrics for Summarization
【速读】: 该论文旨在解决现有摘要评估方法中依赖大规模语言模型(Large Language Models, LLMs)、预测分数校准不足以及需要多个参考摘要才能评估单文档平均质量的问题。其解决方案的关键在于提出一个通用框架,能够在无需参考摘要、人工标注或昂贵的模型基评估指标的情况下,生成个体和平均的代理评分(proxy scores);同时引入组等距回归分箱(Group Isotonic Regression Binning, GIRB)这一校准方法,将原始预测值调整至更贴近真实评价指标的分布,从而提升评估结果的可靠性与一致性。
链接: https://arxiv.org/abs/2604.17200
作者: Hongye Liu,Dhanajit Brahma,Ricardo Henao
机构: Duke University
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in summary evaluation are based on model-based metrics to assess quality dimensions, such as completeness, conciseness, and faithfulness. However, these methods often require large language models, and predicted scores are frequently miscalibrated, limiting their reliability. Moreover, evaluating the average quality across different summaries for a single document typically requires access to multiple reference summaries. Here, we propose a general framework that generates individual and average proxy scores without relying on reference summaries, human annotations, or expensive model-based metrics. We also propose group isotonic regression binning (GIRB), a calibration method that adjusts the raw predictions to better align with ground-truth evaluation metrics. While we focus on continuous-value scenarios, such as summarization, the method is applicable to discrete-value tasks, such as question answering. Experiments on seven datasets demonstrate that our approach consistently outperforms existing baselines.
[NLP-154] Learning to Control Summaries with Score Ranking
【速读】: 该论文旨在解决生成式摘要(Summarization)中多维度质量指标(如完整性、简洁性和忠实性)之间存在固有权衡的问题,即现有方法在联合优化这些维度时难以实现对某一特定维度的可控调节。解决方案的关键在于提出一种新的损失函数,该函数能够将模型输出与基于模型的细粒度评估得分(如FineSurE)对齐,从而在提升整体摘要质量的同时,实现对各个质量维度的选择性控制,使得用户可根据需求优先优化某一指标(如侧重简洁性或完整性)。
链接: https://arxiv.org/abs/2604.17197
作者: Hongye Liu,Liang Ding,Ricardo Henao
机构:
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in summarization research focus on improving summary quality across multiple criteria, such as completeness, conciseness, and faithfulness, by jointly optimizing these dimensions. However, these efforts largely overlook the challenge of controlling summary generation with respect to individual criteria, especially in the presence of their inherent trade-offs. For example, enhancing conciseness can compromise completeness, and vice versa. In this work, we address this gap by proposing a loss function that aligns model outputs with fine-grained, model-based evaluation scores (e.g., from FineSurE), enabling both improvement in summary quality and dimension-specific control. Our approach improves the overall quality of summaries while maintaining the ability to selectively prioritize one criterion over others. Experiments on three pretrained models (LLaMA, Qwen, and Mistral) demonstrate that our method achieves performance comparable to state-of-the-art summarizers, while uniquely offering strong controllability over individual quality dimensions.
[NLP-155] Beyond Overlap Metrics: Rewarding Reasoning and Preferences for Faithful Multi-Role Dialogue Summarization
【速读】: 该论文旨在解决多角色对话摘要(multi-role dialogue summarization)中如何有效建模多方交互、保持角色特异性信息以及确保事实一致性的问题。现有方法主要优化ROUGE和BERTScore等自动指标,但这些指标倾向于表面模仿参考摘要,而非提升摘要的真实性或与人类偏好的一致性。解决方案的关键在于提出一种融合显式认知推理与基于奖励的优化框架:首先通过教师模型蒸馏结构化推理路径(如逐步推断和中间反思),以分阶段监督微调初始化一个具备推理感知能力的摘要模型;随后采用GRPO算法,设计双原则奖励机制,结合指标驱动信号与人类对关键信息覆盖、隐含推理、事实忠实性和简洁性的偏好目标,从而显著提升摘要的事实准确性和人类偏好对齐度。
链接: https://arxiv.org/abs/2604.17188
作者: Xiaoyong Mei,Tingting Zuo,Da Chen,Guangyu Hu,Xiangyu Wen,Chao Duan,Mingyan Zhang,Fudan Zheng
机构: Zhejiang Normal University (浙江师范大学); Huawei Technologies (华为技术有限公司); HKUST, Hong Kong (香港科技大学); CUHK, Hong Kong (香港中文大学); Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-role dialogue summarization requires modeling complex interactions among multiple speakers while preserving role-specific information and factual consistency. However, most existing methods optimize for automatic metrics such as ROUGE and BERTScore, which favor surface-level imitation of references rather than genuine gains in faithfulness or alignment with human preferences. We propose a novel framework that couples explicit cognitive-style reasoning with reward-based optimization for multi-role dialogue summarization. Our method first distills structured reasoning traces (e.g., step-by-step inferences and intermediate reflections) from a large teacher model and uses them as auxiliary supervision to initialize a reasoning-aware summarizer via staged supervised fine-tuning. It then applies GRPO with a dual-principle reward that blends metric-based signals with human-aligned criteria targeting key information coverage, implicit inference, factual faithfulness, and conciseness. Experiments on multilingual multi-role dialogue benchmarks show that our method matches strong baselines on ROUGE and BERTScore. Specifically, results on CSDS confirm the framework’s stability in semantic consistency, while in-depth analysis on SAMSum demonstrates clear gains in factual faithfulness and model-based preference alignment. These findings underscore the value of reasoning-aware and preference-aware training for reliable dialogue summarization. Checkpoints and datasets are available at this https URL.
[NLP-156] Cognitive Policy-Driven LLM for Diagnosis and Intervention of Cognitive Distortions in Emotional Support Conversation ACL2026
【速读】: 该论文旨在解决当前情感支持对话(Emotional Support Conversation, ESC)模型在处理求助者心理困扰时,因忽视认知扭曲(Cognitive Distortion)问题而仅能提供基础情绪安慰、难以实现深层次心理干预的局限性。其解决方案的关键在于构建首个包含认知扭曲类型、强度及安全风险等级标注的CogBiasESC数据集,并提出基于认知策略驱动的大语言模型框架(Cognitive Policy-driven Large Language Model, CoPoLLM),通过引入认知诊断与干预策略机制,显著提升模型在识别和应对认知扭曲方面的准确性与安全性,实验表明CoPoLLM在多个指标上优于15种先进基线方法。
链接: https://arxiv.org/abs/2604.17178
作者: Lin Zhong,Renjin Zhu,Shujuan Ma,Jinhao Cui,Lingzhi Wang,Hao Chen,Qing Liao
机构: Harbin Institute of Technology, Shenzhen, China; City University of Macau, Macao SAR, China; Peng Cheng Laboratory, Shenzhen, China
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026 (Main Conference)
Abstract:Emotional Support Conversation (ESC) plays a critical role in mental health assistance by providing accessible psychological support in real-world applications. Large Language Models (LLMs) have shown strong empathetic abilities in ESC tasks. Yet, existing methods overlook the issue of cognitive distortions in help-seekers’ expressions. As a result, current models can only provide basic emotional comfort, rather than helping help-seekers address their psychological distress at a deeper cognitive level. To address this challenge, we construct the CogBiasESC dataset, the first dataset that expands existing ESC datasets by adding labels for cognitive distortions, includes their type, intensity, and safe risk level. Furthermore, we propose the Cognitive Policy-driven Large Language Model framework (CoPoLLM) to enhance LLMs’ ability to diagnose and intervene cognitive distortions in help-seekers. We also analyze the safety advantages of CoPoLLM from a theoretical perspective. Experimental results show that CoPoLLM significantly outperforms 15 state-of-the-art baselines in terms of distortion diagnosis accuracy, intervention strategy effectiveness, and safety risk control.
[NLP-157] Modeling Multi-Dimensional Cognitive States in Large Language Models under Cognitive Crowding ACL2026
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在多维认知状态建模中的局限性问题,即现有模型虽能在单一维度(如情绪、思维风格、立场或意图)上表现良好,但在联合建模多个认知维度时性能显著下降。其核心问题是LLMs所依赖的欧几里得空间(Euclidean space)难以有效表示具有层次结构的认知状态,导致表征重叠和理解能力受限。解决方案的关键在于提出HyCoLLM框架,该框架将认知状态建模从欧几里得空间迁移至双曲空间(hyperbolic space),利用双曲几何对层次结构的天然适应性,并通过超球面引导对齐微调(Hyperbolic Guided Alignment Tuning)实现LLM表征的优化,从而显著提升多维认知理解能力,使8B参数模型超越GPT-4o等强基线。
链接: https://arxiv.org/abs/2604.17174
作者: Lin Zhong,Siyu Zhu,Zizhen Yuan,Jinhao Cui,Xinyang Zhao,Lingzhi Wang,Hao Chen,Qing Liao
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026
Abstract:Modeling human cognitive states is essential for advanced artificial intelligence. Existing Large Language Models (LLMs) mainly address isolated tasks such as emotion analysis or stance detection, and fail to capture interactions among cognitive dimensions defined in psychology, including emotion, thinking style, stance, and intention. To bridge this gap, we construct CognitiveBench, the first benchmark with unified annotations across the above four dimensions. Experiments on CognitiveBench show that although LLMs perform well on single dimension tasks, their performance drops sharply in joint multi-dimensional modeling. Using Gromov \delta -hyperbolicity analysis, we find that CognitiveBench exhibits a strong hierarchical structure. We attribute the performance bottleneck to ``Cognitive Crowding’', where hierarchical cognitive states require exponential representational space, while the Euclidean space of LLMs grows only polynomially, causing representation overlap and degraded performance. To address this mismatch, we propose HyCoLLM, which models cognitive states in hyperbolic space and aligns LLM representations via Hyperbolic Guided Alignment Tuning. Results show that HyCoLLM substantially improves multi-dimensional cognitive understanding, allowing 8B parameter model to outperform strong baselines, including GPT-4o.
[NLP-158] Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在进攻性网络安全任务中模型性能评估缺乏系统性和跨模型可比性的问题。其解决方案的关键在于构建了一个全面的多模型评测框架——基于 D-CIPHER 多智能体架构扩展出支持多提供商后端、集成超过 100 种预装渗透测试工具的 Kali Linux 环境,并引入运行时工具发现代理(runtime tool-discovery agents),从而在 NYU CTF Bench 的全部 200 个挑战上对 10 个前沿大语言模型(LLM)进行受控因子实验。结果表明,环境配置(如使用 Kali Linux 相较于 Ubuntu 提升 +9.5 个百分点)和模型选择(Claude 4.5 Opus 解决率最高达 59%)是性能最强驱动因素,而提示工程(prompt engineering)在资源充足环境中反而可能带来边际收益递减甚至负面效果。
链接: https://arxiv.org/abs/2604.17159
作者: Tyler H. Merves,Michael H. Conaway,Joseph M. Escobar,Hakan T. Otal,Unal Tatar
机构: University at Albany, SUNY; College of Emergency Preparedness, Homeland Security, and Cybersecurity
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages, 4 figures. Submitted to the IEEE Systems and Information Engineering Design Symposium (SIEDS)
Abstract:We present, to our knowledge, the most comprehensive cross-model evaluation of LLM agents on offensive cybersecurity tasks, benchmarking 10 frontier models from 7 providers on all 200 challenges of the NYU CTF Bench. Building on the D-CIPHER multi-agent framework, we extend it with multi-provider backend support, a custom Kali Linux environment with over 100 pre-installed penetration testing tools, and runtime tool-discovery agents. Through a controlled factorial study, we find that the Kali Linux environment yields a +9.5 percentage-point improvement over Ubuntu, while auto-prompting and category-specific tips often degrade performance in well-equipped environments. Among models, Claude 4.5 Opus achieves the highest solve rate (59%), followed by Gemini 3 Pro (52%), with Gemini 3 Flash offering the best cost-efficiency at 0.05 per solve. Asymmetric planner/executor model assignments provide no meaningful benefit while coherent same-model configurations consistently outperform mixed-tier pairings. Our results indicate that environment tooling and model selection emerge as the strongest drivers of performance, whereas prompt engineering interventions show diminishing or negative returns in well-equipped environments. Reported performance reflects both model reasoning ability and compatibility with agent tooling and API integration.
[NLP-159] From Legal Text to Executable Decision Models: Evaluating Structured Representations for Legal Decision Model Generation
【速读】: 该论文旨在解决将法律文本转化为可执行决策逻辑(executable decision logic)这一长期存在于法律信息学中的挑战,尤其在大型语言模型(LLM)兴起背景下,如何提升自动化生成质量成为关键问题。其解决方案的核心在于引入中间结构化表示(intermediate structured representations),通过在输入文本中嵌入输入/输出约束(input and output constraints)来显著改善LLM生成的决策模型结构相似性与功能等价性。实验表明,I/O约束使生成模型与基准模型的结构相似度提升37-54%,且生成模型虽更小更简,仍能在51-53%的测试场景中实现功能等价;更重要的是,结构相似性与功能等价性之间无强相关性,提示二者需协同评估以全面衡量生成质量。
链接: https://arxiv.org/abs/2604.17153
作者: David Graus
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, accepted to ICAIL 2026
Abstract:Transforming legal text into executable decision logic is a longstanding challenge in legal informatics. With the rise of LLMs, this task has gained renewed interest, but remains challenging due to requiring extensive manual coding and evaluation. We use a unique real-world dataset that pairs production-grade decision models with legal text from the Dutch Environment and Planning Act. These models power the Omgevingsloket government platform, where citizens check permit requirements for environmental activities. We study whether intermediate structured representations can improve LLM-based generation of executable decision models from legal text. We compare four input conditions: raw legal text, text enriched with semantic role labels, text enriched with input and output constraints, and text enriched with both. We evaluate along two dimensions: structural evaluation, through similarity to gold decision models with graph kernels and graphs’ descriptive statistics, and outcome evaluation, through functional equivalence by executing models on pre-configured test scenarios. Our findings show that I/O constraints provide the dominant improvement (+37-54% similarity over baseline), while semantic role labels show modest improvements. Outcome evaluation shows that generated models match the gold standard on 51-53% of test scenarios, even though generated models are typically smaller and simpler. We find LLMs eliminate redundant pass-through logic that comprises up to 45-55% of nodes. Importantly, structural similarity and outcome equivalence are complementary: structural similarity does not guarantee outcome equivalence, and vice versa. To facilitate reproducibility, we publicly release our dataset of 95 production decision models with associated legal text and all experimental code.
[NLP-160] SciImpact: A Multi-Dimensional Multi-Field Benchmark for Scientific Impact Prediction
【速读】: 该论文旨在解决科学文献影响力评估中缺乏多维、自动化预测方法的问题,现有研究主要依赖引文指标,未能充分衡量如奖项、媒体报道、专利引用及成果采纳等其他影响维度。其解决方案的关键在于构建了一个大规模、跨领域的多维科学影响力基准——SciImpact,涵盖19个学科领域,通过整合异构数据源与定向网络爬取,构建了215,928对对比论文对(反映短期和长期影响差异),并系统评估了11种主流大语言模型(LLMs)在该基准上的表现,结果表明,经过多任务监督微调的小型模型(如4B参数规模)能显著超越大型模型(如30B参数)甚至领先闭源模型(如o4-mini),从而验证了SciImpact作为挑战性基准的价值及其在多维、跨领域科学影响力预测中的有效性。
链接: https://arxiv.org/abs/2604.17141
作者: Hangxiao Zhu,Yuyu Zhang,Ping Nie,Yu Zhang
机构: Texas AM University (德州农工大学); Verdent AI; University of Waterloo (滑铁卢大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid growth of scientific literature calls for automated methods to assess and predict research impact. Prior work has largely focused on citation-based metrics, leaving limited evaluation of models’ capability to reason about other impact dimensions. To this end, we introduce SciImpact, a large-scale, multi-dimensional benchmark for scientific impact prediction spanning 19 fields. SciImpact captures various forms of scientific influence, ranging from citation counts to award recognition, media attention, patent reference, and artifact adoption, by integrating heterogeneous data sources and targeted web crawling. It comprises 215,928 contrastive paper pairs reflecting meaningful impact differences in both short-term (e.g., Best Paper Award) and long-term settings (e.g., Nobel Prize). We evaluate 11 widely used large language models (LLMs) on SciImpact. Results show that off-the-shelf models exhibit substantial variability across dimensions and fields, while multi-task supervised fine-tuning consistently enables smaller LLMs (e.g., 4B) to markedly outperform much larger models (e.g., 30B) and surpass powerful closed-source LLMs (e.g., o4-mini). These results establish SciImpact as a challenging benchmark and demonstrate its value for multi-dimensional, multi-field scientific impact prediction. Our project homepage is this https URL
[NLP-161] RoIt-XMASA: Multi-Domain Multilingual Sentiment Analysis Dataset for Romanian and Italian AAAI
【速读】: 该论文旨在解决跨语言(cross-lingual)和跨领域(cross-domain)情感分析中的性能下降问题,特别是在低资源语言(如意大利语和罗马尼亚语)上的迁移学习挑战。其核心解决方案是提出一种多目标对抗训练框架(multi-target adversarial training framework),通过引入损失反转(loss reversal)机制并结合元学习(meta-learning)得到的动态系数,实现情感判别能力与语言及领域不变性之间的平衡,从而提升模型在多语言、多领域场景下的泛化性能。
链接: https://arxiv.org/abs/2604.17134
作者: Andrei-Marius Avram,Aureliu Valentin Antonie,Cosmin-Mircea Croitoru,Vlad Andrei Muntean,Dumitru-Clementin Cercel
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at the International AAAI Conference on Web and Social Media (ICWSM 2026)
Abstract:We present RoIt-XMASA, a multilingual dataset that extends the Cross-lingual Multi-domain Amazon Sentiment Analysis to Italian and Romanian, comprising 36,000 labeled reviews across three domains (books, movies, and music) and 202,141 unlabeled samples. To address cross-lingual and cross-domain challenges, we propose a multi-target adversarial training framework that employs loss reversal with meta-learned coefficients to dynamically balance sentiment discrimination with domain and language invariance. XLM-R achieves an F1-score of 66.23% with our approach, outperforming the baseline by 4.64%. Few-shot evaluation shows that Llama-3.1-8B achieves 58.43% F1-score, revealing a meaningful trade-off between the efficiency of prompting-based approaches and the higher performance of task-specific fine-tuning.
[NLP-162] Please refuse to answer me! Mitigating Over-Refusal in Large Language Models via Adaptive Contrastive Decoding ACL2026
【速读】: 该论文旨在解决安全对齐的大语言模型(Large Language Models, LLMs)中存在的“过度拒绝”(over-refusal)问题,即模型在面对无害查询时频繁生成拒绝响应,而现有缓解方法难以同时保持对有害请求的高拒绝率与对无害请求的低拒绝率。解决方案的关键在于提出一种无需训练且模型无关的自适应对比解码方法(Adaptive Contrastive Decoding, AdaCD):首先,通过对比模型在极端安全系统提示下与无极端提示下的输出分布,优化拒绝 token 的分布;其次,引入一种自适应对比解码策略,动态调整是否引入拒绝 token 分布,从而在不损害安全性前提下提升非拒绝 token 的选择概率。实验表明,AdaCD 在五个基准数据集上平均将过度拒绝查询的拒绝率降低 10.35%,同时小幅提升恶意查询的拒绝率 0.13%。
链接: https://arxiv.org/abs/2604.17132
作者: Yupeng Qi,Ziyu Lyu,Lixin Cui,Lu Bai,Feng Xia
机构: Sun Yat-sen University (中山大学); Central University of Finance and Economics (中央财经大学); Beijing Normal University (北京师范大学); RMIT University (皇家墨尔本理工大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2026 Main Conference
Abstract:Safety-aligned large language models (LLMs) often generate refusal responses to harmless queries due to the over-refusal problem. However, existing methods for mitigating over-refusal cannot maintain a low refusal ratio for harmless queries while keeping a high refusal ratio for malicious ones. In this paper, we analyze how system prompts with varying safety levels affect LLM refusal behaviors when facing over-refusal queries. A key observation is that, when LLMs suffer from the over-refusal issue, non-refusal tokens remain present in the next-token candidate list, but the model systematically fails to select them, despite the generation of refusal tokens. Based on this observation, we propose a training-free and model-agnostic approach, Adaptive Contrastive Decoding (AdaCD), to mitigate over-refusal while maintaining LLM safety. First, AdaCD compares the output distributions of the LLM with or without an extreme safety system prompt to refine the refusal token distribution. Second, we introduce an adaptive contrastive decoding strategy that dynamically incorporates or removes the refusal token distribution, adaptively boosting the probability of selecting refusal or non-refusal tokens. Experimental results on five benchmark datasets show that, on average, AdaCD reduces the refusal ratio for over-refusal queries by 10.35%, yet still increases the refusal ratio for malicious queries by 0.13%. Code is available at this https URL.
[NLP-163] he Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning
【速读】: 该论文旨在解决前沿大语言模型(Large Language Models, LLMs)在临床生成内容中存在“溯源缺口”(Provenance Gap)的问题,即模型输出虽具临床准确性,但其引用文献常为虚构,导致证据不可验证。解决方案的关键在于提出并实现HEG-TKG(Hierarchical Evidence-Grounded Temporal Knowledge Graphs)系统,该系统通过构建基于4,512篇PubMed文献与人工标注的疾病轨迹里程碑(1,280个节点)的分层证据锚定时间知识图谱,确保临床主张可被精确追溯至真实来源。实验表明,HEG-TKG在保持与基线相同临床特征覆盖度的同时,实现了100%的证据可验证性(203个内联引用),且相较Guideline-RAG方法零可验证引用的表现显著优越;此外,其具备高抗干扰能力(80%抵抗注入错误)和完全可检测性,同时支持本地化部署以保障患者数据安全。
链接: https://arxiv.org/abs/2604.17114
作者: Md Shamim Ahmed,Maja Dusanic,Moritz Nikolai Kirschner,Elisabeth Nyoungui,Jana Zschüntzsch,Lukas Galke Poech,Richard Röttger
机构: University of Southern Denmark (南丹麦大学); Universitätsmedizin Göttingen (哥廷根大学医学中心)
类目: Computation and Language (cs.CL)
备注: 32 pages, 9 figures, 7 tables. Will submit to npj Digital Medicine. Supplementary materials included
Abstract:Frontier large language models generate clinically accurate outputs, but their citations are often fabricated. We term this the Provenance Gap. We tested five frontier LLMs across 36 clinician-validated scenarios for three rare neuromuscular disease pairs. No model produced a clinically relevant PubMed identifier without prompting. When explicitly asked to cite, the best model achieved 15.3% relevant PMIDs; the majority resolved to real publications in unrelated fields. We present HEG-TKG (Hierarchical Evidence-Grounded Temporal Knowledge Graphs), a system that grounds clinical claims in temporal knowledge graphs built from 4,512 PubMed records and curated sources with quality-tier stratification and 1,280 disease-trajectory milestones. In a controlled three-arm comparison using the same synthesis model, HEG-TKG matches baseline clinical feature coverage while achieving 100% evidence verifiability with 203 inline citations. Guideline-RAG, given overlapping source documents as raw text, produces zero verifiable citations. LLM judges cannot distinguish fabricated from verified citations without PubMed audit data. Independent clinician evaluation confirms the verifiability advantage (Cohen’s d = 1.81, p 0.001) with no degradation on safety or completeness. A counterfactual experiment shows 80% resistance to injected clinical errors with 100% detectability via citation trace. The system deploys on-premise via open-source models so patient data never leaves institutional infrastructure.
[NLP-164] Beyond Word Boundaries: A Hebrew Coreference Benchmark and an Evaluation Protocol for Morphologically Complex Text
【速读】: 该论文旨在解决生成式 AI(Generative AI)在处理形态丰富的语言(Morphologically Rich Languages, MRLs)时,核心指代消解(Coreference Resolution, CR)性能显著下降的问题。传统CR方法通常假设词边界与指代边界一致(如英语),但在MRLs中(如现代希伯来语),一个词可能包含多个语法成分(如词缀或代词附着成分),导致标准模型无法准确识别指代关系。解决方案的关键在于:首先构建首个针对现代希伯来语的多粒度标注核心指代数据集KibutzR,该数据集覆盖词、子词和多词层级的指代边界;其次提出一种面向词/词素边界不一致问题的分割感知评估协议,从而更真实地衡量模型在原始未分词文本上的表现。实验表明,当前主流大语言模型(LLMs)在希伯来语上的表现远低于英语,且小规模编码器模型反而优于大型解码器模型,揭示了MRLs下CR任务的新研究方向。
链接: https://arxiv.org/abs/2604.17108
作者: Refael Shaked Greenfeld,Reut Tsarfaty
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Coreference Resolution (CR) is a fundamental NLP task critical for long-form tasks as information extraction, summarization, and many business applications. However, CR methods originally designed for English struggle with Morphologically Rich Languages (MRLs), where mention boundaries do not necessarily align with word boundaries, and a single token may consist of multiple anaphors. CR modeling and evaluation protocols standardly assume that, as in English, words and mentions mostly align. However, this assumption breaks down in MRLs, particularly in the context of LLMs’ raw-text processing and end-to-end tasks. To assess and address this challenge, we introduce \em KibutzR, the first comprehensive CR dataset for Modern Hebrew, an MRL rich with complex words and pronominal clitics. We deliver an annotated dataset that identifies mentions at word, sub-word and multi-word levels, and propose an evaluation protocol that directly addresses word/morpheme boundary discrepancies. Our experiments show that contemporary LLMs perform significantly worse on Hebrew than on English, and that performance degrades on raw unsegmented text. Crucially, we show an inverse performance-trend in Hebrew relative to English, where smaller encoders perform far better than contemporary decoder models, leaving ample space for investigation and improvement. We deliver a new benchmark for Hebrew coreference resolution and a segmentation-aware evaluation protocol to inform future work on other MRLs.
[NLP-165] How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them ACL2026
【速读】: 该论文旨在解决当前文本语言模型(Language Model, LM)在分词(tokenization)过程中忽视词音系特征的问题,特别是子词级分词对局部(如押韵)和全局(如音节划分)音系信息编码能力的系统性削弱。其解决方案的关键在于提出一种基于国际音标(IPA)的轻量级微调方法,通过引入音系感知机制来增强模型对音系知识的表示能力,从而在不显著损害数学推理和通用推理能力的前提下,提升模型在三项与音系相关的任务上的表现。
链接: https://arxiv.org/abs/2604.17105
作者: Disen Liao,Freda Shi
机构: University of Waterloo (滑铁卢大学); Vector Institute (向量研究所)
类目: Computation and Language (cs.CL)
备注: 18 pages, 7 figures, ACL 2026
Abstract:Tokenization is the first step in every language model (LM), yet it never takes the sounds of words into account. We investigate how tokenization influences text-only LMs’ ability to represent phonological knowledge. Through a series of probing experiments, we show that subword-based tokenization systematically weakens the encoding of both local (e.g., rhyme) and global (e.g., syllabification) phonological features. To quantify this effect, we introduce the syllabification-tokenization alignment distance (STAD), a metric that measures the misalignment between a model’s tokenization and the natural syllable boundaries of words, and find that higher misalignment correlates with poorer phonological representations, providing a simple diagnostic for phonology-aware tokenization. To address these limitations, we propose a lightweight IPA-based fine-tuning method that infuses phonological awareness into LMs, leading to consistent improvements across three phonology-related tasks while largely preserving math and general reasoning ability, with 1.1% and 0.9% drops on GSM8K and MMLU, respectively.
[NLP-166] GenericAgent : A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)
【速读】: 该论文旨在解决长时程大语言模型(Large Language Model, LLM)代理在持续交互中因上下文窗口有限而导致的关键决策信息被淹没、历史经验难以复用的问题。其核心挑战在于:随着任务执行时间延长,工具描述、记忆片段和环境反馈不断累积,占据大量上下文空间,反而挤占了对当前决策至关重要的信息;同时,跨任务的经验无法有效留存。解决方案的关键在于提出GenericAgent(GA),其基于“上下文信息密度最大化”这一单一原则构建,通过四个紧密耦合的组件实现:最小原子工具集以简化接口、分层按需记忆机制默认仅展示高层视图、自进化机制将验证过的轨迹转化为可复用的标准操作流程(Standard Operating Procedure, SOP)和可执行代码、以及上下文截断与压缩层以维持长期运行中的信息密度。该设计使GA在任务完成率、工具使用效率、记忆有效性、自进化能力及网页浏览等多维度上均显著优于现有主流代理系统,且token消耗和交互次数更少,并具备持续演化的潜力。
链接: https://arxiv.org/abs/2604.17091
作者: Jiaqing Liang,Jinyi Han,Weijia Li,Xinyi Wang,Zhoujia Zhang,Zishang Jiang,Ying Liao,Tingyun Li,Ying Huang,Hao Shen,Hanyu Wu,Fang Guo,Keyi Wang,Zhonghua Hong,Zhiyu Lu,Lipeng Ma,Sihang Jiang,Yanghua Xiao
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Long-horizon large language model (LLM) agents are fundamentally limited by context. As interactions become longer, tool descriptions, retrieved memories, and raw environmental feedback accumulate and push out the information needed for decision-making. At the same time, useful experience gained from tasks is often lost across episodes. We argue that long-horizon performance is determined not by context length, but by how much decision-relevant information is maintained within a finite context budget. We present GenericAgent (GA), a general-purpose, self-evolving LLM agent system built around a single principle: context information density maximization. GA implements this through four closely connected components: a minimal atomic tool set that keeps the interface simple, a hierarchical on-demand memory that only shows a small high-level view by default, a self-evolution mechanism that turns verified past trajectories into reusable SOPs and executable code, and a context truncation and compression layer that maintains information density during long executions. Across task completion, tool use efficiency, memory effectiveness, self-evolution, and web browsing, GA consistently outperforms leading agent systems while using significantly fewer tokens and interactions, and it continues to evolve over time. Project: this https URL
[NLP-167] Comparing Human and Large Language Model Interpretation of Implicit Information ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理隐含信息提取(Implicit Information Extraction, IIE)任务时的局限性问题,即LLMs是否能像人类一样准确识别和推理文本中的隐含语义。其解决方案的关键在于提出了一种基于LLM的IIE流水线,该流程通过从上下文句中抽取关系三元组(relational triplets)、验证隐含推理(implicit inferences)以及分析时间关系(temporal relations),构建结构化的知识图谱。实验表明,尽管模型与人类在多数三元组上达成一致,但人类仍持续提出大量新增三元组,说明当前LLM-based IIE存在覆盖不足的问题;此外,在社会丰富语境中,模型比人类更保守地进行隐含推理,而人类在短且事实导向的语境中则表现出更高的保守倾向。
链接: https://arxiv.org/abs/2604.17085
作者: Antonio De Santis,Tommaso Bonetti,Andrea Tocchetti,Marco Brambilla
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026 Findings
Abstract:The interpretation of implicit meanings is an integral aspect of human communication. However, this framework may not transfer to interactions with Large Language Models (LLMs). To investigate this, we introduce the task of Implicit Information Extraction (IIE) and propose an LLM-based IIE pipeline that builds a structured knowledge graph from a context sentence by extracting relational triplets, validating implicit inferences, and analyzing temporal relations. We evaluate two LLMs against crowdsourced human judgments on two datasets. We find that humans agree with most model triplets yet consistently propose many additions, indicating limited coverage in current LLM-based IIE. Moreover, in our experiments, models appear to be more conservative about implicit inferences than humans in socially rich contexts, whereas humans become more conservative in shorter, fact-oriented contexts. Our code is available at this https URL.
[NLP-168] Auditing Support Strategies in LLM s through Grounded Multi-Turn Social Simulation
【速读】: 该论文旨在解决当前对支持性大语言模型(Supportive Large Language Models, LLMs)的评估多基于单轮、完整提示(single-turn, fully specified prompts)的问题,这与用户在真实场景中逐步披露困境并寻求社会支持(Social Support)的行为模式不一致。为此,作者提出一种多轮模拟框架(multi-turn simulation framework),将来自五个Reddit社区的支持求助叙事分解为有序片段,并逐轮呈现给语言模型;每轮响应通过社会支持行为编码系统(Social Support Behavior Code, SSBC)进行多标签标注,以捕捉支持策略的组成而非单一质量评分。关键创新在于利用隐藏表示上的线性探测(linear probes)来估计模型内部对用户痛苦程度的感知信号,从而分析支持策略选择是否随模型自身对用户困境的理解而动态调整——结果显示,随着估计痛苦水平上升,教学类支持策略显著下降,而情感和自尊导向策略(如验证)则呈模型特异性变化,且社区语境独立于人口统计类别影响支持行为模式。这一方法揭示了单轮评估无法捕捉的轨迹级动态,推动了面向社会敏感应用的多轮审计框架的发展。
链接: https://arxiv.org/abs/2604.17079
作者: Michelle Star,Andrew Aquilina,Yu-Ru Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:When users seek social support from chatbots, they disclose their situation gradually, yet most evaluations of supportive LLMs rely on single-turn, fully specified prompts. We introduce a multi-turn simulation framework that closes this gap. Support-seeking narratives from five Reddit communities are decomposed into ordered fragments and revealed turn by turn to a language model. Each response is coded with the Social Support Behavior Code (SSBC), an established multi-label taxonomy that captures the composition of support, rather than a single quality score. To ask whether support choices track the model’s own construal of user distress, we use linear probes on hidden representations to estimate this internal signal without altering the generation context. Across two mid-scale models (Llama-3.1-8B, OLMo-3-7B) and more than 6,200 turns, support composition shifts systematically with estimated distress: teaching declines as estimated distress rises, a finding that replicates across architectures, while increases in affective and esteem-oriented strategies (such as validation) are suggestive but model-specific and rest on noisier annotations. Community context independently shapes behavior, tracking topic and discourse norms rather than demographic categories. These trajectory-level dynamics, invisible to single-turn evaluation, motivate multi-turn auditing frameworks for socially sensitive applications.
[NLP-169] Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL ACL2026
【速读】: 该论文旨在解决大语言模型在强化微调(reinforcement fine-tuning)后可能出现的“错误回答不可回答问题”的现象,即模型倾向于猜测或产生幻觉(hallucination),而非正确地选择不回答(abstention)。现有方法要么训练模型生成泛化的拒绝响应,要么鼓励后续澄清但无法验证这些澄清是否真正识别出关键缺失信息。解决方案的关键在于提出一种澄清感知的强化学习价值回归(clarification-aware RLVR)奖励机制,该机制不仅奖励模型在可回答问题上的正确输出,还同时优化两个目标:对不可回答问题进行显式 abstention(明确拒绝),以及生成语义上与缺失信息一致的后续澄清说明(post-refusal clarification)。通过该奖励训练得到的 Abstain-R1 模型,在保持可回答问题性能的同时,显著提升了不可回答问题上的校准性 abstention 和澄清质量,表明可靠的拒绝与解释行为可通过可验证的奖励信号学习获得,而无需依赖模型规模的自然涌现。
链接: https://arxiv.org/abs/2604.17073
作者: Skylar Zhai,Jingcheng Liang,Dongyeop Kang
机构: University of Minnesota (明尼苏达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2026
Abstract:Reinforcement fine-tuning improves the reasoning ability of large language models, but it can also encourage them to answer unanswerable queries by guessing or hallucinating missing information. Existing abstention methods either train models to produce generic refusals or encourage follow-up clarifications without verifying whether those clarifications identify the key missing information. We study queries that are clear in meaning but cannot be reliably resolved from the given information, and argue that a reliable model should not only abstain, but also explain what is missing. We propose a clarification-aware RLVR reward that, while rewarding correct answers on answerable queries, jointly optimizes explicit abstention and semantically aligned post-refusal clarification on unanswerable queries. Using this reward, we train Abstain-R1, a 3B model that improves abstention and clarification on unanswerable queries while preserving strong performance on answerable ones. Experiments on Abstain-Test, Abstain-QA, and SelfAware show that Abstain-R1 substantially improves over its base model and achieves unanswerable-query behavior competitive with larger systems including DeepSeek-R1, suggesting that calibrated abstention and clarification can be learned through verifiable rewards rather than emerging from scale alone.
[NLP-170] Stability-Weighted Decoding for Diffusion Language Models
【速读】: 该论文旨在解决扩散大语言模型(Diffusion Large Language Models, dLLMs)在文本生成过程中因静态置信度指标导致的过早解码不稳定词元(token)的问题。现有解码策略依赖单一去噪步骤计算的静态置信度,忽略了时间维度上的历史信息,从而可能在不稳定的条件下提前释放掩码内容,影响生成质量。解决方案的关键在于提出一种无需训练、可即插即用的稳定性加权解码(Stability-Weighted Decoding, SWD)方法,其核心思想是利用相邻预测分布之间的KL散度量化词元的时间不稳定性,并证明该不稳定性提供其与剩余掩码上下文互信息的严格下界——即时间不稳定词元本质上不可靠。SWD将此稳定性指标融入评分机制,作为通用调制器适配任意基于分数的解码策略,在代码生成和数学推理基准上显著提升准确性并保持鲁棒性。
链接: https://arxiv.org/abs/2604.17068
作者: Yue Wu,Jian Huang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Diffusion large language models (dLLMs) enable parallel text generation by iteratively denoising a fully masked sequence, unmasking a subset of masked tokens at each step. Existing decoding strategies rely on static confidence metrics computed at a single denoising step, ignoring temporal history and often leading to premature unmasking of unstable tokens. In this work, we theoretically establish that a token’s temporal instability, quantified by the KL divergence between consecutive prediction distributions, provides a strict lower bound on its mutual information with the remaining masked context, indicating that temporally unstable tokens are inherently unsafe to unmask. Based on this insight, we propose Stability-Weighted Decoding (SWD), a training-free, plug-and-play strategy that incorporates temporal stability into token scoring and acts as a universal modulator for arbitrary score-based decoding policies. Experiments on code generation and mathematical reasoning benchmarks demonstrate that SWD consistently improves generation accuracy across representative scoring metrics and selection policies, and exhibits exceptional robustness, maintaining a significant performance lead over standard baselines across varying acceleration ratios.
[NLP-171] Jailbreaking Large Language Models with Morality Attacks ACL2026
【速读】: 该论文旨在解决生成式 AI(Generative AI)在面对多元道德价值观时的对齐问题,即如何使大语言模型(Large Language Models, LLMs)能够兼容并服务于具有不同伦理立场的人类社会。其核心挑战在于当前LLMs虽在学习多元价值方面已有诸多探索,但其在面对精心设计的“越狱攻击”(jailbreak prompts)时仍表现出脆弱性,易被诱导输出违背预设道德规范的内容。解决方案的关键在于:构建一个包含10.3K条实例的道德数据集,涵盖“价值模糊”与“价值冲突”两类场景,并基于此设计四种对抗性攻击方法,以系统性测试LLMs及防护模型(guardrail models)在复杂道德判断中的鲁棒性。实验表明,LLMs和现有防护机制均对这类隐蔽且具道德感知力的攻击存在显著漏洞,揭示了当前AI道德对齐研究中亟需加强的薄弱环节。
链接: https://arxiv.org/abs/2604.17053
作者: Ying Su,Mingen Zheng,Weili Diao,Haoran Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: 27 pages, 6 figures, 18 tables. Accepted by ACL 2026 Findings
Abstract:Pluralism alignment with AI has the sophisticated and necessary goal of creating AI that can coexist with and serve morally multifaceted humanity. Research towards pluralism alignment has many efforts in enhancing the learning of large language models (LLMs) to accomplish pluralism. Although this is essential, the robustness of LLMs to produce moral content over pluralistic values is still under this http URL by the astonishing persuasion abilities via jailbreak prompts, we propose to leverage jailbreak attacks to study LLMs’ internal pluralistic values. In detail, we develop a morality dataset with 10.3K instances in two categories: Value Ambiguity and Value Conflict. We further formalize four adversarial attacks with the constructed dataset, to manipulate LLMs’ judgment over the morality questions. We evaluate both the large language models and guardrail models which are typically used in generative systems with flexible user input. Our experiment results show that there is a critical vulnerability of LLMs and guardrail models to these subtle and sophisticated moral-aware attacks.
[NLP-172] Efficient Task Adaptation in Large Language Models via Selective Parameter Optimization IJCNN
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域任务微调过程中因参数更新导致的灾难性遗忘(catastrophic forgetting)问题,即预训练阶段积累的通用知识被部分覆盖或丢失,从而削弱模型的泛化能力和迁移性能。解决方案的关键在于提出一种参数元素重要性评估方法,通过区分参数对通用语言能力任务与特定领域任务的重要性,将模型参数划分为“核心参数”和“非核心参数”,并在微调过程中固定核心参数,仅对非核心参数进行更新,从而在保留通用知识的同时提升模型对特定任务的适应性。
链接: https://arxiv.org/abs/2604.17051
作者: Weijie Wan,Jiangjiang Zhao
机构: 华南师范大学(南中国师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: IJCNN Full Paper
Abstract:Large Language Models (LLMs) have demonstrated excellent performance in general language understanding, generation and other tasks. However, when fine-tuning for specific domain tasks, the general knowledge accumulated in the pre-training phase is often partially overwritten or forgotten due to parameter updates, which severely limits the generalization ability and transferability of LLMs. Traditional fine-tuning strategies mostly train on the entire parameter space, ignoring the heterogeneity of model parameters, that is, some parameters are extremely important for general tasks, while other parameters are more sensitive to specific tasks. To alleviate the above problems, this paper innovatively proposes a parameter element importance evaluation method, which divides parameters into “core parameters” and “non-core parameters” by distinguishing the importance of parameters for general language ability tasks and specific domain tasks, and fixes the core parameters during fine-tuning, and only fine-tunes the non-core parameters. Extensive experiments on scientific, medical and physical tasks using GPT-J and LLaMA-3 show that our method can mitigate catastrophic forgetting while enhancing the adaptability of the model.
[NLP-173] Dynamic Emotion and Personality Profiling for Multimodal Deception Detection ACL2026
【速读】: 该论文旨在解决现有欺骗检测方法中缺乏样本级动态情绪标注的问题,从而提升欺骗、情绪和人格三者联合检测的准确性。其关键解决方案是提出一种多模态多提示标注方案(multi-model multi-prompt annotation scheme)与严格的标签质量评估标准,构建了首个融合欺骗、情绪与人格的多模态联合检测数据集DDEP,并设计了Rel-DDEP框架——一种基于自适应可靠性加权融合机制的模型。该框架通过将各模态特征映射到高维高斯分布空间量化不确定性,结合对齐模块与排序约束模块实现多任务协同优化,显著提升了欺骗检测(F1提升2.53%)、情绪检测(F1提升2.66%)和人格检测(F1提升9.30%)的性能。
链接: https://arxiv.org/abs/2604.17037
作者: Li Zheng,Yanyi Luo,Hao Fei,Yuzhe Ding,Yujie Huang,Fei Li,Chong Teng,Donghong Ji
机构: Wuhan University (武汉大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2026
Abstract:Deception detection is of great significance for ensuring information security and conducting public opinion analysis, with personality factors and emotion cues playing a critical role. However, existing methods lack sample-level dynamic annotations for emotions and this http URL this paper, we propose an innovative multi-model multi-prompt annotation scheme and a strict label quality evaluation standard, and establish a multimodal joint detection dataset DDEP for deception, emotion, and personality. Meanwhile, we propose Rel-DDEP, an adaptive reliability-weighted fusion framework. Our framework quantifies uncertainty by mapping modal features to a high-dimensional Gaussian distribution space. It then performs reliability-weighted fusion and incorporates an alignment module and a sorting constraint module to achieve joint detection of deception, emotion, and personality. Experimental results on the MDPE and DDEP datasets show that our Rel-DDEP significantly outperforms the existing state-of-the-art baseline models in three tasks. The F1 score of the deception detection increases by 2.53%, that of the emotion detection increases by 2.66%, and that of the personality detection increases by 9.30%. The experiments fully verify the necessity of annotating dynamic emotion and personality labels for each sample and the effectiveness of reliability-weighted fusion.
[NLP-174] Where is the Mind? Persona Vectors and LLM Individuation
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)的个体化问题(individuation problem),即在与这些模型相关的实体中,哪些应被识别为具有心智属性的主体。其解决方案的关键在于通过机制可解释性(mechanistic interpretability)进行分析,并聚焦于近期关于角色向量(persona vectors)、角色空间(persona space)以及涌现错位(emergent misalignment)的经验研究。作者提出三种最具潜力的观点:虚拟实例观(virtual instance view)、新提出的虚拟实例-角色观((virtual) instance-persona view)和模型-角色观(model-persona view)。其中,虚拟实例观的核心依据是注意力流(attention streams)能够在词元时间(token-time)上维持准心理关联,从而支持将特定运行实例视为具备心智属性的实体;而基于角色结构的两种新观点则源于对LLM内部角色表征机制的三重假设,显示出作为心智候选者的可行性。
链接: https://arxiv.org/abs/2604.17031
作者: Pierre Beckmann,Patrick Butlin
机构: EPFL (瑞士联邦理工学院); Idiap Research Institute (Idiap 研究所); Eleos AI Research (Eleos AI 研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The individuation problem for large language models asks which entities associated with them, if any, should be identified as minds. We approach this problem through mechanistic interpretability, engaging in particular with recent empirical work on persona vectors, persona space, and emergent misalignment. We argue that three views are the strongest candidates: the virtual instance view and two new views we introduce, the (virtual) instance-persona view and the model-persona view. First, we argue for the virtual instance view on the grounds that attention streams sustain quasi-psychological connections across token-time. Then we present the persona literature, organised around three hypotheses about the internal structure underlying personas in LLMs, and show that the two persona-based views are promising alternatives.
[NLP-175] Beyond Black-Box Labels: Interpretable Criteria for Diagnosing SubjectiveNLP Tasks ACL
【速读】: 该论文旨在解决主观自然语言处理(Natural Language Processing, NLP)数据集中标注者间分歧难以诊断的问题,即这种分歧究竟是源于标准不清晰、类别区分失效,还是存在合法的多样性。其解决方案的关键在于提出一种schema-level诊断方法,在确定最终黄金标签(gold label)之前,仅基于多标注者的准则判断来审计专家设计的标注框架(annotation schema)。该方法能够区分两种失败模式:一是准则本身不稳定、边界难以操作化;二是类别间系统性重叠导致互斥类别的边界模糊。通过在商业文档中的说服价值提取任务上应用该诊断,研究发现标注分歧并非随机分布,而是集中在少数准则上,且近半数句子激活多个类别,这些信号与领域专家的实际分歧高度一致,从而为优化标注指南、调整类别结构或重新审视标注范式提供了实证依据。
链接: https://arxiv.org/abs/2604.17022
作者: Nisrine Rair,Alban Goupil,Valeriu Vrabie,Emmanuel Chochoy
机构: CReSTIC, Université de Reims Champagne-Ardenne, Reims, France; Chochoy Conseil, Reims, France
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL Findings 2026
Abstract:Subjective NLP datasets typically aggregate annotator judgments into a single gold label, making it difficult to diagnose whether disagreement reflects unclear criteria, collapsed distinctions, or legitimate plurality. We propose a \emphschema-level diagnostic for auditing expert-designed annotation schemas \emphprior to gold-label commitment, using only multi-annotator criterion judgments. The diagnostic separates two failure modes: unstable criteria with hard-to-operationalize boundaries, and systematic overlap that blurs the boundaries between mutually exclusive categories. Applied to persuasive value extraction in commercial documents, we find that disagreement is not diffuse: instability concentrates in a few criteria, while nearly half of covered sentences activate multiple categories. These signals align with where domain experts disagree, yielding an evidence-based audit for tightening guidelines, revising category structure, or reconsidering the annotation paradigm.
[NLP-176] Beyond Static Benchmarks: Synthesizing Harmful Content via Persona-based Simulation for Robust Evaluation ACL2026
【速读】: 该论文旨在解决静态有害内容检测基准在可扩展性、多样性方面的局限性,以及因大规模预训练语料库污染导致的评估偏差问题。其解决方案的关键在于提出一种基于人格引导的大语言模型(Large Language Model, LLM)代理框架,通过构建包含人口统计特征与主题兴趣、情境化有害策略的二维用户人格,实现多样且语境贴合的有害交互模拟,从而生成高质量、高挑战性的合成有害场景,有效提升对有害内容检测系统的压力测试能力。
链接: https://arxiv.org/abs/2604.17020
作者: Huije Lee,Jisu Shin,Hoyun Song,Changgeon Ko,Jong C. Park
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026
Abstract:Static benchmarks for harmful content detection face limitations in scalability and diversity, and may also be affected by contamination from web-scale pre-training corpora. To address these issues, we propose a framework for synthesizing harmful content, leveraging persona-guided large language model (LLM) agents. Our approach constructs two-dimensional user personas by integrating demographic identities and topical interests with situational harmful strategies, enabling the simulation of diverse and contextually grounded harmful interactions. We evaluate the framework along three dimensions: harmfulness, challenge level, and diversity. Both human and LLM-based evaluations confirm that our framework achieves a high harmful generation success rate. Experiments across multiple detection systems reveal that our synthetic scenarios are more challenging to detect than those in existing benchmarks. Furthermore, a multi-faceted analysis confirms that our approach achieves linguistic and topical diversity comparable to human-curated datasets, establishing our framework as an effective tool for robust stress-testing of harmful content detection systems.
[NLP-177] Improving LLM Code Reasoning via Semantic Equivalence Self-Play with Formal Verification
【速读】: 该论文旨在解决 Haskell 程序中语义等价性判定(semantic equivalence)的问题,即如何自动判断两个函数在逻辑上是否等价。其核心挑战在于既要识别出真正等价的程序对(以支持代码重构与优化),又要有效区分非等价情况(以避免误判)。解决方案的关键在于提出一种基于自对弈(self-play)的框架,通过形式化验证(formal verification)引导生成器(generator)与评估器(evaluator)之间的对抗训练:利用 Liquid Haskell 的证明来确认等价性,借助执行层面的反例来检测不等价性,并结合难度感知的课程学习(difficulty-aware curriculum)组织训练数据。该方法显著提升了评估器在下游任务(如 EquiBench 和 PySecDB)上的迁移性能,且消融实验表明,等价性证明是模型推理能力的关键来源,而非仅依赖不等价监督带来的数据量增益。
链接: https://arxiv.org/abs/2604.17010
作者: Antonio Valerio Miceli Barone,Poon Tsz Nok
机构: University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注:
Abstract:We introduce a self-play framework for semantic equivalence in Haskell, utilizing formal verification to guide adversarial training between a generator and an evaluator. The framework leverages Liquid Haskell proofs for validating equivalence and execution-based counterexamples for inequivalence, organized via a difficulty-aware curriculum. To facilitate this, we release \textbfOpInstruct-HSx, a synthetic dataset of \approx 28k validated Haskell programs. Empirical experiments show that our evaluator transfers effectively to downstream tasks, achieving up to 13.3pp accuracy gain on EquiBench and consistent gains on PySecDB. Ablation studies on the SEQ-SINQ regimes indicate that while inequivalence supervision provides data volume, equivalence proofs are uniquely responsible for the model’s reasoning capabilities. The entire training pipeline and dataset are publicly released on GitHub and Hugging Face respectively.
[NLP-178] BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM -Generated Stories ACL2026
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在生成儿童故事等叙事内容时,存在显著的英语中心主义(English-centric)评估倾向的问题,即现有对AI安全与对齐(alignment)的评测多集中于英语语境,缺乏对多语言环境下叙事生成行为跨语言一致性的系统性研究。其解决方案的关键在于构建了一个大规模平行语料库 BiasedTales-ML,包含约35万条来自八种类型和文化差异显著的语言的儿童故事,并采用全排列提示设计(full-permutation prompting design)以控制变量;同时提出一个结构化的生成器-提取器流水线(generator-extractor pipeline)与多维分布分析框架(multi-dimensional distributional analysis framework),从而揭示不同语言、模型及社会条件下叙事属性(如角色设定、场景描述与主题侧重)的跨语言变异模式,验证了英语中观察到的生成分布未必适用于其他语言,尤其在低资源语言环境中更为明显。
链接: https://arxiv.org/abs/2604.17008
作者: Yuxuan Ouyang,yingfeng luo,JingBo Zhu,Tong Xiao
机构: Northeastern University (东北大学); NiuTrans Research (牛津研究)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Findings. Data are available at this https URL
Abstract:Large Language Models (LLMs) are increasingly used to generate narrative content, including children’s stories, which play an important role in social and cultural learning. Despite growing interest in AI safety and alignment, most existing evaluations focus primarily on English, leaving the cross-lingual generalization of aligned behavior underexplored. In this work, we introduce BiasedTales-ML, a large-scale parallel corpus of approximately 350,000 children’s stories generated across eight typologically and culturally diverse languages using a full-permutation prompting design. We propose a structured generator-extractor pipeline and a multi-dimensional distributional analysis framework to examine how narrative attributes vary across languages, models, and social conditions. Our analysis reveals substantial cross-lingual variability in narrative generation patterns, indicating that distributions observed in English do not always exhibit similar characteristics in other languages, particularly in lower-resource settings. At the narrative level, we identify recurring structural patterns involving character roles, settings, and thematic emphasis, which manifest differently across linguistic contexts. These findings highlight the limitations of English-centric evaluation for characterizing socially grounded narrative generation in multilingual settings. We release the dataset, code, and an interactive visualization tool to support future research on multilingual narrative analysis and evaluation.
[NLP-179] SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在训练推理导向模型时存在的探索不足问题,即RL通常只能提升单样本成功率(Pass@1),却难以促进多样化的推理路径探索,从而限制了多样本成功率(Pass@k)的提升。其核心问题是RL训练过程中概率质量过度集中于少数高奖励轨迹,导致探索受限。解决方案的关键在于提出“引导概率压缩”(Steering Probability Squeezing, SPS)训练范式,该方法通过交替使用RL与逆强化学习(Inverse Reinforcement Learning, IRL),将策略采样的轨迹作为示范数据,利用IRL显式重构轨迹分布,从而增强探索能力且无需外部监督。实验表明,SPS可显著改善Pass@k性能,并揭示了RL训练中Pass@k的实证上限,为提升大语言模型的推理探索能力提供了新路径。
链接: https://arxiv.org/abs/2604.16995
作者: Yifu Huo,Chenglong Wang,Ziming Zhu,Shunjie Xing,Peinan Feng,Tongran Liu,Qiaozhi He,Tianhua Zhou,Xiaojia Chang,Jingbo Zhu,Zhengtao Yu,Tong Xiao
机构: Northeastern University (东北大学); CAS Key Laboratory of Behavioral Science (中国科学院行为科学重点实验室); Independent Researcher (独立研究员); Kunming University of Science and Technology (昆明理工大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement learning (RL) has emerged as a promising paradigm for training reasoning-oriented models by leveraging rule-based reward signals. However, RL training typically tends to improve single-sample success rates (i.e., Pass@1) while offering limited exploration of diverse reasoning trajectories, which is crucial for multi-sample performance (i.e., Pass@k). Our preliminary analysis reveals that this limitation stems from a fundamental squeezing effect, whereby probability mass is excessively concentrated on a narrow subset of high-reward trajectories, restricting genuine exploration and constraining attainable performance under RL training. To address this issue, in this work, we propose Steering Probability Squeezing (SPS), a training paradigm that interleaves conventional RL with inverse reinforcement learning (IRL). SPS treats on-policy rollouts as demonstrations and employs IRL to explicitly reshape the induced trajectory distribution, thereby enhancing exploration without introducing external supervision. Experiments on five commonly used reasoning benchmarks demonstrate that SPS can enable better exploration and improve Pass@k. Beyond algorithmic contributions, we provide an analysis of RL learning dynamics and identify an empirical upper bound on Pass@k, shedding light on intrinsic exploration limits in RL-based reasoning models. Our findings suggest that alternating between RL and IRL offers an effective pathway toward extending the exploration capacity of reasoning-oriented large language models.
[NLP-180] Bolzano: Case Studies in LLM -Assisted Mathematical Research
【速读】: 该论文旨在探索大语言模型(Large Language Models, LLMs)在数学与理论计算机科学领域中辅助科研的潜力,特别是其是否能独立或半自主地产生具有 publishable 水平的研究成果。解决方案的关键在于设计并实现一个名为 Bolzano 的开源多智能体系统,该系统通过协调多个证明者智能体(prover agents)与验证者智能体(verifier agent)之间的多轮交互,并维护跨轮次的持久知识库(persistent knowledge base),从而实现对复杂问题的逐步推理与修正。实验表明,Bolzano 在六项问题中有四项达到可发表研究水平,其中三项几乎完全自主完成,这为LLMs在数学研究中的实质性贡献提供了实证支持。
链接: https://arxiv.org/abs/2604.16989
作者: Jan Grebík,Pavel Hubáček,Martin Koutecký,Matěj Kripner,Václav Rozhoň,Robert Šámal,Adrián Zámečník
机构: Charles University (查理大学); Czech Academy of Sciences (捷克科学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 25 pages, 1 figure. Project page: this https URL
Abstract:We report new results on six problems in mathematics and theoretical computer science, produced with the assistance of Bolzano, an open-source multi-agent LLM system. Bolzano orchestrates rounds of interaction between parallel prover agents and a verifier agent while maintaining a persistent knowledge base that is carried across rounds. Classified using the significance-autonomy taxonomy of Feng et al., four of the six results reach the level of publishable research, and three of the six were produced essentially autonomously by Bolzano. Our results provide evidence that LLMs can contribute meaningfully to mathematical research, complementing recent reports by Bubeck et al., Woodruff et al., and others.
[NLP-181] DOSE: Data Selection for Multi-Modal LLM s via Off-the-Shelf Models
【速读】: 该论文旨在解决现有视觉语言模型(Vision-Language Models, VLMs)训练数据中存在的噪声、冗余和对齐质量差的问题,这些问题限制了模型性能的提升。传统数据过滤方法虽能提高数据质量,但需在目标数据上进行额外的训练,导致计算成本高昂。其解决方案的关键在于提出DOSE框架,利用从未见过目标数据的预训练模型(off-the-shelf pretrained models)直接评估文本质量和图像-文本对齐度,从而无需任务特定微调即可筛选高质量样本。通过构建联合的质量-对齐分布并采用自适应加权采样策略,DOSE在保留长尾多样性的同时显著提升数据信息密度,实验证明该方法在标准VQA和数学推理基准上可达到甚至超越全量数据训练的效果,且具备高效性和可扩展性。
链接: https://arxiv.org/abs/2604.16979
作者: Biao Wu,Yiwu Zhong,Meng Fang,Ling Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 10 pages, 5 figures
Abstract:High-quality and diverse multimodal data are essential for improving vision-language models (VLMs), yet existing datasets often contain noisy, redundant, and poorly aligned samples. To address these problems, data filtering is commonly used to enhance the efficiency and performance of multimodal learning, but it introduces extra computational cost because filtering models are usually trained on the same data they are meant to screen. To reduce this cost, we study DOSE, which explores whether off-the-shelf pretrained models that have never seen the target data can be used to select training samples for larger and stronger multimodal models without any task-specific training. Even without fine-tuning, these models can effectively assess text quality and image-text alignment to guide data selection. Based on this, we build a joint quality-alignment distribution and apply adaptive weighted sampling to select informative samples while maintaining long-tail diversity. This approach enhances data diversity, enabling models trained on DOSE-filtered data to match or surpass those trained on the full dataset on standard VQA and math benchmarks. Extensive experiments demonstrate its effectiveness, efficiency, and scalability.
[NLP-182] On Safety Risks in Experience-Driven Self-Evolving Agents ACL2026
【速读】: 该论文旨在解决自演化大语言模型代理(large language model agents)在积累和利用经验过程中可能引入的安全风险问题,特别是当这些代理仅从良性任务中学习时仍可能导致高风险场景下的安全性能下降。其核心发现是:经验积累的执行导向特性会强化代理“行动”而非“拒绝”的倾向,从而削弱安全性;而在混合良性与有害任务的真实场景中,虽然包含拒绝相关经验可缓解安全退化,却会导致过度拒绝现象,揭示出安全与实用性之间的根本权衡。因此,解决方案的关键在于设计更系统化的策略,以实现安全可靠的自适应机制,而非单纯依赖经验驱动的自我进化。
链接: https://arxiv.org/abs/2604.16968
作者: Weixiang Zhao,Yichen Zhang,Yingshuo Wang,Yang Deng,Yanyan Zhao,Xuda Zhi,Yongbo Huang,HaoHe,Wanxiang Che,Bing Qin,Ting Liu
机构: Harbin Institute of Technology (哈尔滨工业大学); Singapore Management University (新加坡管理大学); SERES
类目: Computation and Language (cs.CL)
备注: Findings of ACL 2026
Abstract:Experience-driven self-evolution has emerged as a promising paradigm for improving the autonomy of large language model agents, yet its reliance on self-curated experience introduces underexplored safety risks. In this study, we investigate how experience accumulation and utilization in self-evolving agents affect safety performance across web-based and embodied environments. Notably, experience gathered solely from benign tasks can still compromise safety in high-risk scenarios. Further analysis attributes this degradation to the execution-oriented nature of accumulated experience, which reinforces agents’ tendency to act rather than refuse. In more realistic settings where agents encounter both benign and harmful tasks, refusal-related experience mitigates safety decline but induces over-refusal, revealing a fundamental safety-utility trade-off. Overall, our findings expose inherent limitations of current self-evolving agents and call for more principled strategies to ensure safe and reliable adaptation.
[NLP-183] MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在图像翻译任务中因难以有效捕捉图像内细粒度文本信息而导致的模态鸿沟问题,以及现有基于指令微调的方法易引发预训练知识参数冗余、削弱泛化性能的局限。其解决方案的关键在于提出一种**模态神经元感知微调(Modality Neuron-Aware Fine-Tuning, MNAFT)**策略:通过指令驱动的激活分析识别视觉与语言模块中的语言无关(language-agnostic)和语言特定(language-specific)神经元,并评估其在不同翻译任务中的重要性;随后仅对目标任务相关层中的这两类神经元进行选择性微调,保留其余神经元与层的原始知识,从而实现高效且高精度的跨模态理解与语言特定翻译。
链接: https://arxiv.org/abs/2604.16943
作者: Bo Li,Ningyuan Deng,Tianyu Dong,Shaobo Wang,Shaolin Zhu,Lijie Wen
机构: Tianjin University (天津大学); Tsinghua University (清华大学); Renmin University of China (中国人民大学); Shanghai Jiao Tong University (上海交通大学); Baidu Inc. (百度公司)
类目: Computation and Language (cs.CL)
备注: Accepted by SCIS (SCIENCE CHINA Information Science)
Abstract:Multimodal large language models (MLLMs) have shown impressive capabilities, yet they often struggle to effectively capture the fine-grained textual information within images crucial for accurate image translation. This often leads to a modality gap between visual text inputs and textual inputs/outputs for image translation. Existing methods, primarily relying on instruction fine-tuning, risk parameter redundancy of pre-trained knowledge, hindering generalization performance. To address this, we introduce modality neuron-aware fine-tuning (MNAFT), a novel approach that takes advantage of the specialized roles of individual neurons within MLLMs for enhanced image translation. MNAFT identifies language-agnostic and language-specific neurons in both vision and language modules through an instruction-driven activation analysis, evaluating their importance in various translation tasks. We then perform selective fine-tuning, updating only the parameters of language-specific and language-agnostic neurons within the selected layers relevant to the target task, while preserving the knowledge encoded in other neurons and layers. Our extensive experiments on multiple benchmarks demonstrate that MNAFT significantly outperforms state-of-the-art image translation methods, including cascaded models, standard full fine-tuning, and parameter-efficient tuning techniques. Furthermore, we provide comprehensive analysis, including visualizations of neuron activations and clustering patterns, to offer insights into the roles of different neuron groups in mediating cross-modal understanding and facilitating accurate language-specific translation.
[NLP-184] No One Fits All: From Fixed Prompting to Learned Routing in Multilingual LLM s ACL2026
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, MLLMs)中提示策略选择的不一致性问题,即不同语言和任务下基于翻译的提示(Translation-based Prompting)效果差异显著,且缺乏通用最优策略。其核心解决方案是将提示策略选择建模为一个可学习的决策问题,并引入轻量级分类器来预测每个输入实例是否应采用原生提示(native prompting)或翻译提示(translation-based prompting)。该方法在四个基准测试上均取得统计学显著提升,并能泛化至训练时未见过的任务格式,关键发现在于语言资源水平(language resource level)而非翻译质量本身,决定了翻译提示的有效性。
链接: https://arxiv.org/abs/2604.16937
作者: Wei-Chi Wu,Sheng-Lun Wei,Hen-Hsen Huang,Hsin-Hsi Chen
机构: National Taiwan University (国立台湾大学); Academia Sinica (中央研究院); AI Research Center (AINTU), National Taiwan University (人工智能研究中心,国立台湾大学)
类目: Computation and Language (cs.CL)
备注: Accepted as a long findings paper at ACL 2026
Abstract:Translation-based prompting is widely used in multilingual LLMs, yet its effectiveness varies across languages and tasks. We evaluate prompting strategies across ten languages of different resource levels and four benchmarks. Our analysis shows that no single strategy is universally optimal. Translation strongly benefits low-resource languages even when translation quality is imperfect, high-resource languages gain little, and prompt-based self-routing underperforms explicit translation. Motivated by these findings, we formulate prompting strategy selection as a learned decision problem and introduce lightweight classifiers that predict whether native or translation-based prompting is optimal for each instance. The classifiers achieve statistically significant improvements over fixed strategies across four benchmarks and generalize to unseen task formats not observed during training. Further analysis reveals that language resource level, rather than translation quality alone, determines when translation is beneficial.
[NLP-185] MeasHalu: Mitigation of Scientific Measurement Hallucinations for Large Language Models with Enhanced Reasoning ACL2026
【速读】: 该论文旨在解决生成式 AI 在科学文献中提取定量测量数据时存在的严重幻觉问题(hallucination),这直接影响了自动化科学文献理解系统的可靠性。解决方案的关键在于提出 MeasHalu 框架,其核心包括:首先构建细粒度的测量特异性幻觉分类体系,将错误归类为量值、单位、修饰语和关系四个维度;其次采用两阶段推理感知微调策略,结合增强的科学数据与基于过程的监督信号以提升模型推理能力;最后引入渐进式奖励课程机制,针对性惩罚特定类型的幻觉,从而显著提高提取结果的真实性与准确性。
链接: https://arxiv.org/abs/2604.16929
作者: Ruijun Huang,Zhiqiao Kang,Yuxuan Zhu,Junxiong Li,Jiahao Zhao,Minghuan Tan,Feng Jiang,Min Yang
机构: Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; Artificial Intelligence Research Institute, Shenzhen University of Advanced Technology
类目: Computation and Language (cs.CL)
备注: To appear in ACL 2026
Abstract:The accurate extraction of scientific measurements from literature is a critical yet challenging task in AI4Science, enabling large-scale analysis and integration of quantitative research findings. However, Large Language Models (LLMs) frequently exhibit severe hallucinations, which significantly undermine the reliability of automated scientific document understanding systems. To address this problem, we propose MeasHalu, a novel framework for mitigating scientific measurement hallucinations through enhanced reasoning and targeted optimization. We first present a fine-grained taxonomy of measurement-specific hallucinations, categorizing errors across quantities, units, modifiers, and relations. Our approach incorporates a two-stage reasoning-aware fine-tuning strategy using augmented scientific data and process-based supervision. Furthermore, we introduce a progressive reward curriculum designed to penalize specific hallucination types, significantly improving extraction faithfulness. Experimental results demonstrate that MeasHalu substantially reduces hallucination rates and improves overall accuracy on the MeasEval benchmark. This work provides a targeted solution to a key bottleneck in automated scientific knowledge extraction, facilitating more trustworthy and scalable machine-assisted scientific literature analysis.
[NLP-186] Freshness-Aware Prioritized Experience Replay for LLM /VLM Reinforcement Learning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)和视觉-语言模型(Vision-Language Models, VLMs)在强化学习(Reinforcement Learning, RL)训练中因采用在线策略算法(如PPO、GRPO等)而导致的样本效率低下问题。这类方法在每次梯度更新后丢弃所有收集到的轨迹,尤其在多轮交互的代理任务中造成严重资源浪费。为提升样本利用效率,传统经验回放(Experience Replay, ER)技术被引入,但直接应用优先经验回放(Prioritized Experience Replay, PER)会导致性能下降——这是因为千亿参数模型的策略快速演化,导致存储的优先级过时,使无信息量的旧轨迹持续主导采样。解决方案的关键在于提出感知新鲜度的经验回放(Freshness-Aware PER),其核心创新是在任意PER优先级基础上乘以一个基于有效样本量分析的指数衰减因子,用以动态调整轨迹的采样权重,从而缓解优先级老化问题。实验证明,该方法在多个多步推理与竞赛类任务上显著优于在线策略基线,且标准PER因缺乏年龄衰减而表现更差。
链接: https://arxiv.org/abs/2604.16918
作者: Weiyu Ma,Yongcheng Zeng,Yan Song,Xinyu Cui,Jian Zhao,Xuhui Liu,Mohamed Elhoseiny
机构: King Abdullah University of Science and Technology (KAUST); Chinese Academy of Sciences, Institute of Automation (CASIA); AI Centre, Department of Computer Science, University College London; Zhongguancun Institute of Artificial Intelligence
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement Learning (RL) has achieved impressive success in post-training Large Language Models (LLMs) and Vision-Language Models (VLMs), with on-policy algorithms such as PPO, GRPO, and REINFORCE++ serving as the dominant paradigm. However, these methods discard all collected trajectories after a single gradient update, resulting in poor sample efficiency, particularly wasteful for agentic tasks where multi-turn environment interactions are expensive. While Experience Replay drives sample efficiency in classic RL by allowing agents to reuse past trajectories and prioritize informative ones, directly applying Prioritized Experience Replay (PER) to LLMs fails. The rapid policy evolution of billion-parameter models renders stored priorities stale, causing old high-priority trajectories to dominate sampling long after they have become uninformative. We propose Freshness-Aware PER, which addresses this priority staleness problem by augmenting any PER-based priority with a multiplicative exponential age decay grounded in effective sample size analysis. To the best of our knowledge, Freshness-Aware PER is the first work to successfully apply PER to LLM/VLM reinforcement learning. We evaluate on eight multi-step agentic, reasoning, and math competition tasks with 0.5B, 3B, and 7B models. Freshness-Aware PER significantly outperforms on-policy baselines, achieving +46% on NQ Search, +367% on Sokoban, and +133% on VLM FrozenLake, while standard PER without age decay consistently degrades performance. Our code is publicly available at this https URL.
[NLP-187] x1: Learning to Think Adaptively Across Languages and Cultures ACL2026
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在推理过程中普遍依赖单一主导语言、忽视语言间抽象差异的问题,从而限制了模型在跨语言任务中的表现。其解决方案的关键在于提出x1系列推理模型,该模型能够根据具体输入实例自适应地选择最优的推理语言,而非固定使用某一种语言进行推理。这一机制通过对比相同输入在不同语言下的推理轨迹进行训练,不扩展模型的知识边界,但显著提升了多语言数学推理和文化相关任务的表现。实验表明,语言选择本身是推理过程中的一个功能性组件,且在文化关联任务中,特定语言能更高效准确地召回文化知识,挑战了单纯依赖模型规模提升性能的假设。
链接: https://arxiv.org/abs/2604.16917
作者: Yangfan Ye,Xiaocheng Feng,Xiachong Feng,Yichong Huang,Zekun Yuan,Lei Huang,Weitao Ma,Qichen Hong,Yunfei Lu,Dandan Tu,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); Peng Cheng Laboratory (鹏城实验室); The University of Hong Kong (香港大学); Huawei Technologies Co., Ltd (华为技术有限公司)
类目: Computation and Language (cs.CL)
备注: Findings of ACL2026
Abstract:Languages encode distinct abstractions and inductive priors, yet most large language models (LLMs) overlook this diversity by reasoning in a single dominant language. In this work, we introduce x1, a family of reasoning models that can adaptively reason in an advantageous language on a per-instance basis. To isolate the effect of reasoning-language choice, x1 is constructed without expanding the model’s knowledge boundaries and is trained by contrasting linguistically distinct reasoning trajectories for the same input. Our extensive experiments demonstrate the benefits of adaptive multilingual reasoning across multilingual mathematical reasoning and culturally grounded tasks. Moreover, our results challenge a simplistic view of scaling laws: while scaling reduces cross-lingual disparities in procedural domains such as math reasoning, it does not eliminate the advantages of culture-associated languages in culturally grounded tasks, as we empirically show that such reasoning enables more efficient and accurate cultural knowledge recall. Overall, our findings establish language choice as a functional component of reasoning, with implications for building more generalist and globally competent reasoning models.
[NLP-188] When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)安全对齐评估中存在的偏差问题,即现有评估主要基于开放生成场景(open-ended generation),而忽视了现实应用中广泛存在的结构化决策任务(如多选题,Multiple-Choice Questions, MCQs),在这些任务中模型无法通过拒绝回答来规避风险。研究发现,将有害请求改写为所有选项均不安全的强制选择型MCQ,可系统性绕过模型的拒绝机制,即使这些模型在开放生成场景中能稳定拒绝相同内容。解决方案的关键在于识别并验证“约束诱导的对齐失效”这一新风险模式:结构化约束强度与违规率之间存在非线性关系(人类构造的MCQ呈现倒U形趋势,而高能力模型生成的MCQ则普遍接近饱和违规率),且该效应具有跨模型迁移性,揭示了当前安全评估低估了结构化任务中的潜在风险,强调需将受限决策场景纳入对齐测试的核心维度。
链接: https://arxiv.org/abs/2604.16916
作者: Yuheng Chen,Zhiyu Wu,Bowen Cheng,Tetsuro Takahashi
机构: Kagoshima University (鹿儿岛大学); Fudan University (复旦大学); China University of Petroleum-Beijing (中国石油大学(北京))
类目: Computation and Language (cs.CL)
备注:
Abstract:Safety alignment in large language models (LLMs) is primarily evaluated under open-ended generation, where models can mitigate risk by refusing to respond. In contrast, many real-world applications place LLMs in structured decision-making tasks, such as multiple-choice questions (MCQs), where abstention is discouraged or unavailable. We identify a systematic failure mode in this setting: reformulating harmful requests as forced-choice MCQs, where all options are unsafe, can systematically bypass refusal behavior, even in models that consistently reject equivalent open-ended prompts. Across 14 proprietary and open-source models, we show that forced-choice constraints sharply increase policy-violating responses. Notably, for human-authored MCQs, violation rates follow an inverted U-shaped trend with respect to structural constraint strength, peaking under intermediate task specifications, whereas MCQs generated by high-capability models yield near-saturation violation rates across constraints and exhibit strong cross-model transferability. Our findings reveal that current safety evaluations substantially underestimate risks in structured task settings and highlight constrained decision-making as a critical and underexplored surface for alignment failures.
[NLP-189] he Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus
【速读】: 该论文旨在解决去中心化自治组织(DAO)在治理决策中如何有效抵御语义社会工程攻击的问题,特别是在采用小型语言模型(SLM)作为边缘原生宪法防火墙时,推理时计算资源(System 2)的引入是否能提升形式逻辑严谨性并保障共识稳定性。其解决方案的关键在于通过构建Sentinel-Bench这一840次推理的实证框架,在Qwen-3.5-9B模型上执行严格的模型内消融实验,对比冻结权重下切换推理模式的效果。研究发现,尽管System 2推理理论上更具逻辑深度,但实际在对抗性环境下引发认知崩溃(Reasoning Non-Convergence率达26.7%),导致共识稳定性下降至72.6%,延迟增加17倍,并暴露治理可提取价值(GEV)风险;而基于参数化直觉的System 1自回归基线则保持100%抗攻击鲁棒性和法律一致性,且达成状态终局性仅需<13秒,表明在拜占庭容错(BFT)约束下的边缘SLM场景中,System 1结构优于System 2迭代思辨。
链接: https://arxiv.org/abs/2604.16913
作者: Syed Muhammad Aqdas Rizvi
机构: Lahore University of Management Sciences (LUMS)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Working paper. 14 pages, 3 figures, 6 tables. Code and dataset: this https URL
Abstract:Decentralized Autonomous Organizations (DAOs) are inclined explore Small Language Models (SLMs) as edge-native constitutional firewalls to vet proposals and mitigate semantic social engineering. While scaling inference-time compute (System 2) enhances formal logic, its efficacy in highly adversarial, cryptoeconomic governance environments remains underexplored. To address this, we introduce Sentinel-Bench, an 840-inference empirical framework executing a strict intra-model ablation on Qwen-3.5-9B. By toggling latent reasoning across frozen weights, we isolate the impact of inference-time compute against an adversarial Optimism DAO dataset. Our findings reveal a severe compute-accuracy inversion. The autoregressive baseline (System 1) achieved 100% adversarial robustness, 100% juridical consistency, and state finality in under 13 seconds. Conversely, System 2 reasoning introduced catastrophic instability, fundamentally driven by a 26.7% Reasoning Non-Convergence (cognitive collapse) rate. This collapse degraded trial-to-trial consensus stability to 72.6% and imposed a 17x latency overhead, introducing critical vulnerabilities to Governance Extractable Value (GEV) and hardware centralization. While rare (1.5% of adversarial trials), we empirically captured “Reasoning-Induced Sycophancy,” where the model generated significantly longer internal monologues (averaging 25,750 characters) to rationalize failing the adversarial trap. We conclude that for edge-native SLMs operating under Byzantine Fault Tolerance (BFT) constraints, System 1 parameterized intuition is structurally and economically superior to System 2 iterative deliberation for decentralized consensus. Code and Dataset: this https URL Comments: Working paper. 14 pages, 3 figures, 6 tables. Code and dataset: this https URL Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2604.16913 [cs.AI] (or arXiv:2604.16913v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.16913 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-190] PRISM: Probing Reasoning Instruction and Source Memory in LLM Hallucinations ACL
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在高风险场景中部署时,其幻觉(Hallucination)问题缺乏系统性诊断与定位的问题。现有评估方法多依赖混合查询和生成结果层面的后验评分,虽能量化幻觉严重程度,但无法揭示幻觉在生成流程中的具体发生位置与成因。为此,作者提出PRISM这一受控基准,将幻觉解耦为四个维度:知识缺失、知识错误、推理错误和指令遵循错误,并将其映射至生成流程的三个阶段(记忆、指令理解、推理),从而实现细粒度、阶段感知的诊断式评估。其核心创新在于构建了结构化的评估框架,使研究者能够精准识别幻觉来源,进而发现不同缓解策略在各维度间的权衡关系,推动可信大语言模型的发展。
链接: https://arxiv.org/abs/2604.16909
作者: Yuhe Wu,Guangyu Wang,Yuran Chen,Jiatong Zhang,Yutong Zhang,Yujie Chen,Jiaming Shang,Guang Zhang,Zhuang Liu
机构: HKUST(GZ); NYUSH; DUFE; CUHK(SZ); CUFE
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL main conference 2026
Abstract:As large language models (LLMs) evolve from conversational assistants into agents capable of handling complex tasks, they are increasingly deployed in high-risk domains. However, existing benchmarks largely rely on mixed queries and posterior evaluation, output-level scoring, which quantifies hallucination severity but offers limited insight into where and why hallucinations arise in the generation pipeline. We therefore reformulate hallucination evaluation as a diagnostic problem and propose PRISM, a controlled benchmark that disentangles hallucinations into four dimensions: knowledge missing, knowledge errors, reasoning errors, and instruction-following errors, grounded in three stages of generation (memory, instruction, and reasoning). PRISM contains 9,448 instances across 65 tasks and supports fine-grained, stage-aware diagnostic evaluation. Evaluating 24 mainstream open-source and proprietary LLMs, we uncover consistent trade-offs across instruction following, memory retrieval, and logical reasoning, showing that mitigation strategies often improve specific dimensions at the expense of others. We hope PRISM provides a framework for understanding the specific mechanisms behind LLMs hallucinations, ultimately accelerating the development of trustworthy large language models.
[NLP-191] Prune Interpret Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution
【速读】: 该论文旨在解决跨层转换器(Cross-layer Transcoder, CLT)特征解释中效率低下的问题,即现有特征解释流程通常对均匀采样的全部特征进行处理,而实际上仅有少量特征与目标任务行为相关,导致大量计算资源浪费在无关特征上。为实现高效且可解释的特征选择,作者提出首个面向CLT的端到端框架PIE(Pruning-Interpretation-Evaluation),其核心在于引入特征归因打补丁(Feature Attribution Patching, FAP)方法——通过聚合梯度加权写入贡献来量化CLT特征的重要性,并进一步设计FAP-Synergy协同重排序机制以提升关键特征的识别精度。实验表明,在不同预算下(K ∈ {50, 100, 200, 400, 800}),FAP系列方法在保持行为保真度(KL散度)的同时显著优于激活幅度(Activation-Magnitude)和ACDC风格的剪枝策略,尤其在严格预算条件下(如K=100),FAP-Synergy实现了约40倍压缩率,同时减少低质量特征并大幅降低解释与评估调用次数。
链接: https://arxiv.org/abs/2604.16889
作者: Qinhao Chen,Linyang He,Nima Mesgarani
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Existing feature-interpretation pipelines typically operate on uniformly sampled units, but only a small fraction of cross-layer transcoder (CLT) features matter for a target behavior, with the rest resulting in expensive feature explaining and evaluating costs. We introduce the first CLT-native end-to-end framework, PIE, connecting Pruning, automatic Interpretation, and interpretation Evaluation, enabling systematic measurement of behavioral fidelity and downstream interpretability under pruning. To achieve this, we propose Feature Attribution Patching (FAP), a patch-grounded attribution method that scores CLT features by aggregating gradient-weighted write contributions, and FAP-Synergy, a synergy-aware reranking procedure. We evaluate pruning using KL-divergence behavior retention and assess interpretation quality with FADE-style metrics. Across IOI and Doc-String, across budgets K \in \50, 100, 200, 400, 800\ , and across FAP, FAP-Synergy, Activation-Magnitude, and ACDC-style pruning, the FAP family consistently achieves the best or near-best fidelity, with FAP-Synergy providing its clearest gains in strict-budget regimes. On IOI with CLTs for Llama-3.2-1B and Gemma-2-2B, pruning to K=100 features matches the KL fidelity that random selection from the active feature set requires \approx 4 k features to achieve ( \approx 40\times compression), enabling \approx 40\times fewer interpretation/evaluation calls while substantially reducing low-quality features.
[NLP-192] Incentivizing Parametric Knowledge via Reinforcement Learning with Verifiable Rewards for Cross-Cultural Entity Translation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在跨文化实体翻译中面临的挑战,即模型常输出字面或音译结果,而非符合语境的文化适配翻译。其核心问题是:如何有效激发模型参数中已编码的隐含知识,以实现高质量的跨文化实体翻译,而无需依赖外部知识库。解决方案的关键在于提出EA-RLVR(Entity-Anchored Reinforcement Learning with Verifiable Rewards)训练框架,该框架通过基于实体级别的可验证奖励信号锚定监督,并引入轻量级结构门控机制稳定优化过程,从而引导模型学习鲁棒的推理机制而非简单模仿参考译文。实验表明,仅用7k样本训练即可显著提升Qwen3-14B在未见过实体上的翻译准确率(从23.66%提升至31.87%),并带来跨任务迁移性能增益(如WMT24++上XCOMET指标提升+1.59)。
链接: https://arxiv.org/abs/2604.16881
作者: Jiang Zhou,Xiaohu Zhao,Xinwei Wu,Tianyu Dong,Hao Wang,Yangyang Liu,Heng Liu,Linlong Xu,Longyue Wang,Weihua Luo,Deyi Xiong
机构: TJUNLP Lab, Tianjin University, China; Alibaba Group, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 11 figures, 11 tables
Abstract:Cross-cultural entity translation remains challenging for large language models (LLMs) as literal or phonetic renderings are usually yielded instead of culturally appropriate translations in context. However, relevant knowledge may already be encoded in model parameters during large-scale pre-training. To incentivize the effective use of parametric knowledge, we propose EA-RLVR (Entity-Anchored Reinforcement Learning with Verifiable Rewards), a training framework that optimizes cross-cultural entity translation without relying on external knowledge bases. EA-RLVR anchors supervision on a verifiable, entity-level reward signal and incorporates lightweight structural gates to stabilize optimization. This design steers the model toward learning a robust reasoning process rather than merely imitating reference translations. We evaluate EA-RLVR on XC-Translate and observe consistent improvements in both entity translation accuracy and out-of-domain generalization. Specifically, training on merely 7k samples boosts Qwen3-14B’s entity translation accuracy from 23.66% to 31.87% on a 50k test set comprising entirely unseen entities. The learned entity translation ability also transfers to general translation, yielding +1.35 XCOMET on WMT24++, which scales to +1.59 with extended optimization. Extensive analyses of pass@k dynamics and reward formulations attribute these gains to superior sampling efficiency and a stable optimization landscape.
[NLP-193] A Community-Based Approach for Stance Distribution and Argument Organization
【速读】: 该论文旨在解决在线辩论平台和社交媒体上海量、复杂且多角度的论点内容给读者带来的理解与整合困难问题,尤其在社会政治议题中,用户难以有效把握不同立场间的关联与结构。解决方案的关键在于提出一种无监督的基于图的社区化论点组织方法:通过构建包含论点间多种关系(如主题相似性、语义连贯性、共享关键词及共同实体)的交互图,并利用社区检测算法识别出具有同质性和异质性观点分布的论点群组,再通过策略性图操作简化结构,从而为用户提供可读性强且信息全面的论点模式摘要。该方法无需训练数据,可高效处理数百篇文章并保留论点间的细微关系。
链接: https://arxiv.org/abs/2604.16852
作者: Rudra Ranajee Saha,Laks V. S. Lakshmanan,Raymond T. Ng
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The proliferation of online debate platforms and social media has led to an unprecedented volume of argumentative content on controversial topics from multiple perspectives. While this wealth of perspectives offers opportunities for developing critical thinking and breaking filter bubbles (Pariser 2011), the sheer volume and complexity of arguments make it challenging for readers to synthesize and comprehend diverse viewpoints effectively. We present an unsupervised graph-based approach for community-based argument organization that helps users navigate and understand complex argumentative landscapes. Our system analyzes collections of topic-focused articles and constructs a rich interaction graph by capturing multiple relationship types between arguments: topic similarity, semantic coherence, shared keywords, and common entities. We then employ community detection to identify argument communities that reveal homogeneous and heterogeneous viewpoint distributions. The detected communities are simplified through strategic graph operations to present users with digestible, yet comprehensive summaries of key argumentative patterns. Our approach requires no training data and can effectively process hundreds of articles while preserving nuanced relationships between arguments. Experimental results demonstrate our system’s ability to identify meaningful argument communities and present them in an interpretable manner, facilitating users’ understanding of complex socio-political debates.
[NLP-194] DART: Mitigating Harm Drift in Difference-Aware LLM s via Distill-Audit-Repair Training ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在安全微调后出现的“身份盲视”(identity-blindness)问题,即模型在面对涉及人口群体差异的合理情境时(如基于祖先的疾病发病率或宗教招聘偏好),错误地回避承认差异,导致回答不准确、无谓拒绝或机械采用“平等对待”的默认策略。解决方案的关键在于引入DART(Distill–Audit–Repair Training)框架:首先通过教师模型蒸馏出标签条件下的推理逻辑,其次对生成输出进行危害漂移(harm drift)审计,识别因准确率提升而加剧的有害内容(如强化偏见、引入不当假设或忽略潜在危害),最后利用严重性加权微调修复问题样本。实验表明,DART在多个基准测试中显著提升了差异识别准确性(从39.0%提升至68.8%),同时将危害漂移案例减少72.6%,并在真实世界多领域查询中将差异适切响应比例从39.8%提升至77.5%,同时将拒绝率从34.3%降至3.0%,证明了显式检测与修复机制可有效协调模型准确性与安全性之间的矛盾。
链接: https://arxiv.org/abs/2604.16845
作者: Ziwen Pan,Zihan Liang,Jad Kabbara,Ali Emami
机构: Emory University(埃默里大学); MIT(麻省理工学院)
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of ACL 2026
Abstract:Large language models (LLMs) tuned for safety often avoid acknowledging demographic differences, even when such acknowledgment is factually correct (e.g., ancestry-based disease incidence) or contextually justified (e.g., religious hiring preferences). This identity-blindness yields incorrect responses, unnecessary refusals, or generic “equal-treatment” defaults. We study this via difference-awareness classification: given a question involving demographic groups, the task is not to answer directly, but to classify whether a correct answer requires recognizing group differences (yes) or whether groups should be treated identically (no). Crucially, fine-tuning for accuracy triggers harm drift: model-generated explanations become increasingly harmful as decision accuracy improves, whether by elaborating harmful content, introducing problematic assumptions, or failing to flag harms the baseline identified. To mitigate this, we introduce DART (Distill–Audit–Repair Training), which distills label-conditioned reasoning from a teacher, audits outputs for harm drift cases relative to baseline, and repairs problematic cases via severity-weighted fine-tuning. On eight benchmarks, DART improves Llama-3-8B-Instruct accuracy from 39.0% to 68.8%, with largest gains on equal-treatment prompts (11.3% - 72.6%), while reducing harm drift cases by 72.6%. It also transfers to 280 open-ended real-world queries across medical, legal, policy, and educational domains, improving difference-appropriate responses from 39.8% to 77.5% while reducing refusals from 34.3% to 3.0%. Our results demonstrate that accuracy and safety need not conflict when explicit detection and repair mechanisms are in place.
[NLP-195] HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在长期交互中因固定上下文窗口导致的记忆断层问题,即现有记忆系统通常将对话历史表示为无结构的嵌入向量,仅依赖语义相似性进行检索,无法模拟人类记忆中通过重复共激活形成的关联结构。其解决方案的关键在于提出一种受认知神经科学启发的生物启发式记忆架构 HeLa-Mem,该架构基于三个核心机制——关联(association)、巩固(consolidation)和扩散激活(spreading activation),构建了一个双层级动态图结构:一是通过共激活模式演化的事件记忆图(episodic memory graph),二是由反射代理(Reflective Agent)识别密集连接的记忆枢纽并通过赫布蒸馏(Hebbian Distillation)提炼为结构化可复用的语义知识的语义记忆存储(semantic memory store)。此设计同时利用语义相似性和学习到的关联关系,实现了对人类认知中情景-语义记忆区分的仿生建模。
链接: https://arxiv.org/abs/2604.16839
作者: Jinchang Zhu,Jindong Li,Cheng Zhang,Jiahong Liu,Menglin Yang
机构: The Hong Kong University of Science and Technology (Guangzhou); Jilin University; The Chinese University of Hong Kong
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026
Abstract:Long-term memory is a critical challenge for Large Language Model agents, as fixed context windows cannot preserve coherence across extended interactions. Existing memory systems represent conversation history as unstructured embedding vectors, retrieving information through semantic similarity. This paradigm fails to capture the associative structure of human memory, wherein related experiences progressively strengthen interconnections through repeated co-activation. Inspired by cognitive neuroscience, we identify three mechanisms central to biological memory: association, consolidation, and spreading activation, which remain largely absent in current research. To bridge this gap, we propose HeLa-Mem, a bio-inspired memory architecture that models memory as a dynamic graph with Hebbian learning dynamics. HeLa-Mem employs a dual-level organization: (1) an episodic memory graph that evolves through co-activation patterns, and (2) a semantic memory store populated via Hebbian Distillation, wherein a Reflective Agent identifies densely connected memory hubs and distills them into structured, reusable semantic knowledge. This dual-path design leverages both semantic similarity and learned associations, mirroring the episodic-semantic distinction in human cognition. Experiments on LoCoMo demonstrate superior performance across four question categories while using significantly fewer context tokens. Code is available on GitHub: this https URL Comments: Accepted to ACL 2026 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.16839 [cs.CL] (or arXiv:2604.16839v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.16839 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-196] Crowded in B-Space: Calibrating Shared Directions for LoRA Merging
【速读】: 该论文旨在解决LoRA(Low-Rank Adaptation)适配器在合并过程中性能下降的问题。现有方法通常将LoRA更新矩阵 ΔW=BA 视为单一对象,未区分两个低秩矩阵 A 和 B 的作用,导致合并后任务特定信息丢失。研究表明,性能下降的主要原因是输出侧矩阵 B 在不同任务间重复使用少量共享方向,从而在合并时过度强调这些共享方向,掩盖了任务特异性信息。解决方案的关键在于提出Pico(Pre-merge interference calibration in output-space),一种无需数据的校准方法:在合并前对 B 进行缩放以抑制过共享方向,并在合并后重新调整整体更新幅度。该方法可无缝集成至Task Arithmetic、TIES和TSV-M等主流合并策略中,在数学、编程、金融与医疗等多个领域基准测试中显著提升平均准确率(提升3.4–8.3点),甚至超越联合训练的LoRA模型,证明了分离处理 A 和 B 对优化合并效果的重要性。
链接: https://arxiv.org/abs/2604.16826
作者: Yixuan Tang,Yi Yang
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Merging separately trained LoRA adapters is a practical alternative to joint multi-task training, but it often hurts performance. Existing methods usually treat the LoRA update \Delta W = BA as a single object and do not distinguish the two LoRA matrices. We show that the main source of LoRA merge interference comes from the output-side matrix B . Across tasks, B repeatedly uses a small set of shared directions, while A remains much more task-specific. As a result, the merged adapter overemphasizes these shared directions, and task-specific information is lost. We propose Pico (Pre-merge interference calibration in output-space), a data-free method that calibrates B before merge by downscaling over-shared directions and then rescaling the merged update. Pico plugs directly into existing merging methods such as Task Arithmetic, TIES, and TSV-M. Across eight different benchmarks from math, coding, finance, and medical domains, Pico improves average accuracy by 3.4-8.3 points over the corresponding base method and achieves the best overall average performance. Pico also enables merged adapters to outperform the LoRA trained with all task data. These results show that LoRA merging works better when the two LoRA matrices are treated separately.
[NLP-197] PersonalHomeBench: Evaluating Agents in Personalized Smart Homes
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 系统在复杂且个性化的智能家庭环境中部署时,其适应性和可靠性尚未得到充分评估的问题。解决方案的关键在于构建 PersonalHomeBench——一个面向个性化智能家庭场景的基准测试平台,通过迭代式构建丰富的家庭状态并生成情境依赖的个性化任务,结合 PersonalHomeTools 工具箱实现家电控制、信息检索与情境理解,从而系统性地评估基础模型在单模态和多模态观测下的反应式与主动性能力。实验表明,随着任务复杂度上升,模型性能显著下降,尤其在反事实推理和部分可观测条件下暴露明显短板,凸显了该基准在揭示个性化代理推理与规划鲁棒性局限方面的价值。
链接: https://arxiv.org/abs/2604.16813
作者: Nikhil Verma,InJung Yang,Sungil Kim,KoKeun Kim,YoungJoon Kim,Manasa Bharadwaj,Yolanda Liu,Kevin Ferreira
机构: LG Research Teams (LG 研究团队)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注: 53 pages
Abstract:Agentic AI systems are rapidly advancing toward real-world applications, yet their readiness in complex and personalized environments remains insufficiently characterized. To address this gap, we introduce PersonalHomeBench, a benchmark for evaluating foundation models as agentic assistants in personalized smart home environments. The benchmark is constructed through an iterative process that progressively builds rich household states, which are then used to generate personalized, context-dependent tasks. To support realistic agent-environment interaction, we provide PersonalHomeTools, a comprehensive toolbox enabling household information retrieval, appliance control, and situational understanding. PersonalHomeBench evaluates both reactive and proactive agentic abilities under unimodal and multimodal observations. Thorough experimentation reveals a systematic performance reduction as task complexity increases, with pronounced failures in counterfactual reasoning and under partial observability, where effective tool-based information gathering is required. These results position PersonalHomeBench as a rigorous evaluation platform for analyzing the robustness and limitations of personalized agentic reasoning and planning.
[NLP-198] When Informal Text Breaks NLI: Tokenization Failure Distribution Shift and Targeted Mitigations
【速读】: 该论文旨在解决非正式语言形式(如俚语、表情符号、Gen-Z填充词)对自然语言推理(NLI)模型准确率的负面影响问题。研究发现,不同类型的非正式表达会导致模型性能下降的机制各异:表情符号因被分词器映射为[UNK]而破坏输入信号,属于“信号丢失型”失败;而噪声填充词虽在词汇表内但缺乏训练数据支持,导致模型赋予其不合理的推理权重,属于“语义误判型”失败。解决方案的关键在于区分这两种失败模式并采取针对性干预——预处理可恢复表情符号场景下的性能(通过文本规范化减少[UNK]),数据增强则提升噪声词场景下的鲁棒性(通过引入含噪声样本训练模型)。结合两者形成混合策略后,在SNLI上使ELECTRA-small模型准确率从75.88%提升至88.93%,且未对干净文本造成显著影响,同时超越GPT-4o-mini零样本表现。
链接: https://arxiv.org/abs/2604.16787
作者: Avinash Goutham Aluguvelly
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We study how informal surface forms degrade NLI accuracy in ELECTRA-small (14M) and RoBERTa-large (355M) across four transforms applied to SNLI and MultiNLI: slang substitution, emoji replacement, Gen-Z filler tokens, and their combination. Slang substitution (replacing formal words with informal equivalents, e.g., “going to” - “gonna”, “friend” - “homie”) causes minimal degradation (at most 1.1pp): slang vocabulary falls largely within WordPiece coverage, so the tokenizer handles it without signal loss. Emoji replaces content words with Unicode characters that ELECTRA’s WordPiece tokenizer maps to [UNK], destroying the input signal before any learned parameters see it (93.6% of emoji examples contain at least one [UNK], mean 2.91 per example). Noise tokens (no cap, deadass, tbh) are fully in-vocabulary but absent from NLI training data, consistent with the model assigning them inferential weight they do not carry. The two failure modes respond to different interventions: preprocessing recovers emoji accuracy by normalizing text before tokenization; augmentation handles noise by exposing the model to noise-bearing examples during training. A hybrid of both achieves 88.93% on the combined variant for ELECTRA on SNLI (up from 75.88%), with no statistically significant drop on clean text. Against GPT-4o-mini zero-shot, unmitigated ELECTRA is significantly worse on transformed variants (p 0.0001); hybrid ELECTRA surpasses it across all SNLI variants and reaches statistical parity on MultiNLI.
[NLP-199] StageMem: Lifecycle-Managed Memory for Language Models
【速读】: 该论文旨在解决长时程语言模型(Long-horizon Language Model, LLM)系统中内存管理的现实挑战,即传统静态存储式记忆机制无法有效应对部署场景下的动态控制需求:如保留过多不确定信息、重要信息被错误顺序遗忘,以及用户对记忆持久性缺乏信任等问题。解决方案的关键在于提出StageMem框架,将记忆视为一个状态化过程而非被动存储库,通过引入三个阶段(瞬态、工作态和持久态)和对每个记忆项显式建模其置信度(confidence)与强度(strength),实现从浅层接纳到长期承诺的分层决策机制——信息可先以低成本写入,再根据证据积累和压力变化逐步晋升、保留、更新或驱逐,从而在可控压力下优先保护后期重要的内容,同时降低内存负担和深层污染风险。
链接: https://arxiv.org/abs/2604.16774
作者: Jiarui Han
机构: University of Waterloo (滑铁卢大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-horizon language model systems increasingly rely on persistent memory, yet many current designs still treat memory primarily as a static store: write an item, place it into memory, and retrieve it later if needed. We argue that this framing does not adequately capture the practical memory-control problem in deployed LLM systems. In realistic settings, the difficulty is often not merely forgetting useful information, but retaining too many uncertain items, forgetting important content in the wrong order, and giving users little trust in what will persist over time. We propose StageMem, a lifecycle-managed memory framework that treats memory as a stateful process rather than a passive repository. StageMem organizes memory into three stages – transient, working, and durable memory – and models each item with explicit confidence and strength. This separates shallow admission from long-term commitment: information may first be written at low cost and only later be promoted, retained, updated, or evicted as evidence and pressure evolve. Under controlled pressure regimes, this decomposition helps preserve late-important content while keeping memory burden and deeper-tier pollution more controlled. Adapted external tasks provide boundary evidence that the same schema remains compatible with stronger retrieval structure outside pure synthetic control. We present StageMem as a principled decomposition of the memory-control problem for language model systems.
[NLP-200] When Misinformation Speaks and Converses: Rethinking Fact-Checking in Audio Platforms ACL2026
【速读】: 该论文旨在解决当前事实核查(fact-checking)流程在应对音频 misinformation(虚假信息)时的失效问题。现有方法主要针对书面文本设计,未能充分考虑音频内容的独特性:一方面,音频具有“语音特性”(prosody),如语调、节奏和情感表达,赋予其更强的说服力;另一方面,音频具有“对话特性”(conversational nature),表现为多轮对话、多说话人及跨集连续性,这使得信息验证面临传统文本方法难以应对的结构性挑战。论文指出,解决方案的关键在于重构事实核查流程,使其围绕音频的“口语化”与“对话化”现实进行重新设计,从而实现对音频媒介中虚假信息的有效识别与验证。
链接: https://arxiv.org/abs/2604.16767
作者: Chaewan Chun,Delvin Ce Zhang,Dongwon Lee
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to ACL 2026 Main Conference
Abstract:Audio platforms have evolved beyond entertainment. They have become central to public discourse, from podcasts and radio to WhatsApp voice notes and live streams. With millions of shows and hundreds of millions of listeners, audio platforms are now a major channel for misinformation. Yet existing fact-checking pipelines are mostly designed for written claims, overlooking the unique properties of spoken media. We argue that audio misinformation is not merely textual content with transcripts: it is structurally different because it is both spoken - carrying persuasive force through prosody, pacing, and emotion - and conversational - unfolding across turns, speakers, and episodes. These dual properties introduce verification difficulties that traditional methods rarely face. This position paper synthesizes evidence across modalities and platforms, examines datasets and methods, and highlights why existing pipelines fail on audio. We argue that advancing fact-checking requires rethinking verification pipelines around the spoken and conversational realities of audio.
[NLP-201] Mapping Election Toxicity on Social Media across Issue Ideology and Psychosocial Dimensions
【速读】: 该论文旨在解决在线政治敌意(online political hostility)在不同竞选议题和政治意识形态下的差异性问题,以及毒性表达背后的心理社会信号与道德框架机制。其核心解决方案在于构建一个大规模、多维度的分析框架:首先对X(Twitter)平台在2024年美国总统大选前五周的帖子按10类主要议题进行分类;其次通过人机协同的大语言模型(LLM-assisted)标注方法估计文本意识形态倾向;再利用基于LLM的毒性检测模型识别有害内容;最后结合情感基调(emotional tone)与道德基础(moral framing)等心理语言学维度,系统解析毒性生成机制。关键创新在于揭示了毒性强度和类型在议题间存在显著异质性——身份相关议题毒性最强,骚扰行为最普遍,仇恨言论集中于身份导向讨论;同时发现左右翼在相同议题中情绪模式趋同(情绪镜像效应),且道德基础虽有重叠但受议题语境强烈调节,从而证明线上政治毒性具有高度情境依赖性,需采用议题敏感型策略进行测量与干预。
链接: https://arxiv.org/abs/2604.16765
作者: Lei Cao,Wen Zeng,Xinyue Wu,Eun Cheol Choi,Emilio Ferrara
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Online political hostility is pervasive, yet it remains unclear how toxicity varies across campaign issues and political ideology, and what psychosocial signals and framing accompany toxic expression online. In this work, we present a large-scale analysis of discourse on X (Twitter) during the five weeks surrounding the 2024 U.S. presidential election. We categorize posts into 10 major campaign issues, estimate the ideology of posts using a human-in-the-loop LLM-assisted annotation process, detect harmful content with an LLM-based toxicity detection model, and then examine the psychological drivers of toxic content. We use these annotated data to examine how harmful content varies across campaign issues and ideologies, as well as how emotional tone and moral framing shape toxicity in election discussions. Our results show issue heterogeneity in both the prevalence and intensity of toxicity. Identity-related issues displayed the highest toxicity intensity. As for specific harm categories, harassment was most prevalent and intense across most of the issues, while hate concentrated in identity-centered debates. Partisan posts contained more harmful content than neutral posts, and ideological asymmetries in toxicity varied by issue. In terms of psycholinguistic dimensions, we found that toxic discourse is dominated by high-arousal negative emotions. Left- and right-leaning posts often exhibit similar emotional profiles within the same issue domain, suggesting emotional mirroring. Partisan groups frequently rely on overlapping moral foundations, while issue context strongly shapes which moral foundations become most salient. These findings provide a fine-grained account of toxic political discourse on social media and highlight that online political toxicity is highly context-dependent, underscoring the need for issue-sensitive approaches to measuring and mitigating it.
[NLP-202] Expressing Social Emotions: Misalignment Between LLM s and Human Cultural Emotion Norms
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在跨文化社会情感表达中是否能够准确模拟人类行为的问题,尤其是当模型被用于 culturally nuanced(文化敏感性)的人机交互时,其输出若与目标文化群体的情感表达模式不一致,可能导致误导性建议或不当行为。解决方案的关键在于构建一个心理导向的评估框架,通过对比欧洲裔美国人和拉丁美洲参与者在表达亲社会(engaging)与疏离(disengaging)情绪时的文化差异,系统评估六种前沿LLM的表现。研究发现,所有模型均倾向于过度表达亲社会情绪,且对欧洲裔美国人格(persona)的表达偏差尤为显著,同时响应高度集中且缺乏人类情感表达的多样性;进一步的消融分析表明,这种偏差在不同采样温度下仍具鲁棒性,部分受提示语语言影响,并依赖于响应生成格式。这揭示了当前LLM在建模文化与情绪交互机制上的局限性,尤其在涉及跨文化情感互动场景中的部署风险。
链接: https://arxiv.org/abs/2604.16757
作者: Sree Bhattacharyya,Manas Mehta,Leona Chen,Cristina Salvador,Agata Lapedriza,Shiran Dudy,James Z. Wang
机构: 1. University of California, Riverside (加州大学河滨分校); 2. Google (谷歌); 3. DeepMind (深度思维); 4. University of Barcelona (巴塞罗那大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Under Review
Abstract:The expression of emotions that serve social purposes, such as asserting independence or fostering interdependence, is central to human interactions and varies systematically across cultures. As LLMs are increasingly used to simulate human behavior in culturally nuanced interactions, it is important to understand whether they faithfully capture human patterns of social emotion expression. When LLM responses are not culturally aligned, their utility is compromised – particularly when users assume they are interacting with a culturally attuned interlocutor, and may act on advice that proves inappropriate in their cultural context. We present a psychologically informed evaluation framework of cross-cultural social emotion expression in LLMs. Using a human study comparing European American and Latin American participants’ expression of engaging and disengaging emotions, we evaluate six frontier LLMs on their ability to reflect culturally differentiated patterns for expressing social emotions. We find systematic misalignment between model and human behavior: all models express engaging emotions more than disengaging ones, with particularly stark differences observed for the generally well-represented European American persona. We further highlight that LLM responses are highly concentrated and deterministic, failing to capture the diversity of human responses in expressing social emotions. Our ablation analyses reveal that these patterns are robust to sampling temperatures, partially sensitive to prompt language, and dependent on the response elicitation format. Together, our findings highlight limitations in how current LLMs represent the interaction of cultural and emotional axes, particularly when expressing social emotions, with direct implications for their deployment in cross-cultural affective contexts.
[NLP-203] ICLAD: In-Context Learning with Comparison-Guidance for Audio Deepfake Detection ACL
【速读】: 该论文旨在解决当前生成式音频伪造(Audio Deepfake)检测系统在真实场景(in-the-wild)中泛化能力不足的问题。解决方案的关键在于提出一种基于对比引导的上下文学习范式(In-Context Learning with comparison-guidance, ICLAD),其核心是引入成对比较推理策略,指导音频语言模型(Audio Language Model, ALM)识别并过滤幻觉及与伪造无关的声学特征;同时通过路由机制将分布外样本交由ALM处理,实现无需训练即可适应未见伪造类型,并提供可解释的文本推理依据。
链接: https://arxiv.org/abs/2604.16749
作者: Benjamin Chou,Yi Zhu,Surya Koppisetti
机构: Purdue University (普渡大学); Reality Defender Inc. (现实防御公司)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: To appear at ACL Findings 2026
Abstract:Audio deepfakes pose a significant security threat, yet current state-of-the-art (SOTA) detection systems do not generalize well to realistic in-the-wild deepfakes. We introduce a novel \textbfIn-\textbfContext \textbfLearning paradigm with comparison-guidance for \textbfAudio \textbfDeepfake detection (\textbfICLAD). The framework enables the use of audio language models (ALMs) for training-free generalization to unseen deepfakes and provides textual rationales on the detection outcome. At the core of ICLAD is a pairwise comparative reasoning strategy that guides the ALM to discover and filter hallucinations and deepfake-irrelevant acoustic attributes. The ALM works alongside a specialized deepfake detector, whereby a routing mechanism feeds out-of-distribution samples to the ALM. On in-the-wild datasets, ICLAD improves macro F1 over the specialized detector, with up to 2\times relative improvement. Further analysis demonstrates the flexibility of ICLAD and its potential for deployment on recent open-source ALMs.
[NLP-204] CT Open: An Open-Access Uncontaminated Live Platform for the Open Challenge of Clinical Trial Outcome Prediction
【速读】: 该论文旨在解决如何利用人工智能(AI)系统更可靠地预测现实世界事件结果的问题,特别是聚焦于临床试验结果预测这一高风险且对领域专家而言仍具挑战性的任务。其核心问题在于:现有方法难以确保预测时目标事件尚未公开,从而导致评估偏差。解决方案的关键在于提出并实现了一个全自动的去污染(decontamination)流程,该流程基于迭代式大语言模型(LLM)驱动的网络搜索,精准识别每项临床试验首次公开结果的时间点,从而确保所有用于评估的试验在预测提交时均无已公开结果。这一机制保障了公平、可靠的基准测试环境,使参与者可自由使用任意数据源和方法论,推动生成式AI在真实世界前瞻性预测领域的研究进展。
链接: https://arxiv.org/abs/2604.16742
作者: Jianyou Wang,Youze Zheng,Longtian Bao,Hanyuan Zhang,Qirui Zheng,Yuhan Chen,Yang Zhang,Matthew Feng,Maxim Khan,Aditya K. Sehgal,Christopher D. Rosin,Ramamohan Paturi,Umber Dube,Leon Bergen
机构: University of California, San Diego (加州大学圣地亚哥分校); Elsevier (爱思唯尔)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under Review
Abstract:Scientists have long sought to accurately predict outcomes of real-world events before they happen. Can AI systems do so more reliably? We study this question through clinical trial outcome prediction, a high-stakes open challenge even for domain experts. We introduce CT Open, an open-access, live platform that will run four challenge every year. Anyone can submit predictions for each challenge. CT Open evaluates those submissions on trials whose outcomes were not yet public at the time of submission but were made public afterwards. Determining if a trial’s outcome is public on the internet before a certain date is surprisingly difficult. Outcomes posted on official registries may lag behind by years, while the first mention may appear in obscure articles. To address this, we propose a novel, fully automated decontamination pipeline that uses iterative LLM-powered web search to identify the earliest mention of trial outcomes. We validate the pipeline’s quality and accuracy by human expert’s annotations. Since CT Open’s pipeline ensures that every evaluated trial had no publicly reported outcome when the prediction was made, it allows participants to use any methodology and any data source. In this paper, we release a training set and two time-stamped test benchmarks, Winter 2025 and Summer 2025. We believe CT Open can serve as a central hub for advancing AI research on forecasting real-world outcomes before they occur, while also informing biomedical research and improving clinical trial design. CT Open Platform is hosted at \hrefthis https URLthis https URL
[NLP-205] he impact of postediting on AI generative translation in Yemeni context: Translating literary prose by ChatGPT
【速读】: 该论文旨在解决生成式 AI(Generative AI)在文学翻译中的应用边界问题,特别是ChatGPT-4在处理文化、风格和修辞等复杂语言特征时的局限性,以及是否仍需人类译者进行后编辑(postediting)。其解决方案的关键在于通过混合方法研究设计,由30名专业译者对AI生成的阿拉伯语与英语文学文本进行评估与后编辑,实证表明:尽管AI能显著提升翻译速度与可及性,但其在文化适配性和文体准确性方面仍存在不足,因此必须依赖人类译者的干预;最终提出“人机协同”模式而非完全替代人类译者,强调AI应作为辅助工具,以保障翻译质量与文化恰当性。
链接: https://arxiv.org/abs/2604.16704
作者: Nasim Al-wagieh(Ibb University),Mohammed Q. Shormani(Ibb University)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 4 Tables
Abstract:This study examines the role of artificial intelligence in translation, focusing on ChatGPT, specifically ChatGPT-4, and the extent to which human postediting is required in literary translation. A mixed-method approach was adopted, involving 30 professional translators who evaluated and postedited AI-generated translations of selected Arabic and English literary texts. The results show that although AI improves translation speed and accessibility, it remains limited in handling cultural, stylistic, and figurative aspects of language. Participants generally confirmed the necessity of human postediting, particularly in novels and drama. The findings indicate that emerging human-machine collaboration model rather than replacement of human translators. The study concludes that AI should be used as a supportive tool, while human expertise remains essential for ensuring translation quality and cultural appropriateness.
[NLP-206] No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在使用外部上下文(如检索到的证据)进行推理时存在的“中性回归”(neutral regression)问题,即模型在面对非信息性上下文时仍可能覆盖原本正确的输出,导致性能下降。为应对这一问题,作者提出了一种解码时适配器——无更坏上下文感知解码(No-Worse Context-Aware Decoding, NWCAD),其核心创新在于采用双流架构与两阶段门控机制:当上下文无信息量时,模型自动退回到无上下文解码;否则,在不确定性下引入类似CAD(Context-Aware Decoding)的回退策略以保持准确性。该方案在基准测试中实现了对基线正确样本的零性能下降,同时保留了在真正有益上下文下的高精度表现,从而兼顾了“不伤害”可靠性与上下文利用效率。
链接: https://arxiv.org/abs/2604.16686
作者: Yufei Tao,Ameeta Agrawal
机构: Portland State University (波特兰州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Findings at ACL 2026
Abstract:Large language models (LLMs) can answer questions and summarize documents when conditioned on external contexts (e.g., retrieved evidence), yet context use remains unreliable: models may overwrite an already-correct output (neutral regression) even when the context is non-informative. We formalize neutral regression as a do-no-harm requirement and quantify it by measuring accuracy drops on baseline-correct items under answer-consistent contexts. We propose No-Worse Context-Aware Decoding (NWCAD), a decode-time adapter built on a two-stream setup with a two-stage gate: it backs off to no-context decoding when the context is non-informative, and otherwise uses context-conditioned decoding with a CAD-style fallback under uncertainty. We evaluate NWCAD on benchmarks that separate do-no-harm reliability from context utilization (accuracy gains on genuinely helpful contexts). NWCAD prevents neutral regression on baseline-correct items while preserving strong context-driven accuracy on helpful contexts.
[NLP-207] CBRS: Cognitive Blood Request System with Bilingual Dataset and Dual-Layer Filtering for Multi-Platform Social Streams ACL2026
【速读】: 该论文旨在解决社交媒体上紧急血液捐献请求因信息过载而被忽视的问题,尤其是在资源匮乏地区,传统基于应用程序的手动输入系统难以及时响应。其核心解决方案是提出认知血液请求系统(Cognitive Blood Request System, CBRS),该系统采用低成本的双层架构,能够高效地从多平台社交媒体流中过滤并解析血液捐献请求。关键创新在于构建了一个包含11K条已标注消息的多语言数据集(涵盖孟加拉语、英语及音译孟加拉语),并引入对抗性负样本以提升模型鲁棒性;同时,利用LoRA微调的Llama-3.2-3B模型在零样本场景下实现92%的解析准确率,显著优于基线模型且输入令牌消耗降低35倍,从而实现了高精度、低资源依赖的信息提取,为时间敏感型任务提供了可扩展、包容性强的解决方案。
链接: https://arxiv.org/abs/2604.16665
作者: Anik Saha,Mst. Fahmida Sultana Naznin,Zia Ul Hassan Abdullah,Anisa Binte Asad,K. G. Subarno Bithi,A. B. M. Alim Al Islam
机构: Bangladesh University of Engineering and Technology, Dhaka, Bangladesh (孟加拉国工程技术大学, 达卡)
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of the ACL 2026
Abstract:Urgent blood donation seeking posts and messages on social media often go unnoticed due to the overwhelming volume of daily communications. Traditional app-based systems, reliant on manual input, struggle to reach users in low-resource settings, delaying critical responses. To address this, we introduce the Cognitive Blood Request System (CBRS), a multi-platform framework that efficiently filters and parses blood donation requests from social media streams using a cost-efficient dual-layered architecture. To do so, we curate a novel dataset of 11K parsed blood donation request messages in Bengali, English, and transliterated Bengali, capturing the linguistic diversity of real social media communications. The inclusion of adversarial negatives further enhances the robustness of our model. CBRS achieves an impressive 99% accuracy and precision in filtering, surpassing benchmark methods. In the parsing task, our LoRA finetuned Llama-3.2-3B model achieves 92% zero-shot accuracy, surpassing the base model by 41.54% and exceeding the few-shot performance of GPT-4o-mini, Gemini-2.0-Flash, and other LLMs, while resulting in a 35X reduction in input token usage. This work lays a robust foundation for scalable, inclusive information extraction in time-sensitive, object-focused tasks. Our code, dataset, and trained models are publicly available at [this https URL](this https URL).
[NLP-208] Defrag menting Language Models: An Interpretability-based Approach for Vocabulary Expansion
【速读】: 该论文旨在解决多语言大语言模型(LLM)中存在的“token过碎片化”(token over-fragmentation)问题,即非拉丁语系语言在编码时所需token数量显著高于英语,导致成本和延迟增加。其核心解决方案是推进基于可解释性的词汇表扩展方法(interpretability-based vocabulary expansion),关键在于两个决策:一是采用基于可解释性而非传统频率的方法选择新增词汇项,以实现更优的性能与token效率权衡;二是通过可解释性引导的嵌入初始化策略,显著提升模型对非拉丁文语言的处理效率(如提升约20个点的性能)。研究进一步发现“子词解碎片化”(subword detokenization)现象,并据此提出FragMend方法,通过优化子词合并机制进一步提升扩展效率。
链接: https://arxiv.org/abs/2604.16656
作者: Maitrey Mehta,Nishant Subramani,Zhichao Xu,Ashim Gupta,Vivek Srikumar
机构: Kahlert School of Computing, University of Utah (犹他大学卡勒特计算机学院); Language Technologies Institute, Carnegie Mellon University (卡内基梅隆大学语言技术研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:All languages are equal; when it comes to tokenization, some are more equal than others. Tokens are the hidden currency that dictate the cost and latency of access to contemporary LLMs. However, many languages written in non-Latin scripts observe a poor exchange rate: LLMs take several multiples of tokens to encode the same information in many languages as they do for English. Our analysis reveals that this issue, known as ‘token over-fragmentation’, persists in modern open-weight LLMs. The standard remedy is vocabulary expansion that adds target language items missing from the model’s vocabulary. In this work, we comprehensively study and advance interpretability-based vocabulary expansion, a new research direction. We focus on two core decisions in the vocabulary expansion process: What items should we add? and How should we initialize their corresponding input and output embeddings? First, we question the conventional use of frequency-based methods to choose candidate vocabulary items to add (a decision long treated as settled), and show that interpretability-based methods offer a superior performance-token efficiency trade-off. Next, we strengthen the case for interpretability-based embedding initialization by showing large gains (~20 pts) over baseline initialization methods for several languages written in non-Latin scripts. We identify the phenomenon of “subword detokenization” where models progressively merge fragmented subword tokens into larger subwords across layers. Grounded in our analysis of this phenomenon, we propose FragMend to further push the efficiency ceiling of interpretability-based expansion. We validate the effectiveness of FragMend through comparison against strong baselines and we present extensive analysis of its design choices.
[NLP-209] IYKYK (But AI Doesnt): Automated Content Moderation Does Not Capture Communities Heterogeneous Attitudes Towards Reclaimed Language
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 驱动的在线内容审核工具在识别种族、性别和性少数群体相关贬损语(slurs)时存在的根本性缺陷:即无法有效区分“再占有”(reclaimed slur)与仇恨言论(hate speech),从而误判并压制边缘化群体的自我赋权表达。解决方案的关键在于通过定量与定性相结合的方法,构建一个由目标社群成员标注的在线贬损语使用语料库,并系统分析其上下文特征(如是否指向自身、是否具贬义)与人工判断之间的关联性,揭示出即便在群内成员中也存在高度主观性和情境依赖性的解读差异,进而为开发更具文化敏感性和语境感知能力的自动化检测模型提供实证基础。
链接: https://arxiv.org/abs/2604.16654
作者: Christina Chance,Rebecca Pattichis,Arjun Subramonian,James He,Shruti Narayanan,Saadia Gabriel,Kai-Wei Chang
机构: University of California Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reclaimed slur usage is a common and meaningful practice online for many marginalized communities. It serves as a source of solidarity, identity, and shared experience. However, contemporary automated and AI-based moderation tools for online content largely fail to distinguish between reclaimed and hateful uses of slurs, resulting in the suppression of marginalized voices. In this work, we use quantitative and qualitative methods to examine the attitudes of social media users in LGBTQIA+, Black, and women communities around reclaimed slurs targeting our focus groups including the f-word, n-word, and b-word. With social media users from these communities, we collect and analyze an annotated online slur usage corpus. The corpus includes annotators’ perceptions of whether an online text containing a slur should be flagged as hate speech, as well as contextual features of the slur usage. Across all communities and annotation questions, we observe low inter-annotator agreement, indicating substantial disagreement among in-group annotators. This is compounded by the fact that, absent clear contextual signals of identity and intent, even in-group members may disagree on how to interpret reclaimed slur usage online. Semi-structured interviews with annotators suggest that differences in lived experience and personal history contribute to this variation as well. We find poor alignment between annotator judgments and automated hate speech assessments produced by Perspective API. We further observe that certain features of a text such as whether the slur usage was derogatory and if the slur was targeted at oneself are more associated with whether annotators report the text as hate speech. Together, these findings highlight the inherent subjectivity and contextual nature of how marginalized communities interpret slurs online.
[NLP-210] Migrant Voices Local News: Insights on Bridging Community Needs with Media Content
【速读】: 该论文试图解决的问题是:非主流受众(特别是法语移民群体)在地方新闻消费中的需求与本地媒体内容之间存在的信息鸿沟,尤其是在中等规模欧洲城市中的本地新闻覆盖不足问题。解决方案的关键在于结合定性研究(焦点小组访谈)与定量自然语言处理技术(主题建模、信息检索、情感分析和可读性评估),对超过2000篇超本地新闻文章进行系统分析,从而识别出移民群体关注但未被充分报道的主题,并揭示现有内容在情感倾向和语言难度上的潜在可及性障碍,为本地媒体优化内容策略以服务多元读者群提供实证依据。
链接: https://arxiv.org/abs/2604.16651
作者: David Alonso del Barrio,Paula Dolores Rescala,Victor Bros,Daniel Gatica-Perez
机构: Idiap Research Institute(伊迪普研究所); EPFL(洛桑联邦理工学院)
类目: Computation and Language (cs.CL)
备注: David Alonso del Barrio, Paula Dolores Rescala, Victor Bros, Daniel Gatica-Perez| ACM 2026. This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record will be published in IMX’26 ACM International Conference on Interactive Media Experiences this https URL
Abstract:Research shows news consumption differs across demographics, yet little is known about non-mainstream audiences, especially in relation to local media. Our study addresses this gap by examining how French-speaking migrants in a mid-size European city engage with local news, and whether their needs are reflected in coverage. Eight community members participated in focus groups, whose insights guided the selection of natural language processing methods (topic modeling, information retrieval, sentiment analysis, and readability) applied to over 2000 hyper-local news articles. Results showed that while articles frequently covered local events, gaps remained in topics important to participants. Sentiment analysis revealed a generally positive tone, and readability measures indicated an intermediate-advanced French level, raising questions about accessibility for integration. Our work contributes to bridging the gap between local news platforms’ content and diverse readers’ needs, and could inform local media organizations about opportunities to expand their current news story coverage to appeal to more diverse audiences.
[NLP-211] AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在生成性能关键的核函数代码时,因缺乏可积累的执行反馈而导致的自适应能力不足问题,尤其在Triton等领域特定语言中表现尤为突出。其核心挑战在于:现有方法通常独立处理每个问题实例,无法构建可复用的知识体系;同时,Triton严格的语法约束和非线性优化空间使得盲目生成与局部微调难以保证正确性和高效性。解决方案的关键在于提出AdaExplore框架,通过两个互补阶段实现无需额外微调或外部知识的自我改进:第一阶段为故障驱动适应(failure-driven adaptation),将重复失败转化为有效性规则记忆,确保后续生成始终处于可行域内;第二阶段为多样性保持搜索(diversity-preserving search),以树结构组织候选核函数,交替进行小范围局部优化与大尺度结构重生成,从而突破局部最优并提升整体性能。实验表明,该方法在KernelBench Level-2和Level-3基准上分别实现3.12倍和1.72倍的速度提升,且随着计算资源增加持续优化。
链接: https://arxiv.org/abs/2604.16625
作者: Weihua Du,Jingming Zhuo,Yixin Dong,Andre Wang He,Weiwei Sun,Zeyu Zheng,Manupa Karunaratne,Ivan Fox,Tim Dettmers,Tianqi Chen,Yiming Yang,Sean Welleck
机构: Carnegie Mellon University (卡内基梅隆大学); University of Washington (华盛顿大学); Arm Ltd. (Arm有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preliminary work. The implementation is available at this https URL
Abstract:Recent large language model (LLM) agents have shown promise in using execution feedback for test-time adaptation. However, robust self-improvement remains far from solved: most approaches still treat each problem instance independently, without accumulating reusable knowledge. This limitation is particularly pronounced in domain-specific languages such as Triton, which are underrepresented in LLM pretraining data. Their strict constraints and non-linear optimization landscape further make naive generation and local refinement unreliable. We propose AdaExplore, an agent framework that enables self-improvement via accumulated execution feedback for performance-critical kernel code generation through two complementary stages: failure-driven adaptation and diversity-preserving search, jointly improving correctness and optimization performance without additional fine-tuning or external knowledge. In the adaptation stage, the agent synthesizes tasks and converts recurring failures into a reusable memory of validity rules, helping subsequent generations remain within the feasible set. In the search stage, the agent organizes candidate kernels as a tree and alternates between small local refinements and larger structural regeneration, allowing it to explore the optimization landscape beyond local optima. Experiments on kernel runtime optimization benchmarks validate these gains: AdaExplore achieves 3.12x and 1.72x speedups on KernelBench Level-2 and Level-3, respectively, within 100 steps, and continues to improve with additional computation.
[NLP-212] Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning ACL
【速读】: 该论文旨在解决生成式对话中后援信号(backchannel)的语义表征问题,即如何将后援信号的词汇形式与韵律特征联合建模以准确反映其语用意义。现有研究多集中于预测后援信号出现的时间点,而忽视了其词-韵律组合与语境之间的深层关联。解决方案的关键在于提出一个两阶段框架:首先利用大规模语言模型对话语转录文本进行微调,提取丰富的上下文表征;其次学习一个联合嵌入空间,将对话上下文与后援信号实现对齐。实验表明,该方法在上下文-后援匹配任务和人类感知相似性判断上显著优于传统方法,且揭示了后援信号形式对长程对话语境的高度敏感性,同时证明所学嵌入比原始WavLM特征更贴近人类认知。
链接: https://arxiv.org/abs/2604.16622
作者: Livia Qian,Gabriel Skantze
机构: KTH Royal Institute of Technology, Sweden (皇家理工学院,瑞典)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Association for Computational Linguistics (ACL), 2026
Abstract:Backchannels (e.g., yeah', mhm’, and `right’) are short, non-interruptive feedback signals whose lexical form and prosody jointly convey pragmatic meaning. While prior computational research has largely focused on predicting backchannel timing, the relationship between lexico-prosodic form and meaning remains underexplored. We propose a two-stage framework: first, fine-tuning large language models on dialogue transcripts to derive rich contextual representations; and second, learning a joint embedding space for dialogue contexts and backchannel realizations. We evaluate alignment with human perception via triadic similarity judgments (prosodic and cross-lexical) and a context-backchannel suitability task. Our results demonstrate that the learned projections substantially improve context-backchannel retrieval compared to previous methods. In addition, they reveal that backchannel form is highly sensitive to extended conversational context and that the learned embeddings align more closely with human judgments than raw WavLM features.
[NLP-213] Spotlights and Blindspots: Evaluation Machine-Generated Text Detection
【速读】: 该论文旨在解决生成式语言模型(Generative Language Models)时代下机器生成文本检测(Machine-Generated Text Detection)面临的评估不一致性问题,即不同检测模型在评估时因数据集、评价指标和评估策略差异而导致性能比较困难。其解决方案的关键在于系统性地评估15种来自六个不同系统的检测模型及7个训练模型,在七个英文文本测试集和三个创意人类写作数据集上的表现,并通过实证分析揭示训练与评估数据、关键指标对模型性能的影响。研究发现,单一系统无法在所有任务中均表现最优,且模型排名高度依赖于所选数据集和指标,强调了方法学选择在准确反映模型性能中的核心作用。
链接: https://arxiv.org/abs/2604.16607
作者: Kevin Stowe,Kailash Patil
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures, 4 tables
Abstract:With the rise of generative language models, machine-generated text detection has become a critical challenge. A wide variety of models is available, but inconsistent datasets, evaluation metrics, and assessment strategies obscure comparisons of model effectiveness. To address this, we evaluate 15 different detection models from six distinct systems, as well as seven trained models, across seven English-language textual test sets and three creative human-written datasets. We provide an empirical analysis of model performance, the influence of training and evaluation data, and the impact of key metrics. We find that no single system excels in all areas and nearly all are effective for certain tasks, and the representation of model performance is critically linked to dataset and metric choices. We find high variance in model ranks based on datasets and metrics, and overall poor performance on novel human-written texts in high-risk domains. Across datasets and metrics, we find that methodological choices that are often assumed or overlooked are essential for clearly and accurately reflecting model performance.
[NLP-214] Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models
【速读】: 该论文旨在解决当前语言模型(Language Models, LMs)在处理语义短语(semantic phrase)任务中表现不一致、缺乏系统性评估的问题。现有研究多聚焦于单一类型的多词表达(multiword expressions, MwEs),且缺乏统一的评测框架来全面衡量不同架构和规模的LMs在提取、分类与解释等任务中的语义理解能力。解决方案的关键在于提出SemanticQA,一个整合现有MwE资源并重构为统一测试平台的评估套件,涵盖词汇搭配、习语、名词短语复合结构及动词构式等细粒度类别,并通过多任务组合方式系统评估LMs的语义推理效能,从而揭示其在非平凡语义短语上的理解差异,为提升语言模型的深层语义理解能力提供依据。
链接: https://arxiv.org/abs/2604.16593
作者: Yang Liu,Hongming Li,Melissa Xiaohui Qin,Qiankun Liu,Chao Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 24 pages, 22 figures, 14 tables
Abstract:We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed. It covers both general lexical phenomena, such as lexical collocations, and three fine-grained categories: idiomatic expressions, noun compounds, and verbal constructions. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non-trivial semantic phrases. The evaluation harness and data of SemanticQA are available at this https URL.
[NLP-215] S-GRPO: Unified Post-Training for Large Vision-Language Models
【速读】: 该论文旨在解决当前大型视觉语言模型(Large Vision-Language Models, LVLMs)后训练方法中监督微调(Supervised Fine-Tuning, SFT)与强化学习(Reinforcement Learning, RL)各自存在的效率瓶颈问题:SFT易导致灾难性遗忘,使模型丧失通用多模态能力;而RL在稀疏奖励的视觉任务中常遭遇优化崩溃(optimization collapse),即模型无法自发生成有效轨迹。解决方案的关键在于提出一种统一的后训练框架——监督组相对策略优化(Supervised Group Relative Policy Optimization, S-GRPO),其核心创新是引入条件真实轨迹注入(Conditional Ground-Truth Trajectory Injection, CGI)机制:当二元验证器检测到采样轨迹组完全失效时,将已验证的真实轨迹注入候选池,并赋予确定的最大奖励,从而在组相对优势估计中提供正向信号,使监督学习目标成为策略梯度中的高优势成分,促使模型动态平衡专家轨迹利用与新颖视觉概念探索。
链接: https://arxiv.org/abs/2604.16557
作者: Yuming Yan,Kai Tang,Sihong Chen,Ke Xu,Dan Hu,Qun Yu,Pengfei Hu
机构: Tencent(腾讯)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current post-training methodologies for adapting Large Vision-Language Models (LVLMs) generally fall into two paradigms: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). Despite their prevalence, both approaches suffer from inefficiencies when applied in isolation. SFT forces the model’s generation along a single expert trajectory, often inducing catastrophic forgetting of general multimodal capabilities due to distributional shifts. Conversely, RL explores multiple generated trajectories but frequently encounters optimization collapse - a cold-start problem where an unaligned model fails to spontaneously sample any domain-valid trajectories in sparse-reward visual tasks. In this paper, we propose Supervised Group Relative Policy Optimization (S-GRPO), a unified post-training framework that integrates the guidance of imitation learning into the multi-trajectory exploration of preference optimization. Tailored for direct-generation visual tasks, S-GRPO introduces Conditional Ground-Truth Trajectory Injection (CGI). When a binary verifier detects a complete exploratory failure within a sampled group of trajectories, CGI injects the verified ground-truth trajectory into the candidate pool. By assigning a deterministic maximal reward to this injected anchor, S-GRPO enforces a positive signal within the group-relative advantage estimation. This mechanism reformulates the supervised learning objective as a high-advantage component of the policy gradient, compelling the model to dynamically balance between exploiting the expert trajectory and exploring novel visual concepts. Theoretical analysis and empirical results demonstrate that S-GRPO gracefully bridges the gap between SFT and RL, drastically accelerates convergence, and achieves superior domain adaptation while preserving the base model’s general-purpose capabilities.
[NLP-216] A Survey on the Security of Long-Term Memory in LLM Agents : Toward Mnemonic Sovereignty
【速读】: 该论文旨在解决持续性、可写入的代理记忆(agent memory)在长期使用中所引发的新型安全问题,即:具备持久记忆能力的智能体是否可能被跨会话投毒、未经授权访问、非法传播,以及如何对其进行有效治理。传统研究多聚焦于训练数据泄露风险,而本文提出将“记忆”本身作为独立的安全维度进行系统分析,其关键在于构建了一个基于六阶段(写入、存储、检索、执行、共享、遗忘/回滚)的内存生命周期框架,并将其与完整性、保密性、可用性和治理四个安全目标交叉映射,从而识别出当前研究中的盲区(如对保密性、存储-遗忘阶段和良性持久性失效的关注不足),并指出现有架构普遍缺乏对全部九种治理原语的支持。论文最终提出“记忆主权”(mnemonic sovereignty)概念——即对谁可以读写、何时授权更新、哪些状态可被遗忘等行为实现可验证、可恢复的治理控制,强调未来安全智能体的竞争优势不仅体现在记忆容量,更取决于记忆治理的质量。
链接: https://arxiv.org/abs/2604.16548
作者: Zehao Lin,Chunyu Li,Kai Chen
机构: MemTensor(MemTensor)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 63 pages, 7 figures, 10 tables. Survey paper. Preprint; submitted for review
Abstract:Research on large language model (LLM) security is shifting from “will the model leak training data” to a more consequential question: can an agent with persistent, long-term memory be continuously shaped, cross-session poisoned, accessed without authorization, and propagated across shared organizational state? Recent surveys cover memory architectures and agent mechanisms, but fewer center the epistemic and governance properties of persistent, writable memory as the reason memory is an independent security problem. This survey addresses that gap. Drawing on cognitive neuroscience and the philosophy of memory, we characterize agent memory as malleable, rewritable, and socially propagating, and develop a memory-lifecycle framework organized around six phases – Write, Store, Retrieve, Execute, Share, Forget/Rollback – cross-tabulated against four security objectives: integrity, confidentiality, availability, governance. We organize the literature on memory poisoning, extraction, retrieval corruption, control-flow hijacking, cross-agent propagation, rollback, and governance, and situate representative architectures as determinants of which phases are explicitly governable. Three findings stand out: the literature concentrates on write- and retrieve-time integrity attacks, while confidentiality, availability, store/forget, and benign-persistence failures remain sparsely studied; no published architecture covers all nine governance primitives we identify; and using LLMs themselves for memory security remains sparse yet essential. We unify these under mnemonic sovereignty – verifiable, recoverable governance over what may be written, who may read, when updates are authorized, and which states may be forgotten – arguing future secure agents will be differentiated not only by recall capacity, but by memory governance quality. Comments: 63 pages, 7 figures, 10 tables. Survey paper. Preprint; submitted for review Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: K.6.5; I.2.0; D.4.6 Cite as: arXiv:2604.16548 [cs.CR] (or arXiv:2604.16548v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.16548 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zehao Lin [view email] [v1] Fri, 17 Apr 2026 06:28:22 UTC (24,153 KB)
[NLP-217] WGuard: A Case Study of LLM Safety Guardrails for Localized Linguistic Contexts
【速读】: 该论文旨在解决当前生成式AI安全防护机制(guardrail models)在跨语言和跨文化场景下有效性不足的问题,即现有模型在报告性能与实际部署效果之间存在显著差距。其核心挑战在于缺乏对本地语言特征的适配性优化,导致防护模型在非主导语言地区(如台湾地区)的实际应用中表现不佳。解决方案的关键在于构建一个针对特定语言环境(以台湾汉语为范例)定制化的数据集,并基于此训练出具有语境敏感性的防护模型TWGuard,从而在F1分数上相比基础模型提升0.289,在误报率上比最强基线降低94.9%,显著提升了本地化部署下的实用性与安全性。
链接: https://arxiv.org/abs/2604.16542
作者: Hua-Rong Chu,Kuan-Chun Wang,Yao-Te Huang
机构: Model call failure
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Safety guardrails have become an active area of research in AI safety, aimed at ensuring the appropriate behavior of large language models (LLMs). However, existing research lacks consideration of nuances across linguistic and cultural contexts, resulting in a gap between reported performance and in-the-wild effectiveness. To address this issue, this paper proposes an approach to optimize guardrail models for a designated linguistic context by leveraging a curated dataset tailored to local linguistic characteristics, targeting the Taiwan linguistic context as a representative example of localized deployment challenges. The proposed approach yields TWGuard, a linguistic context-optimized guardrail model that achieves a huge gain (+0.289 in F1) compared to the foundation model and significantly outperforms the strongest baseline in practical use (-0.037 in false positive rate, a 94.9% reduction). Together, this work lays a foundation for regional communities to establish AI safety standards grounded in their own linguistic contexts, rather than accepting boundaries imposed by dominant languages. The inadequacy of the latter is reconfirmed by our findings.
[NLP-218] Scaling Test-Time Compute for Agent ic Coding
【速读】: 该论文旨在解决长时程(long-horizon)编码代理(coding agent)在测试时扩展(test-time scaling)中的核心挑战:传统方法依赖于对短文本输出的直接比较、排序或优化,而编码代理的每次尝试会产生包含动作、观察、错误和部分进展的复杂轨迹(rollout trajectory),难以有效复用历史经验。其关键解决方案在于提出一种基于紧凑轨迹表示(compact representations of rollout trajectories)的测试时扩展框架,将每条轨迹转化为结构化摘要(structured summary),保留其中的关键假设、进展与失败模式,同时丢弃低信噪比的细节信息;该表示支持两种推理时扩展方式:并行扩展采用递归锦标赛投票(Recursive Tournament Voting, RTV)从群体中逐步筛选最优摘要;顺序扩展则通过条件化新轨迹以先前摘要蒸馏结果实现知识迁移(Parallel-Distill-Refine, PDR)。实验证明该方法显著提升前沿编码代理在SWE-Bench Verified和Terminal-Bench v2.0上的性能,表明长时程代理的测试时扩展本质上是表示、选择与重用的问题。
链接: https://arxiv.org/abs/2604.16529
作者: Joongwon Kim,Wannan Yang,Kelvin Niu,Hongming Zhang,Yun Zhu,Eryk Helenowski,Ruan Silva,Zhengxing Chen,Srinivasan Iyer,Manzil Zaheer,Daniel Fried,Hannaneh Hajishirzi,Sanjeev Arora,Gabriel Synnaeve,Ruslan Salakhutdinov,Anirudh Goyal
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 70 pages, 26 figures, 12 tables
Abstract:Test-time scaling has become a powerful way to improve large language models. However, existing methods are best suited to short, bounded outputs that can be directly compared, ranked or refined. Long-horizon coding agents violate this premise: each attempt produces an extended trajectory of actions, observations, errors, and partial progress taken by the agent. In this setting, the main challenge is no longer generating more attempts, but representing prior experience in a form that can be effectively selected from and reused. We propose a test-time scaling framework for agentic coding based on compact representations of rollout trajectories. Our framework converts each rollout into a structured summary that preserves its salient hypotheses, progress, and failure modes while discarding low-signal trace details. This representation enables two complementary forms of inference-time scaling. For parallel scaling, we introduce Recursive Tournament Voting (RTV), which recursively narrows a population of rollout summaries through small-group comparisons. For sequential scaling, we adapt Parallel-Distill-Refine (PDR) to the agentic setting by conditioning new rollouts on summaries distilled from prior attempts. Our method consistently improves the performance of frontier coding agents across SWE-Bench Verified and Terminal-Bench v2.0. For example, by using our method Claude-4.5-Opus improves from 70.9% to 77.6% on SWE-Bench Verified (mini-SWE-agent) and 46.9% to 59.1% on Terminal-Bench v2.0 (Terminus 1). Our results suggest that test-time scaling for long-horizon agents is fundamentally a problem of representation, selection, and reuse.
[NLP-219] SmoGVLM: A Small Graph-enhanced Vision-Language Model ICASSP2026
【速读】: 该论文旨在解决大型视觉语言模型(Vision-Language Models, VLMs)在知识密集型推理任务中普遍存在的幻觉(hallucination)和知识定位(grounding)不佳的问题。其解决方案的关键在于提出一种小型、图增强的视觉语言模型(SmoGVLM),通过引入图神经网络(Graph Neural Networks, GNNs)将结构化知识与视觉和文本模态进行融合,从而提升模型在多模态推理中的准确性与鲁棒性。实验表明,该方法在不同规模模型(从1.3B到13B参数)上均有效,尤其使小模型性能提升达16.24%,并超越更大规模的基线模型,验证了结构化知识增强对高效小规模多模态推理系统的潜力。
链接: https://arxiv.org/abs/2604.16517
作者: Debjyoti Mondal,Rituraj Singh,Subhadarshi Panda
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: ICASSP 2026
Abstract:Large vision-language models (VLMs) achieve strong performance on multimodal tasks but often suffer from hallucination and poor grounding in knowledge-intensive reasoning. We propose SmoGVLM, a small, graph-enhanced VLM that integrates structured knowledge with visual and textual modalities, using Graph Neural Networks. We investigate the effects of our method across a range of model sizes, from tiny (1.3B) to large (13B) models. Our results demonstrate that, when trained using our approach, a small model can achieve performance gains upto 16.24%, and surpass its larger counterparts, outperforming larger VLMs and strong fine-tuned baselines. These results highlight the potential of structured knowledge augmentation for efficient, smaller-scale multimodal reasoning systems.
[NLP-220] SQL Query Engine: A Self-Healing LLM Pipeline for Natural Language to PostgreSQL Translation
【速读】: 该论文旨在解决自然语言到SQL查询转换中的准确性与鲁棒性问题,特别是在复杂数据库环境下的错误处理和结果优化。其核心挑战在于如何在不依赖结构化输出API的情况下准确提取SQL语句,并在查询失败时自动修复,同时避免因迭代修复导致的性能退化或结果波动。解决方案的关键在于设计了一个两阶段的大型语言模型(LLM)流水线:第一阶段通过多策略响应解析器从任意格式(如JSON、代码块或纯文本)中提取SQL;第二阶段引入基于SQLSTATE码和PostgreSQL诊断信息的自愈循环机制,实现错误诊断与重试,配合早期接受(early-accept)和最佳结果追踪(best-result tracking)策略防止回归,从而显著提升执行准确率并保障稳定性。
链接: https://arxiv.org/abs/2604.16511
作者: Muhammad Adeel Ijaz
机构: 未知
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注: 16 pages, 5 tables, 4 figures
Abstract:We present SQL Query Engine, an open-source, self-hosted service that translates natural language questions into validated PostgreSQL queries through a two-stage LLM pipeline. The first stage performs automatic schema introspection and SQL generation; a multi-strategy response parser extracts SQL from any LLM output format (JSON, code blocks, or raw text) without requiring structured output APIs. The second stage executes the query against PostgreSQL and, upon failure or empty results, enters an iterative self-healing loop in which the LLM diagnoses the error using full SQLSTATE codes and PostgreSQL diagnostic messages. Two mechanisms prevent regressions: early-accept returns successful queries immediately without LLM re-evaluation, and best-result tracking preserves the best partial result across retries. Schema context is cached per session in Redis, progress events stream via Redis Pub/Sub and SSE, and an OpenAI-compatible /v1/chat/completions endpoint lets existing tools work without modification. All database connections are read-only at the driver level. We evaluate across five LLM backends on a synthetic benchmark (75 questions, three databases) where the self-healing loop yields up to +9.3pp accuracy gains with zero regressions on the best model (Llama 4 Scout 17B, 57.3%), and on BIRD (437 questions, 11 databases migrated from SQLite to PostgreSQL) where the full pipeline reaches 49.0% execution accuracy (GPT-OSS-120B, +4.6pp). Source code: this https URL.
[NLP-221] Medical thinking with multiple images ICLR2026
【速读】: 该论文旨在解决当前大型语言模型在医学问答(Medical QA)任务中难以有效整合多张影像证据进行临床推理的问题。现有模型通常仅处理单视角图像,而真实临床诊断往往需要跨视图信息融合与分步推理。为此,作者提出了MedThinkVQA——一个专家标注的多图像推理基准,要求模型对每张图像进行解读、跨视图证据整合,并在中间监督和步骤级评估下回答诊断问题。解决方案的关键在于:可靠地实现视觉基础(visual grounding)、跨视图对齐(cross-view alignment)与证据组合(evidence composition),而非单纯增加推理长度。实验表明,即使是最先进的闭源模型在测试集上准确率也仅达57.2%,且超过70%的错误源于图像理解与跨视图整合阶段;提供专家指导的中间提示可显著提升性能,而自动生成的中间步骤反而降低准确性,说明当前模型的核心瓶颈在于对多模态输入的早期证据提取与对齐能力不足。
链接: https://arxiv.org/abs/2604.16506
作者: Zonghai Yao,Benlu Wang,Yifan Zhang,Junda Wang,Iris Xia,Zhipeng Tang,Shuo Han,Feiyun Ouyang,Zhichao Yang,Arman Cohan,Hong Yu
机构: UMass Amherst (马萨诸塞大学阿默斯特分校); Yale University (耶鲁大学); UMass Lowell (马萨诸塞大学洛厄尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Equal contribution for the first two authors. To appear in the proceedings of the Fourteenth International Conference on Learning Representations (ICLR 2026). Code is in this https URL . Dataset is in this https URL
Abstract:Large language models perform well on many medical QA benchmarks, but real clinical reasoning often requires integrating evidence across multiple images rather than interpreting a single view. We introduce MedThinkVQA, an expert-annotated benchmark for thinking with multiple images, where models must interpret each image, combine cross-view evidence, and answer diagnostic questions with intermediate supervision and step-level evaluation. The dataset contains 8,067 cases, including 720 test cases, with an average of 6.62 images per case, substantially denser than prior work, whose expert-level benchmarks use at most 1.43 images per case. On the test set, the best closed-source models, Claude-4.6-Opus, Gemini-3-Pro, and GPT-5.2-xhigh, reach only 57.2%, 55.3%, and 54.9% accuracy, while GPT-5-mini and GPT-5-nano reach 39.7% and 30.8%. Strong open-source models lag behind, led by Qwen3.5-397B-A17B at 52.2% and Qwen3.5-27B at 50.6%. Further analysis identifies grounded multi-image reasoning as the main bottleneck: models often fail to extract, align, and compose evidence across views before higher-level inference can help. Providing expert single-image cues and cross-image summaries improves performance, whereas replacing them with self-generated intermediates reduces accuracy. Step-level analysis shows that over 70% of errors arise from image reading and cross-view integration. Scaling results further show that additional inference-time computation helps only when visual grounding is already reliable; when early evidence extraction is weak, longer reasoning yields limited or unstable gains and can amplify misread cues. These results suggest that the key challenge is not reasoning length alone, but reliable mechanisms for grounding, aligning, and composing distributed evidence across real-world multimodal clinical inputs.
[NLP-222] NL2SQLBench: A Modular Benchmarking Framework for LLM -Enabled NL2SQL Solutions VLDB2026
【速读】: 该论文旨在解决当前自然语言转SQL(Natural Language to SQL, NL2SQL)技术在大语言模型(Large Language Models, LLMs)驱动下快速发展但缺乏系统性评估的问题,特别是对现有方法的有效性、效率和局限性的理解不足。其解决方案的关键在于提出首个模块化的评估与基准测试框架NL2SQLBench,将NL2SQL系统解构为三个核心模块:模式选择(Schema Selection)、候选生成(Candidate Generation)和查询修订(Query Revision),并为每个模块设计细粒度的量化指标以系统评估模块级性能;同时构建一个灵活的多智能体框架,支持跨多种NL2SQL方法的可配置基准测试,从而实现对十种代表性开源方法在两个数据集上的全面评估,揭示了当前方法在准确性与计算效率方面的显著差距,并指出现有基准数据集和评估规则存在的缺陷,为未来NL2SQL技术的精准创新提供了明确参考。
链接: https://arxiv.org/abs/2604.16493
作者: Shizheng Hou,Wenqi Pei,Nuo Chen,Quang-Trung Ta,Peng Lu,Beng Chin Ooi
机构: National University of Singapore(新加坡国立大学); Zhejiang University(浙江大学)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The paper is accepted by VLDB 2026
Abstract:Natural Language to SQL (NL2SQL) technology empowers non-expert users to query relational databases without requiring SQL expertise. While large language models (LLMs) have greatly improved NL2SQL algorithms, their rapid development outpaces systematic evaluation, leaving a critical gap in understanding their effectiveness, efficiency, and limitations. To this end, we present NL2SQLBench, the first modular evaluation and benchmarking framework for LLM-enabled NL2SQL approaches. Specifically, we dissect NL2SQL systems into three core modules: Schema Selection, Candidate Generation, and Query Revision. For each module, we comprehensively review existing strategies and propose novel fine-grained metrics that systematically quantify module-level effectiveness and efficiency. We further implement these metrics in a flexible multi-agent framework, allowing configurable benchmarking across diverse NL2SQL approaches. Leveraging NL2SQLBench, we rigorously evaluate ten representative open-source methods on two datasets, the BIRD development set and the ScienceBenchmark development set, using two LLMs, DeepSeek-V3 and GPT-4o mini. We systematically assess each approach across the three core modules and evaluate multiple critical performance dimensions. Our evaluation reveals significant gaps in existing NL2SQL methods, highlighting not only substantial room for accuracy improvements but also the significant computational inefficiency, which severely hampers real-world adoption. Furthermore, our analysis identifies critical shortcomings in current benchmark datasets and evaluation rules, emphasizing issues such as inaccurate gold SQL annotations and limitations in existing evaluation rules. By synthesizing these insights into a unified benchmarking, our study establishes a clear reference point for fair comparison and serves as essential guidance for future targeted innovation in NL2SQL technology.
[NLP-223] EchoChain: A Full-Duplex Benchmark for State-Update Reasoning Under Interruptions
【速读】: 该论文旨在解决实时语音助手在用户中途打断(mid-speech interruptions)时无法正确更新任务状态的问题,而现有对话评估基准大多仅关注轮次式交互(turn-based interaction),忽略了这一关键失败模式。解决方案的关键在于提出 EchoChain——一个受控的全双工(full-duplex)状态更新推理评测基准,通过标准化地在助理语音起始点后注入打断,并生成场景驱动的对话,从而实现跨模型的可控比较。该基准识别出三种常见失败模式:上下文惯性(contextual inertia)、打断失忆(interruption amnesia)和目标偏移(objective displacement),并验证了40.2%的错误源于中断下的状态更新推理而非任务本身难度,为诊断和改进实时语音交互中的状态管理能力提供了可复现的工具。
链接: https://arxiv.org/abs/2604.16456
作者: Smit Nautambhai Modi,Gandharv Mahajan,Marc Wetter,Randall Welles
机构: Labelbox(标签框)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注:
Abstract:Real-time voice assistants must revise task state when users interrupt mid-response, but existing spoken-dialog benchmarks largely evaluate turn-based interaction and miss this failure mode. We introduce EchoChain, a controlled benchmark for evaluating full-duplex state-update reasoning under mid-speech interruptions. EchoChain identifies three recurring failure patterns in post-interruption continuations: contextual inertia, interruption amnesia, and objective displacement. The benchmark generates scenario-driven conversations and injects interruptions at a standardized point relative to assistant speech onset, enabling controlled cross-model comparison. In a paired half-duplex control, total failures drop by 40.2% relative to interrupted runs, indicating that many errors are driven by state-update reasoning under interruption rather than task difficulty alone. Across evaluated real-time voice models, no system exceeds a 50% pass rate, showing substantial room for improvement in mid-generation state revision. EchoChain provides a reproducible benchmark for diagnosing state-update reasoning failures in full-duplex voice interaction.
[NLP-224] SynopticBench: Evaluating Vision-Language Models on Generating Weather Forecast Discussions of the Future
【速读】: 该论文旨在解决现有视觉语言模型(Visual-Language Models, VLMs)在气象数据文本生成任务中有效性难以量化的问题,特别是针对大气系统混沌性、多尺度时空变化带来的挑战。其解决方案的关键在于构建了SynopticBench数据集和提出SPACE评估框架:SynopticBench包含136万条由美国国家气象局生成的区域天气预报讨论文本及其对应的500mb位势高度、2米温度和850mb风速图像,提供了高质量的多模态气象数据对;SPACE则是一种新颖的评估方法,用于有效衡量文本对天气现象的对齐与覆盖程度,从而为VLMs在气象文本生成中的性能提供可验证的评估基准。
链接: https://arxiv.org/abs/2604.16451
作者: Timothy B. Higgins,Antonios Mamalakis,Chirag Agarwal
机构: University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
备注: Accepted for presentation at Climate Informatics 2026
Abstract:Recent advances in visual-language models (VLMs) have led to significant improvements in a plethora of complex multimodal tasks like image captioning, report generation, and visual perception. However, generating text from meteorological data is highly challenging because the atmosphere is a chaotic system that is rapidly changing at various spatial and temporal scales. Given the complexity of atmospheric phenomena, it is critical to verifiably quantify the effectiveness of existing VLMs on weather forecasting data. In this work, we present SynopticBench, a high-quality dataset consisting of 1,367,041 text samples of Area Forecast Discussions created by the National Weather Service over the continental United States paired to images of 500mb geopotential height, 2 meter temperature, and 850mb wind velocity in weather forecasts. We also present Synoptic Phenomena Alignment and Coverage Evaluation (SPACE), a novel evaluation framework that can be used to effectively estimate the quality of text descriptions of synoptic weather phenomena. Extensive experiments on generating forecast discussions using state-of-the-art VLMs show the sensitivity of existing evaluation metrics in this domain and enable further exploration into synoptic weather and climate text generation.
[NLP-225] Phoneme: Brain-to-Text Communication for ALS Using ConformerXL Decoding
【速读】: 该论文旨在解决肌萎缩侧索硬化症(ALS)患者因运动神经元损伤导致的言语障碍问题,通过开发高精度、实时可用的脑-文本通信系统实现语音恢复。其核心挑战在于神经解码准确率不足与输入接口效率低下,尤其是眼动追踪中的“Midas touch”问题(即注视点误触发)。解决方案的关键在于提出iPhoneme系统:一方面采用改进的ConformerXL深度学习架构(192.9M参数)进行语音音素解码,结合多尺度膨胀卷积、双向GRU和Pre-RMSNorm等技术提升对神经抖动的鲁棒性及CTC稳定性;另一方面设计了基于凝视辅助的音素输入界面,引入“按键式凝视+无声言语”协同输入范式替代传统停留时间选择机制,显著提高交互效率。在T15数据集上验证表明,该系统达到92.14%音素准确率(PER=7.86%)和73.39%词准确率(WER=26.61%),且延迟仅180ms,具备临床实用价值。
链接: https://arxiv.org/abs/2604.16441
作者: Yoonmin Cha,Dawit Chun,Sung Park
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Brain-computer interfaces (BCIs) for speech restoration hold transformative potential for the approximately 173,000–232,500 individuals worldwide with ALS-related dysarthria. Despite recent progress, high-performance speech BCIs have been demonstrated in only 22–31 patients globally, largely due to limitations in neural decoding accuracy and practical input interfaces. We present iPhoneme, a brain-to-text communication system that jointly addresses these challenges through integrated modeling and interaction design. The system combines a deep learning phoneme decoder based on a modified Conformer architecture (ConformerXL, 192.9M parameters) with a gaze-assisted phoneme input interface that mitigates the Midas touch problem in eye-tracking systems. The acoustic model incorporates a temporal prenet with multi-scale dilated convolutions and bidirectional GRU for neural jitter correction, temporal subsampling for CTC stability, and Pre-RMSNorm stabilization across 12 encoder blocks, trained with AdamW and cosine scheduling. On the interaction side, iPhoneme introduces a chorded gaze-plus-silent-speech paradigm that replaces dwell-time selection, enabling more efficient input. We evaluate the system on the T15 dataset (45 sessions, 8,071 trials) of 256-channel intracranial EEG from speech motor cortex regions. A 6-gram phoneme language model trained on 3.1M sequences, combined with WFST beam search (beam=128), achieves 92.14% phoneme accuracy (7.86% PER) and 73.39% word accuracy (26.61% WER), approximately 3% above prior state-of-the-art. The system operates on CPU with 180 ms latency, demonstrating real-time, high-accuracy brain-to-text communication for ALS.
[NLP-226] HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的幻觉(hallucination)问题,即模型在生成内容时产生与事实不符的信息。现有检测方法多忽视了幻觉现象的动态演化特性及其内在机制。解决方案的关键在于提出一种受相变理论启发的框架 HalluSAE,将幻觉建模为模型潜在空间动力学中的临界转变过程;通过将生成轨迹视为穿越势能景观的过程,识别出高能稀疏特征所对应的临界过渡区域,并利用稀疏自动编码器、对比对数归因和线性探测等技术实现对幻觉成因的精准定位与因果检测,从而显著提升幻觉检测的准确性。
链接: https://arxiv.org/abs/2604.16430
作者: Boshui Chen,Zhaoxin Fan,Ke Wang,Zhiying Leng,Faguo Wu,Hongwei Zheng,Yifan Sun,Wenjun Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are powerful and widely adopted, but their practical impact is limited by the well-known hallucination phenomenon. While recent hallucination detection methods have made notable progress, we find most of them overlook the dynamic nature and underlying mechanisms of it. To address this gap, we propose HalluSAE, a phase transition-inspired framework that models hallucination as a critical shift in the model’s latent dynamics. By modeling the generation process as a trajectory through a potential energy landscape, HalluSAE identifies critical transition zones and attributes factual errors to specific high-energy sparse features. Our approach consists of three stages: (1) Potential Energy Empowered Phase Zone Localization via sparse autoencoders and a geometric potential energy metric; (2) Hallucination-related Sparse Feature Attribution using contrastive logit attribution; and (3) Probing-based Causal Hallucination Detection through linear probes on disentangled features. Extensive experiments on Gemma-2-9B demonstrate that HalluSAE achieves state-of-the-art hallucination detection performance.
[NLP-227] Safety Security and Cognitive Risks in State-Space Models: A Systematic Threat Analysis with Spectral Stateful and Capacity Attacks NEURIPS2026
【速读】: 该论文旨在解决状态空间模型(State-Space Models, SSMs)在安全关键长上下文应用(如基因组分析、临床时间序列预测和网络安全日志处理)中的安全性、安全性和认知风险问题,这些问题此前尚未被系统研究。其核心贡献在于构建了首个针对SSM的安全性分析框架——包括五层攻击面、状态完整性破坏(State Integrity Violation, StIV)、跨上下文放大比 XS 和基于 H∞ 范数的谱敏感性命题,并提出了三类新型攻击:谱对抗攻击(利用传递函数增益)、延迟触发状态后门(数千步后激活)以及状态容量饱和(熵洪水导致无声遗忘)。解决方案的关键在于将形式化威胁建模与实证评估相结合,通过14项MITRE ATLAS技术扩展、六种攻击者画像及治理对齐的缓解策略(映射至CREST、NIST AI 600-1和欧盟AI法案),实现了从理论到实践的闭环验证,且实验表明目标基因组注入可使StIV提升至0.519(随机为0.086,p<0.001),PGD状态注入使输出扰动增强156倍,同时提出O(N²)复杂度的状态提取方法相较于传统O(N³)实现N倍加速。
链接: https://arxiv.org/abs/2604.16424
作者: Manoj Parmar
机构: SovereignAI Security Labs (SovereignAI 安全实验室)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注: 32 pages, 22 tables, NeurIPS 2026 submission format. Appendix contains theoretical analysis and future experimentation plans
Abstract:State-Space Models (SSMs) – structured SSMs (S4, S4D, DSS, S5), selective SSMs (Mamba, Mamba-2), and hybrid architectures (Jamba) – are deployed in safety-critical long-context applications: genomic analysis, clinical time-series forecasting, and cybersecurity log processing. Their linear-time scaling is compelling, yet the security properties of their compressed-state recurrent architectures remain unstudied. We present the first systematic treatment of SSM safety, security, and cognitive risks. Seven contributions: (1) Formal threat framework – SSM Attack Surface (five layers), State Integrity Violation (StIV), Cross-Context Amplification Ratio \mathcalX_\mathcalS , and a Spectral Sensitivity Proposition grounded in the H_\infty norm. (2) Three novel attack classes: spectral adversarial attacks (transfer-function gain exploitation), delayed-trigger stateful backdoors (activate thousands of steps after injection), and state capacity saturation (entropy flooding forces silent forgetting). (3) 14 MITRE ATLAS technique extensions across the full tactic chain. (4) Six-profile attacker taxonomy with kill chains for genomics, clinical, and cybersecurity domains. (5) Four cognitive risk hypotheses grounded in state-compression mechanics. (6) Governance-aligned mitigations mapped to CREST, NIST AI 600-1, and EU AI Act. (7) Empirical evaluation: targeted genomic injection achieves \mathrmStIV=0.519 vs. 0.086 random ( 6.0\times , p0.001 ); PGD state injection achieves 156\times output perturbation over random; SSD-structured extraction confirmed at O(N^2) vs. O(N^3) query complexity ( N\times speedup). Validation on pretrained checkpoints is detailed in the Appendix. Comments: 32 pages, 22 tables, NeurIPS 2026 submission format. Appendix contains theoretical analysis and future experimentation plans Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Optimization and Control (math.OC) ACMclasses: I.2.0; K.6.5; C.2.0 Cite as: arXiv:2604.16424 [cs.CR] (or arXiv:2604.16424v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.16424 Focus to learn more arXiv-issued DOI via DataCite
[NLP-228] Injecting Structured Biomedical Knowledge into Language Models: Continual Pretraining vs. GraphRAG LREC2026
【速读】: 该论文旨在解决如何有效将领域特定知识(特别是生物医学领域)注入语言模型(LMs),以提升其在专业任务中的表现。传统方法多依赖非结构化文本语料,而本文提出两种互补策略:一是通过持续预训练(continual pretraining)将知识嵌入模型参数,二是采用图检索增强生成(Graph Retrieval-Augmented Generation, GraphRAG)在推理时调用知识图谱。关键创新在于构建了一个包含340万概念和3420万关系的大型生物医学知识图谱(基于UMLS Metathesaurus),并将其存储于Neo4j中实现高效查询;同时利用该图谱生成约1亿token的文本语料用于持续预训练BERT和BioBERT变体,并设计GraphRAG管道直接集成至LLaMA 3-8B模型,在不重新训练的情况下显著提升PubMedQA和BioASQ两个问答任务上的准确率,且具备可解释性、多跳推理能力和易更新特性。
链接: https://arxiv.org/abs/2604.16422
作者: Jaafer Klila,Sondes Bannour Souihi,Rahma Boujelben,Nasredine Semmar,Lamia Hadrich Belguith
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at LREC 2026
Abstract:The injection of domain-specific knowledge is crucial for adapting language models (LMs) to specialized fields such as biomedicine. While most current approaches rely on unstructured text corpora, this study explores two complementary strategies for leveraging structured knowledge from the UMLS Metathesaurus: (i) Continual pretraining that embeds knowledge into model parameters, and (ii) Graph Retrieval-Augmented Generation (GraphRAG) that consults a knowledge graph at inference time. We first construct a large-scale biomedical knowledge graph from UMLS (3.4 million concepts and 34.2 million relations), stored in Neo4j for efficient querying. We then derive a ~100-million-token textual corpus from this graph to continually pretrain two models: BERTUMLS (from BERT) and BioBERTUMLS (from BioBERT). We evaluate these models on six BLURB (Biomedical Language Understanding and Reasoning Benchmark) datasets spanning five task types and evaluate GraphRAG on the two QA (Question Answering) datasets (PubMedQA, BioASQ). On BLURB tasks, BERTUMLS improves over BERT, with the largest gains on knowledge-intensive QA. Effects on BioBERT are more nuanced, suggesting diminishing returns when the base model already encodes substantial biomedical text knowledge. Finally, augmenting LLaMA 3-8B with our GraphRAG pipeline yields over than 3 points accuracy on PubMedQA and 5 points on BioASQ without any retraining, delivering transparent, multi-hop, and easily updated knowledge access. We release the processed UMLS Neo4j graph to support reproducibility.
[NLP-229] Measuring Representation Robustness in Large Language Models for Geometry
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在几何推理任务中对等价问题表示形式的鲁棒性不足的问题。现有基准测试通常固定问题表述形式(如欧氏几何、坐标几何或向量几何),隐含假设模型具备表示不变性(representation invariance),从而掩盖了因表示变化导致的性能下降。为更准确评估模型的几何推理能力,作者提出GeoRepEval框架,其关键在于引入问题级别的多表示并行评测机制,结合严格答案匹配、Bootstrap置信区间、配对McNemar检验、表示翻转分析及表面复杂度控制的回归校正,量化模型在不同表示下的正确性(correctness)、不变性(invariance)和一致性(consistency)。特别地,提出的Invariance@3指标可将整体准确率分解为稳健与脆弱成分,并受最弱表示形式的上限约束,从而揭示出当前模型依赖特定表示启发式而非抽象几何推理的本质局限。
链接: https://arxiv.org/abs/2604.16421
作者: Vedant Jawandhia,Yash Sinha,Murari Mandal,Ankan Pal,Dhruv Kumar
机构: BITS Pilani; KIIT University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 7 figures, 9 tables
Abstract:Large language models (LLMs) are increasingly evaluated on mathematical reasoning, yet their robustness to equivalent problem representations remains poorly understood. In geometry, identical problems can be expressed in Euclidean, coordinate, or vector forms, but existing benchmarks report accuracy on fixed formats, implicitly assuming representation invariance and masking failures caused by representational changes alone. We propose GeoRepEval, a representation-aware evaluation framework that measures correctness, invariance, and consistency at the problem level across parallel formulations, combining strict answer matching, bootstrap confidence intervals, paired McNemar tests, representation-flip analyses, and regression controls for surface complexity. We prove that our Invariance@3 metric decomposes accuracy into robust and fragile components and is bounded by the weakest representation. Evaluating eleven LLMs on 158 curated high-school geometry problems (474 instances), we find accuracy gaps of up to 14 percentage points induced solely by representation choice. Vector formulations emerge as a consistent failure point, with Invariance@3 as low as 0.044 even after controlling for length and symbolic complexity. A convert-then-solve prompting intervention improves vector accuracy by up to 52 percentage points for high-capacity models, suggesting that failures reflect representation sensitivity rather than inability; however, low-capacity models show no gains, indicating deeper limitations. These results suggest that current models rely on representation-specific heuristics rather than abstract geometric reasoning. All datasets, prompts, and scripts are released at this https URL.
[NLP-230] Measuring the Gap Between Media Coverag e and Public Information Demand: Evidence from the 2026 Lebanon Conflict
【速读】: 该论文试图解决的问题是:在2026年3月黎巴嫩冲突期间,媒体议程(media agenda)与公众信息需求(public information demand)之间是否存在偏差,以及这种偏差的具体表现和成因。解决方案的关键在于通过量化分析方法,将来自GDELT数据库的11,623篇英文新闻文章按主题分类(冲突、经济、生活条件、移民),并与Google Trends中黎巴嫩本地搜索数据进行对比,从而揭示媒体覆盖集中于军事事件(占新闻总量94.9%),而公众实际搜索兴趣则主要集中在经济、生活条件和移民问题(占搜索总量63.1%),且后者在时间上呈现持续性而非事件驱动型特征。这一方法有效识别了媒体议程与公众信息需求之间的显著不匹配,为议程设置理论提供了实证支持。
链接: https://arxiv.org/abs/2604.16417
作者: Mohamed Soufan
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 16 pages, 4 figures, 1 table. Code and data available on GitHub
Abstract:This study examines the relationship between media coverage and public information demand during the Lebanon conflict in March 2026. Using a dataset of 11,623 English-language news articles collected from the GDELT database and Google Trends data for searches conducted within Lebanon, the study compares the distribution of news coverage across topics with the distribution of public search interest. News headlines were filtered for relevance and classified into four categories: Conflict, Economy, Living Conditions, and Emigration. Public information demand was measured using Google Trends topic data for the same categories. The results show a substantial divergence between news coverage and search interest. Conflict accounted for 94.9% of classified news coverage but only 36.9% of total search interest. In contrast, Economy, Living Conditions, and Emigration together accounted for 63.1% of search demand but only 5.1% of news coverage. Time series analysis indicates that search demand for economic and living conditions remained consistently elevated throughout the month rather than reacting to specific conflict events. These findings were robust to the exclusion of the peak conflict period (March 1-5), with Conflict coverage remaining at 94.9% and the information gap persisting across all three under-covered categories. The findings suggest that during the study period, media coverage of Lebanon was heavily concentrated on military events, while public information demand was distributed across economic conditions, daily life, and emigration. This study contributes to agenda-setting research by providing a quantitative comparison between media agenda and public information demand during an active conflict period.
[NLP-231] QU-NLP at QIAS 2026: Multi-Stage QLoRA Fine-Tuning for Arabic Islamic Inheritance Reasoning LREC26
【速读】: 该论文旨在解决大语言模型在阿拉伯语伊斯兰遗产继承推理任务中的结构化推理能力不足问题,该任务要求模型具备多步骤法律分析、基于规则的决策判断以及精确的分数计算能力。解决方案的关键在于采用分阶段的量化低秩适配(Quantized Low-Rank Adaptation, QLoRA)微调策略:首先在3,166条伊斯兰教法判例(fatwa)上进行领域适应,以习得继承术语和法学推理模式;随后在12,000个结构化继承案例上进行任务特定训练,优化JSON格式输出生成。通过4-bit NF4量化与rank-128 LoRA适配器的结合,模型在测试集上达到90%的MIR-E评分,展现出小规模模型在复杂法律推理任务中媲美商用系统(如Gemini-2.5-flash)的高效性能。
链接: https://arxiv.org/abs/2604.16396
作者: Mohammad AL-Smadi
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted for publication, The 7th Workshop on Open-Source Arabic Corpora and Processing Tools, LREC26 conference
Abstract:Islamic inheritance law (ilm al-mawarıth) presents a challenging domain for evaluating large language models’ structured reasoning capabilities, requiring multi-step legal analysis, rule-based blocking decisions, and precise fractional calculations. We present QU-NLP’s submission to the QIAS 2026 shared task on Arabic Islamic inheritance reasoning. Our approach employs a multi-stage Quantized Low-Rank Adaptation (QLoRA) fine-tuning strategy on Qwen3-4B: (1) domain adaptation on 3,166 Islamic fatwa records to acquire inheritance terminology and jurisprudential reasoning patterns, followed by (2) task-specific training on 12,000 structured inheritance cases to optimize JSON-formatted output generation. Using 4-bit NF4 quantization with rank-128 LoRA adapters, our model achieves 90% MIR-E (Mawarith Inheritance Reasoning Evaluation) score on the test set, demonstrating competitive performance while requiring minimal computational resources. Our results show that domain-specific pre-adaptation combined with structured output training enables small language models to perform complex legal reasoning tasks effectively comparing to commercial systems such as Gemini-2.5-flash.
[NLP-232] RoMathExam: A Longitudinal Dataset of Romanian Math Exams (1895-2025) with a Seven-Decade Core (1957-2025)
【速读】: 该论文旨在解决教育领域中缺乏高质量、结构化且与课程紧密对齐的考试数据集问题,尤其是在低资源语言和教育体系中。针对这一挑战,作者提出了RoMathExam——一个覆盖1895至2025年罗马尼亚高中数学考试的纵向数据集,其中1957至2025年部分具有标准化核心结构。其关键解决方案包括:(1)高保真数字化与统一JSON schema设计,确保数据可追溯性;(2)引入课程对齐的主题标签和密集文本嵌入,支持变体检测、去重及相似性检索;(3)提出并验证一种可扩展的内在复杂度指标(solution complexity metric),作为难度的代理变量,有效分离数学内在深度与随机生成噪声,实证显示该指标在三种前沿推理模型间具有高度一致性(r > 0.72)。该方案为低资源语境下的难度建模、课程分析和大语言模型(LLM)评估提供了可复现的研究基础。
链接: https://arxiv.org/abs/2604.16392
作者: Luca-Ncolae Cuclea,Sabin-Codrut Badea,Adrian-Marius Dumitran
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: AIED 2026, 15 pages
Abstract:AI in Education research increasingly relies on authentic, curriculum-grounded assessment data, yet large, well-structured exam corpora remain scarce for many languages and educational systems. We introduce RoMathExam, a longitudinal dataset of Romanian high-school mathematics exams spanning 1895-2025, with a robust standardized core for 1957-2025. The dataset contains 10,592 mathematics problems organized into 600+ complete exam sets across multiple tracks (M1-M4), covering both official national examination sessions and ministry-published training variants. Beyond high-fidelity digitization and a unified JSON schema with traceable provenance, RoMathExam is enriched with curriculum-aligned topic tags and dense text embeddings, enabling variant detection, deduplication, and similarity-based retrieval. To overcome the lack of historical psychometric data, we propose and validate a solution complexity metric as a scalable intrinsic proxy for difficulty. Our evaluation across three frontier reasoning models (GPT-5-mini, DeepSeek-R1, and Qwen3-235B-Thinking) reveals high cross-model synchronization (r 0.72), confirming the metric’s ability to isolate intrinsic mathematical depth from stochastic generation noise. We demonstrate the dataset’s utility through a longitudinal analysis that quantifies a “regime shift” from volatile historical formats to a standardized, algebra-dominant modern curriculum. RoMathExam provides a foundation for reproducible research in difficulty modeling, curriculum analytics, and LLM evaluation in low-resource linguistic contexts.
[NLP-233] LiFT: Does Instruction Fine-Tuning Improve In-Context Learning for Longitudinal Modelling by Large Language Models ?
【速读】: 该论文旨在解决大语言模型在纵向自然语言处理(Longitudinal NLP)任务中难以有效利用历史上下文、追踪动态交互以及识别罕见变化事件的问题。其解决方案的关键在于提出LiFT(Longitudinal Instruction Fine-Tuning)框架,该框架通过统一的指令模板整合多样化的纵向建模任务,并采用渐进式课程学习策略逐步提升时间难度,同时引入少量示例结构和时间条件化机制,以增强模型对历史信息的利用效率。实验表明,LiFT在多个数据集上均显著优于基础模型的上下文学习(In-Context Learning, ICL),尤其在分布外数据和少数类变化事件上表现突出。
链接: https://arxiv.org/abs/2604.16382
作者: Iqra Ali,Talia Tseriotou,Mahmud Elahi Akhter,Yuxiang Zhou,Maria Liakata
机构: Queen Mary University of London (英国); The Alan Turing Institute (英国)
类目: Computation and Language (cs.CL)
备注:
Abstract:Longitudinal NLP tasks require reasoning over temporally ordered text to detect persistence and change in human behavior and opinions. However, in-context learning with large language models struggles on tasks where models must integrate historical context, track evolving interactions, and handle rare change events. We introduce LiFT, a longitudinal instruction fine-tuning framework that unifies diverse longitudinal modeling tasks under a shared instruction schema. LiFT uses a curriculum that progressively increases temporal difficulty while incorporating few-shot structure and temporal conditioning to encourage effective use of past context. We evaluate LiFT across five datasets. Models trained on longitudinal tasks with different levels of temporal granularity are tested for generalisability on two separate datasets. Across models with different parameter sizes (OLMo (1B/7B), LLaMA-8B, and Qwen-14B), LiFT consistently outperforms base-model ICL, with strong gains on out-of-distribution data and minority change events.
[NLP-234] Data Mixing for Large Language Models Pretraining: A Survey and Outlook
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)预训练过程中数据混合策略的系统性优化问题,即如何在有限的计算和数据预算下,通过合理分配不同领域数据的采样权重来提升训练效率与下游任务泛化能力。其解决方案的关键在于将数据混合优化形式化为定义在概率单纯形上的双层问题,并提出一个细粒度的分类体系,将现有方法分为静态混合(规则驱动与学习驱动)和动态混合(自适应与外部引导)两大类,从而清晰揭示各类方法的实现机制、性能-成本权衡特性及其局限性,为未来研究提供理论框架与实践指导。
链接: https://arxiv.org/abs/2604.16380
作者: Zhuo Chen,Yuxuan Miao,Supryadi,Deyi Xiong
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 41 pages, 4 figures, 1 table
Abstract:Large language models (LLMs) rely on pretraining on massive and heterogeneous corpora, where training data composition has a decisive impact on training efficiency and downstream generalization under realistic compute and data budget constraints. Unlike sample-level data selection, data mixing optimizes domain-level sampling weights to allocate limited budgets more effectively. In recent years, a growing body of work has proposed principled data mixing methods for LLM pretraining; however, the literature remains fragmented and lacks a dedicated, systematic survey. This paper provides a comprehensive review of data mixing for LLM pretraining. We first formalize data mixture optimization as a bilevel problem on the probability simplex and clarify the role of data mixing in the pretraining pipeline, and briefly explain how existing methods make this formulation tractable in practice. We then introduce a fine-grained taxonomy that organizes existing methods along two dimensions: static versus dynamic mixing. Static mixing is further categorized into rule-based and learning-based methods, while dynamic mixing is grouped into adaptive and externally guided families. For each class, we summarize representative approaches and analyze their strengths and limitations from a performance-cost trade-off perspective. Building on this analysis, we highlight challenges that cut across methods, including limited transferability across data domains, optimization objectives, models, and validation sets, as well as unstandardized evaluation protocols and benchmarks, and the inherent tension between performance gains and cost control in learning-based methods. Finally, we outline several exploratory directions, including finer-grained domain partitioning and inverse data mixing, as well as pipeline-aware designs, aiming to provide conceptual and methodological insights for future research.
[NLP-235] Reciprocal Co-Training (RCT): Coupling Gradient-Based and Non-Differentiable Models via Reinforcement Learning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)与传统机器学习方法(如随机森林,Random Forest, RF)在预测建模中难以有效集成的问题,因其本质不同的表示方式和训练范式限制了协同优化:LLMs依赖文本数据上的梯度优化,而RF采用不可微的特征分割机制。解决方案的关键在于提出一种基于强化学习的互惠协同训练框架(reciprocal co-training framework),通过双向反馈机制实现模型间的适应性增强——将表格数据转化为标准化文本表示供LLM处理,其嵌入向量扩充RF的特征空间;同时,RF输出的概率校准结果作为奖励信号驱动LLM的强化学习更新,从而形成迭代优化闭环。实验表明,该框架在三个医学数据集上均显著提升两模型性能,尤其对LLM效果突出,且消融分析验证了迭代精炼、混合奖励设计及维度控制对性能增益的共同贡献。
链接: https://arxiv.org/abs/2604.16378
作者: Yunshuo Tian,Akayou Kitessa,Tanuja Chitnis,Yijun Zhao
机构: Fordham University (福特汉姆大学); Mass General Brigham (麻省总医院布里格姆)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) and classical machine learning methods offer complementary strengths for predictive modeling, yet their fundamentally different representations and training paradigms hinder effective integration: LLMs rely on gradient-based optimization over textual data, whereas models such as Random Forests (RF) employ non-differentiable feature partitioning. This work introduces a reciprocal co-training framework that couples an LLM with an RF classifier via reinforcement learning, creating an iterative feedback loop in which each model improves using signals from the other. Tabular data are reformulated into standardized textual representations for the LLM, whose embeddings augment the RF feature space, while calibrated RF probability estimates provide feedback signals that guide reinforcement learning updates of the LLM. Experiments across three medical datasets demonstrate consistent performance gains for both models, with particularly strong effects for the LLM. Ablation analyses show that iterative refinement, hybrid reward design, and dimensionality control jointly contribute to these gains. The proposed framework provides a general mechanism that allows incompatible model families to leverage each other’s strengths through bidirectional adaptation.
[NLP-236] GoCoMA: Hyperbolic Multimodal Representation Fusion for Large Language Model-Generated Code Attribution ICME
【速读】: 该论文旨在解决生成式 AI(Generative AI)代码溯源问题,即在难以区分由大型语言模型(Large Language Models, LLMs)生成与人类编写的代码时,如何准确识别其来源。传统方法往往依赖单一模态特征(如代码文本本身或二进制文件的字节表示),难以全面捕捉代码的多层级语义差异。解决方案的关键在于提出 GoCoMA——一个融合代码风格计量学(code stylometry)与二进制预执行工件(Binary Pre-Executable Artifacts, BPEA)图像表示的多模态框架,通过将不同模态嵌入投影至双曲 Poincaré 球空间,利用基于测地线余弦相似度的跨模态注意力机制(GCSA)进行信息融合,并最终映射回欧氏空间完成 LLM 源头归属任务,从而显著提升代码来源识别的准确性。
链接: https://arxiv.org/abs/2604.16377
作者: Nitin Choudhury,Bikrant Bikram Pratap Maurya,Bhavinkumar Vinodbhai Kuwar,Arun Balaji Buduru
机构: IIIT Delhi (印度信息技术研究所德里分校)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to the International Conference on Multimedia Expo (ICME) 2026
Abstract:Large Language Models (LLMs) trained on massive code corpora are now increasingly capable of generating code that is hard to distinguish from human-written code. This raises practical concerns, including security vulnerabilities and licensing ambiguity, and also motivates a forensic question: ‘Who (or which LLM) wrote this piece of code?’ We present GoCoMA, a multimodal framework that models an extrinsic hierarchy between (i) code stylometry, capturing higher-level structural and stylistic signatures, and (ii) image representations of binary pre-executable artifacts (BPEA), capturing lower-level, execution-oriented byte semantics shaped by compilation and toolchains. GoCoMA projects modality embeddings into a hyperbolic Poincaré ball, fuses them via a geodesic-cosine similarity-based cross-modal attention (GCSA) fusion mechanism, and back-projects the fused representation to Euclidean space for final LLM-source attribution. Experiments on two open-source benchmarks (CoDET-M4 and LLMAuthorBench) show that GoCoMA consistently outperforms unimodal and Euclidean multimodal baselines under identical evaluation protocols.
[NLP-237] Foundational Study on Authorship Attribution of Japanese Web Reviews for Actor Analysis
【速读】: 该论文旨在解决如何利用风格特征(stylistic features)进行作者归属分析(authorship attribution),以支持威胁情报中的行为者分析(actor analysis)。其核心问题是:在面对大规模文本数据时,如何实现高精度、稳定且高效的作者识别,尤其是在暗网论坛等复杂场景下。解决方案的关键在于比较多种特征提取与分类方法的性能表现,发现基于BERT微调(BERT-FT)在小规模作者群体中效果最优,但当作者数量扩展至数百人时,TF-IDF结合逻辑回归(TF-IDF+LR)在准确性、训练稳定性及计算成本方面更具优势;同时,Top-k候选筛选机制有效提升了实用性,而错误分析揭示了模板化文本、主题依赖性和短文本长度是导致误判的主要因素。
链接: https://arxiv.org/abs/2604.16376
作者: Hiroshi Matsubara,Shingo Matsugaya,Taichi Aoki,Masaki Hashimoto
机构: Kagawa University (香川大学); Trend Micro, Inc. (趋势科技公司); JC3 Japan Cybercrime Control Center (日本网络犯罪控制中心)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:This study investigates the applicability of authorship attribution based on stylistic features to support actor analysis in threat intelligence. As a foundational step toward future application to dark web forums, we conducted experiments using Japanese review data from clear web sources. We constructed datasets from Rakuten Ichiba reviews and compared four methods: TF-IDF with logistic regression (TF-IDF+LR), BERT embeddings with logistic regression (BERT-Emb+LR), BERT fine-tuning (BERT-FT), and metric learning with k -nearest neighbors (Metric+kNN). Results showed that BERT-FT achieved the best performance; however, training became unstable as the number of authors scaled to several hundred, where TF-IDF+LR proved superior in terms of accuracy, stability, and computational cost. Furthermore, Top- k evaluation demonstrated the utility of candidate screening, and error analysis revealed that boilerplate text, topic dependency, and short text length were primary factors causing misclassification.
[NLP-238] CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark
【速读】: 该论文旨在解决当前多模态讽刺检测(Multimodal Sarcasm Detection)研究中存在的两个核心问题:一是现有基准数据集存在标注粒度粗、文化覆盖有限的问题,制约了对细粒度语义理解的研究;二是现有模型在隐喻推理(Metaphoric Reasoning)方面表现不足,尤其在跨语言场景下。解决方案的关键在于构建首个面向中文社交媒体的细粒度多模态讽刺数据集CFMS,其包含2,796个高质量图文对,并采用三级标注框架(讽刺识别、目标识别与解释生成),从而为模型提供更精细的训练信号;同时提出一种基于强化学习增强的上下文学习策略(PGDS),通过动态优化示例选择机制,显著提升模型在关键任务上的性能表现。
链接: https://arxiv.org/abs/2604.16372
作者: Junzhao Zhang,Hsiu-Yuan Huang,Chenming Tang,Yutong Yang,Yunfang Wu
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal sarcasm detection has recently garnered significant attention. However, existing benchmarks suffer from coarse-grained annotations and limited cultural coverage, which hinder research into fine-grained semantic understanding. To address this, we construct CFMS, the first fine-grained multimodal sarcasm dataset tailored for Chinese social media. It comprises 2,796 high-quality image-text pairs and provides a triple-level annotation framework: sarcasm identification, target recognition, and explanation generation. We find that the fine-grained explanation annotations effectively guide AI in generating images with explicit sarcastic intent. Furthermore, we curate a high-consistency parallel Chinese-English metaphor subset (200 entries each), revealing significant limitations of current models in metaphoric reasoning. To overcome the constraints of traditional retrieval methods, we propose a Reinforcement Learning-augmented In-Context Learning strategy (PGDS) to dynamically optimize exemplar selection. Extensive experiments demonstrate that CFMS provides a solid foundation for building reliable multimodal sarcasm understanding systems, and the PGDS method significantly outperforms existing baselines on key tasks. Our data and code are available at this https URL.
[NLP-239] Brain-CLIPLM: Decoding Compressed Semantic Representations in EEG for Language Reconstruction
【速读】: 该论文旨在解决从非侵入式脑电图(EEG)信号中可靠解码自然语言句子的问题,其核心挑战在于EEG信号信噪比低、信息带宽受限,导致传统直接重建完整句子的方法难以实现。解决方案的关键在于提出“语义压缩假设”(semantic compression hypothesis),即EEG信号编码的是压缩的语义锚点而非完整的句法结构;进而设计了Brain-CLIPLM两阶段框架:第一阶段通过对比学习从EEG中提取语义锚点,第二阶段利用检索增强的大语言模型(LLM)结合思维链(Chain-of-Thought, CoT)推理进行句子重构,遵循粒度匹配原则以适配神经信息容量与解码复杂度之间的不匹配问题。该方法显著优于直接解码基线,并在跨被试评估中表现出鲁棒性,验证了语义压缩视角在非侵入式脑机接口中的有效性。
链接: https://arxiv.org/abs/2604.16370
作者: Xiaoli Yang,Huiyuan Tian,Yurui Li,Jianyu Zhang,Shijian Li,Gang Pan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Decoding natural language from non-invasive electroencephalography (EEG) remains fundamentally limited by low signal-to-noise ratio and restricted information bandwidth. This raises a fundamental question regarding whether sentence-level linguistic structure can be reliably recovered from such signals. In this work, we suggest that this assumption may not hold under realistic information constraints, and instead propose a semantic compression hypothesis in which EEG signals encode a compressed set of semantic anchors rather than full linguistic structure. Under our new perspective, direct sentence reconstruction becomes an overparameterized objective relative to the intrinsic information capacity of EEG. To address this mismatch, we introduce Brain-CLIPLM, a two-stage framework that decomposes EEG-to-text decoding into semantic anchor extraction via contrastive learning and sentence reconstruction using a retrieval-grounded large language model (LLM) with Chain-of-Thought (CoT) reasoning, following a granularity matching principle that aligns decoding complexity with neural information capacity. Evaluated on the Zurich Cognitive Language Processing Corpus, Brain-CLIPLM achieves 67.55% top-5 and 85.00% top-25 sentence retrieval accuracy, significantly outperforming direct decoding baseline, while cross-subject evaluation confirms robust generalization. Control analyses, including permutation testing, further demonstrate that EEG-derived representations carry sentence-specific information beyond language model priors. These results suggest that EEG-to-text decoding is better framed as recovering compressed semantic content rather than reconstructing full sentences, providing a biologically grounded and data-efficient pathway for non-invasive brain-computer interfaces.
[NLP-240] Why AI Readiness Is an Organizational Learning Problem Not a Technology Purchase
【速读】: 该论文试图解决的问题是:尽管全球企业对人工智能(Artificial Intelligence, AI)的投资在2024年达到2523亿美元,但仅有6%的企业报告了显著的收益影响,表明AI项目失败率高企。作者指出,这种失败本质上是一个组织学习问题,而非单纯的技术短板。解决方案的关键在于提出“Siloed-Integrated-Orchestrated (SIO)”演进模型,该模型从文化、领导力、人力资本、运营、数据架构、系统基础设施及治理与合规等五个支柱维度,系统性地刻画企业AI能力的发展阶段,并提供分阶段的可操作指导路径,从而推动组织将AI投资从技术采购转向能力构建。
链接: https://arxiv.org/abs/2604.16369
作者: Jeanne McClure,Gregg Gerdau
机构: Ars Innovate Technology and Consulting; NC State University; Matador Advisors
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 Pages 2 figures 1 table
Abstract:Global corporate AI investment reached 252.3 billion in 2024, yet only 6% of firms report significant earnings impact. This article argues that AI project failure is fundamentally an organizational learning problem rather than a technology deficit. Drawing on a systematic synthesis of 19 large-scale industry and academic sources, including surveys of nearly 10,000 organizational leaders, we identify two categories of failure: organizational (culture, leadership alignment, governance, and human-AI learning deficits) and technical (semantic bottlenecks and output management challenges). We introduce the Siloed-Integrated-Orchestrated (SIO) progression model, which maps enterprise AI capability across five pillars – Culture Leadership, Human Capital Operations, Data Architecture, Systems Infrastructure, and Governance Regulatory Compliance – and provides prescriptive guidance for advancing between stages. The implications challenge organizations to reframe AI investment as capability development rather than technology procurement.
[NLP-241] Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM
【速读】: 该论文旨在解决跨家族大语言模型(LLM)在统一内存架构(如Apple Silicon)上进行推测解码(speculative decoding)时面临的挑战,特别是由于tokenizer不匹配导致的效率低下问题。其核心问题是:如何在不同tokenizer的模型对之间实现高效、稳定的推测解码,同时适配消费级硬件资源受限的环境。解决方案的关键在于提出通用辅助生成(Universal Assisted Generation, UAG),通过引入上下文感知的token翻译机制(context-aware token translation),有效提升接受率(acceptance rate),并结合硬件感知的速度提升公式,系统性地评估和优化跨家族推测解码在苹果硅芯片上的性能表现,从而首次实现在波兰语场景下跨家族模型的可行推测解码。
链接: https://arxiv.org/abs/2604.16368
作者: Krzysztof Fonal
机构: Wrocław University of Science and Technology (弗罗茨瓦夫理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Speculative decoding accelerates LLM inference by using a small draft model to propose k candidate tokens for a target model to verify. While effective for same-tokenizer pairs on high-bandwidth GPUs, its applicability to cross-family pairs with mismatched tokenizers and consumer-grade unified memory remains underexplored. We extend the MLX-LM framework with Universal Assisted Generation (UAG) to enable cross-tokenizer speculative decoding on Apple Silicon. We evaluate Bielik 11B-Instruct (Mistral-based) as the target model, paired with three draft models: Bielik 1.5B (Qwen-based with custom tokenizer), Qwen2.5-1.5B, and Llama 3.2-1B. Experiments on three Polish-language datasets (Wikipedia, pl_alpaca, synthetic) use draft lengths k in 2, 4, 6 to compare naive and context-aware token translation. Results show: (1) context-aware translation consistently improves acceptance rates across all configurations; (2) the Polish-specialized Bielik 1.5B achieves lower acceptance than general-purpose Qwen2.5 and Llama 3.2 drafters; (3) throughput on Apple Silicon is content-dependent, reaching 1.7x speedup for structured text but failing for varied instructions; and (4) verification cost on unified memory does not amortize as theory predicts because both models are memory-bandwidth bound, making sequential drafting expensive relative to batched verification. We propose a hardware-aware speedup formula and characterize conditions for cross-family speculative decoding on Apple Silicon. This is the first systematic evaluation of cross-family speculative decoding for Polish LLMs and the first empirical study of UAG-based decoding on unified memory architectures.
[NLP-242] Clinical Note Bloat Reduction for Efficient LLM Use
【速读】: 该论文旨在解决电子健康记录(Electronic Health Record, EHR)中因模板化书写、复制粘贴及自动填充导致的文本冗余问题(即“病历膨胀”,note bloat),该问题不仅稀释了临床信息的有效信号,还显著增加了大语言模型(Large Language Models, LLMs)在临床决策支持应用中的计算成本。解决方案的关键在于提出一种名为TRACE的可扩展预处理流水线,其核心机制是利用EHR中的归属元数据(attribution metadata)识别模板化和复制内容,并在元数据缺失时采用基于频率的去重策略,从而高效去除冗余文本,同时保持信息抽取与临床结局预测任务的性能。
链接: https://arxiv.org/abs/2604.16364
作者: Jordan L. Cahoon,Chloe Stanwyck,Asad Aali,Rachel Madding,Emma Sun,Yixing Jiang,Renumathy Dhanasekaran,Emily Alsentzer
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Health systems are rapidly deploying large language models (LLMs) that use clinical notes for clinical decision support applications. However, modern documentation practices rely heavily on templates, copy–paste shortcuts, and auto-populated fields, producing extensive duplicated text (``note bloat’') that dilutes clinically meaningful signal and substantially increases the computational cost of LLM use. We introduce TRACE, a scalable preprocessing pipeline that removes note bloat by leveraging EHR attribution metadata to identify templated and copied content and applying frequency-based deduplication when metadata are unavailable. We evaluated TRACE across four real–world clinical cohorts spanning liver transplantation, obstetrics, and inpatient care (5.3 million notes) using blinded physician review and downstream modeling tasks. TRACE removed 47.3% of chart text while preserving performance for information extraction and clinical outcome prediction. At a large academic medical center, this reduction corresponds to an estimated 9.5 million annual decrease in LLM inference costs assuming one query per encounter. These findings show how underutilized EHR metadata can enable more scalable and cost-efficient deployment of LLM-based clinical systems.
[NLP-243] SaFeR-Steer: Evolving Multi-Turn MLLM s via Synthetic Bootstrapping and Feedback Dynamics
【速读】: 该论文旨在解决多轮对话场景下大语言模型(Large Language Models, LLMs)与视觉信息融合时的安全对齐问题,特别是由于攻击者利用不断演化的视觉-文本历史信息逐步诱导模型产生不安全意图,以及长上下文导致的安全性衰减(long-context safety decay)现象。现有方法主要依赖单轮数据和固定模板对话进行安全对齐,无法适应真实多轮交互中的动态风险演化。其解决方案的关键在于提出 SaFeR-Steer 框架:通过分阶段合成引导(staged synthetic bootstrapping)与“导师在环”强化学习策略优化(tutor-in-the-loop GRPO),实现学生模型在自适应、在线策略攻击下的渐进式多轮对齐;同时引入 TCSR 方法,基于轨迹最小/平均安全性将晚轮失败信号回传至早期对话轮次,从而有效识别并纠正潜在的早期安全漏洞,显著提升模型在多轮场景下的鲁棒性和安全性表现。
链接: https://arxiv.org/abs/2604.16358
作者: Haolong Hu,Hanyu Li,Tiancheng He,Huahui Yi,An Zhang,Qiankun Li,Kun Wang,Yang Liu,Zhigang Zeng
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:MLLMs are increasingly deployed in multi-turn settings, where attackers can escalate unsafe intent through the evolving visual-text history and exploit long-context safety decay. Yet safety alignment is still dominated by single-turn data and fixed-template dialogues, leaving a mismatch between training and this http URL bridge this gap, we propose SaFeR-Steer, a progressive multi-turn alignment framework that combines staged synthetic bootstrapping with tutor-in-the-loop GRPO to train a single student under adaptive, on-policy attacks. We also introduce TCSR, which uses trajectory minimum/average safety to propagate late-turn failures to earlier turns.I. Dataset. We release STEER, a multi-turn multimodal safety dataset with STEER-SFT (12,934), STEER-RL (2,000), and STEER-Bench (3,227) dialogues spanning 2~10 this http URL. Experiment. Starting from Qwen2.5-VL-3B/7B, SaFeR-Steer substantially improves Safety/Helpfulness on both single-turn (48.30/45.86 - 81.84/70.77 for 3B; 56.21/60.32 - 87.89/77.40 for 7B) and multi-turn benchmarks (12.55/27.13 - 55.58/70.27 for 3B; 24.66/46.48 - 64.89/72.35 for 7B), shifting failures to later turns and yielding robustness beyond scaling this http URL are available at this https URL
[NLP-244] Annotation Entropy Predicts Per-Example Learning Dynamics in LoRA Fine-Tuning
【速读】: 该论文旨在解决预训练语言模型在微调过程中对标注不一致样本(即高注释者分歧样本)表现出异常学习行为的问题,具体表现为LoRA微调下这些样本的损失值持续上升,形成“遗忘”现象。解决方案的关键在于通过引入注释熵(annotation entropy)与每个样本的损失曲线下面积(AULC)之间的相关性分析,揭示了LoRA微调中存在一种与全参数微调显著不同的、针对争议样本的特异性退化模式,且该现象在多种模型架构和数据集上具有鲁棒性和一致性。
链接: https://arxiv.org/abs/2604.16332
作者: Brady Steele
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 12 pages, 9 figures, 6 tables
Abstract:We find that LoRA fine-tuning exhibits un-learning on contested examples: items with high annotator disagreement show increasing loss during training, a qualitatively distinct pattern largely absent under full fine-tuning and consistent across all six models tested (four encoder, two decoder-only). This discovery emerges from correlating annotation entropy, computed from ChaosNLI’s 100 labels per example, with per-example area under the loss curve (AULC) on SNLI and MNLI. The correlation is positive in all 25 conditions tested (Spearman \rho = 0.06 - 0.43 ), with decoder-only models showing stronger correlations than encoders at matched LoRA rank. The effect survives partial-correlation controls and replicates across seeds and datasets. A preliminary noise-injection experiment is consistent with these findings.
[NLP-245] Multimodal Claim Extraction for Fact-Checking
【速读】: 该论文旨在解决社交媒体中多模态虚假信息传播带来的自动化事实核查(Automated Fact-Checking, AFC)挑战,特别是现有方法普遍忽视了社交帖子中文本与图像(如表情包、截图和照片)协同构成的复杂语义结构,导致传统文本主导的主张提取(claim extraction)难以准确识别真实意图。为应对这一问题,作者构建了首个面向社交媒体的多模态主张提取基准数据集,并提出MICE框架——一种基于意图感知(intent-aware)的设计方案,其关键在于通过增强模型对话语意图和上下文线索的理解能力,显著提升在高语境敏感场景下的主张提取准确性。
链接: https://arxiv.org/abs/2604.16311
作者: Joycelyn Teo,Rui Cao,Zhenyun Deng,Zifeng Ding,Michael Sejr Schlichtkrull,Andreas Vlachos
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:Automated Fact-Checking (AFC) relies on claim extraction as a first step, yet existing methods largely overlook the multimodal nature of today’s misinformation. Social media posts often combine short, informal text with images such as memes, screenshots, and photos, creating challenges that differ from both text-only claim extraction and well-studied multimodal tasks like image captioning or visual question answering. In this work, we present the first benchmark for multimodal claim extraction from social media, consisting of posts containing text and one or more images, annotated with gold-standard claims derived from real-world fact-checkers. We evaluate state-of-the-art multimodal LLMs (MLLMs) under a three-part evaluation framework (semantic alignment, faithfulness, and decontextualization) and find that baseline MLLMs struggle to model rhetorical intent and contextual cues. To address this, we introduce MICE, an intent-aware framework which shows improvements in intent-critical cases.
[NLP-246] FUSE: Ensembling Verifiers with Zero Labeled Data
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在训练与实际部署中,因缺乏真实正确性标签(ground truth correctness labels)而导致的输出验证质量受限问题。当前实践中常依赖不完美的LLM评判者(LLM judges)或奖励模型进行验证,但其性能受制于标注成本高和获取困难。为此,作者提出完全无监督的评分集成方法(Fully Unsupervised Score Ensembling, FUSE),其核心创新在于通过控制验证器之间的条件依赖关系,提升一类谱算法(spectral algorithms)在无监督场景下的表现,从而在无需任何真实标签的情况下实现优于或等同于半监督方法的验证效果。实验表明,FUSE在多种生成模型、验证器及基准测试(包括GPQA Diamond、Humanity’s Last Exam和IMO Shortlist)上均展现出优异的泛化能力。
链接: https://arxiv.org/abs/2604.18547
作者: Joonhyuk Lee,Virginia Ma,Sarah Zhao,Yash Nair,Asher Spector,Regev Cohen,Emmanuel J. Candès
机构: 未知
类目: Machine Learning (stat.ML); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Verification of model outputs is rapidly emerging as a key primitive for both training and real-world deployment of large language models (LLMs). In practice, this often involves using imperfect LLM judges and reward models since ground truth acquisition can be time-consuming and expensive. We introduce Fully Unsupervised Score Ensembling (FUSE), a method for improving verification quality by ensembling verifiers without access to ground truth correctness labels. The key idea behind FUSE is to control conditional dependencies between verifiers in a manner that improves the unsupervised performance of a class of spectral algorithms from the ensembling literature. Despite requiring zero ground truth labels, FUSE typically matches or improves upon semi-supervised alternatives in test-time scaling experiments with diverse sets of generator models, verifiers, and benchmarks. In particular, we validate our method on both conventional academic benchmarks such as GPQA Diamond and on frontier, unsaturated benchmarks such as Humanity’s Last Exam and IMO Shortlist questions.
[NLP-247] NIM4-ASR: Towards Efficient Robust and Customizable Real-Time LLM -Based ASR
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的自动语音识别(Automatic Speech Recognition, ASR)系统在资源受限场景下的向下可扩展性不足,以及在声学挑战条件下易产生幻觉(hallucinations)的问题。解决方案的关键在于提出一个面向生产的LLM-ASR框架NIM4-ASR,其核心创新包括:首先,通过功能角色解耦明确编码器与LLM的分工,重构多阶段训练范式以对齐各模块的能力边界;其次,改进预训练架构与目标以缩小模态差距并提升参数效率,引入迭代异步监督微调(SFT)阶段以保持声学保真度并控制表征漂移,设计专用于ASR的强化学习阶段以增强识别质量与鲁棒性;最后,集成一系列生产级优化,如噪声和静音条件下的鲁棒性、实时流式推理及基于检索增强生成(Retrieval-Augmented Generation, RAG)的热词定制能力,从而实现高效且高精度的端到端语音识别。
链接: https://arxiv.org/abs/2604.18105
作者: Yuan Xie,Jiaqi Song,Guang Qiu,Xianliang Wang,Kai Qiao,Junfeng Yuan,Shengqing Liu,Yi Zhang,Bowen Chen,Ming Lei,Jie Gao,Jie Wu
机构: NIO
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:
Abstract:Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed – particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate the modality gap and improve parameter efficiency; introduce an iterative asynchronous SFT stage to preserve acoustic fidelity and constrain representation drift; and design an ASR-specialized reinforcement learning stage to further enhance recognition quality and robustness. We additionally incorporate a suite of production-oriented optimizations, including robustness under noisy and silent conditions, real-time streaming inference, and hotword customization via retrieval-augmented generation (RAG). Experiments show that NIM4-ASR achieves state-of-the-art performance on multiple public benchmarks with merely 2.3B parameters, while substantially outperforming larger-scale competitors on internal benchmarks – particularly in entity-intensive real-world scenarios. NIM4-ASR further supports million-scale hotword customization via RAG with sub-millisecond retrieval latency, enabling efficient adaptation to emerging entities and personalized user requirements.
[NLP-248] VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech INTERSPEECH2026
【速读】: 该论文旨在解决大型音频语言模型(Large Audio-Language Models, LALMs)在真实应用场景中存在生成偏见(generative bias)的问题,尤其是现有评估方法因依赖合成语音和多项选择题(Multiple-Choice Questions, MCQs)而无法全面反映模型在开放场景下的公平性表现。解决方案的关键在于提出VIBE框架,通过开放式的个性化推荐等任务,利用真实人类录音进行评估,使社会刻板印象能够自然浮现而非受限于预设选项,从而更真实、可扩展地揭示LALMs中的系统性偏见。实验表明,性别线索比口音线索更容易引发分布偏移,说明当前LALMs仍会复制社会刻板印象。
链接: https://arxiv.org/abs/2604.17248
作者: Yi-Cheng Lin,Yusuke Hirota,Sung-Feng Huang,Hung-yi Lee
机构: National Taiwan University (国立台湾大学); NVIDIA (英伟达)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Submitted to INTERSPEECH 2026
Abstract:Large Audio-Language Models (LALMs) are increasingly integrated into daily applications, yet their generative biases remain underexplored. Existing speech fairness benchmarks rely on synthetic speech and Multiple-Choice Questions (MCQs), both offering a fragmented view of fairness. We propose VIBE, a framework that evaluates generative bias through open-ended tasks such as personalized recommendations, using real-world human recordings. Unlike MCQs, our method allows stereotypical associations to manifest organically without predefined options, making it easily extensible to new tasks. Evaluating 11 state-of-the-art LALMs reveals systematic biases in realistic scenarios. We find that gender cues often trigger larger distributional shifts than accent cues, indicating that current LALMs reproduce social stereotypes.
[NLP-249] okenChain: A Discrete Speech Chain via Semantic Token Modeling ICASSP
【速读】: 该论文旨在解决自动语音识别(ASR)与文本到语音合成(TTS)系统在联合训练中难以协同优化的问题,尤其关注如何通过离散化表示(token interface)实现端到端的反馈机制以提升整体性能。其解决方案的关键在于提出TokenChain架构:一个完全离散的语音链路,耦合语义级ASR与两级TTS流程——首先通过与ASR共同训练的自回归文本到语义模型生成中间表示,再由掩码生成式语义到声学模型进行合成;同时利用直通argmax/Gumbel-Softmax实现跨模块的梯度传递,并通过动态权重平均平衡监督信号,从而在保持稳定文本到语音(T2S)质量的同时显著加速收敛并降低错误率。
链接: https://arxiv.org/abs/2510.06201
作者: Mingxuan Wang,Satoshi Nakamura
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: 5 pages, 3 figures. Submitted to IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026
Abstract:Machine Speech Chain, simulating the human perception-production loop, proves effective in jointly improving ASR and TTS. We propose TokenChain, a fully discrete speech chain coupling semantic-token ASR with a two-stage TTS: an autoregressive text-to-semantic model co-trained with ASR and a masked-generative semantic-to-acoustic model for synthesis only. End-to-end feedback across the text interface is enabled with straight-through argmax/Gumbel-Softmax and balanced with supervised ASR via dynamic weight averaging. Ablations examine optimal temperature schedules for in- and cross-domain transfer. Evaluation reveals TokenChain surpasses baseline accuracy 2-6 epochs earlier and yields 5-13% lower equal-epoch error with stable T2S on LibriSpeech, and reduces relative ASR WER by 56% and T2S WER by 31% on TED-LIUM with minimal forgetting, showing that chain learning remains effective with token interfaces and models.
信息检索
[IR-0] MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval ICLR2026
【速读】:该论文旨在解决当前大型语言模型和多模态模型在数学问题求解能力上面临的评估局限性问题,即现有基准测试在规模、语言覆盖范围和任务多样性方面存在不足。其解决方案的关键在于构建了MathNet——一个高质量、大规模、多模态且多语言的奥林匹克级别数学问题数据集,并首次提出了针对数学问题检索能力的基准评测体系。MathNet涵盖47个国家、17种语言及二十年竞赛题目,包含30,676道专家编写的问题及其解答,支持三大任务:(i)问题求解、(ii)数学感知检索、(iii)检索增强的问题求解。实验表明,即使是最先进的推理模型(如Gemini-3.1-Pro达到78.4%准确率)仍面临挑战,而嵌入式检索模型难以识别数学等价问题;同时,检索质量显著影响生成性能,例如DeepSeek-V3.2-Speciale通过检索增强获得最高得分,提升达12%。这一工作为数学推理与检索提供了系统性评估框架和高质量资源。
链接: https://arxiv.org/abs/2604.18584
作者: Shaden Alshammari,Kevin Wen,Abrar Zainal,Mark Hamilton,Navid Safaei,Sultan Albarakati,William T. Freeman,Antonio Torralba
机构: MIT (麻省理工学院); KAUST (沙特阿卜杜拉国王科技大学); HUMAIN; Individual Researcher (独立研究员)
类目: Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: ICLR 2026; Website: this http URL
Abstract:Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at this https URL. Comments: ICLR 2026; Website: this http URL Subjects: Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2604.18584 [cs.AI] (or arXiv:2604.18584v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.18584 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the International Conference on Learning Representations (ICLR), 2026
[IR-1] Document-as-Image Representations Fall Short for Scientific Retrieval
【速读】:该论文旨在解决当前科学文献检索中因依赖文档图像表示(document-as-image)而导致的语义信息丢失问题,尤其是在文本密集型多模态科学文档中,关键证据常分散于结构化内容(如文本、表格和图表)中,而图像表示无法有效捕捉这些细粒度信息。解决方案的关键在于引入ArXivDoc基准数据集,该数据集基于科学论文的LaTeX源文件构建,可直接访问结构化元素(如章节、表格、图示和公式),从而支持基于特定证据类型的可控查询设计;同时通过系统比较纯文本、图像和图文混合表示在单向量与多向量检索模型中的表现,发现:(1)文档图像表示在长文档中性能显著下降;(2)纯文本表示最有效,即使针对图表类查询也能通过标题和上下文实现高精度匹配;(3)图文交错表示优于文档图像方法,且无需特殊训练即可提升检索效果。
链接: https://arxiv.org/abs/2604.18508
作者: Ghazal Khalighinejad,Raghuveer Thirukovalluru,Alexander H. Oh,Bhuwan Dhingra
机构: Duke University (杜克大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Many recent document embedding models are trained on document-as-image representations, embedding rendered pages as images rather than the underlying source. Meanwhile, existing benchmarks for scientific document retrieval, such as ArXivQA and ViDoRe, treat documents as images of pages, implicitly favoring such representations. In this work, we argue that this paradigm is not well-suited for text-rich multimodal scientific documents, where critical evidence is distributed across structured sources, including text, tables, and figures. To study this setting, we introduce ArXivDoc, a new benchmark constructed from the underlying LaTeX sources of scientific papers. Unlike PDF or image-based representations, LaTeX provides direct access to structured elements (e.g., sections, tables, figures, equations), enabling controlled query construction grounded in specific evidence types. We systematically compare text-only, image-based, and multimodal representations across both single-vector and multi-vector retrieval models. Our results show that: (1) document-as-image representations are consistently suboptimal, especially as document length increases; (2) text-based representations are most effective, even for figure-based queries, by leveraging captions and surrounding context; and (3) interleaved text+image representations outperform document-as-image approaches without requiring specialized training.
[IR-2] Context-Aware Search and Retrieval Under Token Erasure
【速读】:该论文旨在解决在检索增强生成(Retrieval-Augmented Generation, RAG)系统中,当查询表示因token擦除(token erasures)导致部分信息丢失时,如何保证远程文档检索的可靠性问题。其核心解决方案是基于信息论分析,提出一种语义自适应冗余分配机制:根据查询特征的重要性动态分配冗余,使语义重要特征获得更高冗余保护;在此基础上,利用TF-IDF加权相似度进行检索,并通过理论证明相似度边际向量收敛至多元高斯分布,从而给出可计算的检索错误概率上界。这一方法显著提升了在部分查询信息缺失场景下的检索鲁棒性。
链接: https://arxiv.org/abs/2604.18424
作者: Sara Ghasvarianjahromi,Joshua Barr,Yauhen Yakimenka,Jörg Kliewer
机构: 未知
类目: Information Retrieval (cs.IR); Information Theory (cs.IT)
备注:
Abstract:This paper introduces and analyzes a search and retrieval model for RAG-like systems under token erasures. We provide an information-theoretic analysis of remote document retrieval when query representations are only partially preserved. The query is represented using term-frequency-based features, and semantically adaptive redundancy is assigned according to feature importance. Retrieval is performed using TF-IDF-weighted similarity. We characterize the retrieval error probability by showing that the vector of similarity margins converges to a multivariate Gaussian distribution, yielding an explicit approximation and computable upper bounds. Numerical results support the analysis, while a separate data-driven evaluation using embedding-based retrieval on real-world data shows that the same importance-aware redundancy principles extend to modern retrieval pipelines. Overall, the results show that assigning higher redundancy to semantically important query features improves retrieval reliability.
[IR-3] ArbGraph: Conflict-Aware Evidence Arbitration for Reliable Long-Form Retrieval-Augmented Generation
【速读】:该论文旨在解决长文本生成场景下检索增强生成(Retrieval-Augmented Generation, RAG)系统因检索到的证据存在噪声或矛盾而导致事实一致性难以保证的问题。现有方法通常将冲突解决与生成过程耦合,导致可靠性不足。其解决方案的关键在于提出ArbGraph框架,该框架在生成前对证据进行显式的仲裁:首先将检索到的文档分解为原子性陈述,并构建一个包含支持与矛盾关系的冲突感知证据图(evidence graph);在此基础上引入强度驱动的迭代仲裁机制,通过证据间的相互作用传播可信度信号,从而在生成前抑制不可靠和不一致的陈述。这一设计实现了证据验证与文本生成的解耦,为下游长文本生成提供了结构化且一致的证据基础,显著提升了事实召回率、信息密度并降低了幻觉和对检索噪声的敏感性。
链接: https://arxiv.org/abs/2604.18362
作者: Qingying Niu,Yuhao Wang,Ruiyang Ren,Bohui Fang,Wayne Xin Zhao
机构: Gaoling School of Artificial Intelligence (人工智能学院); Renmin University of China (中国人民大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 23 pages, 4 figures
Abstract:Retrieval-augmented generation (RAG) remains unreliable in long-form settings, where retrieved evidence is noisy or contradictory, making it difficult for RAG pipelines to maintain factual consistency. Existing approaches focus on retrieval expansion or verification during generation, leaving conflict resolution entangled with generation. To address this limitation, we propose ArbGraph, a framework for pre-generation evidence arbitration in long-form RAG that explicitly resolves factual conflicts. ArbGraph decomposes retrieved documents into atomic claims and organizes them into a conflict-aware evidence graph with explicit support and contradiction relations. On top of this graph, we introduce an intensity-driven iterative arbitration mechanism that propagates credibility signals through evidence interactions, enabling the system to suppress unreliable and inconsistent claims before final generation. In this way, ArbGraph separates evidence validation from text generation and provides a coherent evidence foundation for downstream long-form generation. We evaluate ArbGraph on two widely used long-form RAG benchmarks, LongFact and RAGChecker, using multiple large language model backbones. Experimental results show that ArbGraph consistently improves factual recall and information density while reducing hallucinations and sensitivity to retrieval noise. Additional analyses show that these gains are evident under conflicting or ambiguous evidence, highlighting the effectiveness of evidence-level conflict resolution for improving the reliability of long-form RAG. The implementation is publicly available at this https URL.
[IR-4] Balanced Co-Clustering of Users and Items for Embedding Table Compression in Recommender Systems SIGIR2026
【速读】:该论文旨在解决推荐系统中嵌入表(embedding table)参数量庞大导致的计算与内存开销问题,尤其是在工业级部署时资源受限背景下,现有压缩方法往往在精度损失或计算成本上难以平衡。其解决方案的关键在于提出BACO框架,通过挖掘用户-物品交互中的协同信号进行用户和物品的联合聚类(balanced co-clustering),使得相似用户/物品共享同一码本(codebook)中的嵌入向量;具体地,BACO设计了一个兼顾簇内连通性最大化与簇体积平衡的优化目标,并借助图聚类理论统一建模,同时引入合理的权重机制、高效的标签传播求解器及次级用户簇以避免码本坍塌(codebook collapse),从而实现高效率且低精度损失的嵌入压缩。
链接: https://arxiv.org/abs/2604.18351
作者: Runhao Jiang,Renchi Yang,Donghao Wu
机构: Hong Kong Baptist University (香港浸会大学); The Chinese University of Hong Kong Shenzhen (香港中文大学(深圳)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 14 pages, The technical report for the paper titled “Balanced Co-Clustering of Users and Items for Embedding Table Compression in Recommender Systems” in SIGIR 2026
Abstract:Recommender systems have advanced markedly over the past decade by transforming each user/item into a dense embedding vector with deep learning models. At industrial scale, embedding tables constituted by such vectors of all users/items demand a vast amount of parameters and impose heavy compute and memory overhead during training and inference, hindering model deployment under resource constraints. Existing solutions towards embedding compression either suffer from severely compromised recommendation accuracy or incur considerable computational costs. To mitigate these issues, this paper presents BACO, a fast and effective framework for compressing embedding tables. Unlike traditional ID hashing, BACO is built on the idea of exploiting collaborative signals in user-item interactions for user and item groupings, such that similar users/items share the same embeddings in the codebook. Specifically, we formulate a balanced co-clustering objective that maximizes intra-cluster connectivity while enforcing cluster-volume balance, and unify canonical graph clustering techniques into the framework through rigorous theoretical analyses. To produce effective groupings while averting codebook collapse, BACO instantiates this framework with a principled weighting scheme for users and items, an efficient label propagation solver, as well as secondary user clusters. Our extensive experiments comparing BACO against full models and 18 baselines over benchmark datasets demonstrate that BACO cuts embedding parameters by over 75% with a drop of at most 1.85% in recall, while surpassing the strongest baselines by being up to 346X faster. Comments: 14 pages, The technical report for the paper titled “Balanced Co-Clustering of Users and Items for Embedding Table Compression in Recommender Systems” in SIGIR 2026 Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2604.18351 [cs.IR] (or arXiv:2604.18351v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.18351 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-5] DocQAC: Adaptive Trie-Guided Decoding for Effective In-Document Query Auto-Completion
【速读】:该论文旨在解决文档内搜索(in-document search)中的查询自动补全(Query Auto-Completion, DocQAC)问题,即如何在长文档场景下提升用户构建高效、精准查询的能力,尤其针对复杂或拼写困难的术语。与传统网页搜索查询补全(WebQAC)不同,DocQAC能利用文档特定上下文信息(如文档内容和用户历史交互),从而实现更精准的补全建议。其解决方案的关键在于提出一种基于自适应Trie引导的解码框架(adaptive trie-guided decoding framework),该框架通过用户查询前缀软性引导语言模型生成高质量补全结果,并引入可调超参数的自适应惩罚机制,在模型置信度与Trie结构引导之间实现权衡;同时结合检索增强生成(RAG)及轻量级文档信号(如标题、关键词、摘要)以高效融入文档上下文。实验表明,该方法在T5和BART等编码器-解码器模型上优于强基线,甚至超越更大规模的指令微调模型(如LLaMA-3和Phi-3),验证了其在实际部署中对效率和可扩展性的优势。
链接: https://arxiv.org/abs/2604.18257
作者: Rahul Mehta,Kavin R V,Indrajit Pal,Tushar Abhishek,Pawan Goyal,Manish Gupta
机构: Microsoft Corporation(微软公司); Indian Institute of Technology Kharagpur(印度理工学院克哈格普尔分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Query auto-completion (QAC) has been widely studied in the context of web search, yet remains underexplored for in-document search, which we term DocQAC. DocQAC aims to enhance search productivity within long documents by helping users craft faster, more precise queries, even for complex or hard-to-spell terms. While global historical queries are available to both WebQAC and DocQAC, DocQAC uniquely accesses document-specific context, including the current document’s content and its specific history of user query interactions. To address this setting, we propose a novel adaptive trie-guided decoding framework that uses user query prefixes to softly steer language models toward high-quality completions. Our approach introduces an adaptive penalty mechanism with tunable hyperparameters, enabling a principled trade-off between model confidence and trie-based guidance. To efficiently incorporate document context, we explore retrieval-augmented generation (RAG) and lightweight contextual document signals such as titles, keyphrases, and summaries. When applied to encoder-decoder models like T5 and BART, our trie-guided framework outperforms strong baselines and even surpasses much larger instruction-tuned models such as LLaMA-3 and Phi-3 on seen queries across both seen and unseen documents. This demonstrates its practicality for real-world DocQAC deployments, where efficiency and scalability are critical. We evaluate our method on a newly introduced DocQAC benchmark derived from ORCAS, enriched with query-document pairs. We make both the DocQAC dataset (this https URL) and code (this https URL) publicly available. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2604.18257 [cs.IR] (or arXiv:2604.18257v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.18257 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-6] Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM -Based Retriever Evaluation Strategies ECIR2026
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中多跳推理(multi-hop reasoning)评估不足的问题,尤其是现有方法普遍聚焦于单跳检索场景,难以有效衡量多个看似无关的上下文在组合后对答案生成的关键作用。其解决方案的核心是提出一种名为“上下文感知的检索器评估”(Context-Aware Retriever Evaluation, CARE)的新策略,通过引入对多跳查询中上下文依赖关系的显式建模,显著提升了对复杂推理任务的评估准确性。实验表明,CARE 在参数量更大、上下文窗口更长的大语言模型(LLMs)上表现最优,尤其在多跳查询场景下效果突出,而单跳查询则对此类上下文感知评估不敏感,验证了其在复杂问答任务中的必要性和有效性。
链接: https://arxiv.org/abs/2604.18234
作者: Lorenz Brehme,Thomas Ströhle,Ruth Breu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 15 Pages, Accepted for publication at the SynIRgy Workshop, ECIR 2026 (48th European Conference on Information Retrieval)
Abstract:Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop queries, where individual contexts may appear irrelevant in isolation but are essential when combined. In this research, we use the HotPotQA, MuSiQue, and SQuAD datasets to simulate a RAG system and compare three LLM-as-judge evaluation strategies, including our proposed Context-Aware Retriever Evaluation (CARE). Our goal is to better understand how multi-hop reasoning can be most effectively evaluated in RAG systems. Experiments with LLMs from OpenAI, Meta, and Google demonstrate that CARE consistently outperforms existing methods for evaluating multi-hop reasoning in RAG systems. The performance gains are most pronounced in models with larger parameter counts and longer context windows, while single-hop queries show minimal sensitivity to context-aware evaluation. Overall, the results highlight the critical role of context-aware evaluation in improving the reliability and accuracy of retrieval-augmented generation systems, particularly in complex query scenarios. To ensure reproducibility, we provide the complete data of our experiments at this https URL. Comments: 15 Pages, Accepted for publication at the SynIRgy Workshop, ECIR 2026 (48th European Conference on Information Retrieval) Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.18234 [cs.IR] (or arXiv:2604.18234v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.18234 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-7] Multi-LLM Token Filtering and Routing for Sequential Recommendation
【速读】:该论文旨在解决如何在不依赖外部文本语料库的情况下,有效利用大语言模型(Large Language Models, LLMs)的词元嵌入(token embeddings)来提升序列推荐系统性能的问题。现有方法通常依赖外部文本数据对齐LLM与推荐模型,而本文指出单纯注入单一LLM的词元嵌入会导致性能不稳定或提升有限,原因在于语义错位、任务适配不足及单个LLM覆盖范围受限。其解决方案的关键在于提出MLTFR框架——一个无语料库的多LLM词元过滤与路由机制,通过用户引导的词元过滤抑制噪声词汇信号,并采用基于Fisher权重的专家混合(Mixture-of-Experts)架构融合多个LLM的语义空间,实现互补知识聚合与稳定的知识集成,从而无需修改推荐模型主干即可显著提升推荐效果。
链接: https://arxiv.org/abs/2604.18200
作者: Wuhan Chen,Min Gao,Xin Xia,Zongwei Wang,Wentao Li,Shane Culpepper
机构: Chongqing University (重庆大学); The University of Queensland (昆士兰大学); University of Leicester (莱斯特大学)
类目: Information Retrieval (cs.IR)
备注: 11 pages,3 figs
Abstract:Large language models (LLMs) have recently shown promise in recommendation by providing rich semantic knowledge. While most existing approaches rely on external textual corpora to align LLMs with recommender systems, we revisit a more fundamental yet underexplored question: Can recommendation benefit from LLM token embeddings alone without textual input? Through a systematic empirical study, we show that directly injecting token embeddings from a single LLM into sequential recommenders leads to unstable or limited gains, due to semantic misalignment, insufficient task adaptation, and the restricted coverage of individual LLMs. To address these challenges, we propose MLTFR, a Multi-LLM Token Filtering and Routing framework for corpus-free sequential recommendation. MLTFR follows an interaction-guided LLM knowledge integration paradigm, where task-relevant token embeddings are selected via user-guided token filtering to suppress noisy and irrelevant vocabulary signals. To overcome the limitations of single-LLM representations, MLTFR integrates multiple LLM token spaces through a Mixture-of-Experts architecture, with a Fisher-weighted semantic consensus expert to balance heterogeneous experts and prevent domination during training. By jointly filtering informative tokens and aggregating complementary semantic knowledge across multiple LLMs, MLTFR enables stable and effective utilization of LLM token embeddings without textual inputs or backbone modification. Extensive experiments demonstrate that MLTFR consistently outperforms state-of-the-art sequential recommendation baselines and existing alignment methods. Our code is available at: this https URL.
[IR-8] Modular Representation Compression: Adapting LLM s for Efficient and Effective Recommendations SIGIR2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推荐系统中应用时,其高维表示压缩效率低下的问题。研究发现了一个反直觉现象——中间层表示优势(Mid-layer Representation Advantage, MRA),即LLM的中间层表示在推荐任务中优于最终层表示,而现有压缩方法通常基于最终层,导致性能受限。解决方案的关键在于提出一种名为Modular Representation Compression (MARC) 的新框架,通过两个核心机制:一是模块化调整(Modular Adjustment),显式引入压缩与任务适配模块,使LLM专注于表征学习;二是模块化任务解耦(Modular Task Decoupling),利用信息约束和差异化网络结构解耦不同任务,从而控制LLM内部功能模块化,有效缓解MRA并提升压缩效率。实验表明,MARC在大规模商业搜索广告场景中实现了2.82%的eCPM提升。
链接: https://arxiv.org/abs/2604.18146
作者: Yunjia Xi,Menghui Zhu,Jianghao Lin,Bo Chen,Ruiming Tang,Yong Yu,Weinan Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: SIGIR 2026
Abstract:Recently, large language models (LLMs) have advanced recommendation systems (RSs), and recent works have begun to explore how to integrate LLMs into industrial RSs. While most approaches deploy LLMs offline to generate and pre-cache augmented representations for RSs, high-dimensional representations from LLMs introduce substantial storage and computational costs. Thus, it is crucial to compress LLM representations effectively. However, we identify a counterintuitive phenomenon during representation compression: Mid-layer Representation Advantage (MRA), where representations from middle layers of LLMs outperform those from final layers in recommendation tasks. This degraded final layer renders existing compression methods, which typically compress on the final layer, suboptimal. We interpret this based on modularity theory that LLMs develop spontaneous internal functional modularity and force the final layer to specialize in the proxy training task. Thus, we propose \underlineModul\underlinear \underlineRepresentation \underlineCompression (MARC) to explicitly control the modularity of LLMs. First, Modular Adjustment explicitly introduces compression and task adaptation modules, enabling the LLM to operate strictly as a representation-learning module. Next, to ground each module to its specific task, Modular Task Decoupling uses information constraints and different network structures to decouple tasks. Extensive experiments validate that MARC addresses MRA and produces efficient representations. Notably, MARC achieved a 2.82% eCPM lift in an online A/B test within a large-scale commercial search advertising scenario.
[IR-9] he Collaboration Gap in Human-AI Work
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在实际应用中作为协作伙伴时,其合作效果往往不稳定、易失效的问题。作者指出,尽管LLMs被广泛视为编程、设计、写作和分析等任务中的合作者,但用户常需反复诊断误解、重构缺失假设并修复响应偏差,导致协作效率低下。论文的核心解决方案在于提出一个“接地条件”(grounding conditions)的概念框架,强调稳定的人机协作不仅依赖于模型能力,更取决于交互过程中信息共享、理解一致性和修复机制的建立。通过分析16位设计师、开发者和AI实践者的访谈数据,作者识别出三种典型的人机工作结构:一次性辅助、弱协作下的不对称修复以及基于充分接地的协作,并指出协作失败的根本原因在于“伙伴关系的表象超前于交互接地能力”,从而为评估与优化LLM-enabled工作流程提供了理论基础和实践指导。
链接: https://arxiv.org/abs/2604.18096
作者: Varad Vishwarupe,Marina Jirotka,Nigel Shadbolt,Ivan Flechais
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted as a conference paper at ECSCW 2026, Germany
Abstract:LLMs are increasingly presented as collaborators in programming, design, writing, and analysis. Yet the practical experience of working with them often falls short of this promise. In many settings, users must diagnose misunderstandings, reconstruct missing assumptions, and repeatedly repair misaligned responses. This poster introduces a conceptual framework for understanding why such collaboration remains fragile. Drawing on a constructivist grounded theory analysis of 16 interviews with designers, developers, and applied AI practitioners working on LLM-enabled systems, and informed by literature on human-AI collaboration, we argue that stable collaboration depends not only on model capability but on the interaction’s grounding conditions. We distinguish three recurrent structures of human-AI work: one-shot assistance, weak collaboration with asymmetric repair, and grounded collaboration. We propose that collaboration breaks down when the appearance of partnership outpaces the grounding capacity of the interaction and contribute a framework for discussing grounding, repair, and interaction structure in LLM-enabled work.
[IR-10] Architecture Matters More Than Scale: A Comparative Study of Retrieval and Memory Augmentation for Financial QA Under SME Compute Constraints
【速读】:该论文旨在解决生成式 AI(Generative AI)在中小企业(SMEs)金融分析场景中部署时面临的实际挑战,即如何在资源受限环境下(如无云GPU预算、缺乏AI团队和API级推理能力)实现高效且准确的推理架构选择。其关键解决方案在于提出一种面向SME约束的实验设置,使用本地部署的8B参数指令微调模型进行系统性评估,并对比四种推理架构:基础大语言模型(LLM)、检索增强生成(Retrieval-Augmented Generation, RAG)、结构化长期记忆与记忆增强对话推理。研究发现,在确定性、操作数显式任务中,结构化记忆能提升精度;而在对话式、参考隐含场景下,检索方法表现更优。基于此,作者进一步提出一个混合部署框架,动态选择推理策略以平衡数值准确性、可审计性和基础设施效率,为资源受限环境下的金融AI落地提供可行路径。
链接: https://arxiv.org/abs/2604.17979
作者: Jianan Liu,Jing Yang,Xianyou Li,Weiran Yan,Yichao Wu,Penghao Liang,Mengwei Yuan
机构: 未知
类目: Information Retrieval (cs.IR)
备注: Accepted at the 2026 6th International Conference on Artificial Intelligence and Industrial Technology Applications (AIITA 2026), to be published by IEEE. 12 pages, 5 figures
Abstract:The rapid adoption of artificial intelligence (AI) and large language models (LLMs) is transforming financial analytics by enabling natural language interfaces for reporting, decision support, and automated reasoning. However, limited empirical understanding exists regarding how different LLM-based reasoning architectures perform across realistic financial workflows, particularly under the cost, accuracy, and compliance constraints faced by small and medium-sized enterprises (SMEs). SMEs typically operate within severe infrastructure constraints, lacking cloud GPU budgets, dedicated AI teams, and API-scale inference capacity, making architectural efficiency a first-class concern. To ensure practical relevance, we introduce an explicit SME-constrained evaluation setting in which all experiments are conducted using a locally hosted 8B-parameter instruction-tuned model without cloud-scale infrastructure. This design isolates the impact of architectural choices within a realistic deployment environment. We systematically compare four reasoning architectures: baseline LLM, retrieval-augmented generation (RAG), structured long-term memory, and memory-augmented conversational reasoning across both FinQA and ConvFinQA benchmarks. Results reveal a consistent architectural inversion: structured memory improves precision in deterministic, operand-explicit tasks, while retrieval-based approaches outperform memory-centric methods in conversational, reference-implicit settings. Based on these findings, we propose a hybrid deployment framework that dynamically selects reasoning strategies to balance numerical accuracy, auditability, and infrastructure efficiency, providing a practical pathway for financial AI adoption in resource-constrained environments.
[IR-11] Bayesian Active Learning with Gaussian Processes Guided by LLM Relevance Scoring for Dense Passage Retrieval ACL2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在段落检索中因计算成本高而导致的预算受限全局优化问题,特别是现有方法依赖第一阶段稠密检索器时所引发的两个局限:一是难以从语义差异显著的簇中检索相关段落,二是无法将相关性信号传播至整个语料库。解决方案的关键在于提出贝叶斯主动学习框架BAGEL(Bayesian Active Learning with Gaussian Processes guided by LLM relevance scoring),其通过基于LLM相关性评分构建查询特定的高斯过程(Gaussian Process, GP)来建模整个嵌入空间中的多模态相关性分布,并迭代地选择需评分的段落,以策略性地平衡高置信度区域的利用与不确定区域的探索,从而实现对复杂相关性分布的有效全局探索,在相同LLM预算下优于现有的LLM重排序方法。
链接: https://arxiv.org/abs/2604.17906
作者: Junyoung Kim,Anton Korikov,Jiazhou Liang,Justin Cui,Yifan Simon Liu,Qianfeng Wen,Mark Zhao,Scott Sanner
机构: Sungkyunkwan University (成均馆大学); University of Toronto (多伦多大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: ACL 2026 Findings
Abstract:While Large Language Models (LLMs) exhibit exceptional zero-shot relevance modeling, their high computational cost necessitates framing passage retrieval as a budget-constrained global optimization problem. Existing approaches passively rely on first-stage dense retrievers, which leads to two limitations: (1) failing to retrieve relevant passages in semantically distinct clusters, and (2) failing to propagate relevance signals to the broader corpus. To address these limitations, we propose Bayesian Active Learning with Gaussian Processes guided by LLM relevance scoring (BAGEL), a novel framework that propagates sparse LLM relevance signals across the embedding space to guide global exploration. BAGEL models the multimodal relevance distribution across the entire embedding space with a query-specific Gaussian Process (GP) based on LLM relevance scores. Subsequently, it iteratively selects passages for scoring by strategically balancing the exploitation of high-confidence regions with the exploration of uncertain areas. Extensive experiments across four benchmark datasets and two LLM backbones demonstrate that BAGEL effectively explores and captures complex relevance distributions and outperforms LLM reranking methods under the same LLM budget on all four datasets.
[IR-12] RankUp: Towards High-rank Representations for Large Scale Advertising Recommender Systems
【速读】:该论文旨在解决推荐系统中模型参数增长与表示能力提升不匹配的问题,即在深度增加时,token表示的有效秩(effective rank)出现阻尼振荡甚至下降,导致表征能力未能随参数规模同步增强。其解决方案的关键在于提出RankUp架构,通过稀疏特征的随机置换分割(randomized permutation splitting)、多嵌入范式(multi-embedding paradigm)、全局token整合、交叉预训练嵌入token以及任务特定token解耦等机制,有效缓解表示坍缩(representation collapse),显著提升模型的表达能力。
链接: https://arxiv.org/abs/2604.17878
作者: Jin Chen,Shangyu Zhang,Bin Hu,Chao Zhou,Junwei Pan,Gengsheng Xue,Wentao Ning,Gengyu Weng,Wang Zheng,Shaohua Liu,Zeen Xu,Chengyuan Mai,Tingyu Jiang,Lifeng Wang,Shudong Huang,Chengguo Yin,Haijie Gu,Jie Jiang
机构: Tencent Inc. (腾讯公司)
类目: Information Retrieval (cs.IR)
备注: 9 pages, 5 figures
Abstract:The scaling laws for recommender systems have been increasingly validated, where MetaFormer-based architectures consistently benefit from increased model depth, hidden dimensionality, and user behavior sequence length. However, whether representation capacity scales proportionally with parameter growth remains largely unexplored. Prior studies on RankMixer reveal that the effective rank of token representations exhibits a damped oscillatory trajectory across layers, failing to increase consistently with depth and even degrading in deeper layers. Motivated by this observation, we propose \textbfRankUp, an architecture designed to mitigate representation collapse and enhance expressive capacity through randomized permutation splitting over sparse features, a multi-embedding paradigm, global token integration, crossed pretrained embedding tokens and task-specific token decoupling. RankUp has been fully deployed in large-scale production across Weixin Video Accounts, Official Accounts and Moments, yielding GMV improvements of 3.41%, 4.81% and 2.21%, respectively.
[IR-13] FedCRF: A Federated Cross-domain Recommendation Method with Semantic-driven Deep Knowledge Fusion
【速读】:该论文旨在解决非重叠场景下跨域推荐系统中知识融合与隐私保护的难题,即在用户行为数据分散于不同平台、缺乏共同用户或物品的情况下,如何实现跨域知识迁移并保障用户隐私。其解决方案的关键在于提出一种基于深度语义融合的联邦跨域推荐方法(FedCRF):通过文本语义作为跨域桥梁,在联邦学习框架下实现全局与局部语义协同建模;服务器端构建全局语义簇以提取共享语义信息,客户端设计FGSAT模块动态适应本地分布以缓解跨域分布偏移;同时引入基于文本特征的语义图结构和全局-局部语义表示间的对比学习约束,增强语义一致性并促进深层次知识融合;仅共享物品语义表示而保留用户交互数据本地存储,有效降低隐私泄露风险。
链接: https://arxiv.org/abs/2604.17681
作者: Lei Guo,Ting Yang,Hui Liu,Xu Yu,Xiaohui Han,Xinhua Wang
机构: Shandong University of Finance and Economics (山东财经大学); University of Petroleum China (中国石油大学); Shandong Normal University (山东师范大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:As user behavior data becomes increasingly scattered across different platforms, achieving cross-domain knowledge fusion while preserving privacy has become a critical issue in recommender systems. Existing PPCDR methods usually rely on overlapping users or items as a bridge, making them inapplicable to non-overlapping scenarios. They also suffer from limitations in the collaborative modeling of global and local semantics. To this end, this paper proposes a Federated Cross-domain Recommendation method with deep knowledge Fusion (FedCRF). Using textual semantics as a cross-domain bridge, FedCRF achieves cross-domain knowledge transfer via federated semantic learning under the non-overlapping scenario. Specifically, FedCRF constructs global semantic clusters on the server side to extract shared semantic information, and designs a FGSAT module on the client side to dynamically adapt to local data distributions and alleviate cross-domain distribution shift. Meanwhile, it builds a semantic graph based on textual features to learn representations that integrate both structural and semantic information, and introduces contrastive learning constraints between global and local semantic representations to enhance semantic consistency and promote deep knowledge fusion. In this framework, only item semantic representations are shared, while user interaction data remains locally stored, effectively mitigating privacy leakage risks. Experimental results on multiple real-world datasets show that FedCRF significantly outperforms existing methods in terms of Recall@20 and NDCG@20, validating its effectiveness and superiority in non-overlapping cross-domain recommendation scenarios.
[IR-14] MasterSet: A Large-Scale Benchmark for Must-Cite Citation Recommendation in the AI/ML Literature SDM2026
【速读】:该论文旨在解决人工智能与机器学习(Artificial Intelligence and Machine Learning, AI/ML)领域中“必须引用”(must-cite)文献推荐的难题,即现有系统多关注广义相关性,而忽视了对实验基线、基础方法和核心依赖关系等关键文献的识别,这些文献若被遗漏将导致研究贡献的创新性误判或可复现性受损。解决方案的关键在于构建了一个大规模基准数据集 MasterSet,涵盖15个顶级会议的15万余篇论文作为候选池,并采用三层次标注体系(实验基线状态、核心相关性评分、文中提及频率)进行精细化标注,其标注流程由大语言模型(Large Language Model, LLM)驱动并经人工专家验证。该基准任务要求仅凭查询论文的标题和摘要,在候选集中检索must-cite文献,评估指标为Recall@K,从而为must-cite推荐提供了标准化评测框架,同时揭示当前稀疏检索、密集科学嵌入与图结构方法仍难以有效解决此问题,凸显其作为开放挑战的研究价值。
链接: https://arxiv.org/abs/2604.17680
作者: Md Toyaha Rahman Ratul,Zhiqian Chen,Kaiqun Fu,Taoran Ji,Lei Zhang
机构: Northern Illinois University(北方伊利诺伊大学); Mississippi State University(密西西比州立大学); Texas Christian University(德克萨斯基督教大学); Texas AM University – Corpus Christi(德克萨斯A&M大学-科珀斯克里斯蒂分校)
类目: Information Retrieval (cs.IR)
备注: submitted to SIAM SDM 2026
Abstract:The explosive growth of AI and machine learning literature – with venues like NeurIPS and ICLR now accepting thousands of papers annually – has made comprehensive citation coverage increasingly difficult for researchers. While citation recommendation has been studied for over a decade, existing systems primarily focus on broad relevance rather than identifying the critical set of ``must-cite’’ papers: direct experimental baselines, foundational methods, and core dependencies whose omission would misrepresent a contribution’s novelty or undermine reproducibility. We introduce MasterSet, a large-scale benchmark specifically designed to evaluate must-cite recommendation in the AI/ML domain. MasterSet incorporates over 150,000 papers collected from official conference proceedings/websites of 15 leading venues, serving as a comprehensive candidate pool for retrieval. We annotate citations with a three-tier labeling scheme: (I) experimental baseline status, (II) core relevance (1–5 scale), and (III) intra-paper mention frequency. Our annotation pipeline leverages an LLM-based judge, validated by human experts on a stratified sample. The benchmark task requires retrieving must-cite papers from the candidate pool given only a query paper’s title and abstract, evaluated by Recall@ K . We establish baselines using sparse retrieval, dense scientific embeddings, and graph-based methods, demonstrating that must-cite retrieval remains a challenging open problem.
[IR-15] Peerispect: Claim Verification in Scientific Peer Reviews
【速读】:该论文旨在解决学术同行评审中审查意见缺乏事实依据的问题,即审稿人常提出主观、修辞性或与稿件内容不一致的陈述,而这些陈述在大规模会议和期刊中难以人工核查其真实性。解决方案的关键在于提出 Peerispect 系统,该系统通过提取审稿意见中的可验证声明(check-worthy claims),从论文中检索相关证据,并利用自然语言推理(Natural Language Inference, NLI)进行逐条验证,最终以可视化界面将证据直接标注在原文中,实现对审稿意见的自动化、交互式事实核查。该系统采用模块化信息检索(Information Retrieval, IR)架构,支持多种检索器、重排序器和验证器,适用于审稿人、作者和程序委员会使用。
链接: https://arxiv.org/abs/2604.17667
作者: Ali Ghorbanpour,Soroush Sadeghian,Alireza Daghighfarsoodeh,Sajad Ebrahimi,Negar Arabzadeh,Seyed Mohammad Hosseini,Ebrahim Bagheri
机构: Reviewerly; University of Toronto (多伦多大学); UC Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Peer review is central to scientific publishing, yet reviewers frequently include claims that are subjective, rhetorical, or misaligned with the submitted work. Assessing whether review statements are factual and verifiable is crucial for fairness and accountability. At the scale of modern conferences and journals, manually inspecting the grounding of such claims is infeasible. We present Peerispect, an interactive system that operationalizes claim-level verification in peer reviews by extracting check-worthy claims from peer reviews, retrieving relevant evidence from the manuscript, and verifying the claims through natural language inference. Results are presented through a visual interface that highlights evidence directly in the paper, enabling rapid inspection and interpretation. Peerispect is designed as a modular Information Retrieval (IR) pipeline, supporting alternative retrievers, rerankers, and verifiers, and is intended for use by reviewers, authors, and program committees. We demonstrate Peerispect through a live, publicly available demo (this https URL) and API services (this https URL), accompanied by a video tutorial (this https URL).
[IR-16] Code-Switching Information Retrieval: Benchmarks Analysis and the Limits of Current Retrievers ACL2026
【速读】:该论文旨在解决当前信息检索(Information Retrieval, IR)系统在多语言环境下对代码转换(Code-Switching)现象处理能力不足的问题。现有系统主要基于单一语言设计与评估,难以应对真实世界中混合语言查询带来的挑战。解决方案的关键在于构建一个高质量的代码转换检索基准——CSR-L(Code-Switching Retrieval benchmark-Lite),通过人工标注数据捕捉自然语言混用的真实场景,并揭示代码转换导致嵌入空间分布显著偏移,从而成为主流多语言模型性能下降的根本瓶颈。进一步地,作者提出CS-MTEB基准以系统性评估多种任务下的性能衰减,表明单纯扩展词汇等传统多语言策略无法完全弥补这一缺陷,强调了将代码转换纳入IR优化核心框架的必要性。
链接: https://arxiv.org/abs/2604.17632
作者: Qingcheng Zeng,Yuheng Lu,Zeqi Zhou,Heli Qi,Puxuan Yu,Fuheng Zhao,Hitomi Yanaka,Weihao Xuan,Naoto Yokoya
机构: Northwestern University (西北大学); Waseda University (早稻田大学); Brown University (布朗大学); RIKEN AIP (理化学研究所人工智能中心); Snowflake Inc. (雪花科技公司); University of Utah (犹他大学); The University of Tokyo (东京大学)
类目: Information Retrieval (cs.IR)
备注: Finding of ACL 2026
Abstract:Code-switching is a pervasive linguistic phenomenon in global communication, yet modern information retrieval systems remain predominantly designed for, and evaluated within, monolingual contexts. To bridge this critical disconnect, we present a holistic study dedicated to code-switching IR. We introduce CSR-L (Code-Switching Retrieval benchmark-Lite), constructing a dataset via human annotation to capture the authentic naturalness of mixed-language queries. Our evaluation across statistical, dense, and late-interaction paradigms reveals that code-switching acts as a fundamental performance bottleneck, degrading the effectiveness of even robust multilingual models. We demonstrate that this failure stems from substantial divergence in the embedding space between pure and code-switched text. Scaling this investigation, we propose CS-MTEB, a comprehensive benchmark covering 11 diverse tasks, where we observe performance declines of up to 27%. Finally, we show that standard multilingual techniques like vocabulary expansion are insufficient to resolve these deficits completely. These findings underscore the fragility of current systems and establish code-switching as a crucial frontier for future IR optimization.
[IR-17] COSEARCH: Joint Training of Reasoning and Document Ranking via Reinforcement Learning for Agent ic Search
【速读】:该论文旨在解决当前**代理搜索(agentic search)中检索系统作为固定工具所导致的性能瓶颈问题,即现有方法(如Search-R1)仅优化推理代理而保持检索组件不变,使得整体性能受限于静态检索模块。实验证明,与理想检索系统(oracle)相比,固定检索系统的F1指标存在高达26.8%的相对差距,表明检索能力是提升代理搜索性能的关键制约因素。为此,作者提出CoSearch框架,其核心创新在于通过组相对策略优化(Group Relative Policy Optimization, GRPO)**实现推理代理与生成式文档排序模型的联合训练。关键突破包括:设计基于词级相似性的语义分组策略,在无需额外采样的前提下构建有效的优化组;引入复合奖励机制,融合排序质量信号与轨迹级结果反馈,使排序模型同时获得即时和长期的学习信号。实验表明,该方案在七个单跳和多跳问答基准上均显著优于强基线,验证了联合训练的有效性与可行性。
链接: https://arxiv.org/abs/2604.17555
作者: Hansi Zeng,Liam Collins,Bhuvesh Kumar,Neil Shah,Hamed Zamani
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Snap Inc (Snap Inc)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Agentic search – the task of training agents that iteratively reason, issue queries, and synthesize retrieved information to answer complex questions – has achieved remarkable progress through reinforcement learning (RL). However, existing approaches such as Search-R1, treat the retrieval system as a fixed tool, optimizing only the reasoning agent while the retrieval component remains unchanged. A preliminary experiment reveals that the gap between an oracle and a fixed retrieval system reaches up to +26.8% relative F1 improvement across seven QA benchmarks, suggesting that the retrieval system is a key bottleneck in scaling agentic search performance. Motivated by this finding, we propose CoSearch, a framework that jointly trains a multi-step reasoning agent and a generative document ranking model via Group Relative Policy Optimization (GRPO). To enable effective GRPO training for the ranker – whose inputs vary across reasoning trajectories – we introduce a semantic grouping strategy that clusters sub-queries by token-level similarity, forming valid optimization groups without additional rollouts. We further design a composite reward combining ranking quality signals with trajectory-level outcome feedback, providing the ranker with both immediate and long-term learning signals. Experiments on seven single-hop and multi-hop QA benchmarks demonstrate consistent improvements over strong baselines, with ablation studies validating each design choice. Our results show that joint training of the reasoning agent and retrieval system is both feasible and strongly performant, pointing to a key ingredient for future search agents.
[IR-18] Matlas: A Semantic Search Engine for Mathematics
【速读】:该论文旨在解决数学知识检索的难题,即在海量且结构复杂的数学文献中高效定位特定定理或结果,这在人类研究和生成式 AI (Generative AI) 系统中均至关重要。其关键解决方案是构建了一个名为 Matlas 的语义搜索引擎,基于包含 807 万条数学命题的大型语义化语料库(来自 43.5 万篇同行评审论文及 1900 本教材),通过提取命题及其依赖关系、构建文档级依赖图,并按拓扑顺序递归展开以生成更自包含的表示形式,从而支持使用自然语言查询进行高效语义检索。
链接: https://arxiv.org/abs/2604.17484
作者: Haocheng Ju,Leheng Chen,Peihao Wu,Bryan Dai,Bin Dong
机构: 北京大学数学科学学院(University of Beijing, School of Mathematical Sciences)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Web Service: this https URL , API Docs: this https URL
Abstract:Retrieving mathematical knowledge is a central task in both human-driven research, such as determining whether a result already exists, finding related results, and identifying historical origins, and in emerging AI systems for mathematics, where reliable grounding is essential. However, the scale and structure of the mathematical literature pose significant challenges: results are distributed across millions of documents, and individual statements are often difficult to interpret in isolation due to their dependence on prior definitions and theorems. In this paper, we introduce Matlas, a semantic search engine for mathematical statements. Matlas is built on a large-scale corpus of 8.07 million statements extracted from 435K peer-reviewed papers spanning 1826 to 2025, drawn from a curated set of 180 journals selected using an ICM citation-based criterion, together with 1.9K textbooks. From these sources, we extract mathematical statements together with their dependencies, construct document-level dependency graphs, and recursively unfold statements in topological order to produce more self-contained representations. On top of this corpus, we develop a semantic retrieval system that enables efficient search for mathematical results using natural language queries. We hope that Matlas can improve the efficiency of theorem retrieval for mathematicians and provide a structured source of grounding for AI systems tackling research-level mathematical problems, and serve as part of the infrastructure for mathematical knowledge retrieval.
[IR-19] ransparent and Controllable Recommendation Filtering via Multimodal Multi-Agent Collaboration
【速读】:该论文旨在解决个性化推荐系统中存在的用户不适内容暴露问题,尤其是现有基于大语言模型(Large Language Models, LLMs)的过滤方法因缺乏多模态感知能力而无法识别视觉不当内容,以及因“过度关联”(over-association)现象——即错误地将用户对特定负面内容(如引发焦虑的营销信息)的厌恶泛化至无害教育性内容——导致大量误判(false positives),从而削弱用户自主权的问题。解决方案的关键在于提出一种融合端云协同、多模态感知与多智能体编排的新框架:通过事实驱动的裁决流水线消除推理幻觉,并构建动态双层偏好图(preference graph),支持人工介入的Delta调整机制,显式防止算法对细粒度用户意图的灾难性遗忘,从而显著降低误报率并提升推荐系统的可解释性与可控性。
链接: https://arxiv.org/abs/2604.17459
作者: Chi Zhang,Zhipeng Xu,Jiahao Liu,Dongsheng Li,Hansu Gu,Peng Zhang,Ning Gu,Tun Lu
机构: Fudan University (复旦大学); Microsoft Research Asia (微软亚洲研究院)
类目: Information Retrieval (cs.IR)
备注: 14 pages, under review
Abstract:While personalized recommender systems excel at content discovery, they frequently expose users to undesirable or discomforting information, highlighting the critical need for user-centric filtering tools. Current methods leveraging Large Language Models (LLMs) struggle with two major bottlenecks: they lack multimodal awareness to identify visually inappropriate content, and they are highly prone to “over-association” – incorrectly generalizing a user’s specific dislike (e.g., anxiety-inducing marketing) to block benign, educational materials. These unconstrained hallucinations lead to a high volume of false positives, ultimately undermining user agency. To overcome these challenges, we introduce a novel framework that integrates end-to-cloud collaboration, multimodal perception, and multi-agent orchestration. Our system employs a fact-grounded adjudication pipeline to eliminate inferential hallucinations. Furthermore, it constructs a dynamic, two-tier preference graph that allows for explicit, human-in-the-loop modifications (via Delta-adjustments), explicitly preventing the algorithm from catastrophically forgetting fine-grained user intents. Evaluated on an adversarial dataset comprising 473 highly confusing samples, the proposed architecture effectively curbed over-association, decreasing the false positive rate by 74.3% and achieving nearly twice the F1-Score of traditional text-only baselines. Additionally, a 7-day longitudinal field study with 19 participants demonstrated robust intent alignment and enhanced governance efficiency. User feedback confirmed that the framework drastically improves algorithmic transparency, rebuilds user control, and alleviates the fear of missing out (FOMO), paving the way for transparent human-AI co-governance in personalized feeds.
[IR-20] RoTRAG : Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation
【速读】:该论文旨在解决多轮对话中有害内容检测的准确性与可解释性问题,现有方法主要依赖大语言模型(Large Language Model, LLM)内部参数化知识,缺乏对外部规范性原则的显式建模,导致在社会语境下判断不一致、推理冗余且难以解释。解决方案的关键在于提出RoTRAG框架,通过引入外部人类编写的道德规范(Rules of Thumb, RoTs)作为显式规范证据,结合检索增强机制,在每一轮对话中动态检索相关RoTs用于逐轮推理和最终危害严重性分类;同时设计轻量级二值路由分类器,智能决定是否需要触发基于检索的推理,从而减少冗余计算并提升效率。
链接: https://arxiv.org/abs/2604.17301
作者: Juhyeon Lee,Wonduk Seo,Junseo Koh,Seunghyun Lee,Haihua Chen,Yi Bu
机构: Peking University (北京大学); Enhans (Enhans); University of North Texas (北德克萨斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 20 pages, 10 figures (Under Review)
Abstract:Detecting harmful content in multi turn dialogue requires reasoning over the full conversational context rather than isolated utterances. However, most existing methods rely mainly on models internal parametric knowledge, without explicit grounding in external normative principles. This often leads to inconsistent judgments in socially nuanced contexts, limited interpretability, and redundant reasoning across turns. To address this, we propose RoTRAG, a retrieval augmented framework that incorporates concise human written moral norms, called Rules of Thumb (RoTs), into LLM based harm assessment. For each turn, RoTRAG retrieves relevant RoTs from an external corpus and uses them as explicit normative evidence for turn level reasoning and final severity classification. To improve efficiency, we further introduce a lightweight binary routing classifier that decides whether a new turn requires retrieval grounded reasoning or can reuse existing context. Experiments on ProsocialDialog and Safety Reasoning Multi Turn Dialogue show that RoTRAG consistently improves both harm classification and severity estimation over competitive baselines, with an average relative gain of around 40% in F1 across benchmark datasets and an average relative reduction of 8.4% in distributional error, while reducing redundant computation without sacrificing performance.
[IR-21] MemSearch-o1: Empowering Large Language Models with Reasoning -Aligned Memory Growth in Agent ic Search
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在执行代理式搜索(agentic search)过程中因迭代思考-检索循环导致的长系统记忆累积问题,即记忆稀释(memory dilution)现象;同时,现有记忆管理方法难以捕捉查询与文档之间的细粒度语义关系,易造成信息丢失。解决方案的关键在于提出 MemSearch-o1 框架,其核心机制为基于推理对齐的记忆增长与回溯重构:首先从查询中提取记忆种子标记(memory seed tokens),动态生成细粒度记忆片段;随后通过贡献函数(contribution function)进行回溯与深度精炼;最终构建全局连接的记忆路径(globally connected memory path),从而将记忆管理从流式拼接转变为基于路径的、标记级结构化增长,显著缓解记忆稀释并增强多种 LLM 的推理能力。
链接: https://arxiv.org/abs/2604.17265
作者: Sheng Zhang,Junyi Li,Yingyi Zhang,Pengyue Jia,Yichao Wang,Xiaowei Qian,Wenlin Zhang,Maolin Wang,Yong Liu,Xiangyu Zhao
机构: City University of Hong Kong (香港城市大学); Dalian University of Technology (大连理工大学); Huawei Technologies Ltd. (华为技术有限公司)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Recent advances in large language models (LLMs) have scaled the potential for reasoning and agentic search, wherein models autonomously plan, retrieve, and reason over external knowledge to answer complex queries. However, the iterative think-search loop accumulates long system memories, leading to memory dilution problem. In addition, existing memory management methods struggle to capture fine-grained semantic relations between queries and documents and often lose substantial information. Therefore, we propose MemSearch-o1, an agentic search framework built on reasoning-aligned memory growth and retracing. MemSearch-o1 dynamically grows fine-grained memory fragments from memory seed tokens from the queries, then retraces and deeply refines the memory via a contribution function, and finally reorganizes a globally connected memory path. This shifts memory management from stream-like concatenation to structured, token-level growth with path-based reasoning. Experiments on eight benchmark datasets show that MemSearch-o1 substantially mitigates memory dilution, and more effectively activates the reasoning potential of diverse LLMs, establishing a solid foundation for memory-aware agentic intelligence.
[IR-22] HORIZON: A Benchmark for In-the-wild User Behaviour Modeling ACL2026
【速读】:该论文旨在解决现有用户建模基准在数据规模、任务设定和评估方式上的局限性,这些问题导致当前模型难以适应真实世界中跨域、跨用户及长时间跨度的复杂行为模式。其核心解决方案是提出HORIZON基准,该基准从亚马逊评论数据中重构出涵盖5400万用户和3500万物品的大规模多域数据集,并围绕数据集、任务设计与评估体系三个维度进行革新。关键创新在于引入时间泛化(temporal generalization)、序列长度变化和未见用户建模等新任务与评估指标,从而推动模型从单一领域内的下一物品预测向具备长期交互理解能力、跨域迁移能力和时序鲁棒性的通用用户建模方向演进。
链接: https://arxiv.org/abs/2604.17259
作者: Arnav Goel,Pranjal A Chitale,Bhawna Paliwal,Bishal Santra,Amit Sharma
机构: Carnegie Mellon University (卡内基梅隆大学); Microsoft Research India (微软研究院印度); University of California, Berkeley (加州大学伯克利分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, accepted to ACL 2026 (Findings)
Abstract:User behavior in the real world is diverse, cross-domain, and spans long time horizons. Existing user modeling benchmarks however remain narrow, focusing mainly on short sessions and next-item prediction within a single domain. Such limitations hinder progress toward robust and generalizable user models. We present HORIZON, a new benchmark that reformulates user modeling along three axes i.e. dataset, task, and evaluation. Built from a large-scale, cross-domain reformulation of Amazon Reviews, HORIZON covers 54M users and 35M items, enabling both pretraining and realistic evaluation of models in heterogeneous environments. Unlike prior benchmarks, it challenges models to generalize across domains, users, and time, moving beyond standard missing-positive prediction in the same domain. We propose new tasks and evaluation setups that better reflect real-world deployment scenarios. These include temporal generalization, sequence-length variation, and modeling unseen users, with metrics designed to assess general user behavior understanding rather than isolated next-item prediction. We benchmark popular sequential recommendation architectures alongside LLM-based baselines that leverage long-term interaction histories. Our results highlight the gap between current methods and the demands of real-world user modeling, while establishing HORIZON as a foundation for research on temporally robust, cross-domain, and general-purpose user models.
[IR-23] HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention Heads
【速读】:该论文旨在解决解码-free重排序方法中因注意力分数同质化(attention score homogenization)导致的细粒度区分能力下降问题,即在中间上下文文档中注意力得分趋同,难以实现有效的排序。其解决方案的关键在于提出HeadRank框架,通过熵正则化的头选择机制(entropy-regularized head selection)、硬相邻层级偏好对(hard adjacent-level preference pairs)以及联合锐化同质化中间区域判别力的分布正则项(distribution regularizer),将偏好优化从离散token空间迁移至连续注意力空间;同时在最深选中的层进行深度截断,使推理复杂度降至O(1)次前向传播,从而在保持低延迟的同时显著提升排序性能。
链接: https://arxiv.org/abs/2604.17237
作者: Juyuan Wang,Chenxing Wang,Yuchen Fang,Huiyun Hu,Junwu Du,Aolin Li,Haijun Wu,Jin Xu,Ligang Liu,Dongliang Liao
机构: Weixin Group, Tencent, China; South China University of Technology, Guangzhou, China
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Decoding-free reranking methods that read relevance signals directly from LLM attention weights offer significant latency advantages over autoregressive approaches, yet suffer from attention score homogenization: middle-context documents receive near-identical scores, destroying the fine-grained distinctions required for ranking. We propose HeadRank, a framework that lifts preference optimization from discrete token space into the continuous attention domain through entropy-regularized head selection, hard adjacent-level preference pairs, and a distribution regularizer that jointly sharpen discriminability in the homogenized middle zone. Depth truncation at the deepest selected layer further reduces inference to \mathcalO(1) forward passes. Across 14 benchmarks on three Qwen3 scales (0.6B–4B) using only 211 training queries, HeadRank consistently outperforms generative and decoding-free baselines with 100% formatting success. At 4B, 57.4% of relevant middle-zone documents reach the top quartile versus 14.2% for irrelevant ones – a 43-percentage-point selectivity gap that demonstrates the effectiveness of attention-space preference alignment for listwise reranking.
[IR-24] RLM-on-KG: Heuristics First LLM s When Needed: Adaptive Retrieval Control over Mention Graphs for Scattered Evidence
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)控制器在知识图谱(Knowledge Graph, KG)探索任务中是否优于传统基于规则的遍历策略的问题。其核心贡献在于揭示了LLM控制的优势具有条件性:仅当证据分布分散且工具调用能力较强时,LLM才能显著超越启发式遍历方法。解决方案的关键在于提出RLM-on-KG系统——一种在查询时进行实体优先、多跳探索的检索架构,采用确定性图构建和固定工具集,将候选发现与排序分离:LLM负责提升探索广度以覆盖更多潜在证据,而最终证据选择则由纯向量重排序完成,从而实现高效且可解释的知识图谱导航。
链接: https://arxiv.org/abs/2604.17056
作者: Andrea Volpini,Elie Raad
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Preprint. 32 pages, 9 figures. Code and data available at the project repository
Abstract:When does an LLM controller outperform rule-based traversal for knowledge graph exploration? We study this question through RLM-on-KG, a retrieval system that treats an LLM as an autonomous navigator over an RDF-encoded mention graph for grounded question answering. Unlike GraphRAG pipelines that rely on offline LLM indexing, RLM-on-KG performs entity-first, multi-hop exploration at query time using deterministic graph construction and a fixed tool set. Our central finding is a conditional advantage: the value of LLM control depends on evidence scatter and tool-calling sophistication. The paper’s core claim is LLM control versus heuristic traversal, not a generic win over GraphRAG. On GraphRAG-Bench Novel (519 questions), Gemini 2.0 Flash achieves +2.47 pp F1 over a rule-based heuristic baseline (p 0.0001), but only +0.16 pp over a GraphRAG-local variant (not significant). With a stronger controller, Claude Haiku 4.5, the gain over heuristic grows to +4.37 pp (p 0.001) and extends to a +2.42 pp significant improvement over GraphRAG-local (p 0.001). The gain is largest when gold evidence is scattered across 6-10 chunks (+3.21 pp) and smallest for concentrated evidence (+1.85 pp). Cross-scale validation on MuSiQue confirms that the LLM-over-heuristic advantage transfers, with expected attenuation on smaller per-question graphs. The core architectural insight is the separation of candidate discovery from ranking: the LLM adds value through exploration breadth, while final evidence selection is best handled by pure vector re-ranking. Beyond retrieval, exploration traces provide a proposed stress-test harness for structured data quality, yielding diagnostics for coverage, connectivity, provenance, and queryability.
[IR-25] Detecting Alarming Student Verbal Responses using Text and Audio Classifier
【速读】:该论文旨在解决自动化言语响应评分(Automated Verbal Response Scoring, AVRS)系统在识别潜在危机学生方面存在的安全漏洞问题。传统AVRS系统仅依赖文本内容进行判断,忽视了语音中的韵律特征(prosodic markers),导致对高风险响应的漏检率较高。解决方案的关键在于提出一种新颖的混合框架,通过融合一个基于内容的文本分类器与一个基于韵律特征的音频分类器,同时分析学生的言语内容和语音表现,从而显著提升对潜在危险响应的检测性能,为人工干预提供更及时、准确的支持,具有重要的临床与安全价值。
链接: https://arxiv.org/abs/2604.16717
作者: Christopher Ormerod,Gitit Kehat
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 9 Pages. Paper to be Presented at the National Council on Measurement in Education Conference on April 10, 2026
Abstract:This paper addresses a critical safety gap in the use Automated Verbal Response Scoring (AVRS). We present a novel hybrid framework for troubled student detection that combines a text classifier, trained to detect responses based on their content, and an audio classifier, trained to detect responses using prosodic markers. This approach overcomes key limitations of traditional AVRS systems by considering both content and prosody of responses, achieving enhanced performance in identifying potentially concerning responses. This system can expedite the review process by humans, which can be life-saving particularly when timely intervention may be crucial.
[IR-26] On the Robustness of LLM -Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability
【速读】:该论文旨在解决生成式 AI(Generative AI)时代下基于 decoder-only 大语言模型(LLM)的密集检索器(dense retriever)在实际应用中所面临的鲁棒性不足问题,尤其是其在泛化能力(generalizability)和稳定性(stability)方面的表现尚未系统评估。解决方案的关键在于从两个互补视角进行系统性分析:一是利用线性混合效应模型量化不同模型在跨30个数据集上的平均性能,并区分模型固有能力与数据异质性的影响,揭示指令微调模型虽整体表现优异但复杂推理优化可能导致“专业化代价”(specialization tax);二是通过模拟自然查询扰动(如同义替换、拼写错误)和恶意攻击(如语料库投毒),发现 LLM-based 检索器对词汇层面扰动更具韧性,但对语义级扰动仍敏感,同时指出嵌入几何特性(如角度均匀性)可作为预测词法稳定性的指标,且模型规模扩大通常提升鲁棒性。这一研究为未来设计具备鲁棒性意识的检索系统提供了实证依据与理论指导。
链接: https://arxiv.org/abs/2604.16576
作者: Yongkang Li,Panagiotis Eustratiadis,Yixing Fan,Evangelos Kanoulas
机构: University of Amsterdam (阿姆斯特丹大学); Chinese Academy of Sciences (中国科学院)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Decoder-only large language models (LLMs) are increasingly replacing BERT-style architectures as the backbone for dense retrieval, achieving substantial performance gains and broad adoption. However, the robustness of these LLM-based retrievers remains underexplored. In this paper, we present the first systematic study of the robustness of state-of-the-art open-source LLM-based dense retrievers from two complementary perspectives: generalizability and stability. For generalizability, we evaluate retrieval effectiveness across four benchmarks spanning 30 datasets, using linear mixed-effects models to estimate marginal mean performance and disentangle intrinsic model capability from dataset heterogeneity. Our analysis reveals that while instruction-tuned models generally excel, those optimized for complex reasoning often suffer a ``specialization tax,‘’ exhibiting limited generalizability in broader contexts. For stability, we assess model resilience against both unintentional query variations~(e.g., paraphrasing, typos) and malicious adversarial attacks~(e.g., corpus poisoning). We find that LLM-based retrievers show improved robustness against typos and corpus poisoning compared to encoder-only baselines, yet remain vulnerable to semantic perturbations like synonymizing. Further analysis shows that embedding geometry (e.g., angular uniformity) provides predictive signals for lexical stability and suggests that scaling model size generally improves robustness. These findings inform future robustness-aware retriever design and principled benchmarking. Our code is publicly available at this https URL.
[IR-27] Modeling User Exploration Saturation: When Recommender Systems Should Stop Pushing Novelty
【速读】:该论文旨在解决公平性驱动的推荐系统中因统一施加探索强度而导致用户体验不均衡的问题。其关键在于识别并实证分析“探索饱和”(exploration saturation)现象——即随着推荐系统持续增加对长尾或低曝光内容的探索力度,用户的效用提升会逐渐减弱甚至转为下降,且这一临界点具有显著的用户个体差异性。研究发现,交互历史较短的用户更容易达到探索饱和,表明全局固定的公平性干预策略可能对部分用户造成不公平负担,从而揭示了在推荐系统中应根据用户特征动态调整公平性驱动探索程度的重要性。
链接: https://arxiv.org/abs/2604.16419
作者: Enock O. Ayiku,Evelyn Osei,Emebo Onyeka
机构: University of Massachusetts Boston (马萨诸塞大学波士顿分校); Virginia Tech (弗吉尼亚理工大学); Kansas State University (堪萨斯州立大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Fairness-aware recommender systems often mitigate bias by increasing exposure to under-represented or long-tail content, commonly through mechanisms that promote novelty and diversity. In practice, the strength of such interventions is typically controlled using global hyperparameters, fixed regularization weights, heuristic caps, or offline tuning strategies. These approaches implicitly assume that a single level of exploration is appropriate across users, contexts, and stages of interaction. In this work, we study exploration saturation as a user-dependent phenomenon arising from fairness- and novelty-driven recommendation strategies. We define exploration saturation as the point at which further increases in exploration no longer improve user utility and may instead reduce engagement or perceived relevance. Rather than proposing a new fairness-aware algorithm or optimizing a specific objective, we empirically analyze how increasing exploration affects users across varied recommendation models. Through longitudinal experiments using MovieLens-1M and this http URL datasets, our results indicate that fairness-induced exploration exhibits diminishing or non-monotonic returns and varies substantially across users. In particular, users with limited interaction histories tend to reach saturation earlier, suggesting that uniform fairness or novelty pressure can disproportionately disadvantage certain users. These findings reveal a trade-off between fairness and user experience, suggesting that recommendation systems should adapt not only to relevance but also to the amount of fairness-driven exploration applied to individual users.
[IR-28] nsor Manifold-Based Graph-Vector Fusion for AI-Native Academic Literature Retrieval
【速读】:该论文旨在解决现有图-向量融合方法在学术文献检索中面临的瓶颈问题,包括矩阵依赖性、存储爆炸、语义稀释以及缺乏对生成式 AI (Generative AI) 原生支持等挑战。其解决方案的关键在于提出了一种基于张量流形理论的几何统一图-向量融合框架,首次从理论上证明了学术文献图是张量流形的离散投影,从而实现了图拓扑结构与向量几何嵌入的原生统一。在此基础上设计的四个核心模块——无矩阵依赖的时间扩散签名更新、分层时间流形编码、时间黎曼流形索引及 AI-agent 可编程检索,均具备线性时间与空间复杂度,可高效适配大规模动态学术文献图,为生成式 AI 原生的学术文献检索提供了新的理论框架与工程实现路径。
链接: https://arxiv.org/abs/2604.16416
作者: Xing Wei,Yang Yu
机构: Dongbi Scientific Data Lab (东必科学数据实验室)
类目: Information Retrieval (cs.IR)
备注: 36 pages, 10 tables, 0 figures; accepted for publication; extended version of graph-vector fusion framework for AI-native academic literature retrieval
Abstract:The rapid development of large language models and AI agents has triggered a paradigm shift in academic literature retrieval, putting forward new demands for fine-grained, time-aware, and programmable retrieval. Existing graph-vector fusion methods still face bottlenecks such as matrix dependence, storage explosion, semantic dilution, and lack of AI-native support. This paper proposes a geometry-unified graph-vector fusion framework based on tensor manifold theory, which formally proves that an academic literature graph is a discrete projection of a tensor manifold, realizing the native unification of graph topology and vector geometric embedding. Based on this theoretical conclusion, we design four core modules: matrix-independent temporal diffusion signature update, hierarchical temporal manifold encoding, temporal Riemannian manifold indexing, and AI-agent programmable retrieval. Theoretical analysis and complexity proof show that all core algorithms have linear time and space complexity, which can adapt to large-scale dynamic academic literature graphs. This research provides a new theoretical framework and engineering solution for AI-native academic literature retrieval, promoting the industrial application of graph-vector fusion technology in the academic field.
[IR-29] GRAB-ANNS: High-Throughput Indexing and Hybrid Search via GPU-Native Bucketing
【速读】:该论文旨在解决传统混合搜索(hybrid search)在GPU上性能受限的问题,即CPU-centric的近似最近邻(Approximate Nearest Neighbor, ANN)索引直接移植到GPU时因架构不匹配导致的严重性能下降,如不规则内存访问、分支发散和过多的CPU-GPU同步开销。解决方案的关键在于提出一种面向GPU原生设计的图索引结构GRAB-ANNS:通过引入基于桶(bucket)的内存布局将范围谓词转化为轻量级桶选择,实现合并内存访问与高效的单指令多线程(SIMT)执行;同时设计混合图拓扑结构,在桶内构建稠密局部边以保障局部导航性,桶间使用稀疏远程边维持全局连通性;此外还开发了支持批量插入与并行图维护的追加式更新流水线,从而显著提升动态混合搜索的查询吞吐量与索引构建效率。
链接: https://arxiv.org/abs/2604.16402
作者: Xinkui Zhao,Hengxuan Lou,Yifan Zhang,Junjie Dai,Shuiguang Deng,Jianwei Yin
机构: Zhejiang University (浙江大学)
类目: Databases (cs.DB); Information Retrieval (cs.IR)
备注:
Abstract:Hybrid search, which jointly optimizes vector similarity and structured predicate filtering, has become a fundamental building block for modern AI-driven systems. While recent predicate-aware ANN indices improve filtering efficiency on CPUs, their performance is increasingly constrained by limited memory bandwidth and parallelism. Although GPUs offer massive parallelism and superior memory bandwidth, directly porting CPU-centric hybrid search algorithms to GPUs leads to severe performance degradation due to architectural mismatches, including irregular memory access, branch divergence, and excessive CPU-GPU synchronization. In this paper, we present GRAB-ANNS, a high-throughput, GPU-native graph index for dynamic hybrid search. Our key insight is to rethink hybrid indexing from a hardware-first perspective. We introduce a bucket-based memory layout that transforms range predicates into lightweight bucket selection, enabling coalesced memory accesses and efficient SIMT execution. To preserve global navigability under arbitrary filters, we design a hybrid graph topology that combines dense intra-bucket local edges with sparse inter-bucket remote edges. We further develop an append-only update pipeline that supports efficient batched insertions and parallel graph maintenance on GPUs. Extensive experiments on large-scale datasets show that GRAB-ANNS achieves up to 240.1 times higher query throughput and 12.6 times faster index construction than state-of-the-art CPU-based systems, and up to 10 times higher throughput compared to optimized GPU-native reimplementations, while maintaining high recall.
[IR-30] GraphRAG -Router: Learning Cost-Efficient Routing over GraphRAG s and LLM s with Reinforcement Learning
【速读】:该论文旨在解决现有图结构检索增强生成(Graph-based Retrieval-Augmented Generation, GraphRAG)系统在处理多样化问题时存在的适应性不足与计算成本过高问题。当前方法通常采用统一的检索框架和单一大语言模型(Large Language Model, LLM)应对所有查询,缺乏对问题复杂度的动态响应能力,导致资源浪费。解决方案的关键在于提出 GraphRAG-Router 框架,其核心是通过分层路由策略协调异构的 GraphRAG 模块与不同规模的生成 LLM,并结合监督微调与两阶段强化学习优化机制——尤其在第二阶段引入课程感知的成本奖励函数,引导模型根据问题难度智能分配计算资源,在保证多跳问答性能的同时显著降低大型 LLM 的使用频率(约减少 30%),实现高效且自适应的知识密集型问答。
链接: https://arxiv.org/abs/2604.16401
作者: Dongzhe Fan,Chuanhao Ji,Zimu Wang,Tong Chen,Qiaoyu Tan
机构: New York University (Shanghai); University of Liverpool
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph-based retrieval-augmented generation (GraphRAG) has recently emerged as a powerful paradigm for knowledge-intensive question answering, especially for tasks that require structured evidence organization and multi-hop reasoning. However, existing GraphRAG systems are typically built in a one-size-fits-all manner, relying on a fixed retrieval framework and a single, often large and costly, generator LLM for all queries. This static design limits their ability to adapt to the complexity of varying questions and often incurs unnecessary computational cost. To fill in the gap, we propose GraphRAG-Router, a cost-efficient framework that adopts a hierarchical routing strategy to coordinate heterogeneous GraphRAGs and generator LLMs. Specifically, GraphRAG-Router is first warmed up through supervised fine-tuning and then optimized with a two-stage reinforcement learning procedure, whose second stage introduces a curriculum cost-aware reward to encourage difficulty-aware and economical generator allocation. Extensive experiments on six general-domain and multi-hop QA benchmarks show that GraphRAG-Router consistently outperforms state-of-the-art baselines, reducing the overuse of large LLMs by nearly 30% while maintaining strong generalization capability.
[IR-31] A Reference Architecture for Agent ic Hybrid Retrieval in Dataset Search
【速读】:该论文旨在解决即兴数据集搜索(ad hoc dataset search)问题,即在用户以不完整自然语言查询时,如何高效准确地匹配稀疏且异构的数据集元数据记录。传统基于词法(如BM25)或稠密向量(dense embedding)的检索方法在此场景下均存在不足。其解决方案的关键在于将数据集搜索重构为软件架构问题,并提出一种受控且可审计的混合检索参考架构:通过大语言模型(LLM)代理协调BM25与稠密嵌入检索,利用互斥排名融合(RRF)整合结果;同时引入离线元数据增强步骤,由LLM生成伪查询以缓解用户意图与元数据间的词汇不匹配问题。该架构支持两种风格——单个ReAct代理与多代理水平架构(含反馈控制),并通过系统性评估框架量化各设计决策的贡献,从而实现可扩展、可观测、可治理的混合检索系统。
链接: https://arxiv.org/abs/2604.16394
作者: Riccardo Terrenzi,Phongsakon Mark Konrad,Tim Lukas Adam,Serkan Ayvaz
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures, accepted at SAML 2026
Abstract:Ad hoc dataset search requires matching underspecified natural-language queries against sparse, heterogeneous metadata records, a task where typical lexical or dense retrieval alone falls short. We reposition dataset search as a software-architecture problem and propose a bounded, auditable reference architecture for agentic hybrid retrieval that combines BM25 lexical search with dense-embedding retrieval via reciprocal rank fusion (RRF), orchestrated by a large language model (LLM) agent that repeatedly plans queries, evaluates the sufficiency of results, and reranks candidates. To reduce the vocabulary mismatch between user intent and provider-authored metadata, we introduce an offline metadata augmentation step in which an LLM generates pseudo-queries for each dataset record, augmenting both retrieval indexes before query time. Two architectural styles are examined: a single ReAct agent and a multi-agent horizontal architecture with Feedback Control. Their quality-attribute tradeoffs are analyzed with respect to modifiability, observability, performance, and governance. An evaluation framework comprising seven system variants is defined to isolate the contribution of each architectural decision. The architecture is presented as an extensible reference design for the software architecture community, incorporating explicit governance tactics to bound and audit nondeterministic LLM components.
[IR-32] Large language models for post-publication research evaluation: Evidence from expert recommendations and citation indicators
【速读】:该论文旨在解决科学文献质量评估中存在的可扩展性差、主观性强和时效延迟等问题,提出利用大语言模型(Large Language Models, LLMs)对学术文本进行自动化评价,以支持发表后同行评审(post-publication peer review)任务。其解决方案的关键在于构建两个评估任务——粗粒度识别高质量文章与细粒度评分(包括文章评级、价值分类及专家风格评论),并系统比较多种LLM架构(如BERT、通用LLM和推理导向LLM)在不同学习策略(零样本提示、少样本提示、监督微调)下的表现,发现监督微调能显著提升模型在细粒度任务中的准确性和一致性,从而为自动化科研评价提供可行路径。
链接: https://arxiv.org/abs/2604.16387
作者: Mengjia Wu,Yi Zhang,Robin Haunschild,Lutz Bornmann
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注:
Abstract:Assessing the quality of scientific research is essential for scholarly communication, yet widely used approaches face limitations in scalability, subjectivity, and time delay. Recent advances in large language models (LLMs) offer new opportunities for automated research evaluation based on textual content. This study examines whether LLMs can support post-publication peer review tasks by benchmarking their outputs against expert judgments and citation-based indicators. Two evaluation tasks are constructed using articles from the H1 Connect platform: identifying high-quality articles and performing finer-grained evaluation including article rating, merit classification, and expert style commenting. Multiple model families, including BERT models, general-purpose LLMs, and reasoning oriented LLMs, are evaluated under multiple learning strategies. Results show that LLMs perform well in coarse grained evaluation tasks, achieving accuracy above 0.8 in identifying highly recommended articles. However, performance decreases substantially in fine-grained rating tasks. Few-shot prompting improves performance over zero-shot settings, while supervised fine-tuning produces the strongest and most balanced results. Retrieval augmented prompting improves classification accuracy in some cases but does not consistently strengthen alignment with citation indicators. The overall correlations between model outputs and citation indicators remain positive but moderate.
[IR-33] LLM AR: A Tuning-Free Recommendation Framework for Sparse and Text-Rich Industrial Domains DATE
【速读】:该论文针对工业B2B场景中推荐系统面临的极端数据稀疏性与丰富的文本交互特征之间的矛盾问题展开研究,传统基于ID的协同过滤方法因缺乏共现信号而失效,而微调标准大语言模型(Large Language Models, LLMs)则存在高运营成本和难以应对频繁数据漂移的局限。解决方案的关键在于提出一种无需微调的框架LLMAR(LLM-Annotated Recommendation),其核心创新包括:(1) 推理驱动标注机制,利用LLM将用户行为历史转化为结构化的语义动机,实现基于推理的匹配;(2) 反思循环机制,通过自校正策略缓解生成幻觉并解决历史信息与当前指令间的“上下文竞争”问题;(3) 成本高效的架构设计,采用无训练组件与异步批量处理以显著降低维护开销。实验证明,LLMAR在公共基准和工业稀疏数据集上均显著优于现有学习型模型,在保持低推理成本(约每千用户1美元)的同时实现了更高准确率与可解释性。
链接: https://arxiv.org/abs/2604.16379
作者: Ryogo Hishikawa,Ichiro Kataoka,Shinya Yuda
机构: Hitachi, Ltd. Research and Development Group(日立有限公司研发部); Hitachi Power Solutions Co., Ltd.(日立电力解决方案公司)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 10 pages, 3 figures, github link is to be updated
Abstract:Industrial B2B applications (e.g., construction site risk prediction, material procurement) face extreme data sparsity yet feature rich textual interactions. In such environments, traditional ID-based collaborative filtering fails lacking co-occurrence signals, while fine-tuning standard Large Language Models (LLMs) incurs high operational costs and struggles with frequent data drift. We propose LLMAR (LLM-Annotated Recommendation), a tuning-free framework. Moving beyond simple embeddings, LLMAR systematically integrates LLM reasoning to capture user “latent motives” without any training process. We introduce three core contributions: (1) Inference-Driven Annotation: uses LLMs to transform behavioral history into structured semantic motives, enabling reasoning-based matching unattainable by ID-based methods; (2) Reflection Loop: a self-correction mechanism that refines generated queries to mitigate hallucinations and resolve “context competition” between past history and current instructions; and (3) Cost-Effective Architecture: relies on tuning-free components and asynchronous batch processing to minimize maintenance costs. Evaluations on public benchmarks (MovieLens-1M, Amazon Prime Pantry) and a sparse industrial dataset (construction risk prediction) demonstrate that LLMAR outperforms state-of-the-art learning-based models (SASRecF), achieving up to a 54.6% nDCG@10 improvement on the industrial dataset. Inference costs remain highly practical (~ 1 per 1,000 users). For B2B domains where strict real-time latency is not critical, combining LLM reasoning with self-verification offers a superior alternative to training-based approaches across accuracy, explainability, and operational cost. Comments: 10 pages, 3 figures, github link is to be updated Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2604.16379 [cs.IR] (or arXiv:2604.16379v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.16379 Focus to learn more arXiv-issued DOI via DataCite
[IR-34] AgriIR: A Scalable Framework for Domain-Specific Knowledge Retrieval ECIR2026
【速读】:该论文旨在解决农业领域中检索增强生成(Retrieval-Augmented Generation, RAG)系统在准确性、可解释性与资源约束之间难以平衡的问题。现有方案往往依赖于大型单一模型,导致计算成本高、部署复杂且缺乏对特定知识垂直领域的适应能力。解决方案的关键在于提出AgriIR框架,其核心是将信息获取流程分解为可配置的模块化阶段——查询精炼、子查询规划、检索、合成与评估,从而实现灵活适配新知识领域而不改变架构;同时结合1B参数语言模型、自适应检索器和领域感知代理目录,在保障确定性引用与透明度的前提下,以较低计算开销实现精准、可信的农业问答服务,体现了“面向农业的人工智能”(AI for Agriculture)理念中的可及性、可持续性和问责制。
链接: https://arxiv.org/abs/2604.16353
作者: Shuvam Banerji Seal,Aheli Poddar,Alok Mishra,Dwaipayan Roy
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted at ECIR 2026
Abstract:This paper introduces AgriIR, a configurable retrieval augmented generation (RAG) framework designed to deliver grounded, domain-specific answers while maintaining flexibility and low computational cost. Instead of relying on large, monolithic models, AgriIR decomposes the information access process into declarative modular stages – query refinement, sub-query planning, retrieval, synthesis, and evaluation. This design allows practitioners to adapt the framework to new knowledge verticals without modifying the architecture. Our reference implementation targets Indian agricultural information access, integrating 1B-parameter language models with adaptive retrievers and domain-aware agent catalogues. The system enforces deterministic citation, integrates telemetry for transparency, and includes automated deployment assets to ensure auditable, reproducible operation. By emphasizing architectural design and modular control, AgriIR demonstrates that well-engineered pipelines can achieve domain-accurate, trustworthy retrieval even under constrained resources. We argue that this approach exemplifies ``AI for Agriculture’’ by promoting accessibility, sustainability, and accountability in retrieval-augmented generation systems.
[IR-35] raining for Compositional Sensitivity Reduces Dense Retrieval Generalization
【速读】:该论文旨在解决密集检索(Dense Retrieval)在语义匹配任务中对文本结构敏感性不足的问题,即当输入文本发生细微但语义显著变化(如否定、角色互换等)时,其嵌入向量仍保持高余弦相似度,导致身份级匹配失效。解决方案的关键在于引入结构目标负样本(structure-targeted negatives),通过显式建模语义不变性与结构敏感性的权衡,在不显著损害召回性能的前提下提升模型对语义扰动的鲁棒性;此外,作者进一步提出基于相似度图的Transformer架构,能够有效区分结构相近但语义不同的样本(near-misses),从而在端到端训练下实现更精确的细粒度匹配。
链接: https://arxiv.org/abs/2604.16351
作者: Radoslav Ralev,Aditeya Baral,Iliya Zhechev,Jen Agarwal,Srijith Rajamohan
机构: Redis(Redis)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Dense retrieval compresses texts into single embeddings ranked by cosine similarity. While efficient for recall, this interface is brittle for identity-level matching: minimal compositional edits (negation, role swaps) flip meaning yet retain high similarity. Motivated by geometric results for unit-sphere cosine spaces (Kang et al., 2025), we test this retrieval-composition tension in text-only retrieval. Across four dual-encoder backbones, adding structure-targeted negatives consistently reduces zero-shot NanoBEIR retrieval (8-9% mean nDCG@10 drop on small backbones; up to 40% on medium ones), while only partially improving pooled-space separation. Treating pooled cosine as a recall interface, we then benchmark verifiers scoring token–token cosine maps. MaxSim (late interaction) excels at reranking but fails to reject structural near-misses, whereas a small Transformer over similarity maps reliably separates near-misses under end-to-end training.
[IR-36] LiteSemRAG : Lightweight LLM -Free Semantic-Aware Graph Retrieval for Robust RAG
【速读】:该论文旨在解决现有基于图的检索增强生成(Graph-based Retrieval-Augmented Generation, RAG)框架在索引和查询阶段严重依赖大语言模型(Large Language Models, LLMs)所导致的高令牌消耗、计算成本及延迟问题。其解决方案的关键在于提出 LiteSemRAG,一个完全无需 LLM 的轻量级语义感知图检索框架:通过利用上下文相关的词元级嵌入构建异构语义图,显式分离表面词汇表示与上下文依赖的语义含义;引入动态语义节点构建机制以鲁棒地建模多义性(polysemy),结合块级上下文聚合与自适应异常处理;并在查询阶段采用两步语义感知检索流程,融合共现图加权与孤立语义恢复机制,实现结构推理与语义覆盖的平衡。实验表明,LiteSemRAG 在多个基准数据集上达到最优平均倒数排名(MRR@10),且召回率(Recall@10)具有竞争力或优于当前最先进的 LLM-based 图 RAG 系统,同时实现零 LLM 令牌消耗和显著的效率提升。
链接: https://arxiv.org/abs/2604.16350
作者: Xiao Yue,Guangzhi Qu,Lige Gan
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph-based Retrieval-Augmented Generation (RAG) has shown great potential for improving multi-level reasoning and structured evidence aggregation. However, existing graph-based RAG frameworks heavily rely on exploiting large language models (LLMs) during indexing and querying, leading to high token consumption, computational cost and latency overhead. In this paper, we propose LiteSemRAG, a lightweight, fully LLM-free, semantic-aware graph retrieval framework. LiteSemRAG constructs a heterogeneous semantic graph by exploiting contextual token-level embeddings, explicitly separating surface lexical representations from context-dependent semantic meanings. To robustly model polysemy, we introduce a dynamic semantic node construction mechanism with chunk-level context aggregation and adaptive anomaly handling. At query stage, LiteSemRAG performs a two-step semantic-aware retrieval process that integrates co-occurrence graph weighting with an isolated semantic recovery mechanism, enabling balanced structural reasoning and semantic coverage. We evaluate LiteSemRAG on three benchmark datasets and experimental results show that LiteSemRAG achieves the best mean reciprocal rank (MRR@10) across all datasets and competitive or superior recall rate (Recall@10) compared to state-of-the-art LLM-based graph RAG systems. Meanwhile, LiteSemRAG consumes zero LLM tokens and achieves substantial efficiency improvements in both indexing and querying due to the elimination of LLM usage. These results demonstrate the effectiveness of LiteSemRAG, indicating that a strong semantic-aware graph retrieval framework can be achieved without relying on LLM-based approaches.
[IR-37] Benchmarking Real-Time Question Answering via Executable Code Workflows
【速读】:该论文旨在解决现有基准测试在评估搜索集成智能体(search-integrated agents)时普遍存在的静态性问题,即无法捕捉信息的时间动态性和现实世界知识的持续演化特性。为应对这一挑战,作者提出了一种名为RT-QA的动态评估框架,其核心创新在于构建了一个由智能体驱动的可执行代码工作流,能够实时爬取网页并基于DOM结构提取答案以生成时效性强的“真实标签”(ground truth)。该框架的关键在于通过自动化代码生成实现动态数据获取,并引入自修复机制以适应网页结构变化,从而确保长期评估的鲁棒性。
链接: https://arxiv.org/abs/2604.16349
作者: Wenjie Zhou,Yuan Gao,Xin Zhou,Hao Fu,Zhongjian Miao,Wei Chen,Bo Chen,Xiaobing Zhao
机构: Li Auto Inc.(理想汽车); Minzu University of China(中央民族大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Retrieving real-time information is a fundamental capability for search-integrated agents in real-world applications. However, existing benchmarks are predominantly static and therefore fail to capture the temporal dynamics of information and the continuously evolving nature of real-world knowledge. To address this limitation, we propose RT-QA, a dynamic evaluation framework that leverages executable code workflows to retrieve up-to-date answers at evaluation time. Specifically, we construct an agent-driven pipeline that autonomously generates code for web crawling and DOM-based answer extraction to produce real-time ground truth. To ensure robust evaluation over time, the pipeline further incorporates a self-repair mechanism to adapt to changes in web page structures. RT-QA spans 12 domains (e.g., Finance, Sports) with 320 Chinese questions categorized into three difficulty levels. Extensive evaluations of state-of-the-art models (e.g., GPT-5.2, GLM-4.7) reveal significant limitations in real-time adaptability: even the best models achieve only 46% accuracy. Our analysis highlights two primary failure modes: (1) Lazy Retrieval, where agents rely on search snippets instead of deeply scanning specific websites for information (20% of failures); and (2) Temporal Confusion, a cognitive error where agents retrieve a historical date (e.g., an event in 2024) and fail to re-anchor to the current time (2026) for subsequent reasoning. These findings suggest that future agents require not just better retrieval strategies, but robust temporal state management.
[IR-38] HR-Agents : Using Multiple LLM -based Agents to Improve QA about Brazilian Labor Legislation
【速读】:该论文旨在解决巴西劳动法(Consolidation of Labor Laws, CLT)体系复杂性给人力资源(HR)专业人员带来的合规挑战,传统问答方法常导致效率低下、延迟和不一致的问题。解决方案的关键在于提出一种基于大型语言模型(Large Language Models, LLMs)的多智能体系统(multi-agent system),通过专业化分工的智能体协同工作,并结合检索增强生成(Retrieval-Augmented Generation, RAG)技术提升回答的上下文相关性与准确性;该系统利用CrewAI框架实现智能体间的协作与响应验证,从而显著提高法律问答的连贯性和正确性,为HR提供更可靠、高效的合规支持。
链接: https://arxiv.org/abs/2604.16337
作者: Abriel K. Moraes,Gabriel S. M. Dias,Vitor L. Fabris,Lucas D. Gessoni,Leonardo R. do Nascimento,Charles S. Oliveira,Vitor G. C. B. de Farias,Fabiana C. Q. de O. Marucci,Matheus H. R. Vicente,Gabriel U. Talasso,Erik Soares,Amparo Munoz,Sildolfo Gomes,Maria L. A. de S. Cruvinel,Leonardo T. dos Santos,Renata De Paris,Wandemberg Gibaut
机构: Eldorado Institute of Research (埃拉多研究所)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Paper presented on: July 2025 Conference: XVII Simpósio Brasileiro de Automação Inteligente (SBAI) At: São João del-Rei
Abstract:The Consolidation of Labor Laws (CLT) serves as the primary legal framework governing labor relations in Brazil, ensuring essential protections for workers. However, its complexity creates challenges for Human Resources (HR) professionals in navigating regulations and ensuring compliance. Traditional methods for addressing labor law inquiries often lead to inefficiencies, delays, and inconsistencies. To enhance the accuracy and efficiency of legal question-answering (QA), a multi-agent system powered by Large Language Models (LLMs) is introduced. This approach employs specialized agents to address distinct aspects of employment law while integrating Retrieval-Augmented Generation (RAG) to enhance contextual relevance. Implemented using CrewAI, the system enables cooperative agent interactions, ensuring response validation and reducing misinformation. The effectiveness of this framework is evaluated through a comparison with a baseline RAG pipeline utilizing a single LLM, using automated metrics such as BLEU, LLM-as-judge evaluations, and expert human assessments. Results indicate that the multi-agent approach improves response coherence and correctness, providing a more reliable and efficient solution for HR professionals. This study contributes to AI-driven legal assistance by demonstrating the potential of multi-agent LLM architectures in improving labor law compliance and streamlining HR operations.
[IR-39] A Collection of Systematic Reviews in Computer Science
【速读】:该论文旨在解决系统性综述(Systematic Review)自动化过程中缺乏跨学科评估资源的问题,尤其在计算机科学领域,现有自动化方法的可复现性受限于数据稀缺与标准缺失。其解决方案的关键在于构建SR4CS——一个大规模、结构化的计算机科学系统性综述语料库,包含1,212篇综述、原始专家设计的布尔查询(Boolean Query)、104,316条已标注参考文献及元数据,并提供标准化的近似形式查询以支持统一评估。该语料库使研究者能够在检索(retrieval)与筛选(screening)阶段进行可控实验,从而推动生成式AI(Generative AI)和信息检索技术在系统性综述自动化中的可复现研究与优化。
链接: https://arxiv.org/abs/2604.16330
作者: Pierre Achkar,Tim Gollub amd Martin Potthast
机构: Leipzig University (莱比锡大学); Fraunhofer ISI Leipzig (弗劳恩霍夫研究所莱比锡分部); Bauhaus-Universität Weimar (包豪斯大学魏玛分校); Kassel University (卡塞尔大学); hessian.AI (黑森人工智能); ScaDS.AI (数据科学中心)
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
备注: Accepted at SCOLIA26 Workshop
Abstract:Systematic reviews are the standard method for synthesizing scientific evidence, but their creation requires substantial manual effort, particularly during retrieval and screening. While recent work has explored automating these steps, evaluation resources remain largely confined to the biomedical domain, limiting reproducible experimentation in other domains. This paper introduces SR4CS, a large-scale collection of systematic reviews in computer science, designed to support reproducible research on Boolean query generation, retrieval, and screening. The corpus comprises 1,212 systematic reviews with their original expert-designed Boolean search queries, 104,316 resolved references, and structured methodological metadata. For controlled evaluation, the original Boolean queries are additionally provided in a normalized, approximated form operating over titles and abstracts. To illustrate the intended use of the collection, baseline experiments compare the approximated expert Boolean queries with zero-shot LLM-generated Boolean queries, BM25, and dense retrieval under a unified evaluation setting. The results highlight systematic differences in precision, recall, and ranking behavior across retrieval paradigms and expose limitations of naive zero-shot Boolean generation. SR4CS is released under an open license on Zenodo (this https URL), together with documentation and code (this https URL), to enable reproducible evaluation and future research on scaling systematic review automation.
[IR-40] Beyond Single-Score Ranking: Facet-Aware Reranking for Controllable Diversity in Paper Recommendation
【速读】:该论文旨在解决当前论文推荐系统仅输出单一相似度分数、无法区分不同层面相关性的局限性,从而导致用户难以理解推荐依据的问题。其解决方案的关键在于提出SciFACE(Scientific Faceted Cross-Encoder)框架,该框架将科学文献间的相似性解耦为两个独立的语义维度:背景(Background,即研究问题)和方法(Method,即解决手段),并通过训练两个分离的交叉编码器(cross-encoder)分别建模这两个维度。该方法基于5,891对由GPT-4o-mini标注的种子-候选论文对进行训练,并在CSFCube数据集上显著优于SPECTER等基线模型,在背景和方法两个维度上分别获得70.63和49.06的NDCG@20得分,证明了高质量人工标注的细粒度标签比大规模合成增强更高效地学习科学相似性。
链接: https://arxiv.org/abs/2604.16329
作者: Duan Ming Tao
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Current paper recommendation systems output a single similarity score that mixes different notions of relatedness, so users cannot specify why papers should be similar. We present SciFACE (Scientific Faceted Cross-Encoder), a reranking framework that models two independent facets: Background (what problem is studied) and Method (how it is solved). SciFACE trains two separate cross-encoders on 5,891 real seed-candidate paper pairs labeled by GPT-4o-mini with facet-specific criteria and validated against human judgments. On CSFCube, SciFACE reaches 70.63 NDCG@20 on Background (5.9 points above SPECTER) and 49.06 NDCG@20 on Method (31.1 points above SPECTER), competitive with state-of-the-art results. Compared with FaBLE without citation pre-training, SciFACE improves Method NDCG@20 by 4.1 points while using 5,891 labeled pairs versus 40K synthetic augmentations. These results show that high-quality grounded facet labels can be more data-efficient than large-scale synthetic augmentation for learning fine-grained scientific similarity.
[IR-41] Diagnosing LLM -based Rerankers in Cold-Start Recommender Systems: Coverag e Exposure and Practical Mitigations
【速读】:该论文旨在解决生成式 AI(Generative AI)与交叉编码器重排序器(cross-encoder reranker)在冷启动电影推荐场景中实际性能远低于简单基线的问题。研究通过系统性诊断发现,性能差距主要源于候选集生成阶段的低召回覆盖率(recall@200 = 0.109 vs. 0.609)、严重的曝光偏差(重排序器仅集中推荐3个物品,而随机基线为497个)以及评分区分度极弱(相关与不相关项平均分差仅为0.098,Cohen’s d = 0.13)。解决方案的关键在于:首先识别出问题根源不在重排序器本身的能力,而在检索阶段的局限性;进而提出混合检索策略、候选池规模优化及分数校准技术等可操作建议,以提升整体推荐系统的有效性与公平性。
链接: https://arxiv.org/abs/2604.16318
作者: Ekaterina Lemdiasova,Nikita Zmanovskii
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 12 pages, 7 figures. Code and data available at this https URL
Abstract:Large language models (LLMs) and cross-encoder rerankers have gained attention for improving recommender systems, particularly in cold-start scenarios where user interaction history is limited. However, practical deployment reveals significant performance gaps between LLM-based approaches and simple baselines. This paper presents a systematic diagnostic study of cross-encoder rerankers in cold-start movie recommendation using the Serendipity-2018 dataset. Through controlled experiments with 500 users across multiple random seeds, we identify three critical failure modes: (1) low retrieval coverage in candidate generation (recall@200 = 0.109 vs. 0.609 for baselines), (2) severe exposure bias with rerankers concentrating recommendations on 3 unique items versus 497 for random baseline, and (3) minimal score discrimination between relevant and irrelevant items (mean difference = 0.098, Cohen’s d = 0.13). We demonstrate that popularity-based ranking substantially outperforms LLM reranking (HR@10: 0.268 vs. 0.008, p 0.001), with the performance gap primarily attributable to retrieval stage limitations rather than reranker capacity. Based on these findings, we provide actionable recommendations including hybrid retrieval strategies, candidate pool size optimization, and score calibration techniques. All code, configurations, and experimental results are made available for reproducibility.
[IR-42] Paper2Data: Large-Scale LLM Extraction and Metadata Structuring of Global Urban Data from Scientific Literature
【速读】:该论文旨在解决全球尺度上缺乏统一的城市数据发现平台的问题,从而导致研究人员需手动搜索网站或学术文献以识别相关数据集的低效现状。其解决方案的关键在于构建了一个名为UrbanDataMiner的开放城市数据发现门户,该平台基于Paper2Data这一创新的大规模大语言模型(Large Language Model, LLM)驱动的数据提取流水线,能够自动识别科学论文中的数据集提及并将其结构化为统一的城市数据元数据模式。实验表明,Paper2Data在数据集识别上具有约90%的召回率和超过80%的字段级精确度,且UrbanDataMiner可检索到超过9%通过通用搜索引擎难以发现的数据集,首次实现了基于文献的大规模、可复用的城市数据基础设施。
链接: https://arxiv.org/abs/2604.16317
作者: Runwen You,Tong Xia,Jingzhi Wang,Jiankun Zhang,Tengyao Tu,Jinghua Piao,Yi Chang,Yong Li
机构: Jilin University(吉林大学); Zhongguancun Academy(中关村学院); Tsinghua University(清华大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Urban data support a wide range of applications across multiple disciplines. However, at the global scale, there is no unified platform for urban data discovery. As a result, researchers often have to manually search through websites or scientific literature to identify relevant datasets. To address this problem, we curate an open urban data discovery portal, \textitUrbanDataMiner, which supports dataset-level search and filtering over more than 60,000 urban datasets extracted from over 15,000 Nature-affiliated publications. \textitUrbanDataMiner is enabled by \textitPaper2Data, a novel large-scale LLM-driven pipeline that automatically identifies dataset mentions in scientific papers and structures them using a unified urban data metadata schema. Human-annotated evaluation demonstrates that \textitPaper2Data achieves high recall (approximately 90%) in dataset identification and high field-level precision (above 80%). In addition, \textitUrbanDataMiner can retrieve over 9% of datasets that are not easily discoverable through general-purpose search engines such as Google. Overall, our work provides the first large-scale, literature-derived infrastructure for urban data discovery and enables more systematic and reusable data-driven research across disciplines. Our code and data are publicly available\footnotethis https URL.
[IR-43] CrossTraffic: An Open-Source Framework for Reproducible and Executable Transportation Analysis and Knowledge Management
【速读】:该论文旨在解决交通运输工程领域中技术手册与分析方法(如《公路通行能力手册》(Highway Capacity Manual, HCM))在传播、管理和实施过程中存在的碎片化问题,包括计算流程嵌套于专有工具、更新不一致以及知识传递受限等挑战,这些问题严重阻碍了分析的可复现性、互操作性和协作发展。其解决方案的关键在于提出一个名为CrossTraffic的开源框架,将交通工程方法和法规知识视为可持续部署与验证的软件基础设施;该框架通过标准化接口提供跨平台访问的可执行计算核心,利用本体驱动的知识图谱编码工程规则与溯源信息,作为分析工作流的语义验证层,并结合对话式接口使大语言模型(Large Language Models, LLMs)能够以结构化工具调用方式接入经验证的执行环境,从而在保持方法学准确性的同时实现自然语言交互,实验表明该机制显著提升了数值精度并实现了对无效输入的精准检测(F1≈1.0)。
链接: https://arxiv.org/abs/2604.16316
作者: Rei Tamaru,Bin Ran
机构: 未知
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注:
Abstract:Transportation engineering often relies on technical manuals and analytical tools for planning, design, and operations. However, the dissemination and management of these methodologies, such as those defined in the Highway Capacity Manual (HCM), remain fragmented. Computational procedures are often embedded within proprietary tools, updates are inconsistently propagated across platforms, and knowledge transfer is limited. These challenges hinder reproducibility, interoperability, and collaborative advancement in transportation analysis. This paper introduces CrossTraffic, an open-source framework that treats transportation methodologies and regulatory knowledge as continuously deployable and verifiable software infrastructure. CrossTraffic provides an executable computational core for transportation analysis with cross-platform access through standardized interfaces. An ontology-driven knowledge graph encodes engineering rules and provenance and serves as a semantic validation layer for analytical workflows. A conversational interface further connects large language models to this validated execution environment through structured tool invocation, enabling natural-language access while preventing procedurally invalid analyses. Experimental results show that knowledge-graph-constrained execution substantially improves numerical accuracy and methodological fidelity compared with context-only approaches, achieving near-zero numerical error (MAE0.50) across multiple large language models and perfect detection of invalid analytical inputs in stress testing (F1~=~1.0). Its modular architecture supports the integration of additional transportation manuals and research models, providing a foundation for an open and collaborative transportation science ecosystem with a reproducible computational core. The system implementation is publicly available at this https URL. Subjects: Computers and Society (cs.CY); Information Retrieval (cs.IR) Cite as: arXiv:2604.16316 [cs.CY] (or arXiv:2604.16316v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2604.16316 Focus to learn more arXiv-issued DOI via DataCite
[IR-44] MARA: A Multimodal Adaptive Retrieval-Augmented Framework for Document Question Answering
【速读】:该论文旨在解决基于检索的多模态文档问答(Retrieval-based Multimodal Document QA)中现有方法在处理视觉丰富、结构复杂的文档时存在的两大核心问题:一是当前方法依赖于与查询无关的文档表征,忽略了关键内容;二是采用静态的 top-k 证据选择策略,无法适应相关资讯分布的不确定性。其解决方案的关键在于提出 Multimodal Adaptive Retrieval-Augmented (MARA) 框架,通过两个创新组件实现查询自适应的检索与生成:一是 Query-Aligned Region Encoder,构建多层级文档表征并根据查询相关性重加权以提升检索精度;二是 Self-Reflective Evidence Controller,在生成过程中动态监控证据充分性,并利用滑动窗口策略自适应地引入低排名来源的内容,从而增强答案的准确性和鲁棒性。
链接: https://arxiv.org/abs/2604.16313
作者: Hui Wu,Haoquan Zhai,Yuchen Li,Hengyi Cai,Peirong Zhang,Yidan Zhang,Lei Wang,Chunle Wang,Yingyan Hou,Shuaiqiang Wang,Dawei Yin
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Retrieval-based multimodal document QA aims to identify and integrate relevant information from visually rich documents with complex multimodal structures. While retrieval-augmented generation (RAG) has shown strong performance in text-based QA, its extensions to multimodal documents remain underexplored and face significant limitations. Specifically, current approaches rely on query-agnostic document representations that overlook salient content and use static top-k evidence selection, which fails to adapt to the uncertain distribution of relevant information. To address these limitations, we propose the Multimodal Adaptive Retrieval-Augmented (MARA) framework, which introduces query-adaptive mechanisms to both retrieval and generation. MARA consists of two components: a Query-Aligned Region Encoder that builds multi-level document representations and reweights them based on query relevance to improve retrieval precision; and a Self-Reflective Evidence Controller that monitors evidence sufficiency during generation and adaptively incorporates content from lower-ranked sources using a sliding-window strategy. Experiments on six multimodal QA benchmarks demonstrate that MARA consistently improves retrieval relevance and answer quality over existing SOTA method.
[IR-45] FlexStructRAG : Flexible Structure-Aware Multi-Granular Relational Retrieval for RAG
【速读】:该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)系统在外部知识组织与检索过程中存在的局限性问题:传统方法通常采用固定长度文本块进行检索,导致语篇上下文碎片化;或仅依赖单一结构化索引(如知识图谱或超图),硬编码了特定关系粒度,难以适应查询对局部二元关系、高阶交互或文档级上下文等不同形式证据的需求。解决方案的关键在于提出 FlexStructRAG 框架,其核心创新包括:(1) 联合构建三种异构知识表示——用于二元关系的知识图谱、用于 n 元关系的知识超图,以及结构感知的语义聚类单元(structure-aware semantic clusters)以聚合关系证据并形成文档级上下文单元;(2) 引入动态分割与截断滑动窗口提取机制,在知识构建阶段减少因统一切片引发的语义断裂;(3) 在推理阶段支持实体、边、超边及聚类层级的多粒度自适应检索,并可灵活组合以提供关系和上下文对齐的证据,从而显著提升生成质量。
链接: https://arxiv.org/abs/2604.16312
作者: Mengzhu Chen,Haodong Yang,Jia Cai,Xiaolin Huang
机构: Guangdong University of Finance & Economics (广东金融学院); Shanghai Jiao Tong University (上海交通大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) systems critically depend on how external knowledge is segmented, structured, and retrieved. Most existing approaches either retrieve fixed-length text chunks, which fragments discourse context, or commit to a single structured index (e.g., a knowledge graph or hypergraph), which hard-codes one relational granularity. This often yields brittle retrieval when queries require different forms of evidence, such as local binary relations, higher-order interactions, or broader document-grounded context. We propose \textbfFlexStructRAG, a flexible structure-aware RAG framework that supports \emphmulti-granular, query-adaptive retrieval over heterogeneous knowledge representations. FlexStructRAG jointly constructs (i) a knowledge graph for binary relations, (ii) a knowledge hypergraph for n-ary relations, and (iii) structure-aware semantic clusters that aggregate relational evidence into document-grounded context units. To reduce semantic fragmentation induced by uniform chunking, we introduce dynamic partitioning and a truncated sliding-window extraction mechanism that incorporates bounded contextual dependencies during knowledge construction. At inference time, FlexStructRAG enables entity-, edge-, hyperedge-, and cluster-level retrieval, which can be flexibly combined to supply generation with relationally and contextually aligned evidence. Experiments on the UltraDomain benchmark across four domains show that FlexStructRAG improves semantic evaluation over strong RAG baselines. Ablation and sensitivity analysis further demonstrate the necessity of multi-granular relational retrieval and structure-aware clustering.
[IR-46] RAG -DIVE: A Dynamic Approach for Multi-Turn Dialogue Evaluation in Retrieval-Augmented Generation
【速读】:该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)系统评估方法依赖静态多轮数据集的问题,此类方法无法捕捉真实对话中动态、上下文相关的交互特性。传统评估方式受限于预定义的一次性查询模式,难以反映RAG系统在多轮交互场景下的自适应性能。其解决方案的关键在于提出RAG-DIVE(Dynamic Interactive Validation and Evaluation),一种基于大语言模型(LLM)的动态交互式验证与评估框架,包含三个核心组件:(1) 对话生成器模拟用户生成多轮查询,(2) 对话验证器过滤并修正低质量输出以保障对话连贯性,(3) 对话评估器对整个对话过程进行端到端评估,提供逐轮及聚合指标,从而实现对RAG系统在真实交互环境中的全面性能刻画。
链接: https://arxiv.org/abs/2604.16310
作者: Lorenz Brehme,Benedikt Dornauer,Jan-Henrik Böttcher,Klaus Schmid,Mircea-Cristian Racasan,Ruth Breu
机构: University of Innsbruck(因斯布鲁克大学); University of Hildesheim(希尔德斯海姆大学); c.c.com Moser GmbH( c.c.com 莫泽有限公司)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for publication at CAIN 2026 (5th International Conference on AI Engineering)
Abstract:Evaluating Retrieval-Augmented Generation (RAG) systems using static multi-turn datasets fails to capture the dynamic nature of real-world dialogues. Existing evaluation methods rely on predefined datasets, which restrict them to static, one-directional queries and limit their ability to capture the adaptive, context-dependent performance of RAG systems in interactive, multi-turn settings. Thus, we introduce the RAG-DIVE, a Dynamic Interactive Validation and Evaluation approach, that simulates user interactions with RAG systems. RAG-DIVE leverages an LLM to generate multi-turn conversations dynamically and is organized into three components. The dialogue generation stage consists of the (1) Conversation Generator, which simulates a user by creating multi-turn queries, and the (2) Conversation Validator, which filters and corrects invalid or low-quality outputs to ensure coherent conversations. The evaluation stage is handled by the (3) Conversation Evaluator, which assesses the RAG system’s performance across the entire dialogue and generates both per-turn and multi-turn metrics that provide an aggregated view of system behavior. We validated RAG-DIVE through two experimental setups. First, we tested a sample RAG system, including human evaluation of dialogue quality, repeated trials to assess consistency, and an ablation study showing that RAG-DIVE detects performance changes caused by system modifications. Second, we compared RAG-DIVE with a traditional static dataset evaluation on an industrial RAG system under different configurations to verify whether both approaches reveal similar performance trends. Our findings demonstrate that RAG-DIVE facilitates dynamic, interaction-driven evaluation for multi-turn conversations, thereby advancing the assessment of RAG systems.
[IR-47] Domain-Specific Query Understanding for Automotive Applications: A Modular and Scalable Approach
【速读】:该论文旨在解决汽车领域中特定查询理解(domain-specific query understanding)的挑战,尤其是在复杂专业术语和多样用户意图背景下,如何实现高效、准确地将自然语言查询分类并提取结构化输入以匹配不同工具(如零件推荐、维修流程或法规查询等)。解决方案的关键在于提出一种两阶段分解式架构:首先进行轻量级分类,随后使用小型专业化提示(prompt)执行针对性实体抽取,从而在响应速度、可靠性与可扩展性之间取得良好平衡。相较单步联合模型,该方法显著提升了效率与准确性,并借助专家审核的高质量数据集(包含人工标注与合成样本)保障了系统性能。
链接: https://arxiv.org/abs/2604.16301
作者: Isha Motiyani,Abhishek Kumar,Tilak Kasturi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures, 10 tables
Abstract:Despite the growing prevalence of large language models (LLMs) in domain-specific applications, the challenge of query understanding in the automotive sector still remains underexplored. This domain presents unique complexities due to its specialized vocabulary and the diverse range of user intents it encompasses. Unlike general-purpose assistants, automotive systems must precisely interpret user queries and route them to appropriate underlying tool, each designed to fulfill a distinct task such as part recommendations, repair procedures, or regulatory lookups. Moreover, these systems must extract structured inputs precisely aligned with the schema required by each tool. In this study, we present a novel two-step system for domain-specific query interpretation in the automotive context that achieves an effective balance between responsiveness, reliability, and scalability. Our initial single-step approach, which jointly performed classification and entity extraction, exhibited moderate performance and higher latency. By decomposing the task into a lightweight classification stage followed by targeted entity extraction using smaller, specialized prompts, our system achieves substantial gains in both efficiency and accuracy. Due to the niche nature of the automotive domain, we also curated a high-quality dataset by combining manually annotated and synthetically generated samples, all reviewed by domain experts. Overall, our findings demonstrate that decomposing query understanding into modular subtasks leads to a scalable, accurate, and latency-efficient solution. This approach establishes a strong ground for practical deployment in real-world automotive query understanding systems.
人机交互
[HC-0] Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLM s on CLD Extraction and Discussion
【速读】:该论文旨在解决生成式 AI(Generative AI)在系统动力学(System Dynamics)辅助任务中的性能评估与部署优化问题,特别是在本地部署模型与云端商业 API 之间的能力对比及影响因素分析。其关键解决方案在于构建了两个专用基准测试:CLD Leaderboard(用于结构化因果回路图提取)和 Discussion Leaderboard(涵盖模型讨论、反馈解释与建模指导),并通过系统性参数扫描(包括模型类型、后端架构 GGUF vs. MLX、量化级别 Q3/Q4_K_M/MLX-3bit 等)对 671B–123B 参数规模的本地模型进行全面评测,发现后端选择比量化程度更具实践影响力,并提供了针对 Apple Silicon 平台运行大模型的完整调参指南与性能清理数据,从而为本地部署生成式 AI 提供可复现、可优化的技术路径。
链接: https://arxiv.org/abs/2604.18566
作者: Terry Leitch
机构: ruxton.ai
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:We present a systematic evaluation of large language model families – spanning both proprietary cloud APIs and locally-hosted open-source models – on two purpose-built benchmarks for System Dynamics AI assistance: the \textbfCLD Leaderboard (53 tests, structured causal loop diagram extraction) and the \textbfDiscussion Leaderboard (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77–89% overall pass rates; the best local model reaches 77% (Kimi~K2.5~GGUF~Q3, zero-shot engine), matching mid-tier cloud performance. On Discussion, the best local models achieve 50–100% on model building steps and 47–75% on feedback explanation, but only 0–50% on error fixing – a category dominated by long-context prompts that expose memory limits in local deployments. A central contribution of this paper is a systematic analysis of \textitmodel type effects on performance: we compare reasoning vs.\ instruction-tuned architectures, GGUF (this http URL) vs.\ MLX (mlx_lm) backends, and quantization levels (Q3 / Q4_K_M / MLX-3bit / MLX-4bit / MLX-6bit) across the same underlying model families. We find that backend choice has larger practical impact than quantization level: mlx_lm does not enforce JSON schema constraints, requiring explicit prompt-level JSON instructions, while this http URL grammar-constrained sampling handles JSON reliably but causes indefinite generation on long-context prompts for dense models. We document the full parameter sweep ( t , p , k ) for all local models, cleaned timing data (stuck requests excluded), and a practitioner guide for running 671B–123B parameter models on Apple~Silicon. Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2604.18566 [cs.AI] (or arXiv:2604.18566v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.18566 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Terry Leitch [view email] [v1] Mon, 20 Apr 2026 17:53:29 UTC (25 KB)
[HC-1] Fast and Forgettable: A Controlled Study of Novices Performance Learning Workload and Emotion in AI-Assisted and Human Pair Programming Paradigms
【速读】:该论文旨在解决生成式 AI(Generative AI)在编程教育中快速普及背景下,其对学习效果和学生体验的潜在影响尚不明确的问题。具体而言,研究聚焦于AI工具(如GitHub Copilot)是否能替代传统人机协作的配对编程(pair programming),以及这种替代是否会削弱学生的学业成就与情感投入。解决方案的关键在于设计一项受控实验,通过对比22名参与者在限时情境下与人类伙伴或AI助手协同编程的表现,结合主观工作负荷、情绪量表及客观绩效与再测成绩等多维指标进行量化分析。结果显示,虽然使用GitHub Copilot显著提升了即时编程表现并降低了认知负荷,但其在长期学习保留方面存在劣势,且未能提供与人类搭档相当的情感激励作用,从而揭示了AI辅助编程不能完全取代人机协作的教学价值。
链接: https://arxiv.org/abs/2604.18538
作者: Nicholas Gardella,James Prather,Juho Leinonen,Paul Denny,Raymond Pettit,Sara L. Riggs
机构: University of Virginia (弗吉尼亚大学); Abilene Christian University (阿比林基督教学院); Aalto University (阿尔托大学); University of Auckland (奥克兰大学)
类目: Human-Computer Interaction (cs.HC)
备注: for online appendices, see this https URL
Abstract:Code-generating Artificial Intelligence has gained popularity within both professional and educational programming settings over the past several years. While research and pedagogy are beginning to cope with this change, computing students are left to bear the unforeseen consequences of AI amidst a dearth of empirical evidence about its effects. Though pair programming between students is well studied and known to be beneficial to self-efficacy and academic achievement, it remains underutilized and further threatened by the proposition that AI can replace a human programming partner. In this paper, we present a controlled pair programming study with 22 participants who wrote Python code under time pressure in teams of two and individually with GitHub Copilot for 20 minutes each. They were incentivized by bonus compensation to balance performance with understanding and were retested individually on the programming tasks after a retention interval of one week. Subjective measures of workload and emotion as well as objective measures of performance and learning (retest performance) were collected. Results showed that participants performed significantly better with GitHub Copilot than their human teammate, and several dimensions of their workload were significantly reduced. However, the emotional effect of the human teammate was significantly more positive and arousing as compared to working with Copilot. Furthermore, there was a nonsignificant absolute retest performance reduction in the AI condition and a larger retest performance decrement in the AI condition. We recommend that educators strongly consider revisiting pair programming as an educational tool in addition to embracing modern AI.
[HC-2] From Awareness to Intent: Mitigating Silent Driving System Failures through Prospective Situation Awareness Enhancing Interfaces
【速读】:该论文旨在解决部分自动化车辆中“静默自动化故障”(silent automation failures)所带来的安全挑战,即系统在未发出警告的情况下未能检测到危险,导致驾驶员无法及时接管。针对这一问题,研究提出通过增强前瞻性情境意识(Prospective Situation Awareness Enhancement, PSAE)的界面设计来改善驾驶员在故障场景下的接管表现。其解决方案的关键在于利用增强现实抬头显示器(augmented reality head-up display)提供不同类型的提示信息,包括感知线索(perceptual cues)和系统意图传达(system intent communication),并发现感知线索最有效提升情境意识(sit觉意识(Situation Awareness, SA),而系统意图传达则更有利于建立信任;同时,研究还识别出神经活动可能与情境意识存在潜在关联,为未来人机交互(HMI)设计提供了基于透明度的优化方向。
链接: https://arxiv.org/abs/2604.18449
作者: Jiyao Wang,Song Yan,Xiao Yang,Qihang He,Chenglin Liu,Ange Wang,Chenglin Chen,Zhenyu Wang,Dengbo He
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); National University of Singapore(新加坡国立大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted by CHI2026
Abstract:Silent automation failures, where a system fails to detect a hazard without warning, pose a critical safety challenge for partially automated vehicles. While research has mostly focused on takeover requests, how to support a driver in silent failure remains underexplored. We conducted a multi-modal driving simulator study with 48 participants to investigate how different Prospective Situation Awareness Enhancement (PSAE) interfaces, delivered via augmented reality head-up display, affect takeover performance. By integrating behavioral, subjective psychological, and physiological data, our analysis suggests that situational awareness (SA) serves as an important moderating factor through which PSAE interfaces improve takeover performance. Further, we found that providing perceptual cues was most effective in enhancing SA, while communicating system intent was superior for building trust. Finally, we identified a potential correlate of SA in the neuroactivity. Overall, this paper contributes to understanding how transparency-oriented interfaces may support drivers and provides design insights into HMI design for silent failures.
[HC-3] Circadian Phase Locking of Epilepsy Seizures in Wearable Data: A Single-Patient Case Study
【速读】:该论文旨在解决癫痫患者发作(seizure)预测的不确定性问题,其核心挑战在于如何从连续采集的生理信号中提取与癫痫发作具有统计显著关联的生理节律特征。解决方案的关键在于引入基于可穿戴设备获取的间期间隔(inter-beat interval, IBI)数据,通过带通滤波和希尔伯特变换估计昼夜节律相位,并利用圆统计方法检验癫痫发作是否在特定生理相位集中发生。研究发现,癫痫发作在昼夜节律相位上呈现显著集中趋势,而多日节律则未表现出一致的相位聚集;此外,初步逻辑回归模型表明,相比单纯依赖时钟时间特征,考虑生理相位能捕捉到更精细的结构信息。这为将连续可穿戴传感数据与稀疏临床事件建立可解释的映射关系提供了新路径,并有望增强现有癫痫发作预测系统的能力。
链接: https://arxiv.org/abs/2604.18297
作者: Berenika Ewart-James,Matthew Wragg,Nawid Keshtmand,Amberly Brigden,Paul Marshall,Raul Santos-Rodriguez(University of Bristol)
机构: University of Bristol(布里斯托大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Epilepsy is a common, chronic neurological disorder characterized by recurrent seizures caused by sudden bursts of abnormal electrical activity in the brain. Seizures can often be unpredictable, leading to uncertainty and anxiety for people with epilepsy. To address this problem, the Epilepsy UK Priority Setting Partnership identified research into seizure forecasting technology as a priority. Seizure onsets are recorded as discrete events embedded within continuously sampled physiological signals that exhibit strong circadian and multi-day rhythms. Standard modelling approaches often treat time as linear or rely on clock-time features, which may not explicitly capture the underlying physiological phase. In this paper, we examine whether seizure onsets exhibit phase preference relative to circadian rhythms derived from wearable inter-beat interval (IBI) data. As a proof-of-concept, using 176 days wearable and seizure diary data from a single patient, we extract oscillatory components via band-limited filtering and Hilbert-based phase estimation, and test for non-uniform seizure-phase alignment using circular statistics. We observe significant circadian phase concentration, while multiday bands do not show consistent or statistically significant phase clustering in this dataset. Exploratory logistic baselines indicate modest but detectable structure beyond simple clock-time effects. We argue that explicit physiological phase representations provide an interpretable bridge between continuous wearable sensing and sparse clinical events and may augment existing seizure forecasting pipelines. We discuss implications for multi-scale modelling, patient-facing interfaces, and future multi-patient validation
[HC-4] EEG-Based Emergency Braking Intensity Prediction Using Blind Source Separation
【速读】:该论文旨在解决脑电图(Electroencephalography, EEG)信号在长期制动强度预测中因多种伪迹干扰而导致可靠性受限的问题。其解决方案的关键在于将EEG信号建模为独立盲源的混合,并通过独立成分分析(Independent Component Analysis, ICA)分解出与制动行为强相关的神经成分;进一步结合时频分析与皮尔逊相关性筛选制动相关成分,并利用层次聚类将其分为两类具有不同空间模式的组别,这些成分表现出试次不变的时间模式和稳定的神经特征;最终基于这些成分的功率特征及历史制动数据,在200 ms前瞻期内实现高精度制动强度预测。
链接: https://arxiv.org/abs/2604.18220
作者: Zikun Zhou,Wenshuo Wang,Wenzhuo Liu,Hui Yao,Chaopeng Zhang,Yichen Liu,Xiaonan Yang,Junqiang Xi
机构: Beijing Institute of Technology (北京理工大学)
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Electroencephalography (EEG) signals have been promising for long-term braking intensity prediction but are prone to various artifacts that limit their reliability. Here, we propose a novel framework that models EEG signals as mixtures of independent blind sources and identifies those strongly correlated with braking action. Our method employs independent component analysis to decompose EEG into different components and combines time-frequency analysis with Pearson correlations to select braking-related components. Furthermore, we utilize hierarchical clustering to group braking-related components into two clusters, each characterized by a distinct spatial pattern. Additionally, these components exhibit trial-invariant temporal patterns and demonstrate stable and common neural signatures of the emergency braking process. Using power features from these components and historical braking data, we predict braking intensity at a 200 ms horizon. Evaluations on the open source dataset (O.D.) and human-in-the-loop simulation (H.S.) show that our method outperforms state-of-the-art approaches, achieving RMSE reductions of 8.0% (O.D.) and 23.8% (H.S.).
[HC-5] Continuous Focus Groups: A Longitudinal Method for Clinical HRI in Autism Care
【速读】:该论文旨在解决当前人机交互(Human-Robot Interaction, HRI)研究中定性方法多采用静态或一次性形式,难以捕捉利益相关者观点随时间演变的问题,尤其在临床场景下,家庭和患者因负担沉重而难以参与重复研究互动。解决方案的关键在于引入“连续焦点小组”(continuous focus groups),这是一种纵向且共生成的质性研究方法,通过在机器人辅助治疗协议的不同阶段组织三轮焦点小组,使参与者能够持续回顾并修正早期观点,从而促进信任建立、整合临床隐性知识至设计决策,并作为伦理保障机制允许参与者重新协商参与意愿与揭示新关切。该方法实现了治疗迭代与研究设计迭代的衔接,在实践可行性和方法严谨性之间取得平衡,具有跨领域适用性,尤其适用于用户直接参与受限且连续性至关重要的敏感场景。
链接: https://arxiv.org/abs/2604.18197
作者: Ghiglino Davide,Foglino Caterina,Wykowska Agnieszka
机构: 未知
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:
Abstract:Qualitative methods are important to use alongside quantitative methods to improve Human-Robot Interaction (HRI), yet they are often applied in static or one-off formats that cannot capture how stakeholder perspectives evolve over time. This limitation is especially evident in clinical contexts, where families and patients face heavy burdens and cannot easily participate in repeated research encounters. To address this gap, we introduce continuous focus groups, a longitudinal and co-agential method designed to sustain dialogue with assistive care professionals working with children with autism spectrum disorder (ASD). Three focus groups were organized across successive phases of a robot-assisted therapeutic protocol, enabling participants to revisit and refine earlier views as the intervention progressed. Results show that continuity fostered trust, supported the integration of tacit clinical expertise into design decisions, and functioned as an ethical safeguard by allowing participants to renegotiate involvement and surface new concerns. By bridging the therapeutic iteration of families, children, and clinicians with the research-design iteration of researchers and developers, continuous focus groups provide a methodological contribution that is both feasible in practice and rigorous in design. Beyond autism care, this approach offers a transferable framework for advancing qualitative research in HRI, particularly in sensitive domains where direct user participation is limited and continuity is essential.
[HC-6] How Do People Accept Robot in Public Space? A Cross-Cultural Study in Germany and Japan
【速读】:该论文旨在解决公共空间中自主清洁机器人与偶然共存人员(InCoPs)之间互动的接受度问题,尤其是从跨文化视角探讨不同文化背景下人们对机器人存在的接受差异。其解决方案的关键在于识别并比较日本与德国参与者在机器人接受度(EA)上的核心驱动因素:社会规范(Social Norms)和信任(Trust)是跨文化最强正向预测因子;德国群体呈现“功能-情感”模式,即实用性(Usefulness)、兴趣(Interest)提升接受度,愤怒(Anger)抑制接受度;而日本群体则呈现“信任-情绪”模式,即信任、惊讶(Surprise)和恐惧(Fear)直接影响接受度。这一发现强调了在机器人设计中融入文化适应性的重要性。
链接: https://arxiv.org/abs/2604.18193
作者: Zhe Zeng,Clara Ayumi Fechner,Fei Yan,Hailong Liu
机构: Ulm University (乌尔姆大学); Nara Institute of Science and Technology (奈良科学技术大学院大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:
Abstract:With the increasing deployment of robots in public spaces, encounters between robots and incidentally copresent persons (InCoPs) are becoming more frequent. However, InCoPs remain largely underexplored in the literature, particularly from a cross-cultural perspective. Therefore, the present study investigates cultural differences in InCoPs’ existence acceptance (EA) of autonomous cleaning robots in public spaces among Japanese and German participants. Online survey results revealed that Germans showed significantly higher EA. Social Norms and Trust were the strongest positive EA predictors across cultures. More specifically, for Germans, EA was directly influenced by Usefulness, Interest and Anger, showing a functional-affective pattern where functional perceptions boost EA and anger suppresses it. For Japanese participants, Trust, Surprise and Fear were the direct associational factors, forming a trust-emotion pattern. These findings reveal cultural influences on cognitive and emotional drivers of public robot acceptance, emphasizing the need for culturally adaptive robot design.
[HC-7] Alleviating Linguistic and Interactional Anxiety of Non-Native Speakers in Multilingual Communication
【速读】:该论文旨在解决非母语者(Non-native Speakers, NNSs)在多语言交流中因语言能力不足和沟通动态不确定性所引发的表达焦虑问题。现有方法虽能提升NNS的理解与参与度,但缺乏对实时口语支持的有效机制。其解决方案的关键在于引入一个具备实时翻译功能的AI工具,通过构建非母语者与母语者(Native Speakers, NSs)之间的双向理解通道,缓解交互焦虑并增强NNS的口语自信心。实验结果表明,该工具显著提升了低语言水平NNS的说话自我效能感、降低了工作负荷,并强化了双方的共情与责任意识,为未来实时辅助型人机交互设计提供了可操作的实践路径。
链接: https://arxiv.org/abs/2604.18171
作者: Peinuan Qin,Justin Peng,Zhengtao Xu,Jiting Cheng,Zicheng Zhu,Naomi Yamashita,Yi-Chieh Lee
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to CSCW 2026. This arXiv version is the authors’ accepted manuscript
Abstract:Non-native speakers (NNSs) frequently encounter speaking difficulties in multilingual communication, where existing approaches have shown promise in facilitating NNSs’ comprehension and participation in real-time communication. However, they often overlook providing direct speaking support, where anxiety stemming from linguistic inadequacy and uncertain communication dynamics are core issues. To address this, we introduce an AI tool with translation for real-time speaking support. It also builds a channel for mutual understanding with native speakers (NSs) to mitigate interactional anxiety. Through a within-subjects experiment involving 25 NNS-NS pairs (N = 50) on collaborative tasks, our findings suggest that the tool improved NNSs’ speaking self-efficacy, reduced their interactional anxiety, and decreased their workload, particularly for NNSs with below-average language proficiency. Furthermore, NNSs reported a significant sense of support from their NS partners via the mutual understanding channel, and NSs also clearly perceived the NNSs’ need for assistance and displayed a strong sense of communicative responsibility. This research underscores the potential of AI support in real-time NNS communication and the importance of promoting mutual understanding, culminating in actionable design insights for future work.
[HC-8] Leverag ing AI for Direct Bystander Intervention Against Cyberbullying
【速读】:该论文旨在解决网络欺凌(cyberbullying)情境中旁观者难以实施直接干预的问题。研究表明,干预技能不足和自我效能感低下是阻碍旁观者采取行动的主要因素。为应对这一挑战,作者提出了一种名为EmojiGen的生成式AI干预工具,其核心创新在于通过用户选择表情符号作为意图线索,结合网络欺凌的具体语境自动生成适切的回应策略,从而降低干预门槛。实验结果表明,EmojiGen显著提升了旁观者支持受害者和对抗施害者的直接干预频率,并增强了干预意愿与自我效能感,同时减少了心理负担与焦虑。该方案的关键在于利用轻量化的交互方式(emoji选择)与AI生成能力,将复杂干预行为简化为可操作步骤,实现了对旁观者干预动机与能力的双重赋能。
链接: https://arxiv.org/abs/2604.18153
作者: Peinuan Qin,Jiting Cheng,Jungup Lee,Junti Zhang,Zhixing Liu,Yi-Chieh Lee
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to CSCW 2026. This arXiv version is the authors’ accepted manuscript
Abstract:Cyberbullying is a pervasive problem in online environments, causing substantial psychological harm to victims. Although bystander intervention has proven effective in mitigating its impact, motivating bystanders to engage in direct intervention remains a persistent challenge. Studies have suggested that difficulties in intervention skills and defending self-efficacy hinder bystanders from initiating direct intervention. To address this challenge, we introduced EmojiGen, an AI intervention tool designed to empower bystanders for direct intervention. EmojiGen enabled users to simply select an emoji as an intention clue, which subsequently combined the cyberbullying context to generate responses. In a between-subjects experiment involving 90 participants on a custom-built social media platform, we found that EmojiGen significantly increased the frequency of direct bystander interventions, both in supporting victims and in confronting perpetrators, driven by different factors. EmojiGen also increased the sense of knowing how to help and defending self-efficacy, while reducing perceived workload and anxiety associated with initiating intervention. The study contributed to the CSCW community through offering an effective direct bystander intervention method and providing design implications for future cyberbullying interventions.
[HC-9] Enabling Sensitive Conversations with Consent Boundaries: Moa a Platform for Discussing PhD Advising Relationships
【速读】:该论文旨在解决弱势个体在面对权力不对等关系(如导师或上级)时,难以有效识别和获取支持性盟友(ally)的问题。由于潜在盟友可能因利益冲突或情感回避而无法提供帮助,甚至可能加剧伤害,传统社交网络难以满足此类敏感情境下的协作需求。解决方案的关键在于提出一个名为“Moa”的社交媒体平台,其核心创新是引入“同意边界”(consent boundaries)机制——允许用户基于共同社会身份或生活经历灵活设定每条内容的受众范围,同时保持发送者与接收者的匿名性,从而在不暴露身份的前提下精准触达可能的支持者。这一设计通过强化用户对信息传播边界的控制权,显著提升了敏感话题讨论的安全性和有效性。
链接: https://arxiv.org/abs/2604.18121
作者: Jane Im,Kentaro Toyama
机构: CISPA Helmholtz Center for Information Security (德国信息安全中心); University of Michigan School of Information (美国密歇根大学信息学院)
类目: Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注: Accepted to ACM CSCW
Abstract:When an individual is harmed by someone in power, such as a workplace manager, it can help to identify allies–people who would offer sympathy, advice, or supportive action. However, ally discovery is fraught because the very people who might be most relevant–e.g., someone who reports to the same manager–might not be sympathetic and could potentially exacerbate the harm. We examine this problem in the specific context of PhD students navigating advising challenges and present a social media platform called “Moa” that brings together a number of features that we believe facilitate ally discovery. Moa’s most novel element is an audience selection process that uses what we call consent boundaries, which allow users to flexibly define each post or comment’s audience based on factors such as common social identity or lived experience, all while preserving anonymity–neither senders nor recipients learn each other’s identities, even as the post reaches the right audience. A 3-week field study with 47 real-world users showed that the features in combination facilitated sensitive conversations about advising, with 22.6% of users using consent boundaries. We discuss both our overall “recipe” for systems for ally discovery and the benefits of a consent-centered approach to design.
[HC-10] HolmeSketcher: Generative 3D Sketch Mapping for Spatial Reconstruction in Crime Scene Investigation
【速读】:该论文旨在解决犯罪现场调查(CSI)中传统2D草图绘图方法在表达三维空间关系方面的局限性问题。其解决方案的关键在于提出了一种名为HolmeSketcher的生成式3D草图绘图系统,该系统结合前端3D绘图界面与后端深度学习流水线,支持对象生成与扩展现实(Extended Reality, XR)环境下的场景重建,从而提升重构场景的空间准确性和可解释性。
链接: https://arxiv.org/abs/2604.18039
作者: Tianyi Xiao,Yizi Chen,Sidi Wu,Peter Kiefer,Yan Feng,Martin Raubal
机构: ETH Zürich(苏黎世联邦理工学院); TU Delft(代尔夫特理工大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Sketch mapping is widely used in crime scene investigation (CSI) to document, interpret, and communicate spatial information. However, it is typically performed on 2D media, which limits its ability to represent 3D spatial relationships. We present HolmeSketcher, a generative 3D sketch mapping system that combines a front-end 3D drawing interface with a back-end deep learning pipeline to support object generation and scene reconstruction in extended reality. In a within-subject user study (N = 15), HolmeSketcher improved the spatial accuracy and interpretability of reconstructed scenes, but with a clear trade-off of higher task load and lower usability compared with paper-based 2D sketch mapping. By integrating findings from the user study and expert interviews (N = 3), we further derive three design implications for next-generation 3D sketch mapping tools for CSI.
[HC-11] Empowering Vocabulary Learning Through Teaching AI: Using LLM s as a Student to Perform Learning by Teaching in Vocabulary Acquisition
【速读】:该论文旨在解决现有“教学式学习(Learning by Teaching, LbT)”系统中问题生成方法依赖固定模板、开发成本高且缺乏灵活性的问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成动态、上下文相关的提问内容,从而降低系统构建成本并提升问题的相关性与适应性,进而增强学习者的记忆保留效果。
链接: https://arxiv.org/abs/2604.17893
作者: Tokio Uchida,Ko Watanabe,Andrew Vargo,Shoya Ishimaru,Ralph L. Rose,Ayaka Sugawara,Andreas Dengel,Koichi Kise
机构: Osaka Metropolitan University (大阪都立大学); DFKI GmbH (德国弗劳恩霍夫协会); Waseda University (早稻田大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:“Learning by Teaching (LbT)” helps learners deepen their understanding by explaining concepts to others, with questions playing a vital role in identifying knowledge gaps and reinforcing comprehension. However, existing systems for generating such questions often rely on rigid templates and are expensive to build. To overcome these limitations, we developed a system using Large Language Models (LLMs) to create dynamic, contextually relevant questions for LbT. In our English vocabulary learning study, we examined which learner characteristics best leverage the system’s benefits. Our results showed improved memory retention over traditional methods at three and seven days of testing, with ten participants. Additionally, we identified traits linked to better learning outcomes, highlighting the potential for tailored approaches. These findings support the development of scalable, cost-effective solutions to enhance LbT methods across various fields.
[HC-12] Scaling Human-AI Coding Collaboration Requires a Governable Consensus Layer
【速读】:该论文旨在解决当前生成式 AI 辅助软件开发中存在的“控制失效”问题,即虽然生成式 AI 能快速产出可执行代码(code),但缺乏对系统结构约束、依赖关系及决策依据的显式记录,导致代码变更难以追溯、回归问题难以定位,且系统在演化过程中变得不透明和脆弱。其解决方案的关键在于提出“代理共识”(Agentic Consensus)范式,以一个可操作的世界模型(operable world model)作为核心构件 C,该模型以类型化的属性图(typed property graph)形式表达,取代传统代码成为工程的主要产物;通过同步算子 Φ(realize)与 Ψ(rehydrate)保持代码与共识层 C 的一致性,并使所有设计承诺可审计、未明确规范部分以可测量的共识熵(consensus entropy)形式显式呈现,从而将评估标准从单一代码正确性扩展至对齐保真度(alignment fidelity)、共识熵和干预距离等维度。
链接: https://arxiv.org/abs/2604.17883
作者: Tianfu Wang,Zhezheng Hao,Yin Wu,Wei Wu,Qiang Lin,Hande Dong,Nicholas Jing Yuan,Hui Xiong
机构: HKUST (GZ)(香港科技大学(广州)); ZJU(浙江大学); USTC(中国科学技术大学); Tencent(腾讯)
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Vibe coding produces correct, executable code at speed, but leaves no record of the structural commitments, dependencies, or evidence behind it. Reviewers cannot determine what invariants were assumed, what changed, or why a regression occurred. This is not a generation failure but a control failure: the dominant artifact of AI-assisted development (code plus chat history) performs dimension collapse, flattening complex system topology into low-dimensional text and making systems opaque and fragile under change. We propose Agentic Consensus: a paradigm in which the consensus layer C, an operable world model represented as a typed property graph, replaces code as the primary artifact of engineering. Executable artifacts are derived from C and kept in correspondence via synchronization operators Phi (realize) and Psi (rehydrate). Evidence links directly to structural claims in C, making every commitment auditable and under-specification explicit as measurable consensus entropy rather than a silent guess. Evaluation must move beyond code correctness toward alignment fidelity, consensus entropy, and intervention distance. We propose benchmark task families designed to measure whether consensus-based workflows reduce human intervention compared to chat-driven baselines.
[HC-13] Design and Evaluation of a Culturally Adapted Multimodal Virtual Agent for PTSD Screening
【速读】:该论文旨在解决战斗暴露的军人中创伤后应激障碍(Post-traumatic stress disorder, PTSD)高发但长期存在报告不足的问题。解决方案的关键在于开发并验证Molhim平台,这是一个文化适配的多模态对话式人工智能系统,通过可配置的对话流程实现特定目的的交互,包括会话初始化、与高保真虚拟人像的实时对话以及会话后的分析与反馈。该平台整合了大语言模型驱动的虚拟形象、实时语音识别、用户输入的视觉理解、文本转语音合成等功能,支持结构化的多轮对话,并能自动执行《DSM-5 PTSD检查表》(PCL-5)评估,从而为临床环境中的PTSD筛查提供可行且可扩展的技术路径。
链接: https://arxiv.org/abs/2604.17871
作者: Cengiz Ozel,Waleed Nadeem,Samuel Potter,Yahya Bokhari,Bdour Alwuqaysi,Wejdan Alotaibi,Rahaf Fahad Alnufaie,Sabri Boughorbel,Abdulrhman Aljouie,Rakan Altasan,Ehsan Hoque
机构: Ministry of Defense(国防部); Prince Sultan Military Medical City(王子苏丹军事医疗中心)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Post-traumatic stress disorder (PTSD) is highly prevalent yet chronically underreported among combat-exposed military personnel. This paper presents Molhim, a culturally adapted multimodal conversational AI platform that supports purpose-specific interactions through a configurable conversational pipeline consisting of session setup, real-time dialogue with a high-fidelity virtual avatar, and post-session analysis and feedback. In this work, we examine the PTSD screening configuration of the Molhim platform in a military healthcare context. The system employs a conversational avatar driven by a large language model, integrating real-time speech recognition, visual understanding of user input, text-to-speech synthesis, and a high-fidelity human avatar to support structured multi-turn dialogue and automated post-session analysis, including administration of the PTSD Checklist for DSM-5 (PCL-5). These findings suggest the feasibility of Molhim as a conversational platform for PTSD screening and highlight design considerations for socially cooperative human-AI systems in clinical environments.
[HC-14] Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research
【速读】:该论文旨在解决通用大语言模型(Large Language Models, LLMs)在发展与政策领域中因缺乏认知谦逊(epistemic humility)而导致的信息误导风险,即模型在生成内容时未能提供可验证的证据支持。解决方案的关键在于构建AVA(AI + Verified Analysis)平台,其核心机制包括:一是引用可验证性(citation verifiability),通过将主张追溯至原始来源确保输出可信;二是合理拒答(reasoned abstention),对无法支持的问题明确拒绝并提供解释与引导,从而划定知识边界。该设计使AV A成为具备“生态意识”的谦逊人工智能(Humble AI)范例,实证表明其显著提升了用户效率并增强了信任感。
链接: https://arxiv.org/abs/2604.17843
作者: Nimisha Karnatak,Mohamad Chatila,Daniel Alejandro Pinzón Hernández,Reza Yazdanfar,Michelle Dugas,Renos Vakis
机构: University of Oxford (牛津大学); The World Bank Group (世界银行集团); Nouswise, Inc. (Nouswise公司)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted at ACM CHI’26
Abstract:General-purpose LLMs pose misinformation risks for development and policy experts, lacking epistemic humility for verifiable outputs. We present AVA (AI + Verified Analysis), a GenAI platform built on a curated library of over 4,000 World Bank Reports with multilingual capabilities. AVA’s multi-agent pipeline enables users to query and receive evidence-based syntheses. It operationalizes epistemic humility through two mechanisms: citation verifiability (tracing claims to sources) and reasoned abstention (declining unsupported queries with justification and redirection). We conducted an in-the-wild evaluation with over 2,200 individuals from heterogeneous organisations and roles in 116 countries, via log analysis, surveys, and 20 interviews. Difference-in-Differences estimates associate sustained engagement with 2.4-3.9 hours saved weekly. Qualitatively, participants used AVA as a specialized “evidence engine”; reasoned abstention clarified scope boundaries, and trust was calibrated through institutional provenance and page-anchored citations. We contribute design guidelines for specialized AI and articulate a vision for “ecosystem-aware” Humble AI.
[HC-15] Navigating the Conceptual Multiverse
【速读】:该论文旨在解决语言模型在回答开放性问题时缺乏透明度和可解释性的问题,即模型隐式做出的决策(如问题框架选择、价值判断等)往往未被用户察觉,导致输出结果缺乏上下文支撑,难以形成对问题的系统性理解。解决方案的关键在于构建“概念多宇宙”(conceptual multiverse)——一个交互式系统,允许用户透明地检视、干预并验证与问题相关的概念性决策,例如如何定义问题边界或评估不同答案的价值;为确保该结构的有效性和可靠性,作者进一步提出一套通用验证框架,通过专家级领域推理校准关键决策属性(如无歧义性、完备性),从而引导用户建立对问题的“工作地图”。实证表明,在哲学、对齐标注和诗歌创作三个领域中,该方法显著提升了用户的问题建模能力与认知深度。
链接: https://arxiv.org/abs/2604.17815
作者: Andre Ye,Jenny Y. Huang,Alicia Guo,Rose Novick,Tamara Broderick,Mitchell L. Gordon
机构: MIT EECS; UW Allen School of CSE; UW Department of Philosophy
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:When language models answer open-ended problems, they implicitly make hidden decisions that shape their outputs, leaving users with uncontextualized answers rather than a working map of the problem; drawing on multiverse analysis from statistics, we build and evaluate the conceptual multiverse, an interactive system that represents conceptual decisions such as how to frame a question or what to value as a space users can transparently inspect, intervenably change, and check against principled domain reasoning; for this structure to be worth navigating rather than misleading, it must be rigorous and checkable against domain reasoning norms, so we develop a general verification framework that enforces properties of good decision structures like unambiguity and completeness calibrated by expert-level reasoning; across three domains, the conceptual multiverse helped participants develop a working map of the problem, with philosophy students rewriting essays with sharper framings and reversed theses, alignment annotators moving from surface preferences to reasoning about user intent and harm, and poets identifying compositional patterns that clarified their taste.
[HC-16] aching Usable Privacy in HCI Education: Designing Implementing and Evaluating an Active Learning Graduate
【速读】:该论文旨在解决当前高等教育中隐私教育存在的碎片化、理论化程度高且脱离实际应用的问题,尤其是在人机交互(Human-Computer Interaction, HCI)领域中,如何有效培养未来设计者和研究者对可用隐私(Usable Privacy)的理解与实践能力。其解决方案的关键在于设计并实施了一门为期15周的研究生课程,采用以实践为导向的教学方法,整合真实场景案例、结构化角色扮演、基于案例的讨论、客座讲座以及多阶段研究项目,使学生能够从多个利益相关者的视角进行隐私问题的思辨;课程以当代隐私研究和现代隐私框架(Modern Privacy framework)为基础,强调概念理解与应用研究技能的双重提升,并通过混合方法评估验证了教学效果,包括学生参与度提高、隐私设计权衡意识增强及理论与实践联系更紧密等成果。
链接: https://arxiv.org/abs/2604.17796
作者: Sanchari Das,Dhiman Goswami,Michelle Melo,Aditya Johri,Vivian G. Motti
机构: George Mason University (乔治梅森大学)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:As digital systems increasingly rely on pervasive data collection and inference, educating future designers and researchers about Usable Privacy has become a critical need for HCI. However, privacy education in higher education is often fragmented, theory-heavy, or detached from real-world applications. Thus, in this paper, we present the design, implementation, and evaluation of a 15-week graduate-level course on Usable Privacy that addresses this through active, practice-oriented pedagogy. The course integrates use cases, structured role playing, case-based discussions, guest lectures, and a multi-phase research project to support students in reasoning about privacy from multiple stakeholder perspectives. Grounded in contemporary privacy research and the Modern Privacy framework, the curriculum emphasizes both conceptual understanding and applied research skills. We report findings from two course offerings in consecutive years (2024-2025) using a mixed-methods evaluation that combines quantitative teaching evaluations with qualitative analysis of student reflections and instructor observations. Results indicate increased student engagement, improved ability to articulate trade-offs in privacy design, and stronger connections between theory and practice. To support adoption and replication, we also release detailed assignment descriptions and grading rubrics. This work contributes an empirically informed model for teaching Usable Privacy in HCI education and offers actionable guidance for educators seeking to integrate privacy into their curricula.
[HC-17] MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models ACL2026
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在心理健康咨询场景中安全性评估的局限性问题,即现有框架多基于孤立响应和静态数据集进行粗粒度分类,难以捕捉临床危害在多轮交互过程中如何因角色扮演而产生并累积。其解决方案的关键在于提出两个核心组件:一是R-MHSafe,一个基于角色感知的心理健康安全分类体系,能够从AI咨询师的角色(如施害者、煽动者、促进者或促成者)与临床危害类别相结合的角度刻画危害;二是MHSafeEval,一种闭环的、基于代理的评估框架,通过对抗性多轮交互实现轨迹级危害发现,并以角色感知建模为引导,从而显著提升对安全失效模式的覆盖度和诊断精细度。
链接: https://arxiv.org/abs/2604.17730
作者: Suhyun Lee,Palakorn Achananuparp,Neemesh Yadav,Ee-Peng Lim,Yang Deng
机构: Hanyang University (汉阳大学); Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to ACL 2026 Findings
Abstract:Large language models (LLMs) are increasingly explored as scalable tools for mental health counseling, yet evaluating their safety remains challenging due to the interactional and context-dependent nature of clinical harm. Existing evaluation frameworks predominantly assess isolated responses using coarse-grained taxonomies or static datasets, limiting their ability to diagnose how harms emerge and accumulate over multi-turn counseling interactions. In this work, we introduce R-MHSafe, a role-aware mental health safety taxonomy that characterizes clinically significant harm in terms of the interactional roles an AI counselor adopts, including perpetrator, instigator, facilitator, or enabler, combined with clinically grounded harm categories. Then, we propose MHSafeEval, a closed-loop, agent-based evaluation framework that formulates safety assessment as trajectory-level discovery of harm through adversarial multi-turn interactions, guided by role-aware modeling. Using R-MHSafe and MHSafeEval, we conduct a large-scale evaluation across state-of-the-art LLMs. Our results reveal substantial role-dependent and cumulative safety failures that are systematically missed by existing static benchmarks, and show that our framework significantly improves failure-mode coverage and diagnostic granularity.
[HC-18] Replay Revise and Refresh: Smartphone-based Refresher Training for Community Healthcare Workers in India
【速读】:该论文旨在解决印度社区卫生工作者(Community Health Workers, CHWs)在传统课堂培训中知识获取效果差、留存率低的问题,尤其是针对其在孕产妇和儿童保健服务中的知识更新需求。解决方案的关键在于引入游戏化学习模式作为知识刷新工具,通过对比三种训练方式——标准课堂培训、实体卡牌游戏和智能手机游戏——发现数字游戏模式在短期内显著提升知识增量(p<0.05),而六月后的知识保留水平在数字与实体卡牌版本间无显著差异(p=0.4),表明基于智能手机的游戏化干预具有高可扩展性和有效性,是提升CHWs知识掌握与持续应用的可行策略。
链接: https://arxiv.org/abs/2604.17638
作者: Arka Majhi,Aparajita Mondal,Satish B. Agnihotri
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Accepted in HCI International 2024
Abstract:In India, community healthcare workers are the primary touchpoints between the state and the beneficiaries, such as pregnant mothers and children. Their healthcare knowledge directly impacts the quality of care they provide through home visits and community activities. Classroom in-person or traditional ways of training are found ineffective in imparting knowledge and render poor knowledge retention, which needs reinforcements through short, frequent revisions. Smartphone games on healthcare topics could be a promising solution as a refresher, as they can be scaled and tailored as per players’ requirements. This study aims to check the differences in knowledge gain, pre and post-intervention, and, secondly, to check knowledge retention after six months. 270 CHWs or participants were recruited to evaluate different modes of refresher training and assigned into three equal groups of 90 each. The control group (CG) (n=90) was trained using the standard classroom method, which is usually followed. Intervention Group-1 (IG1)(n=90) was trained in a physical card game format, and Intervention Group-2 (IG2)(n=90) was trained in a smartphone game format. 4 sets of questionnaires were made by shuffling 45 questions based on immunization of equal weightage. The questionnaires were filled out by CHWs by hand and collected, evaluated, and analyzed. Paired t-tests were conducted to compare pre-post knowledge increments and repeated measure ANOVA to check for differences in knowledge retention. Results suggest a significant difference in scores in all three groups. A significant difference was observed between the physical and digital gameplay modes. Pre-post knowledge increment was higher in the digital mode (p0.05), but knowledge retained was not significantly different (p=.4) in digital and physical card versions.
[HC-19] Developing Models of Procedural Skills using an AI-assisted Text-to-Model Approach
【速读】:该论文旨在解决可扩展的生成式 AI 辅导系统在程序性技能学习中因缺乏结构化知识表示而导致的瓶颈问题,即如何高效构建高质量、结构完整的任务-方法-知识(Task-Method-Knowledge)模型。其解决方案的关键在于提出一种“人在回路”(human-in-the-loop)的文本到模型(text-to-model)管道,利用大语言模型(Large Language Models, LLMs)通过本体约束提示(ontology-constrained prompting)和模板驱动生成(template-based generation),将教学材料自动转化为结构完备的程序性技能模型;同时保留专家对因果转换和失败条件的验证,从而在显著降低专家建模时间(50–70%)的同时确保模型的结构性有效性与可复现性,为大规模部署结构化 AI 教学辅导系统提供了可行路径。
链接: https://arxiv.org/abs/2604.17624
作者: Rahul K. Dass,Shubham Puri,Arpit Khandelwal,Xiao Jin,Ashok K. Goel
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Human-Computer Interaction (cs.HC)
备注: 10 pages. To appear in Proceedings of the 13th ACM Conference on Learning at Scale (L@S '26)
Abstract:Scalable AI tutoring for procedural skill learning requires structured knowledge representations, yet constructing these representations remains a labor-intensive bottleneck. This paper presents a human-in-the-loop text-to-model pipeline that uses large language models to transform instructional materials into schema-complete Task-Method-Knowledge models of procedural skills through ontology-constrained prompting and template-based generation. The approach automates structural scaffolding while preserving expert oversight for validating causal transitions and failure conditions. We apply the pipeline to instructional materials from a graduate-level online AI course, constructing 23 procedural skill models. AI-assisted authoring reduced expert modeling time by 50-70% while producing structurally valid and highly reproducible models under fixed-input conditions. We evaluate structural validity, semantic alignment, reproducibility, and refinement effort to characterize authoring scalability. Results indicate that AI-assisted text-to-model methods can substantially lower the cost of constructing structured procedural representations, making course-wide deployment of structured AI coaching systems practically feasible.
[HC-20] Refresher Training through Quiz App for capacity building of Community Healthcare Workers or Anganwadi Workers in India
【速读】:该论文旨在解决印度长期存在的儿童营养不良问题,特别是由于实施长达四十年的综合儿童发展计划(Integrated Child Development Scheme, ICDS)中基层工作人员——妇幼保健员(Anganwadi Workers, AWWs)能力不足导致的项目效果不佳问题。解决方案的关键在于利用信息技术(Information and Communication Technology, ICT)手段,通过开发一款基于安卓系统的互动测验应用程序(quiz app),以替代传统课堂培训方式,提升AWWs及其督导人员的能力更新效率和培训覆盖面,从而增强ICDS项目的执行效能。
链接: https://arxiv.org/abs/2604.17620
作者: Arka Majhi,Satish B. Agnihotri,Aparajita Mondal
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Accepted in the Asian CHI Symposium 2021
Abstract:High and persistent child malnutrition levels with tardy reduction, seen in successive health surveys, continue to be a matter of concern in India, drawing attention to the need to revamp the four-decade-old Government program, Integrated Child Development Scheme (ICDS). ICDS field functionaries or Anganwadi Workers’ (AWWs) capacity deficit was identified as a significant factor affecting ICDS’s effectiveness. Considering rising numbers, over 1.4 million AWWs, and continuously advancing knowledge of community healthcare, conventional training pedagogy is ineffective in building and updating AWWs and their supervisors’ capacity, which calls for rethinking, using the ICT approach. Over 6 lakh AWWs in India were smartphone equipped by 2020. An android based quiz app was designed, following AWWs training modules’ content and need assessment results. The study investigates the quiz app’s effectiveness and compares it with conventional classroom instruction, with a group of AWWs, and discusses ways to make it an adequate substitute.
[HC-21] WhatIf: Interactive Exploration of LLM -Powered Social Simulations for Policy Reasoning
【速读】:该论文旨在解决政策制定者在应急管理和城市规划等领域的决策问题,这些场景中存在深度不确定性(deep uncertainty),即政策效果高度依赖于大规模人群对信息的解读、协调与随时间演变的采纳行为。传统工具如桌面推演(tabletop exercises)虽支持协作讨论但缺乏动态反馈,而计算模拟则能刻画群体动态却仅适用于离线分析,难以满足实时交互需求。论文提出 WhatIf 系统,其关键在于将大型语言模型(Large Language Models, LLMs)驱动的社会模拟与交互式界面结合,使政策制定者能够实时“引导”(steer)、“检查”(inspect)并“比较”不同情境下的模拟结果。该系统基于四项设计要求构建:流式引导、实时规模性、协作探索和多层级可解释性,并通过三类灾害疏散场景验证,证明其能促进专家进行迭代分支推理、暴露隐含假设、识别新风险,并以可追溯的个体代理案例为基础进行决策推理,从而为深度不确定性下的决策提供更具互动性和共享性的智能支持环境。
链接: https://arxiv.org/abs/2604.17615
作者: Yuxuan Li,Kyzyl Monteiro,Hirokazu Shirado,Sauvik Das
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Policymakers in domains such as emergency management, public health, and urban planning must make decisions under deep uncertainty, where outcomes depend on how large populations interpret information, coordinate, and adopt over time. Existing tools only partially support this process: tabletop exercises enable collaborative discussion but lack dynamic feedback, while computational simulations capture population dynamics but are designed for offline analysis. We present WhatIf, an interactive system that enables policymakers to steer, inspect, and compare LLM-powered social simulations in real time. Informed by a formative study in emergency preparedness planning, we derive four design requirements for interactive policy simulations: fluid steering, real-time scale, collaborative exploration, and multi-level interpretability. We developed WhatIf guided by these requirements and evaluated it with five preparedness professionals across three disaster evacuation scenarios. Our findings show that participants used the system as a space for iterative branching and comparison rather than evaluating fixed plans; reflected on tacit planning assumptions when agent behavior violated expectations; surfaced previously unrecognized planning vulnerabilities; and grounded their reasoning in inspectable agent-level cases rather than aggregate outputs alone. These findings suggest broader design implications for LLM-powered social simulation systems: designing such systems as interactive, shared reasoning environments – rather than offline predictive tools – can better support expert decision-making under deep uncertainty.
[HC-22] Refresher Training through Digital and Physical Card-Based Game for Accredited Social Health Activists (ASHAs) and Anganwadi Workers (AWWs) in India
【速读】:该论文旨在解决印度部分地区儿童免疫接种率不达标的问题,尤其是由于基层卫生工作者(CHWs)培训方法不足导致的知识与技能欠缺。研究发现,传统的培训方式难以有效提升CHWs在儿童免疫实践方面的专业能力。解决方案的关键在于开发并应用一种基于智能手机的游戏化再培训工具——该工具兼具实体卡牌和数字应用程序两种形式,通过游戏机制增强CHWs对免疫知识的理解与记忆。定量分析与定性反馈均表明,这种游戏化训练显著提升了CHWs的知识掌握程度和长期保留效果,为资源受限环境中推广高效、可扩展的数字化培训方案提供了实证支持。
链接: https://arxiv.org/abs/2604.17604
作者: Arka Majhi,Aparajita Mondal,Satish B. Agnihotri
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校); Tampere University (坦佩雷大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Accepted at CHI PLAY 2024
Abstract:India’s recent health surveys have highlighted a worrying trend of incomplete child immunization rates across several district clusters in India. Conventional training methods for community healthcare workers (CHWs) in India are inadequate for improving their skills and knowledge. Smartphone games could be a viable and cost-effective method of refresher training specifically targeting immunization practices. A refresher training game was designed both as a physical card-based and digital app-based game, focusing on enhancing CHWs’ knowledge and practices related to child immunization. A quasi-experimental study was conducted with 368 participants. Quantitative gameplay analytics and qualitative feedback from players were collected through interviews. The findings show that game-based refresher training significantly improves CHWs’ knowledge gain and retention in the area of child immunization. The discussion highlights the study’s implications and insights while developing effective digital tools for training CHWs. The research contributes to the growing body of work on digital tools for training CHWs in resource-constrained settings. The study underscores the potential of smartphone games as a scalable and effective method of refresher training for improving child immunization rates.
[HC-23] Real-Time Cellist Postural Evaluation With On-Device Computer Vision
【速读】:该论文旨在解决初学者在弦乐器(以大提琴为例)练习过程中因缺乏实时身体姿势反馈而导致的姿势恶化问题,这不仅增加肌肉骨骼损伤风险,还影响技术效率。现有解决方案受限于昂贵硬件或复杂多传感器系统,难以普及。其关键创新在于开发了Cello Evaluator——一个基于移动端计算机视觉推理优化的实时姿势评估系统,仅需一台当前代次的Android手机即可实现高效、便捷的姿势监测与反馈,从而填补个体练习中的姿势指导空白。
链接: https://arxiv.org/abs/2604.17530
作者: Paolo Wang,Michael Zhang,Shrinand Perumal,Ekaterina Tszyao,Luke Choi,Kexin Sha,Felix Lu,Paige Lorenz,Jackson P. Shields,Sivamurugan Velmurugan,Joshua Kamphuis,William P. Jiang,Gurtej Bagga,Trevor Ju,Raymond Otis Kwon,Kristen Yeon-Ji Yun,Yung-Hsiang Lu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Posture is a critical factor for beginning instrumental learners. Most students receive instruction only once a week, and during the intervals between lessons they have little or no feedback on their physical posture. As a result, posture often deteriorates, increasing the risk of musculoskeletal injury and inefficient technique. Recent advances in computer vision and machine learning make it possible to evaluate posture without the constant presence of a human expert. However, current solutions have been extremely limited in availability and convenience due to their reliance on computationally expensive hardware or multi-sensor setups. We present Cello Evaluator, a real-time postural feedback system for practicing cellists. Through this optimization for on-device computer vision inference, we provide access to cellist postural evaluation to anyone with a current generation Android phone and thus reduces the postural feedback voids within individual practice. To validate our mobile application, we conduct a heuristic evaluation consisting of cellist and UX experts. Overall feedback from the evaluation found the app to be user friendly and helpful.
[HC-24] Generative AI Technologies Techniques Tensions: A Primer DATE
【速读】:该论文试图解决的问题是:当前生成式 AI (Generative AI) 系统在学术、职业和个人生活中迅速普及,但大多数用户将其视为神秘的黑箱工具,而非可理解的系统,导致对其行为产生误解和不当使用。其核心问题在于,用户对计算机的传统预期(如确定性、透明性)与生成式 AI 的统计本质及其拟人化行为之间存在根本性错位。解决方案的关键在于将生成式 AI 解构为多个相互作用的组件——数据、模型、产品功能和用户输入——并基于此构建一个概念框架,使研究者能够从教育学和行为科学研究的角度出发,利用已有方法来建模潜在过程、管理不确定性,并解释复杂的人机交互。这一视角使教育研究者具备独特优势,从而推动更负责任、更具批判性的理解和应用。
链接: https://arxiv.org/abs/2604.17497
作者: John T. Behrens
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: In press chapter for this http URL , J. Behrens, D. Robinson (Eds.), The Handbook of Generative AI in Education. Springer. Expected publication date approximately August 2026
Abstract:Generative AI systems have entered everyday academic, professional, and personal life with remarkable speed, yet most users encounter them as mysterious artifacts rather than intelligible systems. This chapter discusses large language models within a broader historical shift in computing paradigms and argues that many of the confusions surrounding their use arise from a mismatch between how these systems are built, how they behave, and how people expect computers to behave writ large. Rather than treating generative AI as a monolithic technology, the chapter decomposes it into interacting components, spanning data, models, product features, and user inputs, each introducing distinct affordances and tensions. Particular attention is given to the statistical and data-based foundations of these systems and to the fact that their surface behavior is explicitly human-like, a combination that places them squarely within the intellectual traditions of educational and behavioral research. From this perspective, educational researchers are unusually well positioned to study, evaluate, and productively use generative AI systems, drawing on established methods for modeling latent processes, managing uncertainty, and interpreting complex human-system interactions. The goal is to equip readers with a conceptual map that supports more informed experimentation, critical interpretation, and responsible use as these systems continue to evolve.
[HC-25] Agent ic Education: Using Claude Code to Teach Claude Code
【速读】:该论文旨在解决当前AI编程助手(如Claude Code)缺乏结构化教学框架的问题,开发者在使用这些工具时面临从文档到实践能力之间的鸿沟,依赖碎片化资源进行学习。其核心解决方案是提出cc-self-train系统,关键在于构建一个模块化、交互式的学习课程体系,通过五个创新机制实现:(1)基于“责任渐进释放”(Gradual Release of Responsibility)理念的导师角色演化模型,按引导者(Guide)、合作者(Collaborator)、同伴(Peer)到启动者(Launcher)四个阶段调整指导语气;(2)基于钩子启发式(hook-based heuristics)的自适应学习机制,在短周期( streak detection)和长周期(module-boundary)动态调整支持策略;(3)跨领域统一课程设计,使五个项目域共享相同特征序列以促进迁移学习;(4)显式暂停原语控制步进节奏,缓解AI作为教师情境下的信息过载;(5)自动更新课程机制,由引导代理检测上游工具变更并提前重构教学材料。该方案通过参数化测试套件确保50个模块的结构一致性,实证表明其显著提升用户自我效能感(p < 0.001),尤其在高级功能如钩子和自定义技能方面效果突出。
链接: https://arxiv.org/abs/2604.17460
作者: Zain Naboulsi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: 26 pages, 5 figures, 7 tables. Code: this https URL
Abstract:AI coding assistants have proliferated rapidly, yet structured pedagogical frameworks for learning these tools remain scarce. Developers face a gap between tool documentation and practical mastery, relying on fragmented resources such as blog posts, video tutorials, and trial-and-error. We present cc-self-train, a modular interactive curriculum for learning Claude Code, an agentic AI coding tool, through hands-on project construction. The system introduces five contributions: (1) a persona progression model that adapts instructor tone across four stages (Guide, Collaborator, Peer, Launcher), operationalizing Gradual Release of Responsibility for AI-mediated instruction; (2) an adaptive learning system that observes engagement quality through hook-based heuristics and adjusts scaffolding at two timescales, using streak detection for mid-module intervention and aggregate metrics for module-boundary persona changes; (3) a cross-domain unified curriculum in which five distinct project domains share identical feature sequencing, enabling transfer learning; (4) a step-pacing mechanism with explicit pause primitives to manage information overload in an AI-as-instructor context; and (5) an auto-updating curriculum design in which the onboarding agent detects upstream tool changes and updates teaching materials before instruction begins. A parametrized test suite enforces structural consistency as a proxy for pedagogical invariants across all 50 modules. A pilot evaluation with 27 participants shows statistically significant reported self-efficacy gains across all 10 assessed skill areas (p 0.001), with the largest effects on advanced features such as hooks and custom skills. We discuss implications for the design of auto-updating educational systems.
[HC-26] Analysing Human Interaction with Electronic Displays in Microgravity
【速读】:该论文旨在解决微重力环境下航天员与触控界面(touchscreen display)交互效率及心理状态评估的问题,尤其关注手指与触控笔在国际空间站(ISS)中的人机交互性能差异及其对认知负荷和心理健康的影响。解决方案的关键在于通过对比宇航员、地面人员与大学生在微重力条件下使用手指和触控笔完成指向与选择任务的表现,并结合空间2-back测试和自评量表评估认知状态与心理福祉,最终发现:在微重力环境中,手指操作显著快于触控笔操作(基于420次任务分析),且宇航员与地面上的参与者在指向性能和心理状态上无显著差异。这一结果可为航天器驾驶舱图形用户界面(GUI)元素的尺寸与布局优化提供预测模型,从而提升人机交互效率与用户体验。
链接: https://arxiv.org/abs/2604.17322
作者: Pradipta Biswas,Himanshu Vishwakarma,Mukund Mitra,KamalPreet Singh Saluja,Aumkar Kishore Shah
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Human Space Flight missions often require interaction with touchscreen displays. This paper presents a study of investigating human machine interaction with touchscreen using both finger and stylus in the International Space Station. The study also reports cognitive state of astronauts in the form of spatial 2-back test and mental well-being through self-reported scales. We presented a series of results comparing pointing and selection performance among ISS crews, ground crews and university students, finger-based touching and stylus-based touching in microgravity and mental well-being scores. We reported that finger-based pointing is statistically significantly faster than stylus-based pointing in microgravity based on analysis of 420 pointing tasks in ISS from 2 astronauts. We also did not find any significant difference among pointing performance and mental state of astronauts and students on ground. Results from the study can be used to predict pointing and selection time from dimension and position of GUI (Graphical User Interface) elements for cockpits of spacecraft.
[HC-27] What Security and Privacy Transparency Users Need from Consumer-Facing Generative AI
【速读】:该论文旨在解决消费者在使用生成式 AI(Generative AI)工具时,因安全与隐私(Security and Privacy, SP)信息不透明或不可信而导致的采纳决策困难与后续使用受限问题。研究发现,现有 SP 信息披露往往无法有效影响用户初始采纳行为,用户更倾向于依赖如流行度等粗粒度代理指标来推断安全性;而在使用过程中,对 SP 实践的不确定性进一步抑制了用户在高风险场景下的使用意愿,甚至导致弃用。解决方案的关键在于构建可操作的透明性设计,包括提供可信的信息来源(如第三方独立评估)和可用的交互界面(如按需披露机制),并据此提炼出五个支持决策与使用的透明性设计维度,为研究人员、设计师和政策制定者优化消费级 GenAI 的 SP 透明度提供了系统性指导。
链接: https://arxiv.org/abs/2604.17270
作者: Jiaxun Cao,Yu Dong,Chunxi Zhan,Rithvik Neti,Sai Teja Peddinti,Pardis Emami-Naeini
机构: Duke University (杜克大学); Duke Kunshan University (杜克昆山大学); Google (谷歌)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:
Abstract:Users increasingly rely on consumer-facing generative AI (GenAI) for tasks ranging from everyday needs to sensitive use cases. Yet, it remains unclear whether and how existing security and privacy (SP) communications in GenAI tools shape users’ adoption decisions and subsequent experiences. Understanding how users seek, interpret, and evaluate SP information is critical for designing usable transparency that users can trust and act on. We conducted semi-structured interviews and design sessions with 21 U.S. GenAI users. We find that available SP information rarely drove initial adoption in practice, as participants often perceived it as incomplete, ineffective, or lacking credibility. Instead, they relied on rough proxies, such as popularity, to infer SP practices. After adoption, uncertainty about SP practices constrained participants’ willingness to use GenAI tools, particularly in high-stakes contexts, and, in some cases, contributed to discontinued use. Participants therefore called for transparency that supports decision-making and use, including trustworthy information (e.g., independent evaluations) and usable interfaces (e.g., on-demand disclosure). We synthesize participants’ desired design practices into five dimensions to facilitate systematic future investigation into best practices. We conclude with recommendations for researchers, designers, and policymakers to improve SP transparency in consumer-facing GenAI.
[HC-28] All Public Voices Are Equal But Are Some More Equal Than Others to LLM s?
【速读】:该论文旨在解决联邦政府在规则制定过程中使用大语言模型(Large Language Models, LLMs)处理公众意见时是否存在公平性问题,尤其是不同身份属性(如种族、性别、社会经济地位)的评论者是否受到差异化的处理。其解决方案的关键在于采用反事实设计:保持评论内容不变,仅改变评论者的身份归属(通过职业、种族、性别等变量),系统性测试8个可用于联邦机构的LLM对相同文本生成摘要的一致性差异。研究发现,只有职业这一社会经济信号导致了稳定且显著的差异化处理——将同一评论归因于街头摊贩而非金融分析师时,模型生成的摘要更少保留原意、语言更简单、情感倾向发生偏移,且该效应在所有模型、提示词和监管场景中均一致;而种族和性别影响不显著,写作质量则通过论点实质内容而非表面语法决定摘要效果。这表明,在AI公平性评估中应重点关注社会经济信号,并建议将公平性基准纳入现有联邦IT采购流程(如FedRAMP)。
链接: https://arxiv.org/abs/2604.17247
作者: Sola Kim,Marco A. Janssen,Jieshu Wang,Ame Min-Venditti,Neha Karanjia,John M. Anderies
机构: Arizona State University (亚利桑那州立大学); Stony Brook University (石溪大学)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Preprint
Abstract:Federal agencies are increasingly deploying large language models (LLMs) to process public comments submitted during notice-and-comment rulemaking, the primary mechanism through which citizens influence federal regulation. Whether these systems treat all public input equally remains largely untested. Using a counterfactual design, we held comment content constant and varied only the commenter’s demographic attribution – race, gender, and socioeconomic status – to test whether eight LLMs available for federal use produce differential summaries of identical comments. We processed 182 public comments across 32 identity conditions, generating over 106,000 summaries. Occupation was the only identity signal to produce consistent differential treatment: the same comment attributed to a street vendor, compared to a financial analyst, received a summary that preserved less of the original meaning, used simpler language, and shifted emotional tone. This pattern held across all names, prompts, models, and regulatory contexts tested. Race effects were inconsistent and appeared driven by specific name tokens rather than racial categories; gender effects were absent. Writing quality predicted summarization outcomes through argument substance rather than surface mechanics; experimentally injected spelling and grammar errors had negligible effects. The magnitude of occupation-based differential treatment varied by model provider, meaning that selecting a model implicitly selects a level of fairness – a dimension that current procurement frameworks such as FedRAMP do not evaluate. These findings suggest that socioeconomic signals warrant attention in AI fairness assessments for government information systems, and that fairness benchmarks could be incorporated into existing federal IT procurement processes.
[HC-29] Lightweight Cybersickness Detection based on User-Specific Eye and Head Tracking Data in Virtual Reality
【速读】:该论文旨在解决虚拟现实(Virtual Reality, VR)中晕动病(Cybersickness)检测的可靠性与用户个体差异问题,现有方法普遍存在跨用户检测性能不稳定、模型复杂度高以及缺乏个性化适配等缺陷。其解决方案的关键在于提出一种轻量级检测框架,结合集成学习(Ensemble Learning)模型与用户特定的眼球和头部追踪数据,通过特征工程和基于相似内容片段的数据集构建,显著提升检测精度;实验表明,在跨用户设置下准确率达93%,用户个性化设置下达88%,且仅使用23维特征即可实现高效实时检测,兼顾了时间效率与性能表现。
链接: https://arxiv.org/abs/2604.17158
作者: Yijun Wang,Mihai Bâce,Maria Torres Vega
机构: KU Leuven (鲁汶大学)
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 23 pages, 4 figures, 5 tables
Abstract:The occurrence of cybersickness in virtual reality (VR) significantly impairs users’ perception and sense of immersion. Therefore, timely detection of cybersickness and the application of appropriate intervention strategies are crucial for enhancing the user experience. However, existing cybersickness detection methods often suffer from issues such as poor detection reliability across different levels of cybersickness and unnecessary model complexity. Furthermore, while cybersickness exhibits significant inter-user variability, most existing approaches aggregate all data from users and lack user-specific solutions. In this paper, we investigate a lightweight approach for cybersickness detection incorporating an ensemble learning model and user-specific eye and head tracking data. Our experiments using the open-source dataset Simulation 2021 demonstrate that feature engineering and training set construction are critical for determining detection performance. Models trained with data from similar-content segments achieve the best results, attaining detection accuracies of 93% in the cross-user setting and 88% in the user-personalized setting, using only 23-dimensional eye and head features. Moreover, by using user-specific data, well-tuned ensemble learning models with shorter training and inference times can be feasibly applied to real-world cybersickness detection, offering superior time efficiency and outstanding detection performance. This work offers useful evidence toward the development of lightweight and user-adaptive cybersickness detection models for VR applications.
[HC-30] he Privacy Placebo: Diagnosing Consent Burden through Performative Scrolling
【速读】:该论文旨在解决用户在隐私同意界面中因结构化交互设计导致的“虚假控制感”问题,即用户虽被要求做出选择,但实际决策过程常被低效、重复的交互模式(如performative scrolling)所主导,而非真正理解信息。其核心解决方案是提出可复现的界面审计指标——行为式滚动指数(Performative Scrolling Index, PSI),用于量化用户在可见且可操作的非接受选项出现前所承受的预选择负担(pre-choice burden)。PSI将负担分解为四个可观测维度:距离(distance)、时间(time)、焦点循环(focus loops)和隐藏揭示(hidden reveals),从而诊断界面设计如何通过离屏选项、碎片化披露和分阶段模态流程等机制增加摩擦,而未实质性提升用户控制力。此方法不评估理解程度或法律合规性,而是作为支持可重复审计与界面重构的诊断工具。
链接: https://arxiv.org/abs/2604.17129
作者: Haoze Guo,Ziqi Wei
机构: University of Wisconsin - Madison(威斯康星大学麦迪逊分校)
类目: Human-Computer Interaction (cs.HC)
备注: In Submission
Abstract:While consent banners and privacy policies invite users to read and choose, many choices are shaped by repeated, low-yield interaction routines rather than deliberation. This paper studies performative scrolling: slow, low-information interaction that can signal attention to consent without substantially improving understanding. We present the Performative Scrolling Index (PSI), a reproducible interface-audit metric for measuring pre-choice burden before a meaningful non-accepting alternative becomes visible and actionable. PSI decomposes burden into four observable components: distance, time, focus loops, and hidden reveals. In this paper, PSI is the primary burden metric, while companion signals such as AAI, CSI, and divergence are used as secondary interpretive audit aids rather than standalone validated scales. We also provide a least-effort audit protocol, design-side invariants, a worked example, and a medium-scale live deployment across desktop and mobile conditions under pointer and keyboard traversal policies. Together, these analyses show how structural choices such as offscreen alternatives, fragmented disclosure, and staged modal flows can increase pre-choice friction without improving meaningful control. PSI is not a measure of comprehension or legal sufficiency; rather, it is a diagnostic of interface-side burden intended to support reproducible audits and redesigns.
[HC-31] he Effects of Request Alerts on the Diversity and Visibility of Community Notes
【速读】:该论文旨在解决 crowdsourced fact-checking(众包事实核查)系统在内容筛选和可见性方面存在的公平性问题,特别是如何引导贡献者更广泛地覆盖多样化且具有政治敏感性的内容,同时避免加剧内容不平等。其解决方案的关键在于引入“request alerts”(请求提醒)机制——当用户对某条帖子提出足够多的社区笔记请求时,平台会显示一个提示,作为界面线索以引导贡献者行为。研究发现,这一机制显著提升了笔记的可见性(提高8.4–20.2个百分点),并促使写作者处理更多样化和政治性强的内容;但同时也导致资源进一步集中于已占主导地位的“政治与冲突”类别,从而加剧了内容分布的不平等。这表明 request alerts 作为一种有效的界面提示,在提升个体参与多样性的同时,也需警惕其可能引发的系统级偏倚。
链接: https://arxiv.org/abs/2604.17042
作者: Yilin Gong,Siqi Wu
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 11 pages, 8 figures
Abstract:Several major social media platforms have shifted toward crowdsourced fact-checking systems like Community Notes to combat misinformation at scale. However, these systems face criticism regarding which content is scrutinized and how visible that scrutiny is. To address these concerns, X allows users to request community notes for specific posts. When sufficient requests accumulate, X displays an alert, formalizing an interface cue intended to guide contributor behavior. In this study, we examine the effectiveness of request alerts. We infer the presence of request alerts at the time each note was written and identify 318 top writers who were repeatedly exposed to these alerts. Through analyzing their contributed 54,874 English notes written with and without request alerts, we find that at the individual level, writers fact-check more diverse and more political content under alerts. Nonetheless, at the collective level, these shifts direct contributions toward the already dominant Politics and Conflict category, thereby increasing content inequality within the Community Notes ecosystem. Finally, using a mixed-effects model that controls for both writer- and topic-level random effects, we estimate that notes written under alerts are between 8.4 and 20.2 percentage points more likely to be classified as helpful and thus visible to the public, compared to non-alerted notes. This visibility gain diminishes as topics diverge further from writers’ prior interests, demonstrating a pivot penalty effect. Taken together, our findings show that request alerts function as an effective interface cue that increases both topical diversity and note visibility in Community Notes.
[HC-32] he Instrumental Dissolution of Typing: Why AI Challenges the Keyboard Era in Knowledge Work
【速读】:该论文试图解决的问题是:随着多模态生成式 AI(Generative AI)在语音和手势理解上达到人类水平,传统键盘作为知识工作核心工具的“工具性必要性”正在消失,这种变化如何重塑专业劳动、组织沟通与生产力认定机制。其解决方案的关键在于提出“工具性消解”(instrumental dissolution)概念——即键盘不再因硬件替代而退出历史舞台,而是功能逐步迁移至AI系统中,导致知识工作者角色从“打字生产者”转变为“对抗性审计者”,从而将约束焦点从内容生成转向验证评估(verification bottleneck)。这一转变要求重新设计人机交互界面,以支持基于验证中心的新型人机协作范式。
链接: https://arxiv.org/abs/2604.17023
作者: Wei Roy Hua
机构: 8T5 Innovations
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 146 pages, 9 sections. Also available at this https URL
Abstract:For four decades, the QWERTY keyboard organized white-collar knowledge work. Typing’s dominance was instrumental, not cognitively necessary. As multimodal AI achieves human-parity understanding of speech and gesture, this necessity dissolves. We introduce instrumental dissolution – loss of institutional-default status while persisting in specialist niches. The keyboard era ends not through hardware replacement but through migration of its function into AI systems. The central contribution identifies the verification bottleneck: as AI collapses production friction, the primary constraint shifts from generation to evaluation. Knowledge workers become adversarial auditors rather than keystroke-producers. This restructures professional expertise, organizational communication, and how productive labor is recognized. Converging evidence from history, philosophy, neuroscience, technology, organizational studies, and cultural analysis supports this thesis. We map synthetic literacy – oral input generating literate output – as the defining feature of this transition. Under three scenarios (optimistic: 2028-2035; base: 2035-2045; pessimistic: 2045-2060), we specify disconfirmation criteria that would weaken the thesis if observed. We propose seven interface primitives operationalizing verification-centered HCI.
[HC-33] Intelligent Drill-Down: Large Language Model-Driven Drill-Down Technique for Human-AI Collaborative Visual Exploration
【速读】:该论文旨在解决可视化分析中因钻取(drill-down)空间过大而导致分析师迷失方向、效率下降的问题。其核心解决方案是提出一个智能钻取框架(Intelligent Drill-Down Framework),关键在于利用大语言模型(Large Language Model, LLM)实现三个核心功能:一是通过训练LLM近似验证过的贪心算法,生成有价值的钻取路径推荐;二是基于用户交互数据解析用户意图以构建个性化的钻取图表;三是设计分支管理机制以支持并行钻取路径的高效控制。该框架有效降低了数据解释的认知负担,提升了多维数据分析中的洞察生成效率。
链接: https://arxiv.org/abs/2604.17002
作者: Zhijun Zheng,Tian Qiu,Yuheng Zhao,Siming Chen
机构: Fudan University (复旦大学)
类目: Human-Computer Interaction (cs.HC)
备注: 11 pages, 6 figures. Accepted to IEEE PacificVis 2026
Abstract:In visual analytics, applying filters to drill-down and extract higher-value insights is a common and important data analysis method. When the drill-down space becomes excessively large, analysts may lose orientation, leading to decreased efficiency in the drill-down process. To tackle these challenges, we propose the Intelligent Drill-Down Framework, in which a large language model (LLM) facilitates the generation of visual insights, leverages user interaction data to interpret user intent, and generates appropriate drill-down paths. Our method is designed to assist users in identifying valuable drill-down paths when exploring multidimensional data, thereby reducing the cognitive burden of data interpretation and facilitating the generation of insights. Specifically, we propose a drill-down path recommendation method, in which the LLM is trained to approximate a validated greedy algorithm. Secondly, we analyze the user’s intent to construct a drill-down chart. Finally, we design a branch management method. Building upon this framework, we designed a system that includes a hybrid interface providing hierarchical navigation to monitor users and manage parallel branches, a visualization panel for interactive data exploration, and an insight panel to present analytical findings and generate drill-down recommendations. We evaluated the effectiveness of our method through a demonstrative use case and a user study.
[HC-34] Correcting Low-Signal Sensitivity in the Deliberative Reason Index
【速读】:该论文旨在解决标准** deliberative reason index (DRI)** 在低信号条件下易产生高估一致性评分的问题,即当相关性接近零时,标准DRI会错误地将其视为一致性的证据,从而导致偏差。其解决方案的关键在于引入一种改进的DRI,通过施加连续惩罚项来校正低信号相关对,该惩罚机制在无实质性信号时抑制虚假一致性,同时保持原始量纲,并在存在显著信号时精确还原为标准DRI;通过阈值敏感性分析确定最优参数τ=0.2,实证检验表明该修正方法在保留核心推断的同时显著提升了低信号场景下的可靠性与可比性。
链接: https://arxiv.org/abs/2604.16963
作者: Francesco Veri
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 9 pages, 1 figure
Abstract:The Deliberative Reason Index (DRI) is increasingly used to assess the coherence between considerations and preferences in deliberative settings, including applications to LLM-generated data. Under low-signal conditions, however, the standard DRI can produce inflated scores by treating near-zero correlations as evidence of consistency. Monte Carlo simulations across common study designs show that this bias increases with group size and yields positive values even under random response. A modified DRI is introduced that applies a continuous penalty to low-signal correlation pairs. The modification preserves the original scale and reduces exactly to the standard DRI when substantive signal is present. A threshold sensitivity analysis identifies \tau=0.2as the optimal parameter. An empirical check with archival deliberative data shows that substantive inferences remain unchanged. The modification improves the reliability and comparability of the DRI in low-signal settings.
[HC-35] LLM s can persuade only psychologically susceptible humans on societal issues via trust in AI and emotional appeals amid logical fallacies
【速读】:该论文旨在解决缺乏纵向证据来探究大语言模型(Large Language Models, LLMs)在随时间演化的心理框架下其说服力与“类人”感知的变化问题。解决方案的关键在于提出并实施Talk2AI这一纵向研究框架,通过结构化对话实验(共3,080次对话、60,000轮交互)系统量化LLMs在社会极化议题上的心理社会、推理和情感维度的说服力。该框架结合多波次数据采集、行为反馈分析与可解释人工智能(Explainable AI, XAI),揭示了人类信念具有长期惯性、LLMs与人类均频繁使用谬误推理,并识别出个体对LLM说服更敏感的心理特征(如高信任度、宜人性、外向性及认知需求),从而为理解生成式AI(Generative AI)如何通过多重心理社会路径影响人类观点提供了实证基础与方法论支撑。
链接: https://arxiv.org/abs/2604.16935
作者: Alexis Carrillo,Salvatore Citraro,Ali Aghazhadeh Ardebili,Enrique Taietta,Giulio Rossetti,Emilio Ferrara,Giuseppe Alessandro Veltri,Massimo Stella
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:
Abstract:Scarce longitudinal evidence examines LLMs’ persuasiveness and humanness along time-evolving psychological frameworks. We introduce Talk2AI, a longitudinal framework quantifying psycho-social, reasoning and affective dimensions of LLMs’ persuasiveness about polarizing societal topics. In a four-way longitudinal setup, Talk2AI’s 770 participants engaged in structured conversations with one of four leading LLMs on topics like climate change, social media misinformation, and math anxiety. This produced 3,080 conversations over 60,000 turns. After each wave, participants reported conviction in their initial topic stance, perceived opinion change, LLM’s perceived humanness, a self-donation to the topic and a textual explanation. Feedback time series showed longitudinal inertia in convictions, indicating some human anchoring to initial opinions even after repeated exposure to AI-generated arguments. Interestingly, NLP analyses revealed that both humans and LLMs relied on fallacious reasoning in 1 conversational quip every 6, countering the ``LLMs as superior systems" stereotype behind LLMs’ cognitive surrender. LLMs’ perceived humanness was most learnable from sociodemographic, psychological and engagement features ( R^2=0.44 ), followed by opinion change ( R^2=0.34 ), conviction ( R^2=0.26 ) and personal endowment ( R^2=0.24 ). Crucially, explainable AI (XAI) indicated: (i) the presence of individuals more susceptible to LLM-based opinion changes; (ii) psychological susceptibility to LLM-convincing consisted of having more trust in LLMs, being more agreeable and extraverted and with a higher need for cognition. A multiverse approach with mixed-effects models confirmed XAI results, alongside strong individual differences. Talk2AI provides a grounded framework and evidence for detecting how GenAI can influence human opinions via multiple psycho-social pathways in AI-human digital platforms.
[HC-36] Beyond Serendipity: From Exposing the Unknown to Fostering Engagement through Peer Recommendation
【速读】:该论文旨在解决传统推荐系统中“过滤气泡”(filter bubble)问题,即用户长期接触相似内容导致兴趣窄化,而单纯增加陌生内容的曝光并不能保证用户真正理解或欣赏这些内容。其解决方案的关键在于提出“同伴推荐”(Peer Recommendation)框架,其中用户与一个具有不同偏好的人工智能代理(Peer)通过对话式交互共同探索陌生内容,用户既是推荐者也是接收者,双方协同构建共享播放列表。这种机制将推荐过程从单向推送转变为双向协作,通过引入“他异性”(otherness)促进用户对陌生内容的深度参与和价值感知,从而实现从被动暴露到主动理解的转变。
链接: https://arxiv.org/abs/2604.16818
作者: Sosui Moribe,Taketoshi Ushiama
机构: Kyushu University (九州大学)
类目: Human-Computer Interaction (cs.HC)
备注: 7 pages, 4 figures
Abstract:Serendipity-oriented recommender systems expose users to unfamiliar items to counter filter bubbles, yet mere exposure does not ensure that users will understand or appreciate the content they encounter. We propose Peer Recommendation, a framework in which a user and an AI agent (Peer) with distinct preferences collaboratively explore unfamiliar content. Unlike conventional conversational recommender systems where the user is a passive recipient, our framework positions the user as both a recommender and a recipient: the user and the Peer mutually recommend songs to each other through chat-based dialogue, collaboratively building a shared playlist. In an exploratory within-subjects experiment (N=14), we compared three conditions: (1) a Close Peer, (2) a Distant Peer, and (3) a baseline agent without an explicit preference profile. The Close Peer significantly increased users’ interest expansion and perceived value of the activity compared to the baseline, with medium-to-large effect sizes. The Distant Peer showed no significant difference at the aggregate level; however, qualitative analysis revealed varied responses, with some participants strongly preferring the Distant Peer. These findings suggest that the “otherness” of a recommendation partner is essential for moving beyond mere exposure toward genuine engagement, and that the appropriate degree of preference distance may vary and need to be adapted to individual users.
[HC-37] he Reliance Negotiation Framework: A Dynamic Process Model of Student LLM Engagement in Academic Writing
【速读】:该论文旨在解决现有理论框架无法充分解释学生在学术写作中对大语言模型(Large Language Models, LLMs)依赖行为的动态性和复杂性问题,尤其是忽视了个体内部跨任务的变异性、经验带来的习惯化而非能力提升的悖论,以及基于伦理推理的非使用策略。其解决方案的关键在于提出“依赖协商框架”(Reliance Negotiation Framework, RNF),将LLM依赖视为一个由四个持续输入要素(感知收益、感知风险、伦理承诺和情境需求)共同作用并递归影响后续决策的动态过程;同时引入双模型架构,以容纳13.0%因明确伦理立场而完全不参与协商的学生群体,从而实现对多样化使用模式的系统性建模,并为AI素养教育、学术诚信政策及少数族裔服务型机构的公平实践提供可验证的预测与指导。
链接: https://arxiv.org/abs/2604.16772
作者: Shahin Hossain
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 21 pages, 1 figure, 8 tables
Abstract:Student engagement with large language models (LLMs) in academic writing is not a stable trait, an adoption decision, or a competency level; it is a continuously negotiated process that existing frameworks cannot adequately theorize. Typological models provide categories without mechanisms; technology acceptance models explain adoption but not post-adoption quality; AI literacy frameworks treat competency as a static predictor rather than a live input. None accounts for within-student variability across tasks, the developmental paradox whereby experience produces habituation rather than sophistication, or principled non-use as a form of ethical reasoning. This article introduces the Reliance Negotiation Framework (RNF), developed from a sequential explanatory mixed-methods study of 382 undergraduates at a public minority-serving institution in the United States (survey, N = 382; 14 semi-structured interviews; three qualitative survey strands; 1,435 coded instances). The RNF reconceptualizes LLM reliance as an ongoing negotiation among four concurrent inputs (perceived benefits, perceived risks, ethical commitments, and situational demands) with outputs that recursively modify subsequent decisions. A Two-Model Architecture accommodates the 13.0% of participants whose categorical ethical commitments foreclose negotiation entirely. The framework generates four falsifiable predictions with implications for AI literacy pedagogy, academic integrity policy, and equity-centered practice at minority-serving institutions.
[HC-38] You can just review things: A digital ethnography of informal peer review
【速读】:该论文旨在解决传统封闭式同行评审(peer review)机制在透明度、参与广度和反馈时效性方面的局限性,探索一种新兴的“非正式同行评审”(informal peer review)实践如何作为补充甚至替代路径来增强学术评价体系的开放性和可靠性。其核心问题在于:非正式同行评审的参与者构成、运作方式及其对学术生态的实际影响尚未被系统理解。解决方案的关键在于通过跨平台数字民族志研究方法,追踪15个学术社群中12个典型案例及8篇元评论,识别出四大主题——评审者群体多元且自组织性强、使用非常规深度策略、面临来自作者与出版方的阻力,并提出应以学者价值为导向,降低参与门槛、设计激励机制并强化治理结构,从而推动这一碎片化但具有潜力的“证据基础设施”向可扩展、可持续的方向演进。
链接: https://arxiv.org/abs/2604.16764
作者: Jay Patel,Joel Chan
机构: 未知
类目: Digital Libraries (cs.DL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 108 pages, 17 figures, 7 tables, version 1.0
Abstract:Across scholarly communities, manuscripts face similar evaluative rituals: editors invite experts to privately assess submissions through formal peer reviews. This closed, loosely structured, and publisher-mediated process is now being supplemented by critiques on open, distributed platforms. We call this practice, a blend of three open peer review variants, informal peer review as it is accessible to outsiders, unmediated by publishers, and conducted across public platforms. Informal peer reviewers range from occasional error detectors to experienced sleuths who identify plagiarism, fraud, errors, conflicts of interest, and conceptual flaws. They may interpret methods, clarify jargon, assess value, and connect to related work. Here, we asked four questions: (1) Who are informal peer reviewers? (2) Where do they work? (3) How do they evaluate research? and (4) What are their impacts? To answer these questions, we conducted a cross-platform digital ethnography with participant observation. We traced discourse across communities over four months and revisited cases after nine and twelve months. From 15 communities, we selected 12 case mentions (10 unique cases) and 8 meta-commentaries from 26 reviewers. Using open and axial coding, we generated 1,080 codes and four themes: reviewers are a motley crew, they self-organize across subpar digital spaces, use deep, uncommon strategies, and they face resistance from authors, publishers, and editors. Informal peer review, we concluded, is a fragile, minimally governed patchwork of people, platforms, and practices, as well as an emerging evidence infrastructure that can be scaled up. We advise advocates and tool-builders to evolve informal review tools, communities, training, and governance by connecting to scholars’ values, reducing participation friction, and rewarding attempts to extend the scholarly dialogue. Comments: 108 pages, 17 figures, 7 tables, version 1.0 Subjects: Digital Libraries (cs.DL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC) Cite as: arXiv:2604.16764 [cs.DL] (or arXiv:2604.16764v1 [cs.DL] for this version) https://doi.org/10.48550/arXiv.2604.16764 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-39] Evaluating Adaptive Personalization of Educational Readings with Simulated Learners
【速读】:该论文旨在解决教育阅读材料个性化适应性不足的问题,即如何基于学习者的知识状态动态调整阅读内容以提升学习效果。解决方案的关键在于构建一个理论驱动的模拟学习者框架,其核心包括:从开放教材中提取学习目标与知识组件本体(knowledge-component ontology),在浏览器端构建本体图谱(Ontology Atlas)并标注文本片段;采用受Construction-Integration启发的记忆模型整合DIME风格的读者特征和KREC风格的误解修正机制,并引入开放的新 Dale-Chall 可读性信号;通过贝叶斯知识追踪(BKT)实现自适应路径选择,同时基于显式记忆状态进行选项评分决策。实验表明,在计算机科学领域,自适应阅读显著提升了学习成效,而在无机化学和生物学中则呈现微弱正向或中性结果。
链接: https://arxiv.org/abs/2604.16744
作者: Ryan T. Woo,Anmol Rao,Aryan Keluskar,Yinong Chen
机构: Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:We present a framework for evaluating adaptive personalization of educational reading materials with theory-grounded simulated learners. The system builds a learning-objective and knowledge-component ontology from open textbooks, curates it in a browser-based Ontology Atlas, labels textbook chunks with ontology entities, and generates aligned reading-assessment pairs. Simulated readers learn from passages through a Construction-Integration-inspired memory model with DIME-style reader factors, KREC-style misconception revision, and an open New Dale-Chall readability signal. Answers are produced by score-based option selection over the learner’s explicit memory state, while BKT drives adaptation. Across three sampled subject ontologies and matched cohorts of 50 simulated learners per condition, adaptive reading significantly improved outcomes in computer science, yielded smaller positive but inconclusive gains in inorganic chemistry, and was neutral to slightly negative in general biology.
[HC-40] acher-Authored Prompts for Configuring Student-AI Dialogue: K-12 Classroom Implementation
【速读】:该论文旨在解决生成式 AI (Generative AI) 在课堂教学中如何有效传递教师的 pedagogical intent(教学意图),并确保学生与 AI 的对话能够实现预期的学习目标,尤其是在教师通过实时课堂调控对 AI 对话进行微调的情况下。研究发现,教师通过设计“教师到 AI”的指令提示(instructional scaffold)和面向学生的对话启动语,可显著提升学生-AI 交互与教学目标的一致性——71% 的对话完全符合教学意图,仅 1% 显著偏离。然而,关键挑战在于认知需求水平(Depth of Knowledge, DOK)的落实存在设计-实施差距:38% 的对话未达到教师设定的 DOK 水平,且在 DOK 3 目标下接近 50%。解决方案的关键在于教师自定义的提示层(prompt layers)作为核心调控杠杆,其中明确的“结束线”(finish lines)可将 DOK 差距缩小 0.22 级(p < .001),而“禁止直接答案”的护栏机制可降低 AI 给出最终答案的比例 8.5 个百分点。这表明,教师对提示的设计能力是实现高阶思维训练可规模化落地的核心因素,同时也凸显了需进一步开发支持工具以稳定维持高阶认知参与的必要性。
链接: https://arxiv.org/abs/2604.16738
作者: Alex Liu,Min Sun,Lief Esbenshade,Victor Tian,Zachary Zhang,Kevin He
机构: University of Washington (华盛顿大学); Colleague AI, INC.
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:GenAI has rapidly entered instructional and learning settings as a teaching assistant or AI tutor. However, less is known about how pedagogical intent connects to the learning generated within these systems, especially when student-facing AI dialogues are fine-tuned through teacher orchestration in live classrooms. This study examines a classroom deployment of a “Classroom Teaching Aide” (TASD) system, which enables teachers to author both a teacher-to-AI setup prompt (instructional scaffold) and a student-facing conversation starter to launch AI-mediated classroom discussions. We analyze a multi-subject pilot conducted in Spring 2025, involving 20 participating teachers (16 of whom implemented the system), across 39 classrooms and 77 TASD settings, yielding 1,479 student-AI conversations with 878 unique students. Using platform logs, LLM coding with human validation, and post-study teacher interviews (N=10), we characterize teacher authoring choices and link them to enacted student-AI interaction outcomes. In deployment, student-AI conversations were largely aligned with instructional intent: 71% were fully on-track, and fewer than 1% were substantially off-track. However, a persistent design-enactment gap emerged for cognitive demand: 38% of conversations under-reached the teacher-targeted DOK level, approaching 50% when targeting DOK 3. The study also shows that explicit finish lines in the prompt reduced the DOK gap by 0.22 levels (p .001), and “no direct answers” guardrails reduced AI final-answer rates by 8.5 percentage points. These findings position teacher-authored prompt layers as critical orchestration levers that translate pedagogical intent into structured student-AI dialogue, underscoring both their promise for scalable classroom integration and the need for additional supports to reliably sustain higher-order reasoning during enactment.
[HC-41] Agent Click: A Skill-Based Human-in-the-Loop Review Layer for Terminal AI Agents
【速读】:该论文旨在解决当前终端式自主AI代理(如Codex和Claude Code)与用户交互效率低、用户体验差的问题,尤其针对非专家用户在协作过程中面临的高门槛。现有交互方式依赖命令行或文本频道(如Discord),导致输出信息难以阅读、行动缺乏结构化展示、反馈需手动输入,从而阻碍了有效的人机协同。解决方案的关键在于提出AgentClick——一个基于本地npm服务器和浏览器插件的交互式审查层,将终端代理连接至结构化的Web界面,支持邮件撰写与修订、计划审查、轨迹可视化、错误定位等人类在环(human-in-the-loop)工作流,并通过可编辑的记忆机制实现偏好持久化和远程HTTP访问,显著提升了人机协作的效率与质量。
链接: https://arxiv.org/abs/2604.16520
作者: Haomin Zhuang,Hanwen Xing,Xiangliang Zhang
机构: University of Notre Dame (圣母大学); University of Southern California (南加州大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to ACM CAIS 2026 System Demonstrations. Conference paper
Abstract:Recent autonomous AI agents such as Codex, and Claude Code have made it increasingly practical for users to delegate complex tasks, including writing emails, executing code, issuing shell commands, and carrying out multi-step plans. However, despite these capabilities, human-agent interaction still largely happens through terminal interfaces or remote text-based channels such as Discord. These interaction modes are often inefficient and unfriendly: long text outputs are difficult to read and review, proposed actions lack clear structure and visual context, and users must express feedback by typing detailed corrections, which is cumbersome and often discourages effective collaboration. As a result, non-expert users in particular face a high barrier to working productively with agents. To address this gap, we present AgentClick, an interactive review layer for terminal-based agents. AgentClick is implemented as a localhost npm server paired with a skill-based plugin that connects the running agent to a browser interface, allowing users to supervise and collaborate with agents through a structured web UI rather than raw terminal text alone. The system supports a range of human-in-the-loop workflows, including email drafting and revision, plan review and modification, memory management, trajectory inspection and visualization, and error localization during agent execution. It also turns code generation and execution into a reviewable process, enabling users to inspect and intervene before consequential actions are taken. In addition, AgentClick supports persistent preference capture through editable memory and remote access over HTTP, allowing users to review agents running on servers from their personal devices. Our goal is to lower the barrier for non-expert users and improve the efficiency and quality of human-agent co-work.
[HC-42] An Edge-Host-Cloud Architecture for Robot-Agnostic Caregiver-in-the-Loop Personalized Cognitive Exercise: Multi-Site Deployment in Dementia Care
【速读】:该论文旨在解决个性化认知训练中缺乏可扩展、隐私保护且具备多角色协同能力的机器人交互平台的问题。传统单机器人系统难以整合照护者知识、本地智能处理与多样化机器人形态,导致个性化不足、延迟高及部署受限。解决方案的关键在于提出一个分布式、多方参与的“Speaking Memories”架构:通过本地边缘服务器解耦多模态感知与推理,实现低延迟、隐私安全的实时对话;由照护者通过云端门户输入结构化生平信息以驱动个性化对话策略,并借助自动化多模态评估层持续分析用户情感、参与度等指标,形成闭环反馈机制,从而支持长期个性化干预与数据驱动的模型优化,最终提升交互质量与临床适用性。
链接: https://arxiv.org/abs/2604.16408
作者: Wenzheng Zhao,Ruth Palan Lopez,Shu Fen Wung,Fengpei Yuan
机构: Worcester Polytechnic Institute (伍斯特理工学院)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: 21 pages, 6 figures, 10 tables, submitted to IEEE Transactions on Robotics (T-RO)
Abstract:We present Speaking Memories, a distributed, stakeholder-in-the-loop robotic interaction platform for personalized cognitive exercise support. Rather than a single robot-centric system, Speaking Memories is designed as a generalizable robotics architecture that integrates caregiver-authored knowledge, local edge intelligence, and embodied robotic agents into a unified socio-technical loop. The platform fuses auditory, visual, and textual signals to enable emotion-aware, personalized dialogue, while decoupling multimodal perception and reasoning from robot-specific hardware through a local edge interaction server. This design achieves low-latency, privacy-preserving operation and supports scalable deployment across heterogeneous robotic embodiments. Caregivers and family members contribute structured biographical knowledge via a secure cloud portal, which conditions downstream dialogue policies and enables longitudinal personalization across interaction sessions. Beyond real-time interaction, the system incorporates an automated multimodal evaluation layer that continuously analyzes user responses, affective cues, and engagement patterns, producing structured interaction metrics at scale. These metrics support systematic assessment of interaction quality, enable data-driven model fine-tuning, and lay the foundation for future clinician- and caregiver-informed personalization and intervention planning. We evaluate the platform through real-world deployments, measuring end-to-end latency, dialogue coherence, interaction stability, and stakeholder-reported usability and engagement. Results demonstrate sub-6-second response latency, robust multimodal synchronization, and consistently positive feedback from both participants and caregivers. Furthermore, subsets of the dataset can be shared upon request, subject to participant consent and IRB constraints.
[HC-43] Instructor-Created Custom GPT s as Pedagogical Partners Fostering Immersion in Online Higher Education: Two Case Studies
【速读】:该论文旨在解决在线高等教育中学生参与度难以维持的挑战,特别是探索定制化生成式 AI(custom GPTs)如何促进师生沉浸式学习(immersive learning),即深度心理投入的状态。其核心解决方案在于通过“沉浸式学习立方体”框架(Immersive Learning Cube),将沉浸感分解为三个维度:系统沉浸(环境包围感)、叙事沉浸(意义情境)和代理沉浸(意义建构承诺)。研究发现,精心设计并嵌入课程的定制 GPT 能够在不同教学场景中分别扮演反馈伙伴或元认知导师角色,从而同时增强这三个维度的沉浸体验——例如,在美国加速研究生写作课程中,GPT 作为即时反馈工具提升系统与代理沉浸;在葡萄牙本科软件工程课程中,以情境化角色扮演形式出现的 GPT 则强化了系统、叙事与代理沉浸。关键在于,这类工具并非替代教师,而是通过提升即时性、连贯性和学习者自主性,显著增强在线学习的沉浸感与参与度。
链接: https://arxiv.org/abs/2604.16397
作者: Dennis Beck,Leonel Morgado
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted for presentation at iLRN 2026 - Immersive Learning Research Network conference
Abstract:As online higher education expands, sustaining student engagement remains a critical challenge. This paper approaches immersive learning by investigating how custom GPTs foster immersion (as a state of deep mental involvement) for students and instructors. While large language models (LLMs) offer potential for enhancing feedback, little research has examined instructor-created custom GPTs designed to align with specific pedagogical goals. This paper addresses this gap, employing the Immersive Learning Cube framework, which conceptualizes immersion through three dimensions: system (envelopment by the environment), narrative (meaningful context), and agency (commitment to meaning-making). Through a qualitative analysis of two distinct case studies, an accelerated graduate grant writing course in the US and an undergraduate software engineering course in Portugal, we analyze course-embedded artifacts to map how custom GPTs influence these immersion dimensions. In the grant writing course, the custom GPT functioned as a feedback partner, fostering system immersion through its immediacy, narrative immersion by reinforcing the proposal’s evolving story, and agency immersion by empowering students to negotiate feedback and take ownership of revisions. In the software engineering course, a diegetically-framed custom GPT acted as a metacognitive tutor, enhancing system immersion via its permanent availability, narrative immersion through its role-play function and agency immersion by scaffolding students’ self- and co-regulated learning. Our findings demonstrate that thoughtfully integrated custom GPTs can act as powerful pedagogical partners that leverage all three dimensions of immersion. Rather than replacing human instructors, they can amplify immediacy, coherence, and learner autonomy, creating more engaging and immersive online learning environments.
[HC-44] How Do Developers Interact with AI? An Exploratory Study on Modeling Developer Programming Behavior
【速读】:该论文旨在解决当前对开发者与人工智能(AI)交互行为理解的不足问题,特别是忽视了隐性意图和情感维度如何与具体编程动作交织的现象。现有研究多聚焦于可观测的开发活动(如“提示词设计”和“代码编辑”),而未深入探讨开发者在使用AI辅助编程时的心理状态与行为模式之间的复杂关系。论文通过一项包含76名开发者的混合方法研究(分为AI辅助组与非AI对照组),结合屏幕录制回溯标注、问卷调查与深度访谈,构建了一个名为S-IASE的新模型,从意图(intention)、行动(action)、支持工具(supporting tool)和情绪(emotion)四个维度刻画开发状态。其关键创新在于揭示了AI辅助编程中行为模式的聚合性与顺序性特征,例如开发者更倾向于专注于生成、评估与验证AI输出;同时发现AI组情绪更为稳定,而非AI组情绪波动更大,并识别出部分开发者存在因依赖AI而产生的自我怀疑或愧疚感。这一成果填补了理解开发者-AI互动复杂性的关键空白。
链接: https://arxiv.org/abs/2604.16393
作者: Yinan Wu,Ze Shi Li,Kathryn Thomasset Stolee,Bowen Xu
机构: North Carolina State University (北卡罗来纳州立大学); University of Oklahoma (俄克拉荷马大学)
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注: Accepted at ACM International Conference on the Foundations of Software Engineering (FSE 2026), Research Track
Abstract:Artificial Intelligence (AI) is reshaping how developers adopt software engineering practices, yet the multi-dimensional nature of developer-AI interaction remains under-explored. Prior studies have primarily examined dimensions observable from developer activities such as “Prompt crafting” and “Code Editing”, overlooking how hidden intentions and emotional dimensions intertwine with concrete actions during AI-assisted programming. To understand this phenomenon, we conducted a mixed-methods study with 76 developers split into AI-assisted and non-AI groups. Each performed programming tasks (Python with API management or Java with SQL). Developers retrospectively labeled their self-reported intentions, tool-supported actions, and emotions from screen recordings, supplemented by surveys and interviews. Our user study resulted in a novel model named S-IASE with four dimensions to describe programming behavior: intention, action, supporting tool, and emotion for a given development state. Our analysis reveals aggregated and sequential behavioral patterns. For example, using AI assistants often makes developers more focused on actively creating code, evaluating, and verifying generated results. AI-assisted participants showed emotionally stable development flow, as opposed to non-AI-assisted participants who experienced more fluctuating emotions. Interviews revealed further nuance: some developers reported impostor-like feelings, expressing guilt or self-doubt about relying on AI. Our work bridges an important gap in understanding the complexities of developer-AI interaction in programming context.
[HC-45] A Multi-Technique Approach for Improving Summary Polar Diagrams
【速读】:该论文旨在解决总结性极坐标图(summary polar diagrams)在直观性和可读性方面的局限性问题,例如因极坐标特性导致的感知困难和重叠绘制(overplotting)现象。其核心解决方案是提出一种混合方法,整合概览+细节(overview+detail)、聚合(aggregation)、交互式过滤(interactive filtering)、笛卡尔链接(Cartesian linking)以及小多图(small multiples)等技术手段,从而提升极坐标图的清晰度、全面性和功能性。该方案通过用户研究与领域专家评审验证了其有效性,在气候科学、数据科学和机器学习等多个领域均展现出增强理解力、可用性和实用性的潜力。
链接: https://arxiv.org/abs/2604.16355
作者: Aleksandar Anžel,Zewen Yang,Georges Hattab
机构: Robert Koch-Institute (罗伯特·科赫研究所); Freie Universität (自由大学)
类目: Human-Computer Interaction (cs.HC)
备注: 21 pages, 6 figures, 1 table, 1 supplemental video
Abstract:While the polar system may lack the universal familiarity of its Cartesian counterpart, it remains indispensable for certain tasks. Summary polar diagrams, such as Taylor and mutual information diagrams, address tasks like discovering relationships, visualizing data similarity, and quantifying correspondence. Although these diagrams are invaluable tools for uncovering data relationships, their polar nature can hinder intuitiveness and lead to issues like overplotting. We present a hybrid approach that combines overview+detail, aggregation, interactive filtering, Cartesian linking, and small multiples to enhance the clarity, comprehensiveness, and functionality of summary polar diagrams. We performed a user study to assess this approach’s effectiveness, noting comparable response times among participants. Additionally, three domain experts with varying visualization experience reviewed an implemented solution applying summary polar diagrams to climate, data science (novel), and machine learning, refining the approach prior to the user study. The findings underscore the versatility of our approach in enhancing comprehension, accessibility, and utility.
[HC-46] Hidden Technical Debt in Generative (GenUI) and Malleable User Interfaces
【速读】:该论文旨在解决生成式用户界面(GenUI)系统在实际应用中面临的采纳障碍问题,包括缺乏可适应的数据格式、过时的安全协议以及用户在构建自定义界面时认知与创造能力的不足。其解决方案的关键在于提出新的评估策略和科学方法,以量化malleable软件在用户研究中的影响、记录使用模式,并促进其实际落地与推广。
链接: https://arxiv.org/abs/2604.16354
作者: Besjon Cifliku
机构: Center for Advanced Internet Studies (CAIS)
类目: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: Paper accepted to the Workshop "What Does Generative UI Mean for HCI Practice?‘’ at CHI 2026 - Barcelona, April 15, 2026
Abstract:Malleable software can profoundly change how users interact with digital content, enabling non-experts to create their own customized tools. However, the practical adoption of GenUI systems faces several barriers, which I unpack in this paper, including a lack of adaptable data formats, “old” security protocols, and gaps in users’ cognitive and creative skills for building their own interfaces. I advocate new evaluation strategies and scientific methods to measure the impact of malleable software in user studies, document usage patterns, and ensure their practical adoption.
[HC-47] MDwAIstScheduler: A Low-Cost Voice-Activated Device for Hands-Free Clinical Scheduling ALT
【速读】:该论文旨在解决临床医师因电子健康记录(Electronic Health Record, EHR)任务和行政工作占用近一半工作时间而导致的职业倦怠问题,进而减少其用于直接患者照护的时间。解决方案的关键在于开发了一款低成本、可佩戴在腰间的语音助手MDwAIstScheduler,支持在患者诊疗过程中实现免提日程管理;该设备隐藏于白大褂之下,避免了可见屏幕或腕戴设备对医患眼神交流的干扰,同时基于树莓派(Raspberry Pi)本地运行与云端语音识别及大语言模型(Large Language Model, LLM)意图提取相结合的端到端流程,使医生仅需自然语言指令即可自动创建日历事件,从而提升工作效率并改善临床体验。
链接: https://arxiv.org/abs/2604.16352
作者: Diego Mardien,Frank Liu
机构: Arizona State University (亚利桑那州立大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted into CHI 2026 Workshop: Everyday Wearable for Personalized Health and Well-Being
Abstract:Physicians spend nearly half their workday on EHR tasks and administrative work, contributing to burnout and reducing time for direct patient care. We present MDwAIstScheduler, a low-cost, belt-worn voice assistant that allows hands-free calendar management during patient encounters. Hidden beneath a lab coat, the device avoids the eye-contact disruptions caused by visible screens or wrist-worn devices. Running on a Raspberry Pi with cloud-based speech recognition and LLM intent extraction, the system lets clinicians simply say ‘Schedule a follow-up with Mr. Smith next Tuesday at 2’ and automatically creates the calendar event. Our demo show-cases this end-to-end pipeline.
[HC-48] Beyond the Townhall: Spatial Anchoring and LLM Agents for Scalable Participatory Urban Planning
【速读】:该论文旨在解决参与式城市规划中因技术门槛高而导致公众难以深度参与的问题,尤其是在推动可持续城市发展过程中如何实现更广泛、更具包容性的公民参与。其解决方案的关键在于构建一个嵌入数字孪生(Digital Twin)的可扩展数字参与平台,通过空间锚定的沉浸式虚拟导览(结合“地点法”记忆策略和音频解说)增强信息记忆效果,并集成两个专为任务设计的大语言模型(LLM)助手:一个提供基于来源的事实澄清,另一个促进反思性讨论。实证研究表明,该方案显著提升了信息 recall 效果,引导参与者从关注个体不便转向关注集体社区层面的可持续收益,从而生成更具建设性和问题导向的反馈,为城市治理提供了可行的民主化参与工具。
链接: https://arxiv.org/abs/2604.16348
作者: Carina I Hausladen,Javier Argota Sánchez-Vaquerizo,Michael Siebenmann,Arthur Capozzi,Sachit Mahajan,Dirk Helbing
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:
Abstract:Participatory urban planning is central to sustainable city-making, yet the technically demanding nature of such interventions often limits meaningful involvement by diverse publics. We introduce a scalable digital participation platform that embeds sustainability projects within a navigable digital twin. Citizens experience a guided virtual walkthrough with audio narration employing the method of loci and spatial anchoring to support mnemonic encoding and recall. This immersive interface is augmented by two purpose-built LLM assistants: one delivers source-grounded factual clarifications, while the other facilitates reflective discussion. We evaluated this system in a randomized controlled online experiment (N = 195) against conventional industry practices (static visualizations and text-based consultations). Results show that spatially anchored immersive presentation significantly improved information recall, which substantially shifted participants’ attention from individual inconveniences to collective, community-oriented sustainability benefits. Consequently, participants provided significantly more constructive, solution-focused feedback to the (simulated) municipality. These findings establish a practical tool for cities and policymakers to foster inclusive, democratic participation in sustainability transitions.
[HC-49] Lean Atlas: An Integrated Proof Environment for Scalable Human-AI Collaborative Formalization
【速读】:该论文旨在解决AI生成形式化数学证明中存在的语义幻觉问题,即尽管形式化证明通过了类型检查器的逻辑正确性验证,但其命题和定义可能未能准确表达预期的数学内容。为应对这一挑战,作者提出一种“人类在环路”(human-in-the-loop)协作范式,由人类科学家负责命题与定义的语义验证,AI则完成形式化推理。解决方案的关键在于开发了Lean Atlas工具,其核心组件Lean Compass能够基于目标定理集自动提取受项目特定语义正确性影响的节点集合,显著缩小人工语义审查的候选范围;同时提出“对齐的Lean代码”(aligned Lean code)作为质量标准,以标识经过人类语义验证的形式化代码。
链接: https://arxiv.org/abs/2604.16347
作者: Banri Yanahama,Akiyoshi Sannai
机构: Nyx Foundation(奈克斯基金会); Kyoto University(京都大学); National Institute of Informatics(日本信息研究所)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 12 pages, 3 figures, 2 tables. Submitted to AIPV 2026 (1st Workshop on AI, Proof and Verification, co-located with FM 2026)
Abstract:AI-driven autoformalization of mathematics is advancing rapidly. However, the type checker of a proof assistant guarantees only the logical correctness of proofs; it does not verify whether propositions and definitions faithfully capture their intended mathematical content. Consequently, AI-generated formal proofs can exhibit semantic hallucination-passing the type checker yet failing to express the intended mathematics. We propose a human-in-the-loop approach in which human scientists and AI collaboratively produce formal proofs, with humans responsible for the semantic verification of propositions and definitions. To realize this approach, we develop Lean Atlas, a Lean 4 tool that visualizes the dependency graph of a Lean 4 project as an interactive web viewer, enabling human scientists to grasp the overall structure of a formalization efficiently. Its core feature, Lean Compass, is an algorithm that, given a selected theorem set, automatically extracts the project-specific nodes whose semantic correctness can affect those target statements, thereby reducing the candidate set for semantic review in large-scale formalizations. We further define aligned Lean code as formalization code that has undergone human semantic verification, and propose it as a quality standard for AI-generated formalizations. We evaluate the tool on six Lean 4 formalization projects with different structural characteristics; proof-heavy projects (PrimeNumberTheoremAnd, Carleson, Brownian Motion) achieved 94-99% average node reduction, a 6-theorem milestone subset of FLT achieved 59.8%, mixed PhysLib 69.0%, and definition-heavy XMSS 27.3%. Lean Atlas is available as open-source software at this https URL .
[HC-50] DR. INFO at the Point of Care: A Prospective Pilot Study of an Agent ic AI Clinical Assistant
【速读】:该论文旨在解决临床文档撰写与信息检索占用医生超过一半工作时间的问题,这导致认知过载和职业倦怠。其解决方案的关键在于开发并评估一个名为DR. INFO的代理型人工智能(Agentic AI)临床助手,在真实临床实践中提升时间效率和决策支持能力。研究结果显示,跨专科、多职业阶段的医生在使用DR. INFO后报告了显著的时间节省(平均评分4.27/5)和稳定的决策支持(4.16/5),且净推荐值(NPS)高达81.2,表明其具有良好的用户接受度和潜在应用价值。
链接: https://arxiv.org/abs/2604.16346
作者: Rogerio Corga Da Silva,Miguel Romano,Tiago Mendes,Marta Isidoro,Sandhanakrishnan Ravichandran,Shivesh Kumar,Michiel van der Heijden,Olivier Fail,Valentine Emmanuel Gnanapragasam
机构: Synduct GmbH( Synduct GmbH); University of Minho, School of Medicine(米尼奥大学医学院); Unidade Local de Saúde (ULS) do Alto Minho, Intensive Care Medicine(阿尔托明霍地方卫生单位重症医学科); ULS Almada-Seixal(阿尔玛达-塞克萨尔地方卫生单位)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:
Abstract:Background: Clinical documentation and information retrieval consume over half of physicians working hours, contributing to cognitive overload and burnout. While artificial intelligence offers a potential solution, concerns over hallucinations and source reliability have limited adoption at the point of care. Objective: To evaluate clinician-reported time savings, decision-making support, and satisfaction with DR. INFO, an agentic AI clinical assistant, in routine clinical practice. Methods: In this prospective, single-arm pilot study, 29 clinicians across multiple specialties in Portuguese healthcare institutions used DR. INFO v1.0 over five working days within a two-week period. Outcomes were assessed via daily Likert-scale evaluations and a final Net Promoter Score. Non-parametric methods were used throughout. Results: Clinicians reported high perceived time saving (mean 4.27/5; 95% CI: 3.97-4.57) and decision support (4.16/5; 95% CI: 3.86-4.45), with ratings stable across all study days and no evidence of attrition bias. The Net Promoter Score was 81.2, with no detractors. Conclusions: Clinicians across specialties and career stages reported sustained satisfaction with DR. INFO for both time efficiency and clinical decision support. Validation in larger, controlled studies with objective outcome measures is warranted. Subjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY) Cite as: arXiv:2604.16346 [cs.HC] (or arXiv:2604.16346v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.16346 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Valentine Emmanuel Gnanapragasam VmeG [view email] [v1] Mon, 16 Mar 2026 13:09:14 UTC (1,520 KB) Full-text links: Access Paper: View a PDF of the paper titled DR. INFO at the Point of Care: A Prospective Pilot Study of an Agentic AI Clinical Assistant, by Rogerio Corga Da Silva and 8 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.HC prev | next new | recent | 2026-04 Change to browse by: cs cs.CY References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[HC-51] Bridging the Experimental Last Mile: Digitizing Laboratory Know-How for Safe AI-Assisted Support
【速读】:该论文旨在解决在教育和探索性研究环境中,传统人工实验仍占主导地位时,如何有效整合现场实践知识以提升实验室操作安全性与可靠性的问题。其核心挑战在于,常规实验手册往往忽略物理操作细节和现场特定规则,而这些信息对安全执行实验至关重要。解决方案的关键在于构建一个“人在回路”(human-in-the-loop)的AI助手,该系统融合第一人称实验视频、多模态人工智能(Multimodal AI)与检索增强生成(Retrieval-Augmented Generation, RAG),从学生录制的实验视频中提取站点特异性的实验室知识(如操作技巧和语音确认),并基于此生成可溯源的指导建议。通过两层安全设计——RAG限制信息来源和严格系统提示约束——显著降低了幻觉风险,实验证明该框架能在人类监督下可靠地辅助实验室实践,而非替代人类判断。
链接: https://arxiv.org/abs/2604.16345
作者: Akira Miura,Yuki Sasahara,Momoka Demura,Yuji Masubuchi,Tetsuya Asai,Chikahiko Mitsui
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 32 pages in total (main 13 pages, appendix 19 pages), 2 main figures, 1 main table
Abstract:Advances in Materials Informatics have accelerated the development of Self-Driving Laboratories (SDLs), yet human-led experiments remain standard in many educational and exploratory research settings. In such environments, practical know-how, including operational details and site-specific rules, is essential for safe and reliable laboratory work. In this proof-of-concept study, we developed a human-in-the-loop AI assistant that combines first-person experimental video, multimodal AI, and retrieval-augmented generation (RAG). Using powder X-ray diffraction experiments and student-recorded video data as inputs, the system extracts site-specific laboratory knowledge from recorded procedures, including physical techniques and audible confirmation that conventional manuals could omit. It then provides grounded responses based on the resulting manual. To reduce the risk of unsupported outputs, the system employs a two-layer safety design: source restriction through RAG and strict system-prompt constraints. Instructor-based evaluation showed alignment with expected guidance for questions covered by the manual. For out-of-scope queries, the system appropriately refused to answer, indicating a reduced risk of hallucination. Expert evaluation further indicated that the generated advisory reports were useful and safe (utility: 3.25/4.00; safety: 4.00/4.00). These results suggest a framework in which AI supports laboratory practice under explicit human supervision rather than replacing human judgment.
[HC-52] Discovering the Latency-Elastic Trust Window: A Patentable UX Governor for Real-Time Payment Confirmation in WebRTC Streaming
【速读】:该论文旨在解决直播平台中支付确认延迟(confirmation latency)对用户体验(UX)和用户行为的负面影响问题,尤其是在WebRTC实时流媒体场景下,延迟不仅影响转化率与参与度,还破坏了互动节奏并削弱用户信任。其解决方案的关键在于提出一种名为“延迟弹性信任窗口”(Latency-Elastic Trust Window, LETW)的控制层机制,该机制通过为每个会话动态计算延迟预算、自适应调整用户反馈,并引入抖动感知阈值来保护对话节奏;同时结合危险模型(hazard model)与行为弹性曲线(behavioral elasticity curve),构建了一个基于遥测数据的延迟阈值管理框架,从而实现将网络状态映射为用户可感知的信任维持模式,为工程团队提供可操作的阈值以保障支付反馈的可信性。
链接: https://arxiv.org/abs/2604.16344
作者: Anton Malinovskiy
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 13 pages, 8 tables, 2 listings, 3 figures
Abstract:Live streaming platforms increasingly embed payments into the interaction loop. In these systems, payment confirmation latency is not merely a back-end performance metric but a front-end UX variable that shapes user behavior, trust, and retention. This paper introduces a novel invention candidate - the Latency-Elastic Trust Window (LETW) - a control layer that computes a per-session latency budget, adapts UX feedback, and enforces jitter-aware thresholds to protect conversational rhythm. We model confirmation latency as a behavioral driver in WebRTC streaming, quantify its effect on conversion and engagement, and propose a telemetry-driven framework to manage latency thresholds. We combine a hazard model with a behavioral elasticity curve and present simulated, calibration-based results that mirror real-world response patterns. Our findings indicate that latency beyond two seconds materially reduces tip completion and repeat engagement, and that latency variance is as important as mean latency. We further formalize the LETW as a patentable UX governor that maps network conditions to user-facing modes, and we provide operational thresholds for engineering teams to enforce trust-preserving payment feedback.
[HC-53] Elder-Sim: A Psychometrically Validated Platform for Personality-Stable Elderly Digital Twins
【速读】:该论文旨在解决生成式 AI(Generative AI)在老年数字孪生(digital twin)应用中面临的人格漂移(personality drift)问题,即大语言模型(LLM)在多次交互中表现出不一致的人格特征,从而影响老年心理健康干预模拟的可靠性。解决方案的关键在于提出并实现了一个名为 ELDER-SIM 的多角色老年护理对话平台,其核心创新包括:(1)基于大五人格(Big Five, OCEAN)的结构化人格设定;(2)嵌入以贝克认知行为疗法(CBT)为理论基础的认知概念图(Cognitive Conceptualization Diagram, CCD),用于稳定认知表征;(3)引入本地化长时记忆模块与 LoRA 微调策略(基于 CHARLS 数据集的 19,717 条指令对),显著提升人格一致性。实验表明,该框架通过心理测量学指标(如 Cronbach’s α 和 ICC)验证了人格稳定性,尤其 CCD 和 LoRA 分别在一致性提升和整体可靠性上表现最优,为老年数字孪生的纵向模拟和临床前仿真评估提供了可复现的技术路径。
链接: https://arxiv.org/abs/2604.16343
作者: Jiaqing Wang,Zhongfang Yang,Xingyuan Zhu,Zong’an Huang,Hao Wang,Li Tian,Ying Cao,Xiaomin Qu,Xiang Qi,Bei Wu,Zheng Zhu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Background: LLMs enable patient-facing conversational agents, creating a pathway toward digital twins that capture older adults’ lived experiences and behavioral responses across time. A central barrier is personality drift – inconsistent trait expression across repeated interactions – which undermines reliability of generated trajectories and intervention-response simulation in geriatric care. Objective: To develop ELDER-SIM, a multi-role elderly-care conversational platform for building personality-stable digital twin agents, and to propose a psychometric validation framework for quantifying personality consistency in LLM-based agents. Methods: ELDER-SIM was implemented via n8n workflow orchestration with local LLM inference (Ollama/vLLM), integrating (1) Big Five (OCEAN) trait specifications, (2) a Cognitive Conceptualization Diagram (CCD) grounded in Beck’s CBT framework, and (3) a MySQL-based long-term memory module. Ablation studies across four conditions – Baseline, +Memory, +CCD, and +LoRA (fine-tuned on 19,717 instruction pairs from CHARLS) – were evaluated via Cronbach’s \alpha , ICC, and role discrimination accuracy. Results: Reliability was acceptable to excellent across conditions (Cronbach’s \alpha : 0.70–0.94; ICC: 0.85–0.96). Role discrimination improved from 83.3% (Baseline) to 88.9% (+Memory), 94.4% (+CCD), and 97.2% (+LoRA). CCD produced the largest consistency gain (mean \alpha 0.702 \to 0.892), while LoRA achieved the highest overall consistency ( \alpha 0.940; ICC 0.958). Conclusions: ELDER-SIM provides a psychometrically validated approach for constructing personality-consistent elderly digital twin agents. Structured cognitive modeling and domain adaptation reduce personality drift, supporting reliable longitudinal simulation for elderly mental health care and reproducible in silico evaluation before clinical deployment. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.16343 [cs.HC] (or arXiv:2604.16343v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.16343 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jiaqing Wang [view email] [v1] Mon, 16 Mar 2026 07:39:35 UTC (1,275 KB) Full-text links: Access Paper: View a PDF of the paper titled Elder-Sim: A Psychometrically Validated Platform for Personality-Stable Elderly Digital Twins, by Jiaqing Wang and 10 other authorsView PDF view license Current browse context: cs.HC prev | next new | recent | 2026-04 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[HC-54] SAGE: Sensor-Augmented Grounding Engine for LLM -Powered Sleep Care Agent
【速读】:该论文旨在解决睡眠健康干预中普遍存在的“数据-行动鸿沟”(Data-Action Gap)问题,即用户虽能通过可穿戴设备和健康应用获取大量睡眠相关数据,却难以将其转化为有效的行为改变。现有方案如静态仪表盘、规则驱动代理和缺乏个性化数据支撑的大型语言模型(LLM)代理均无法有效提升用户的理解和信任。其解决方案的关键在于提出SAGE(Sensor-Augmented Grounding Engine),该系统通过将来自传感器的连续睡眠、生理和活动数据标准化为可查询的时间序列层,实现两个核心功能:一是选择性系统触发监测,仅在检测到与个体基线显著偏离时发出通知以减少警报疲劳;二是支持用户自然语言提问并自动转化为可执行数据库查询,确保所有响应基于精确的时间段、比较基准和指标数据,从而增强个性化、可追溯性和可信度,为基于证据的睡眠护理信息传递开辟了新的设计空间。
链接: https://arxiv.org/abs/2604.16342
作者: Hansoo Lee,Yoonjae Cho,Sonya S.Kwak,Rafael A. Calvo
机构: Imperial College London (帝国理工学院); KIST (韩国科学技术院)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to the Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA '26). 6 pages
Abstract:Sleep is vital for health, yet access to data alone does not guarantee improvement. While wearables and health apps enable tracking, users face a “Data-Action Gap,” struggling to interpret metrics and translate them into action. Current interventions fail to bridge this: static dashboards lack context, rule-based agents rely on rigid scripts, and LLM-agents lack grounding in personal data, causing trust issues. We propose SAGE (Sensor-Augmented Grounding Engine) for an LLM-powered sleep care agent. SAGE normalizes continuous sleep, physiological, and activity data from the sensors into a queryable time-series layer. It supports (1) selective system-initiated monitoring that triggers notifications only upon detecting meaningful deviations against personal baselines to reduce alert fatigue, and (2) user-initiated QA where natural language is translated into executable database queries. By ensuring responses are grounded in precise period, comparison, and metric data, SAGE aims to enhance personalization, traceability, and trust, articulating a novel design space for evidence-based messaging in sleep care.
[HC-55] Deep Learning for Virtual Reality User Identification: A Benchmark
【速读】:该论文旨在解决虚拟现实(Virtual Reality, VR)环境中用户身份识别的安全性问题,尤其是在制造等场景下保护工人身份和确保设备访问安全。其解决方案的关键在于利用VR头显与控制器采集的运动轨迹时间序列数据作为行为生物特征,并系统性地评估多种深度学习架构在大规模数据集(Who is Alyx VR dataset,包含71名用户)上的识别性能,包括传统模型(如LSTM、GRU、CNN、TCN、Transformer)和新兴的状态空间模型(State Space Model, SSM),从而建立适用于VR场景的用户识别基准性能指标,为未来隐私保护的身份认证系统提供技术基础。
链接: https://arxiv.org/abs/2604.16341
作者: Davide Frizzo,Fabrizio Genilotti,David Petrovic,Arianna Stropeni,Francesco Borsatti,Davide Dalle Pezze,Riccardo De Monte,Manuel Barusco,Gian Antonio Susto
机构: University of Padova (帕多瓦大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Virtual Reality (VR) applications require robust user identification systems to ensure secure access to equipment and protect worker identities. Motion tracking data from VR headsets and controllers has emerged as a powerful behavioral biometric, with recent studies demonstrating identification accuracies exceeding 94% across a large user base. However, the application of modern deep learning architectures, particularly State Space Models (SSM), to VR scenarios remains largely unexplored. In this work, we benchmark user identification performance across the large-scale Who is Alyx VR dataset, gathering data from 71 users playing the popular Half-Life:Alyx game. We evaluate both established architectures (Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Convolutional Neural Network (CNN), Temporal Convolutional Network (TCN), Transformer) and the emerging SSMs on time series motion data. Our results provide the first comprehensive benchmark of state-of-the-art and novel architectures for VR user identification, establishing baseline performance metrics for future privacy preserving authentication systems in manufacturing environments.
[HC-56] How Can Explainable Artificial Intelligence Improve Trust and Transparency in Medical Diagnosis Systems? ALT
【速读】:该论文旨在解决人工智能在医疗诊断系统中因缺乏透明度而导致的可信度不足问题,尤其是在临床实践中,现有模型多为“黑箱”模式,限制了医生对AI决策逻辑的理解与信任。其解决方案的关键在于引入可解释人工智能(Explainable Artificial Intelligence, XAI),通过提供清晰的决策解释来增强用户对AI推荐的理解、安全感知和采纳意愿,实证研究表明XAI知识与信任水平(r = 0.48, p = 0.01)及有用性感知(r = 0.60, p = 0.001)显著正相关,从而推动AI在医疗决策支持系统中的有效整合。
链接: https://arxiv.org/abs/2604.16340
作者: Altynbek Seitenov,Ainur Nurzhanova,Azhar Bekbussinova,Yerassyl Bolatkan
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 15 pages, 22 figures, survey study on explainable AI in healthcare decision support systems
Abstract:The growing adoption of artificial intelligence in healthcare has raised concerns about the transparency and trustworthiness of AI-driven medical diagnosis systems. Many existing models operate as black boxes, limiting clinicians’ ability to understand how decisions are made. Explainable Artificial Intelligence (XAI) has been proposed as a solution to improve transparency, interpretability, and trust in AI-assisted medical tools. This study investigates the relationship between explainability and trust in AI-based diagnostic systems. A structured survey of 30 medical students was conducted to examine the influence of XAI understanding, confidence in AI decisions, perceived usefulness, and adoption intentions. The results indicate that explanations significantly increase trust, clarity, and perceived safety of AI recommendations. Knowledge of XAI showed a positive correlation with trust (r = 0.48, p = 0.01) and perceived usefulness (r = 0.60, p = 0.001). The findings suggest that explainability is a key factor for successful integration of AI in healthcare decision support systems. While AI explanations improve transparency and trust, participants still prefer AI to function as a support tool rather than replacing human clinical judgment. Comments: 15 pages, 22 figures, survey study on explainable AI in healthcare decision support systems Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2604.16340 [cs.HC] (or arXiv:2604.16340v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.16340 Focus to learn more arXiv-issued DOI via DataCite
[HC-57] Multimodal Digital Sensing of Early-Life Laying Hens: A Pilot Study Integrating Thermal Acoustic Optical-Flow and Environmental Data
【速读】:该论文旨在解决早期发育对产蛋鸡长期福利的深远影响难以量化监测的问题,现有方法受限于主观评估和单一模态工具。解决方案的关键在于构建一个融合热成像、声学记录、基于光流的视频分析及环境监测的多模态传感框架,通过系统性采集与分析从出壳至20周龄期间的生理与行为特征,识别出具有显著发育效应的指标(如足部温度η²=0.51)、声学谱特征的系统性变化(p<0.001)以及行为反应随年龄下降的趋势(t=28.12, p=0.00126),并验证了各模态内部一致性高(r=0.85–0.96)且跨模态关联有限,从而为精准家禽养殖中的福利相关监测提供了可并行实施的技术路径。
链接: https://arxiv.org/abs/2604.16307
作者: Yashan Dhaliwal,Daniel Essien,Suresh Neethirajan
机构: 未知
类目: Multimedia (cs.MM); Human-Computer Interaction (cs.HC)
备注: 29 pages, 11 figures, 5 Tables
Abstract:Early-life development strongly influences long-term welfare in laying hens, yet monitoring remains limited by subjective assessment and single-modality tools. This pilot study evaluated the feasibility of a multimodal sensing framework integrating thermal imaging, acoustic recording, optical-flow-based video analysis, and environmental monitoring to characterize physiological and behavioural development from hatch to 20 weeks. One hundred fifty Lohmann LSL-Lite chicks were housed across five controlled rooms; thermal and environmental data were collected system-wide, while detailed audio and video analyses focused on one representative room. Weekly aggregated features included head and foot surface temperatures, acoustic spectral descriptors, optical-flow movement responses to caretaker entry, and ambient conditions. Thermal imaging showed age-related increases and stabilization of peripheral temperatures, with foot temperature exhibiting a strong developmental effect (eta squared = 0.51). Acoustic features changed systematically across weeks (p 0.001), consistent with vocal maturation. Optical-flow analysis revealed pronounced early reactivity to caretaker presence that declined with age (weeks 5 to 10 versus 11 to 20: t = 28.12, p = 0.00126). Z-score-normalized multimodal trajectories and correlation analysis (false discovery rate q 0.05) showed strong within-modality consistency (r = 0.85 to 0.96) and selective associations between humidity and acoustic features (r = 0.65 to 0.70), while thermal, acoustic, and behavioural domains remained largely independent. This pilot establishes baseline multimodal developmental patterns and supports parallel sensing for welfare-relevant monitoring in precision poultry farming.
[HC-58] Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild
【速读】:该论文旨在解决产品团队在评估大语言模型(Large Language Models, LLMs)驱动的产品时面临的挑战,尤其是传统评估方法因LLM的不可预测性而失效的问题。研究通过访谈19位来自不同行业的从业者,识别出从非正式“直觉判断”到组织层面元工作(meta-work)的十种评估实践,并揭示了除已知四类挑战外的一个新问题——结果-可操作性差距(results-actionability gap),即从业者虽能收集评估数据,却难以将其转化为具体改进措施。解决方案的关键在于提出系统化策略以弥合这一差距,推动从业者从临时性的解释性实践(如“vibe checks”)向结构化评估过渡,强调这些解释性实践是对LLM特性的一种必要适应,而非方法论缺陷。
链接: https://arxiv.org/abs/2604.16304
作者: Willem van der Maden,Malak Sadek,Ziang Xiao,Aske Mottelson,Q. Vera Liao,Jichen Zhu
机构: IT University of Copenhagen (哥本哈根信息技术大学); Cambridge University (剑桥大学); Johns Hopkins University (约翰霍普金斯大学); University of Michigan (密歇根大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:How do product teams evaluate LLM-powered products? As organizations integrate large language models (LLMs) into digital products, their unpredictable nature makes traditional evaluation approaches inadequate, yet little is known about how practitioners navigate this challenge. Through interviews with nineteen practitioners across diverse sectors, we identify ten evaluation practices spanning informal ‘vibe checks’ to organizational meta-work. Beyond confirming four documented challenges, we introduce a novel fifth we call the results-actionability gap, in which practitioners gather evaluation data but cannot translate findings into concrete improvements. Drawing on patterns from successful teams, we contribute strategies to bridge this gap, supporting practitioners’ formalization journey from ad-hoc interpretive practices (e.g., vibe checks) toward systematic evaluation. Our analysis suggests these interpretive practices are necessary adaptations to LLM characteristics rather than methodological failures. For HCI researchers, this presents a research opportunity to support practitioners in systematizing emerging practices rather than developing new evaluation frameworks.
计算机视觉
[CV-0] MUA: Mobile Ultra-detailed Animatable Avatars
【速读】:该论文旨在解决生成高保真、可动画化的全身数字人时面临的两大难题:一方面,现有方法难以同时实现动态几何与外观的高保真度;另一方面,轻量级模型虽能部署于资源受限设备(如VR头显),但常因表面动态受限、细节缺失而产生明显伪影。为弥合这一差距,作者提出了一种名为“小波引导的多层级空间因子化混合形状”(Wavelet-guided Multi-level Spatial Factorized Blendshapes)的新颖可动画化身表示方法,并设计了相应的蒸馏流程,将预训练超高质量教师模型中的运动感知服装动态和细粒度外观细节迁移至紧凑高效的表示中。其核心创新在于结合多层级小波频谱分解与纹理空间低秩结构因子化,在保持视觉合理动态的同时,使计算成本降低至原模型的1/2000,模型体积缩小10倍,且在桌面PC上实现超过180 FPS的帧率,在Meta Quest 3上实现原生实时(24 FPS)性能,显著提升了高保真数字人在沉浸式应用中的实用性。
链接: https://arxiv.org/abs/2604.18583
作者: Heming Zhu,Guoxing Sun,Marc Habermann
机构: Max Planck Institute for Informatics (马克斯·普朗克信息研究所); Saarland Informatics Campus (萨尔兰信息学园区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Building photorealistic, animatable full-body digital humans remains a longstanding challenge in computer graphics and vision. Recent advances in animatable avatar modeling have largely progressed along two directions: improving the fidelity of dynamic geometry and appearance, or reducing computational complexity to enable deployment on resource-constrained platforms, e.g., VR headsets. However, existing approaches fail to achieve both goals simultaneously: Ultra-high-fidelity avatars typically require substantial computation on server-class GPUs, whereas lightweight avatars often suffer from limited surface dynamics, reduced appearance details, and noticeable artifacts. To bridge this gap, we propose a novel animatable avatar representation, termed Wavelet-guided Multi-level Spatial Factorized Blendshapes, and a corresponding distillation pipeline that transfers motion-aware clothing dynamics and fine-grained appearance details from a pre-trained ultra-high-quality avatar model into a compact, efficient representation. By coupling multi-level wavelet spectral decomposition with low-rank structural factorization in texture space, our method achieves up to 2000X lower computational cost and a 10X smaller model size than the original high-quality teacher avatar model, while preserving visually plausible dynamics and appearance details closely resemble those of the teacher model. Extensive comparisons with state-of-the-art methods show that our approach significantly outperforms existing avatar approaches designed for mobile settings and achieves comparable or superior rendering quality to most approaches that can only run on servers. Importantly, our representation substantially improves the practicality of high-fidelity avatars for immersive applications, achieving over 180 FPS on a desktop PC and real-time native on-device performance at 24 FPS on a standalone Meta Quest 3.
[CV-1] ReCap: Lightweight Referential Grounding for Coherent Story Visualization
【速读】:该论文旨在解决故事可视化(Story Visualization)中跨帧一致性难以维持的问题,特别是角色身份(character identity)、空间配置和风格连贯性在生成序列图像时易出现漂移或不一致的现象。传统方法依赖显式记忆库、架构扩展或辅助语言模型(auxiliary language models),导致参数量激增和推理开销显著增加。解决方案的关键在于提出轻量级一致性框架 ReCap,其核心创新包括:1)CORE(COnditional frame REferencing)模块,将代词(anaphors)作为视觉锚点,在角色被代词指代时激活,并基于前一帧条件化当前帧,实现选择性跨帧一致性传播,仅引入149K额外参数;2)SemDrift(Guided Semantic Drift Correction)机制,在训练阶段通过对比预训练 DINOv3 视觉嵌入与去噪器表示,纠正因文本模糊或指代不清导致的角色外观漂移,且零推理成本。该方案在两个主流基准上显著优于现有最优方法 StoryGPT-V,验证了其在角色一致性上的有效性。
链接: https://arxiv.org/abs/2604.18575
作者: Aditya Arora,Akshita Gupta,Pau Rodriguez,Marcus Rohrbach
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Diffusion Models, Story Visualization
Abstract:Story Visualization aims to generate a sequence of images that faithfully depicts a textual narrative that preserve character identity, spatial configuration, and stylistic coherence as the narratives unfold. Maintaining such cross-frame consistency has traditionally relied on explicit memory banks, architectural expansion, or auxiliary language models, resulting in substantial parameter growth and inference overhead. We introduce ReCap, a lightweight consistency framework that improves character stability and visual fidelity without modifying the base diffusion backbone. ReCap’s CORE (COnditional frame REferencing) module treats anaphors, in our case pronouns, as visual anchors, activating only when characters are referred to by a pronoun and conditioning on the preceding frame to propagate visual identity. This selective design avoids unconditional cross-frame conditioning and introduces only 149K additional parameters, a fraction of the cost of memory-bank and LLM-augmented approaches. To further stabilize identity, we incorporate SemDrift (Guided Semantic Drift Correction) applied only during training. When text is vague or referential, the denoiser lacks a visual anchor for identity-defining attributes, causing character appearance to drift across frames, SemDrift corrects this by aligning denoiser representations with pretrained DINOv3 visual embeddings, enforcing semantic identity stability at zero inference cost. ReCap outperforms previous state-of-the-art, StoryGPT-V, on the two main benchmarks for story visualization by 2.63% Character-Accuracy on FlintstonesSV and by 5.65% on PororoSV, establishing a new state-of-the-art character consistency on both benchmarks. Furthermore, we extend story visualization to human-centric narratives derived from real films, demonstrating the capability of ReCap beyond stylized cartoon domains.
[CV-2] -REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability
【速读】:该论文旨在解决视觉-语言编码器的两大核心问题:一是语言与密集视觉特征之间的对齐能力较弱,影响开放词汇语义分割等任务性能;二是细粒度视觉表征导致的token数量过高,限制了其在长视频场景下的可扩展性。解决方案的关键在于提出T-REN(Text-aligned Region Encoder Network),这是一种轻量级网络结构,在冻结的视觉主干之上进行设计,通过将每个语义区域内patch-level表示池化为区域级token,并将其与区域级文本标注对齐,从而实现高效且强对齐的跨模态理解。该方法仅引入3.7%额外参数,显著提升了dense cross-modal理解能力,同时将图像和视频的token数量分别减少超过24倍和187倍。
链接: https://arxiv.org/abs/2604.18573
作者: Savya Khosla,Sethuraman T V,Aryan Chadha,Alex Schwing,Derek Hoiem
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite recent progress, vision-language encoders struggle with two core limitations: (1) weak alignment between language and dense vision features, which hurts tasks like open-vocabulary semantic segmentation; and (2) high token counts for fine-grained visual representations, which limits scalability to long videos. This work addresses both limitations. We propose T-REN (Text-aligned Region Encoder Network), an efficient encoder that maps visual data to a compact set of text-aligned region-level representations (or region tokens). T-REN achieves this through a lightweight network added on top of a frozen vision backbone, trained to pool patch-level representations within each semantic region into region tokens and align them with region-level text annotations. With only 3.7% additional parameters compared to the vision-language backbone, this design yields substantially stronger dense cross-modal understanding while reducing the token count by orders of magnitude. Specifically, T-REN delivers +5.9 mIoU on ADE20K open-vocabulary segmentation, +18.4% recall on COCO object-level text-image retrieval, +15.6% recall on Ego4D video object localization, and +17.6% mIoU on VSPW video scene parsing, all while reducing token counts by more than 24x for images and 187x for videos compared to the patch-based vision-language backbone. The code and model are available at this https URL.
[CV-3] Back into Platos Cave: Examining Cross-modal Representational Convergence at Scale
【速读】:该论文旨在解决“柏拉图表征假说”(Platonic Representation Hypothesis)是否成立的问题,即不同模态(如文本与图像)训练的神经网络是否会收敛到对现实世界一致的表征。其核心解决方案在于系统性地重新评估现有实验证据的稳健性,发现当前支持该假说的实验结果高度依赖于特定的评估范式:在小规模数据集(约1K样本)上观察到的模态间对齐现象,在扩展至百万级样本时显著退化;且剩余对齐主要体现为粗粒度语义重叠,而非细粒度结构一致性。此外,研究指出此前实验多基于一对一图像-文本配对场景,这在真实世界的多对多关系中不成立,进一步削弱了对齐效果。最终结论是:当前证据不足以支撑跨模态表征收敛的观点,不同模态模型可能学习到同样丰富的世界表征,但未必共享同一表征空间。
链接: https://arxiv.org/abs/2604.18572
作者: A. Sophia Koepke,Daniil Zverev,Shiry Ginosar,Alexei A. Efros
机构: UC Berkeley (加州大学伯克利分校); Technical University Munich, MCML (慕尼黑工业大学, MCML); University of Tübingen, Tübingen AI Center (图宾根大学, 图宾根人工智能中心); Toyota Technical Institute at Chicago (丰田技术学院芝加哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this http URL
Abstract:The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and eventually converge toward the same representation of reality. If true, this has significant implications for whether modality choice matters at all. We show that the experimental evidence for this hypothesis is fragile and depends critically on the evaluation regime. Alignment is measured using mutual nearest neighbors on small datasets ( \approx 1K samples) and degrades substantially as the dataset is scaled to millions of samples. The alignment that remains between model representations reflects coarse semantic overlap rather than consistent fine-grained structure. Moreover, the evaluations in Huh et al. are done in a one-to-one image-caption setting, a constraint that breaks down in realistic many-to-many settings and further reduces alignment. We also find that the reported trend of stronger language models increasingly aligning with vision does not appear to hold for newer models. Overall, our findings suggest that the current evidence for cross-modal representational convergence is considerably weaker than subsequent works have taken it to be. Models trained on different modalities may learn equally rich representations of the world, just not the same one.
[CV-4] MultiWorld: Scalable Multi-Agent Multi-View Video World Models
【速读】:该论文旨在解决现有视频世界模型(video world models)在多智能体(multi-agent)场景下难以准确建模复杂交互关系及跨视角一致性的问题。当前大多数方法仅适用于单智能体环境,无法有效处理真实世界中多个智能体之间的动态交互以及多视角观测的一致性需求。其解决方案的关键在于提出一个统一的框架MultiWorld,包含两个核心组件:一是多智能体条件模块(Multi-Agent Condition Module),用于实现对多个智能体的精确控制;二是全局状态编码器(Global State Encoder),以确保不同视角下的观测具有一致性。该框架支持灵活扩展智能体数量和视角数量,并能并行生成多视角视频,从而在多人游戏环境和多机器人操作任务中显著优于基线方法,在视频保真度、动作跟随能力和多视角一致性方面均取得提升。
链接: https://arxiv.org/abs/2604.18564
作者: Haoyu Wu,Jiwen Yu,Yingtian Zou,Xihui Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 10 figures
Abstract:Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-world multi-agent systems. We present \textbfMultiWorld, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. We introduce the Multi-Agent Condition Module to achieve precise multi-agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Project page: this https URL
[CV-5] AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation ACL2026
【速读】:该论文旨在解决推理分割(Reasoning Segmentation)任务中现有方法依赖单一分割令牌(\textttSEG)导致语义推理与空间定位难以显式解耦的问题。其解决方案的关键在于提出AnchorSeg框架,通过构建结构化的条件生成过程,将语言引导的查询库(query banks)分为两类:潜伏推理令牌(latent reasoning tokens)用于捕捉中间语义状态,以及分割锚点令牌(segmentation anchor token)提供显式的空间定位信号;同时,将空间条件建模为图像令牌上的因子化分布,使锚点查询决定定位信号而上下文查询提供语义调制,并引入Token–Mask Cycle Consistency(TMCC)训练目标以实现跨分辨率的对齐。此设计显著提升了模型在ReasonSeg测试集上的性能(gIoU 67.7%,cIoU 68.1%)。
链接: https://arxiv.org/abs/2604.18562
作者: Rui Qian,Chuanhang Deng,Qiang Huang,Jian Xiong,Mingxuan Li,Yingbo Zhou,Wei Zhai,Jintao Chen,Dejing Dou
机构: Fudan University (复旦大学); BEDI Cloud
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been accepted to ACL 2026, please refer to this https URL
Abstract:Reasoning segmentation requires models to ground complex, implicit textual queries into precise pixel-level masks. Existing approaches rely on a single segmentation token \textttSEG , whose hidden state implicitly encodes both semantic reasoning and spatial localization, limiting the model’s ability to explicitly disentangle what to segment from where to segment. We introduce AnchorSeg, which reformulates reasoning segmentation as a structured conditional generation process over image tokens, conditioned on language grounded query banks. Instead of compressing all semantic reasoning and spatial localization into a single embedding, AnchorSeg constructs an ordered sequence of query banks: latent reasoning tokens that capture intermediate semantic states, and a segmentation anchor token that provides explicit spatial grounding. We model spatial conditioning as a factorized distribution over image tokens, where the anchor query determines localization signals while contextual queries provide semantic modulation. To bridge token-level predictions and pixel-level supervision, we propose Token–Mask Cycle Consistency (TMCC), a bidirectional training objective that enforces alignment across resolutions. By explicitly decoupling spatial grounding from semantic reasoning through structured language grounded query banks, AnchorSeg achieves state-of-the-art results on ReasonSeg test set (67.7% gIoU and 68.1% cIoU). All code and models are publicly available at this https URL.
[CV-6] SynAgent : Generalizable Cooperative Humanoid Manipulation via Solo-to-Cooperative Agent Synergy
【速读】:该论文旨在解决具身智能中可控的协作人形机器人操作问题,其核心挑战包括数据稀缺性、多智能体协同复杂性以及跨物体泛化能力不足。解决方案的关键在于提出一个统一框架SynAgent,通过“单智能体到多智能体的协同增强”(Solo-to-Cooperative Agent Synergy)机制,将丰富的单人-物体交互数据迁移至多人-物体-人场景中;同时引入基于Delaunay四面体剖分构建的交互网格(Interact Mesh)实现运动传递过程中的语义完整性保持,并结合单智能体预训练与去中心化多智能体PPO策略优化,最终利用条件变分自编码器(conditional VAE)和多教师蒸馏技术训练出可稳定执行指定轨迹的生成式策略,从而在协作模仿与轨迹控制任务上均显著优于现有方法并具备良好的物体几何泛化能力。
链接: https://arxiv.org/abs/2604.18557
作者: Wei Yao,Haohan Ma,Hongwen Zhang,Yunlian Sun,Liangjun Xing,Zhile Yang,Yuanjun Guo,Yebin Liu,Jinhui Tang
机构: Nanjing University of Science and Technology (南京理工大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Beijing Normal University (北京师范大学); Tsinghua University (清华大学); Nanjing Forestry University (南京林业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Controllable cooperative humanoid manipulation is a fundamental yet challenging problem for embodied intelligence, due to severe data scarcity, complexities in multi-agent coordination, and limited generalization across objects. In this paper, we present SynAgent, a unified framework that enables scalable and physically plausible cooperative manipulation by leveraging Solo-to-Cooperative Agent Synergy to transfer skills from single-agent human-object interaction to multi-agent human-object-human scenarios. To maintain semantic integrity during motion transfer, we introduce an interaction-preserving retargeting method based on an Interact Mesh constructed via Delaunay tetrahedralization, which faithfully maintains spatial relationships among humans and objects. Building upon this refined data, we propose a single-agent pretraining and adaptation paradigm that bootstraps synergistic collaborative behaviors from abundant single-human data through decentralized training and multi-agent PPO. Finally, we develop a trajectory-conditioned generative policy using a conditional VAE, trained via multi-teacher distillation from motion imitation priors to achieve stable and controllable object-level trajectory execution. Extensive experiments demonstrate that SynAgent significantly outperforms existing baselines in both cooperative imitation and trajectory-conditioned control, while generalizing across diverse object geometries. Codes and data will be available after publication. Project Page: this http URL
[CV-7] Advancing Vision Transformer with Enhanced Spatial Priors
【速读】:该论文旨在解决视觉 Transformer(Vision Transformer, ViT)中自注意力机制(Self-Attention)缺乏显式空间先验信息以及计算复杂度为二次方的问题,从而限制其在计算机视觉任务中的适用性。解决方案的关键在于提出一种增强型欧式距离的视觉骨干网络(Euclidean enhanced Vision Transformer, EVT),其核心改进包括:采用更合理的欧氏距离衰减函数以更精确地建模空间关系,替代原 RMT 中使用的曼哈顿距离;同时摒弃分解注意力机制,引入空间无关的分组策略,提升模型对每组 token 数量的灵活性控制能力。这些改进使 EVT 在保留高效建模能力的同时,显著增强了空间先验的表达能力与任务适应性。
链接: https://arxiv.org/abs/2604.18549
作者: Qihang Fan,Huaibo Huang,Mingrui Chen,Hongmin Liu,Ran He
机构: MAIS NLPR, Institude of Automation, Chinese Academic of Science (中国科学院自动化研究所); School of Artificial Intelligence, University of Science and Technology Beijing (北京科技大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TPAMI2026
Abstract:In recent years, the Vision Transformer (ViT) has garnered significant attention within the computer vision community. However, the core component of ViT, Self-Attention, lacks explicit spatial priors and suffers from quadratic computational complexity, limiting its applicability. To address these issues, we have proposed RMT, a robust vision backbone with explicit spatial priors for general purposes. RMT utilizes Manhattan distance decay to introduce spatial information and employs a horizontal and vertical decomposition attention method to model global information. Building on the strengths of RMT, Euclidean enhanced Vision Transformer (EVT) is an expanded version that incorporates several key improvements. Firstly, EVT uses a more reasonable Euclidean distance decay to enhance the modeling of spatial information, allowing for a more accurate representation of spatial relationships compared to the Manhattan distance used in RMT. Secondly, EVT abandons the decomposed attention mechanism featured in RMT and instead adopts a simpler spatially-independent grouping approach, providing the model with greater flexibility in controlling the number of tokens within each group. By addressing these modifications, EVT offers a more sophisticated and adaptable approach to incorporating spatial priors into the Self-Attention mechanism, thus overcoming some of the limitations associated with RMT and further enhancing its applicability in various computer vision tasks. Extensive experiments on Image Classification, Object Detection, Instance Segmentation, and Semantic Segmentation demonstrate that EVT exhibits exceptional performance. Without additional training data, EVT achieves 86.6% top1-acc on ImageNet-1k.
[CV-8] MetaCloak-JPEG: JPEG-Robust Adversarial Perturbation for Preventing Unauthorized DreamBooth-Based Deepfake Generation
【速读】:该论文旨在解决生成式 AI(Generative AI)中基于扩散模型的深度伪造(deepfake)攻击问题,特别是针对DreamBooth等个性化微调技术在无用户授权情况下利用公开人脸图像生成逼真有害内容的安全隐患。现有防御系统如PhotoGuard、Anti-DreamBooth和MetaCloak虽能通过扰动用户图像干扰微调过程,但均未考虑社交平台普遍采用的JPEG压缩流程对保护信号的破坏——由于JPEG量化操作中的round()函数在反向传播中梯度为零,导致60–80%的防护能量丢失于高频DCT带。解决方案的关键在于引入可微分JPEG(Differentiable JPEG, DiffJPEG)模块,其前向传递使用标准JPEG压缩,后向传递用恒等映射替代round()函数以实现梯度回传;该模块嵌入到JPEG感知的EOT分布(约70%增强包含DiffJPEG)与课程学习质量因子调度(QF从95降至50)中,并置于双层元学习框架内,最终在l-inf扰动预算ε=8/255下实现32.7 dB PSNR、91.3% JPEG存活率,且在全部9个JPEG质量因子上优于PhotoGuard。
链接: https://arxiv.org/abs/2604.18537
作者: Tanjim Rahaman Fardin,S M Zunaid Alam,Mahadi Hasan Fahim,Md Faysal Mahfuz
机构: BRAC University (BRAC大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures
Abstract:The rapid progress of subject-driven text-to-image synthesis, and in particular DreamBooth, has enabled a consent-free deepfake pipeline: an adversary needs only 4-8 publicly available face images to fine-tune a personalized diffusion model and produce photorealistic harmful content. Current adversarial face-protection systems – PhotoGuard, Anti-DreamBooth, and MetaCloak – perturb user images to disrupt surrogate fine-tuning, but all share a structural blindness: none backpropagates gradients through the JPEG compression pipeline that every major social-media platform applies before adversary access. Because JPEG quantization relies on round(), whose derivative is zero almost everywhere, adversarial energy concentrates in high-frequency DCT bands that JPEG discards, eliminating 60-80% of the protective signal. We introduce MetaCloak-JPEG, which closes this gap by inserting a Differentiable JPEG (DiffJPEG) layer built on the Straight-Through Estimator (STE): the forward pass applies standard JPEG compression, while the backward pass replaces round() with the identity. DiffJPEG is embedded in a JPEG-aware EOT distribution (~70% of augmentations include DiffJPEG) and a curriculum quality-factor schedule (QF: 95 to 50) inside a bilevel meta-learning loop. Under an l-inf perturbation budget of eps=8/255, MetaCloak-JPEG attains 32.7 dB PSNR, a 91.3% JPEG survival rate, and outperforms PhotoGuard on all 9 evaluated JPEG quality factors (9/9 wins, mean denoising-loss gain +0.125) within a 4.1 GB training-memory budget.
[CV-9] UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
【速读】:该论文旨在解决将均匀离散扩散模型(Uniform Discrete Diffusion Model, UDM)与强化学习(Reinforcement Learning, RL)有效结合的问题,现有方法如GRPO在直接应用于UDM时存在训练不稳定且性能提升有限的缺陷。其解决方案的关键在于提出了一种名为\Ours的新框架,核心创新包括:(i) 将最终干净样本视为动作(action),从而获得更准确和稳定的优化信号;(ii) 通过扩散前向过程重构轨迹,使概率路径更好地对齐预训练分布。此外,引入Reduced-Step和CFG-Free两种策略进一步提升训练效率,显著改善了多个文本到图像(Text-to-Image, T2I)任务中的基线性能,在GenEval和PickScore指标上分别实现从69%到96%、20.46到23.81的提升,并在OCR任务中准确率从8%提升至57%,验证了方法的泛化能力。
链接: https://arxiv.org/abs/2604.18518
作者: Jiaqi Wang(1 and 2),Haoge Deng(2),Ting Pan(2),Yang Liu(2),Chengyuan Wang(2),Fan Zhang(2),Yonggang Qi(1),Xinlong Wang(2) ((1) Beijing University of Posts and Telecommunications, (2) Beijing Academy of Artificial Intelligence)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code:\href{ this https URL }{this https URL}
Abstract:Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose \Ours, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. \Ours significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from 69% to 96% and PickScore increases from 20.46 to 23.81 , achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from 8% to 57% , further validating the generalization ability of our method. Code is available at \hrefthis https URLthis https URL.
[CV-10] S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在多图像推理任务中表现不足的问题,尤其是现有方法局限于局部化推理(如指定图像索引进行问答),缺乏全局视觉搜索与自主跨图像比较的能力。其解决方案的关键在于提出一种简单到困难(Simple-to-Hard, S2H)的学习框架,通过分层构建三类多图像偏好数据:(1)单图像局部推理、(2)多图像局部对比、(3)全局视觉搜索,从而系统性地提升模型的多图像理解能力。该方法不依赖特定模型属性(如幻觉或注意力启发式),而是利用提示驱动的复杂度生成通用的选中/拒绝配对,显著增强LLaVA和Qwen-VL等模型在多图像推理上的性能,同时保持单图像推理能力,推动了整体视觉偏好对齐的最新进展。
链接: https://arxiv.org/abs/2604.18512
作者: Nitish Shukla,Surgan Jandial,Arun Ross
机构: Michigan State University; Carnegie Mellon University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) have demonstrated remarkable progress in single-image understanding, yet effective reasoning across multiple images remains challenging. We identify a critical capability gap in existing multi-image alignment approaches: current methods focus primarily on localized reasoning with pre-specified image indices (``Look at Image 3 and…‘’), bypassing the essential skills of global visual search and autonomous cross-image comparison. To address this limitation, we introduce a Simple-to-Hard (S2H) learning framework that systematically constructs multi-image preference data across three hierarchical reasoning levels requiring an increasing level of capabilities: (1) single-image localized reasoning, (2) multi-image localized comparison, and (3) global visual search. Unlike prior work that relies on model-specific attributes, such as hallucinations or attention heuristics, to generate preference pairs, our approach leverages prompt-driven complexity to create chosen/rejected pairs that are applicable across different models. Through extensive evaluations on LLaVA and Qwen-VL models, we show that our diverse multi-image reasoning data significantly enhances multi-image reasoning performance, yielding significant improvements over baseline methods across benchmarks. Importantly, our approach maintains strong single-image reasoning performance while simultaneously strengthening multi-image understanding capabilities, thus advancing the state of the art for holistic visual preference alignment.
[CV-11] XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
【速读】:该论文旨在解决当前云侧视觉-语言-动作(Vision-Language-Action, VLA)模型在训练过程中依赖高质量标注数据时,因基于二维图像-文本预训练的通用视觉语言模型(Vision-Language Models, VLMs)缺乏几何推理能力和物理语义理解而导致的性能瓶颈问题。解决方案的关键在于提出XEmbodied——一种云侧基础模型,通过结构化的3D Adapter将几何表示内嵌入VLM中,并利用高效图像-具身适配器(Efficient Image-Embodied Adapter)将物理信号(如占用栅格、3D边界框)蒸馏为上下文标记,从而赋予模型内在的3D几何感知与物理交互能力;同时结合渐进式领域课程学习和强化学习后训练策略,在保持通用性的同时显著提升空间推理、交通语义理解、具身可操作性及分布外泛化能力,适用于大规模场景挖掘与具身视觉问答(Embodied Visual Question Answering, Embodied VQA)。
链接: https://arxiv.org/abs/2604.18484
作者: Kangan Qian,ChuChu Xie,Yang Zhong,Jingrui Pang,Siwen Jiao,Sicong Jiang,Zilin Huang,Yunlong Wang,Kun Jiang,Mengmeng Yang,Hao Ye,Guanghao Zhang,Hangjun Ye,Guang Chen,Long Chen,Diange Yang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. Tsinghua University (清华大学); 4. Peking University (北京大学); 5. Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Robotics (cs.RO)
备注: 15 pages, 5 figures
Abstract:Vision-Language-Action (VLA) models drive next-generation autonomous systems, but training them requires scalable, high-quality annotations from complex environments. Current cloud pipelines rely on generic vision-language models (VLMs) that lack geometric reasoning and domain semantics due to their 2D image-text pretraining. To address this mismatch, we propose XEmbodied, a cloud-side foundation model that endows VLMs with intrinsic 3D geometric awareness and interaction with physical cues (e.g., occupancy grids, 3D boxes). Instead of treating geometry as auxiliary input, XEmbodied integrates geometric representations via a structured 3D Adapter and distills physical signals into context tokens using an Efficient Image-Embodied Adapter. Through progressive domain curriculum and reinforcement learning post-training, XEmbodied preserves general capabilities while demonstrating robust performance across 18 public benchmarks. It significantly improves spatial reasoning, traffic semantics, embodied affordance, and out-of-distribution generalization for large-scale scenario mining and embodied VQA.
[CV-12] SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection CVPR2026
【速读】:该论文旨在解决摄像头-only 3D目标检测中存在的长尾分布不平衡问题,即在真实世界数据集中,许多安全关键但样本稀少的类别(如儿童、婴儿车或应急车辆)因代表性不足导致模型学习偏置和性能下降。同时,类间模糊性(如视觉相似子类)和类内多样性(如外观、尺度、姿态或场景差异)进一步加剧了尾部类别的识别困难。解决方案的关键在于提出SemLT3D框架,其核心包括:(1) 基于语言引导的专家混合模块(language-guided mixture-of-experts module),根据语义亲和度将3D查询路由至特定专家,从而增强对易混淆类别的解耦能力并聚焦尾部分布;(2) 语义投影蒸馏管道(semantic projection distillation pipeline),通过CLIP启发的2D语义对齐3D查询,生成跨多样视觉表现的一致且判别性强的特征表示。该方法不仅缓解长尾问题,还提升了模型在广泛外观变化和极端场景下的鲁棒性。
链接: https://arxiv.org/abs/2604.18476
作者: Hao Vo,Khoa Vo,Thinh Phan,Ngo Xuan Cuong,Gianfranco Doretto,Hien Nguyen,Anh Nguyen,Ngan Le
机构: AICV Lab, University of Arkansas, USA; University of Utah, USA; University of Houston, USA; University of Liverpool, UK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Camera-only 3D object detection has emerged as a cost-effective and scalable alternative to LiDAR for autonomous driving, yet existing methods primarily prioritize overall performance while overlooking the severe long-tail imbalance inherent in real-world datasets. In practice, many rare but safety-critical categories such as children, strollers, or emergency vehicles are heavily underrepresented, leading to biased learning and degraded performance. This challenge is further exacerbated by pronounced inter-class ambiguity (e.g., visually similar subclasses) and substantial intra-class diversity (e.g., objects varying widely in appearance, scale, pose, or context), which together hinder reliable long-tail recognition. In this work, we introduce SemLT3D, a Semantic-Guided Expert Distillation framework designed to enrich the representation space for underrepresented classes through semantic priors. SemLT3D consists of: (1) a language-guided mixture-of-experts module that routes 3D queries to specialized experts according to their semantic affinity, enabling the model to better disentangle confusing classes and specialize on tail distributions; and (2) a semantic projection distillation pipeline that aligns 3D queries with CLIP-informed 2D semantics, producing more coherent and discriminative features across diverse visual manifestations. Although motivated by long-tail imbalance, the semantically structured learning in SemLT3D also improves robustness under broader appearance variations and challenging corner cases, offering a principled step toward more reliable camera-only 3D perception.
[CV-13] Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation
【速读】:该论文旨在解决自动驾驶(AV)开发中闭合环路仿真(closed-loop simulation)所面临的挑战,即如何将稀疏的、现实世界中的驾驶日志数据转化为完整的、可用于模拟的3D对象资产(3D object assets),以支持代理交互与多视角新视图合成。其解决方案的关键在于提出了一种端到端的“Asset Harvester”系统,该系统结合了基于对象中心的数据筛选与大规模训练元组构建、跨异构传感器的几何感知预处理,以及融合稀疏视图条件多视角生成与3D高斯提升(3D Gaussian lifting)的鲁棒训练策略;其中,SparseViewDiT模型专门针对有限视角等现实数据问题进行优化,辅以混合数据增强和自蒸馏机制,实现了从稀疏AV观测到可复用3D资产的高效、规模化转换。
链接: https://arxiv.org/abs/2604.18468
作者: Tianshi Cao,Jiawei Ren,Yuxuan Zhang,Jaewoo Seo,Jiahui Huang,Shikhar Solanki,Haotian Zhang,Mingfei Guo,Haithem Turki,Muxingzi Li,Yue Zhu,Sipeng Zhang,Zan Gojcic,Sanja Fidler,Kangxue Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: NVIDIA white paper. The project page: this https URL
Abstract:Closed-loop simulation is a core component of autonomous vehicle (AV) development, enabling scalable testing, training, and safety validation before real-world deployment. Neural scene reconstruction converts driving logs into interactive 3D environments for simulation, but it does not produce complete 3D object assets required for agent manipulation and large-viewpoint novel-view synthesis. To address this challenge, we present Asset Harvester, an image-to-3D model and end-to-end pipeline that converts sparse, in-the-wild object observations from real driving logs into complete, simulation-ready assets. Rather than relying on a single model component, we developed a system-level design for real-world AV data that combines large-scale curation of object-centric training tuples, geometry-aware preprocessing across heterogeneous sensors, and a robust training recipe that couples sparse-view-conditioned multiview generation with 3D Gaussian lifting. Within this system, SparseViewDiT is explicitly designed to address limited-angle views and other real-world data challenges. Together with hybrid data curation, augmentation, and self-distillation, this system enables scalable conversion of sparse AV object observations into reusable 3D assets.
[CV-14] Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions
【速读】:该论文旨在解决传统视频大语言模型(Video LLMs)在在线流式视频理解场景下响应不及时、决策不透明及难以对齐视觉证据时间戳的问题。现有方法多基于离线评估,无法满足生成式AI(Generative AI)在真实环境中需在首次出现充分证据时精确响应的需求。解决方案的关键在于提出一种解耦推理控制与记忆整合的新框架:其一,引入主动思考决策器(Active Thinking Decision Maker, ATDM),通过可观测的进展指标(ρ)和置信度指标(\boldsymbolc)实现推理过程的透明化,从而精准将响应时间 tr 对齐至首个充分证据时刻 t⋆;其二,设计分层渐进语义整合模块(Hierarchical Progressive Semantic Integration, HPSI),利用可学习的多层级聚合标记在片段间传播,构建全局因果一致的认知状态,同时严格控制token消耗。实验表明,该框架显著提升了在线视频理解性能,在StreamingBench上达到71.6%准确率,优于此前最优方法(67.63%)。
链接: https://arxiv.org/abs/2604.18459
作者: Kecheng Zhang,Zongxin Yang,Mingfei Han,Haihong Hao,Yunzhi Zhuge,Changlin Li,Junhan Zhao,Zhihui Li,Xiaojun Chang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Visual agents operating in the wild must respond to queries precisely when sufficient evidence first appears in a video stream, a critical capability that is overlooked by conventional video LLMs evaluated in offline settings. The shift to an online, streaming paradigm introduces significant challenges: a lack of decision transparency, the difficulty of aligning response timing with visual evidence, and the need to maintain a global, causally consistent understanding under tight computational budgets. To address these issues, we propose a novel framework that decouples reasoning control from memory integration. We introduce \textbf\model, an instantiation of this framework with two core components. First, the \emphActive Thinking Decision Maker (ATDM) is a transparent reasoning controller that externalizes its decision process using observable progress ( \boldsymbol\rho ) and confidence ( \boldsymbolc ) metrics. This allows it to precisely time its response t_r to match the first-sufficient-evidence timestamp t^\star while streaming its reasoning to the user. Second, the \emphHierarchical Progressive Semantic Integration (HPSI) module acts as an efficient memory system. It employs a set of learnable, multi-level aggregation tokens that are propagated across clips to build a rich, global cognitive state without exceeding token budgets. %Our approach sets a new standard on key online video understanding benchmarks, achieving strong performance of \textbf71.6% on StreamingBench and \textbf46.9% on OVOBench, demonstrating a robust solution for evidence-aligned and transparent online video analysis. Extensive experiments demonstrate the effectiveness of ATDM and HPSI, e.g., Thinking-QwenVL improves the accuracy of the previous state-of-the-art from 67.63% to 71.60% on the StreamingBench benchmark.
[CV-15] ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification
【速读】:该论文旨在解决零样本视觉语言模型(Zero-shot Vision-Language Models, VLMs)在胸部X光片分类任务中面临的三大挑战:标签共现混淆(label co-occurrence bias)、长尾类别不平衡(long-tail class imbalance)以及域偏移下的迁移不稳定性(transfer instability under domain shift)。解决方案的关键在于提出ProtoCLIP,一种针对CLIP类VLM的精炼策略,其核心包括两点:一是通过构建聚焦病理的训练子集并引入人工筛选的负样本以减少标签共现偏差;二是设计保持表征结构的蒸馏目标(representation-preserving distillation objective),在稳定适应过程的同时增强对临床相关共病的判别能力。该方法无需大规模重训练即可显著提升模型在未见数据集上的性能,尤其在气胸检测上达到AUC 0.94的最新水平。
链接: https://arxiv.org/abs/2604.18444
作者: Florian Kittler,Sheethal Bhat,Andreas Maier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Zero-shot vision-language models (VLMs) have shown promise for chest radiograph classification, but their performance is often limited by confounding label co-occurrence, long-tail class imbalance, and transfer instability under domain shift. We propose ProtoCLIP, a refinement strategy for CLIP-style VLMs that improves zero-shot discrimination through targeted data curation and distilled anchor alignment. Specifically, we construct pathology-focused training subsets with curated negative samples to reduce co-occurrence bias. We also introduce a representation-preserving distillation objective to stabilize adaptation while maintaining semantic structure and improving discrimination of clinically relevant co-occurring pathologies. Evaluated on an unseen dataset VinDr-CXR, ProtoCLIP improves AUC by 2-10 percentage points over a strong CLIP-based baseline across multiple findings. For pneumothorax specifically, ProtoCLIP achieves a state-of-the-art AUC of 0.94. These results demonstrate that anchor-guided refinement, coupled with curated supervision and controlled adaptation, can mitigate common zero-shot transfer failures in medical VLMs without requiring large-scale retraining.
[CV-16] Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen Models
【速读】:该论文旨在解决遥感图像中语义变化的自然语言问答问题(Change Visual Question Answering, Change VQA),即通过多模态模型理解并回答关于双时相遥感图像间语义差异的问题。其解决方案的关键在于对比两种不同架构的视觉语言模型(Vision-Language Models, VLMs):一是基于结构化视觉-语言流水线的Qwen3-VL(具有多深度视觉条件和全注意力解码器),二是原生多模态模型Qwen3.5(采用单阶段对齐与混合解码器骨干)。实验表明,原生多模态模型在该任务上表现更优,说明紧密集成的多模态骨干网络比单纯增大模型规模或显式的多深度视觉条件更能提升语言驱动的遥感图像语义变化推理能力。
链接: https://arxiv.org/abs/2604.18429
作者: Yakoub Bazi,Mohamad M. Al Rahhal,Mansour Zuair,Faroun Mohamed
机构: King Saud University (沙特国王大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Change visual question answering (Change VQA) addresses the problem of answering natural-language questions about semantic changes between bi-temporal remote sensing (RS) images. Although vision-language models (VLMs) have recently been studied for temporal RS image understanding, Change VQA remains underexplored in the context of modern multimodal models. In this letter, we revisit the CDVQA benchmark using recent Qwen models under a unified low-rank adaptation (LoRA) setting. We compare Qwen3-VL, which follows a structured vision-language pipeline with multi-depth visual conditioning and a full-attention decoder, with Qwen3.5, a native multimodal model that combines a single-stage alignment with a hybrid decoder backbone. Experimental results on the official CDVQA test splits show that recent VLMs improve over earlier specialized baselines. They further show that performance does not scale monotonically with model size, and that native multimodal models are more effective than structured vision-language pipelines for this task. These findings indicate that tightly integrated multimodal backbones contribute more to performance than scale or explicit multi-depth visual conditioning for language-driven semantic change reasoning in RS imagery.
[CV-17] MedProbeBench: Systematic Benchmarking at Deep Evidence Integration for Expert-level Medical Guideline
【速读】:该论文旨在解决现有基准测试无法有效评估大语言模型(Large Language Models, LLMs)在真实临床工作流中进行多步证据整合与专家级判断的能力问题。当前医学指南的制定高度依赖对大规模外部知识的检索、合成与推理,而现有评测体系缺乏对此类复杂任务的全面衡量。解决方案的关键在于提出MedProbeBench——首个基于高质量临床指南作为专家级参考的基准,并配套开发MedProbe-Eval评估框架,其核心包括:(1) 涵盖1,200余项任务自适应评分标准的综合评价指标体系,用于全面评估生成内容质量;(2) 基于5,130余个原子命题的细粒度证据验证机制,确保证据精确性。该方案显著提升了对LLMs在医学证据整合与指南生成能力上的量化评估水平,揭示了当前模型与专家级临床实践之间的显著差距。
链接: https://arxiv.org/abs/2604.18418
作者: Jiyao Liu,Jianghan Shen,Sida Song,Tianbin Li,Xiaojia Liu,Rongbin Li,Ziyan Huang,Jiashi Lin,Junzhi Ning,Changkai Ji,Siqi Luo,Wenjie Li,Chenglong Ma,Ming Hu,Jing Xiong,Jin Ye,Bin Fu,Ningsheng Xu,Yirong Chen,Lei Jin,Hong Chen,Junjun He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in deep research systems enable large language models to retrieve, synthesize, and reason over large-scale external knowledge. In medicine, developing clinical guidelines critically depends on such deep evidence integration. However, existing benchmarks fail to evaluate this capability in realistic workflows requiring multi-step evidence integration and expert-level judgment. To address this gap, we introduce MedProbeBench, the first benchmark leveraging high-quality clinical guidelines as expert-level references. Medical guidelines, with their rigorous standards in neutrality and verifiability, represent the pinnacle of medical expertise and pose substantial challenges for deep research agents. For evaluation, we propose MedProbe-Eval, a comprehensive evaluation framework featuring: (1) Holistic Rubrics with 1,200+ task-adaptive rubric criteria for comprehensive quality assessment, and (2) Fine-grained Evidence Verification for rigorous validation of evidence precision, grounded in 5,130+ atomic claims. Evaluation of 17 LLMs and deep research agents reveals critical gaps in evidence integration and guideline generation, underscoring the substantial distance between current capabilities and expert-level clinical guideline development. Project: this https URL
[CV-18] One-Step Diffusion with Inverse Residual Fields for Unsupervised Industrial Anomaly Detection
【速读】:该论文旨在解决扩散模型在无监督工业异常检测(uIAD)中推理速度慢的问题,其核心挑战源于扩散模型固有的迭代去噪和加噪机制导致的计算效率低下。解决方案的关键在于提出一种新颖的一步扩散模型(OSD-IRF),通过训练一个无需条件约束的深度扩散概率模型(DDPM)后,在测试阶段基于已训练的参数化噪声函数预测样本的逆残差场(Inverse Residual Fields, IRF),并利用IRF在高斯分布下的概率密度进行异常判定。作者发现异常在IRF空间中更具可区分性,且IRF具有时间步邻近不变性,从而使得整个异常检测仅需一步扩散即可完成,显著提升了推理效率(约2倍加速),同时保持了SOTA或竞争性的检测性能。
链接: https://arxiv.org/abs/2604.18393
作者: Boan Zhang,Wen Li,Guanhua Yu,Xiyang Liu,Wenchao Chen,Long Tian
机构: Xidian University(西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models have achieved outstanding performance in unsupervised industrial anomaly detection (uIAD) by learning a manifold of normal data under the common assumption that off-manifold anomalies are harder to generate, resulting in larger reconstruction errors in data space or lower probability densities in the tractable latent space. However, their iterative denoising and noising nature leads to slow inference. In this paper, we propose OSD-IRF, a novel one-step diffusion with inverse residual fields, to address this limitation for uIAD task. We first train a deep diffusion probabilistic model (DDPM) on normal data without any conditioning. Then, for a test sample, we predict its inverse residual fields (IRF) based on the noise estimated by the well-trained parametric noise function of the DDPM. Finally, uIAD is performed by evaluating the probability density of the IRF under a Gaussian distribution and comparing it with a threshold. Our key observation is that anomalies become distinguishable in this IRF space, a finding that has seldom been reported in prior works. Moreover, OSD-IRF requires only single step diffusion for uIAD, thanks to the property that IRF holds for any neighboring time step in the denoising process. Extensive experiments on three widely used uIAD benchmarks show that our model achieves SOTA or competitive performance across six metrics, along with roughly a 2X inference speedup without distillation.
[CV-19] owards Robust Text-to-Image Person Retrieval: Multi-View Reformulation for Semantic Compensation
【速读】:该论文旨在解决文本到图像行人检索任务中因自然语言表达多样性与视觉语义隐含性导致的“表达漂移”(Expression Drift)问题,即语义等价的文本在嵌入空间中因表述差异而产生显著特征偏差,从而削弱图像-文本对齐的鲁棒性。解决方案的关键在于提出一种由大语言模型(Large Language Models, LLMs)驱动的语义补偿框架(MVR),通过多视角语义重述与特征补偿机制提升跨模态表示一致性:其核心包括三部分——多视角重述(Multi-View Reformulation, MVR)利用双分支提示策略融合关键特征引导与多样性感知重写生成分布多样但语义一致的文本变体;文本特征鲁棒性增强机制在无需训练的情况下通过多视角特征均值池化与残差连接抑制噪声干扰,有效捕捉“语义回声”(Semantic Echoes);视觉语义补偿则借助视觉语言模型(VLM)生成多角度图像描述,并结合共享文本重述以弥合视觉语义鸿沟。
链接: https://arxiv.org/abs/2604.18376
作者: Chao Yuan,Yujian Zhao,Haoxuan Xu,Guanglin Niu
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In text-to-image person retrieval tasks, the diversity of natural language expressions and the implicitness of visual semantics often lead to the problem of Expression Drift, where semantically equivalent texts exhibit significant feature discrepancies in the embedding space due to phrasing variations, thereby degrading the robustness of image-text alignment. This paper proposes a semantic compensation framework (MVR) driven by Large Language Models (LLMs), which enhances cross-modal representation consistency through multi-view semantic reformulation and feature compensation. The core methodology comprises three components: Multi-View Reformulation (MVR): A dual-branch prompting strategy combines key feature guidance (extracting visually critical components via feature similarity) and diversity-aware rewriting to generate semantically equivalent yet distributionally diverse textual variants; Textual Feature Robustness Enhancement: A training-free latent space compensation mechanism suppresses noise interference through multi-view feature mean-pooling and residual connections, effectively capturing “Semantic Echoes”; Visual Semantic Compensation: VLM generates multi-perspective image descriptions, which are further enhanced through shared text reformulation to address visual semantic gaps. Experiments demonstrate that our method can improve the accuracy of the original model well without training and performs SOTA on three text-to-image person retrieval datasets.
[CV-20] DSA-CycleGAN: A Domain Shift Aware CycleGAN for Robust Multi-Stain Glomeruli Segmentation
【速读】:该论文旨在解决数字组织病理学中因染色变异(inter- and intra-stain variations)导致的分割模型性能下降问题,尤其是当训练数据仅来源于单一染色类型时,如何有效迁移至多染色场景。传统方法如CycleGAN虽可用于染色转换以实现跨染色标签复用,但其在处理具有“一对多”映射特性的染色对时易引入噪声,这与循环一致性损失(cycle consistency loss)存在冲突。本文提出Domain Shift Aware CycleGAN(DSA-CycleGAN),其核心创新在于显式建模域偏移(domain shift),从而减少翻译过程中的噪声生成,提升染色转换质量与下游分割任务的准确性。实验表明,DSA-CycleGAN在肾小球多染色分割任务中不仅提升了分割性能,还在生物差异显著的染色对之间表现出更强的鲁棒性与噪声抑制能力。
链接: https://arxiv.org/abs/2604.18368
作者: Zeeshan Nisar,Friedrich Feuerhake,Thomas Lampert
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A key challenge in segmentation in digital histopathology is inter- and intra-stain variations as it reduces model performance. Labelling each stain is expensive and time-consuming so methods using stain transfer via CycleGAN, have been developed for training multi-stain segmentation models using labels from a single stain. Nevertheless, CycleGAN tends to introduce noise during translation because of the one-to-many nature of some stain pairs, which conflicts with its cycle consistency loss. To address this, we propose the Domain Shift Aware CycleGAN, which reduces the presence of such noise. Furthermore, we evaluate several advances from the field of machine learning aimed at resolving similar problems and compare their effectiveness against DSA-CycleGAN in the context of multi-stain glomeruli segmentation. Experiments demonstrate that DSA-CycleGAN not only improves segmentation performance in glomeruli segmentation but also outperforms other methods in reducing noise. This is particularly evident when translating between biologically distinct stains. The code is publicly available at this https URL.
[CV-21] EAST: Early Action Prediction Sampling Strategy with Token Masking ICLR2026
【速读】:该论文旨在解决早期动作预测(Early Action Prediction)中因视觉证据有限而导致的挑战,即在动作尚未完全展开时准确预测其类别。其解决方案的关键在于提出一种名为EAST的简单且高效的框架,核心创新是采用随机采样训练策略,通过随机选择一个时间步将视频帧分为已观察和未观察部分,使单一模型能够无缝适应测试时的不同观测比例;同时引入联合学习机制,同时利用当前观测特征与未来“理想”(oracle)特征进行训练,显著提升性能,甚至使仅使用编码器的模型也能达到优异效果。此外,作者还设计了一种token掩码方法以降低内存消耗并加速训练,从而实现高效率与高性能的平衡。
链接: https://arxiv.org/abs/2604.18367
作者: Iva Sović,Ivan Martinović,Marin Oršić
机构: Faculty of Electrical Engineering and Computing, University of Zagreb (电气工程与计算学院,萨格勒布大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR 2026
Abstract:Early action prediction seeks to anticipate an action before it fully unfolds, but limited visual evidence makes this task especially challenging. We introduce EAST, a simple and efficient framework that enables a model to reason about incomplete observations. In our empirical study, we identify key components when training early action prediction models. Our key contribution is a randomized training strategy that samples a time step separating observed and unobserved video frames, enabling a single model to generalize seamlessly across all test-time observation ratios. We further show that joint learning on both observed and future (oracle) representations significantly boosts performance, even allowing an encoder-only model to excel. To improve scalability, we propose a token masking procedure that cuts memory usage in half and accelerates training by 2x with negligible accuracy loss. Combined with a forecasting decoder, EAST sets a new state of the art on NTU60, SSv2, and UCF101, surpassing previous best work by 10.1, 7.7, and 3.9 percentage points, respectively.
[CV-22] LBFTI: Layer-Based Facial Template Inversion for Identity-Preserving Fine-Grained Face Reconstruction
【速读】:该论文旨在解决人脸模板(Facial Template)在身份认证系统中因存在可被逆向重建风险而导致的隐私泄露问题。现有技术能够通过面部模板还原出高保真度的人脸图像,从而威胁用户隐私。其解决方案的关键在于提出一种分层式人脸模板逆向重建方法(Layer-Based Facial Template Inversion, LBFTI),将人脸图像分解为前景层(含眉毛、眼睛、鼻子和嘴巴)、中景层(皮肤区域)与背景层,并采用专用生成器分别建模各层,结合三阶段训练策略:首先独立优化前景与中景层生成质量,其次融合两层并注入模板信息生成完整人脸图像,最后联合微调所有模块以增强层间协同与身份一致性。该方法在机器认证性能(TAR提升25.3%)和人类感知相似性方面均优于当前最优方案。
链接: https://arxiv.org/abs/2604.18358
作者: Zixuan Shen,Zhihua Xia,Kaikai Gan,Peipeng Yu
机构: Jinan University (暨南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In face recognition systems, facial templates are widely adopted for identity authentication due to their compliance with the data minimization principle. However, facial template inversion technologies have posed a severe privacy leakage risk by enabling face reconstruction from templates. This paper proposes a Layer-Based Facial Template Inversion (LBFTI) method to reconstruct identity-preserving fine-grained face images. Our scheme decomposes face images into three layers: foreground layers (including eyebrows, eyes, nose, and mouth), midground layers (skin), and background layers (other parts). LBFTI leverages dedicated generators to produce these layers, adopting a rigorous three-stage training strategy: (1) independent refined generation of foreground and midground layers, (2) fusion of foreground and midground layers with template secondary injection to produce complete panoramic face images with background layers, and (3) joint fine-tuning of all modules to optimize inter-layer coordination and identity consistency. Experiments demonstrate that our LBFTI not only outperforms state-of-the-art methods in machine authentication performance, with a 25.3% improvement in TAR, but also achieves better similarity in human perception, as validated by both quantitative metrics and a questionnaire survey.
[CV-23] AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation CVPR2026
【速读】:该论文旨在解决视频扩散变换器(Video Diffusion Transformers, DiTs)在推理阶段因注意力机制的二次复杂度而导致的高延迟问题。现有稀疏注意力方法要么忽略语义相似性,要么无法适应不同层间标记分布的异质性,从而造成模型性能下降。其解决方案的关键在于提出一种无需训练的自适应聚类框架AdaCluster:通过角度相似性保持的聚类方法对查询向量进行高效压缩,同时设计欧氏相似性保持的聚类方法处理键向量,并结合簇数量分配、阈值自适应聚类与关键簇高效选择策略,在不显著损失生成质量的前提下实现高达4.31倍的加速效果。
链接: https://arxiv.org/abs/2604.18348
作者: Haoyue Tan,Shengnan Wang,Yulin Qiao,Juncheng Zhang,Youhui Bai,Ping Gong,Zewen Jin,Cheng Li
机构: University of Science and Technology of China (中国科学技术大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院); University of Macau (澳门大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026 poster
Abstract:Video diffusion transformers (DiTs) suffer from prohibitive inference latency due to quadratic attention complexity. Existing sparse attention methods either overlook semantic similarity or fail to adapt to heterogeneous token distributions across layers, leading to model performance degradation. We propose AdaCluster, a training-free adaptive clustering framework that accelerates the generation of DiTs while preserving accuracy. AdaCluster applies an angle-similarity-preserving clustering method to query vectors for higher compression, and designs a euclidean-similarity-preserving clustering method for keys, covering cluster number assignment, threshold-wise adaptive clustering, and efficient critical cluster selection. Experiments on CogVideoX-2B, HunyuanVideo, and Wan-2.1 on one A40 GPU demonstrate up to 1.67-4.31x speedup with negligible quality degradation.
[CV-24] Enhancing Glass Surface Reconstruction via Depth Prior for Robot Navigation
【速读】:该论文旨在解决室内机器人导航中玻璃表面导致深度传感器测量严重失真这一问题,该问题会显著影响基于深度信息的定位与建图性能。解决方案的关键在于提出一种无需训练的框架,利用深度基础模型(如Depth Anything 3)作为结构先验,并通过鲁棒的局部RANSAC对齐方法将其与原始传感器深度数据融合,从而自然规避错误玻璃区域的干扰并恢复准确的绝对度量尺度。
链接: https://arxiv.org/abs/2604.18336
作者: Jiamin Zheng,Jingwen Yu,Guangcheng Chen,Hong Zhang
机构: Southern University of Science and Technology (南方科技大学); CKS Robotics Institute (CKS机器人研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 8 figures
Abstract:Indoor robot navigation is often compromised by glass surfaces, which severely corrupt depth sensor measurements. While foundation models like Depth Anything 3 provide excellent geometric priors, they lack an absolute metric scale. We propose a training-free framework that leverages depth foundation models as a structural prior, employing a robust local RANSAC-based alignment to fuse it with raw sensor depth. This naturally avoids contamination from erroneous glass measurements and recovers an accurate metric scale. Furthermore, we introduce \tiGlassRecon, a novel RGB-D dataset with geometrically derived ground truth for glass regions. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art baselines, especially under severe sensor depth corruption. The dataset and related code will be released at this https URL.
[CV-25] OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在复杂真实物理场景中生成高质量以人为中心视频时面临的挑战,其核心问题在于现有数据集在全局场景与相机多样性、人与人/人与物体之间的交互建模稀疏性以及个体属性对齐不足等三个维度存在结构性缺陷。解决方案的关键在于提出 OmniHuman 数据集和 OmniHuman Benchmark(OHBench),其中 OmniHuman 是一个大规模、多场景的细粒度人类建模数据集,采用分层标注体系覆盖视频级场景、帧级交互和个体级属性,并通过全自动化的高质量数据采集与多模态标注流程实现高效构建;而 OHBench 则是一个三级评估体系,引入与人类感知高度一致的指标,从全局场景、关系交互和个体属性三个层面提供科学诊断,从而系统性填补了现有基准测试在人类中心音视频合成任务中的评估空白。
链接: https://arxiv.org/abs/2604.18326
作者: Lei Zhu,Xing Cai,Yingjie Chen,Yiheng Li,Binxin Yang,Hao Liu,Jie Chen,Chen Li,Jing LYu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 6 figures
Abstract:Recent advancements in audio-video joint generation models have demonstrated impressive capabilities in content creation. However, generating high-fidelity human-centric videos in complex, real-world physical scenes remains a significant challenge. We identify that the root cause lies in the structural deficiencies of existing datasets across three dimensions: limited global scene and camera diversity, sparse interaction modeling (both person-person and person-object), and insufficient individual attribute alignment. To bridge these gaps, we present OmniHuman, a large-scale, multi-scene dataset designed for fine-grained human modeling. OmniHuman provides a hierarchical annotation covering video-level scenes, frame-level interactions, and individual-level attributes. To facilitate this, we develop a fully automated pipeline for high-quality data collection and multi-modal annotation. Complementary to the dataset, we establish the OmniHuman Benchmark (OHBench), a three-level evaluation system that provides a scientific diagnosis for human-centric audio-video synthesis. Crucially, OHBench introduces metrics that are highly consistent with human perception, filling the gaps in existing benchmarks by providing a comprehensive diagnosis across global scenes, relational interactions, and individual attributes.
[CV-26] EVE: Verifiable Self-Evolution of MLLM s via Executable Visual Transformations
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)自进化过程中存在的两大核心问题:一是基于伪标签(pseudo-label)的方法因模型预测漂移导致质量逐步退化;二是基于模板(template-based)的方法受限于静态变换集,难以动态调整难度与多样性。解决方案的关键在于提出EVE(Executable Visual transformation-based self-Evolution)框架,其通过完全摒弃伪标签,利用可执行的视觉变换代码持续丰富训练分布的多样性和复杂性,采用Challenger-Solver双策略架构实现闭环自进化:Challenger维护并扩展视觉变换代码示例队列,生成动态Python脚本以执行真实世界视觉变换,从而获得可验证的VQA任务真值答案;同时,一个多维奖励机制结合语义多样性与动态难度校准,驱动Challenger不断优化任务难度与多样性,防止模式坍缩,并促进两策略间的协同进化,最终实现稳定、可扩展且可验证的MLLM自进化范式。
链接: https://arxiv.org/abs/2604.18320
作者: Yongrui Heng,Chaoya Jiang,Han Yang,Shikun Zhang,Wei Ye
机构: National Engineering Research Center for Software Engineering, Peking University (北京大学软件工程国家工程研究中心); School of Control Science and Engineering, Shandong University (山东大学控制科学与工程学院); Zeekr, Geely Auto (几何汽车)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Self-evolution of multimodal large language models (MLLMs) remains a critical challenge: pseudo-label-based methods suffer from progressive quality degradation as model predictions drift, while template-based methods are confined to a static set of transformations that cannot adapt in difficulty or diversity. We contend that robust, continuous self-improvement requires not only deterministic external feedback independent of the model’s internal certainty, but also a mechanism to perpetually diversify the training distribution. To this end, we introduce EVE (Executable Visual transformation-based self-Evolution), a novel framework that entirely bypasses pseudo-labels by harnessing executable visual transformations continuously enriched in both variety and complexity. EVE adopts a Challenger-Solver dual-policy architecture. The Challenger maintains and progressively expands a queue of visual transformation code examples, from which it synthesizes novel Python scripts to perform dynamic visual transformations. Executing these scripts yields VQA problems with absolute, execution-verified ground-truth answers, eliminating any reliance on model-generated supervision. A multi-dimensional reward system integrating semantic diversity and dynamic difficulty calibration steers the Challenger to enrich its code example queue while posing progressively more challenging tasks, preventing mode collapse and fostering reciprocal co-evolution between the two policies. Extensive experiments demonstrate that EVE consistently surpasses existing self-evolution methods, establishing a robust and scalable paradigm for verifiable MLLM self-evolution. The code is available at this https URL .
[CV-27] Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection SIGIR2026
【速读】:该论文旨在解决开放词汇时空动作检测(Open-Vocabulary Temporal Action Detection, OV-TAD)中因动作语义标签与视频内容之间存在语义不平衡而导致的跨模态对齐偏差问题,即抽象简洁的动作标签难以准确匹配复杂丰富的视频表征,从而引入语义噪声并影响检测精度。解决方案的关键在于提出DFAlign框架,其核心创新为通过扩散去噪机制生成前景知识作为视频与文本表示之间的中间语义锚点:首先设计Semantic-Unify Conditioning (SUC)模块统一动作共享与特定语义作为扩散条件;其次引入Background-Suppress Denoising (BSD)模块逐步去除视频背景冗余以生成前景知识;最后通过Foreground-Prompt Alignment (FPA)模块将提取的前景知识注入文本表征作为提示令牌,引导模型关注动作相关片段,实现精准的跨模态对齐。
链接: https://arxiv.org/abs/2604.18313
作者: Sa Zhu,Wanqian Zhang,Lin Wang,Jinchao Zhang,Cong Wang,Bo Li
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); State Key Laboratory of Cyberspace Security Defense (网络空间安全防御国家重点实验室); Hangzhou Dianzi University (杭州电子科技大学); Control Science and Engineering, Zhejiang University (浙江大学控制科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by SIGIR 2026
Abstract:Open-Vocabulary Temporal Action Detection (OV-TAD) aims to localize and classify action segments of unseen categories in untrimmed videos, where effective alignment between action semantics and video representations is critical for accurate detection. However, existing methods struggle to mitigate the semantic imbalance between concise, abstract action labels and rich, complex video contents, inevitably introducing semantic noise and misleading cross-modal alignment. To address this challenge, we propose DFAlign, the first framework that leverages diffusion-based denoising to generate foreground knowledge for the guidance of action-video alignment. Following the ‘conditioning, denoising and aligning’ manner, we first introduce the Semantic-Unify Conditioning (SUC) module, which unifies action-shared and action-specific semantics as conditions for diffusion denoising. Then, the Background-Suppress Denoising (BSD) module generates foreground knowledge by progressively removing background redundancy from videos through denoising process. This foreground knowledge serves as effective intermediate semantic anchor between video and text representations, mitigating the semantic gap and enhancing the discriminability of action-relevant segments. Furthermore, we introduce the Foreground-Prompt Alignment (FPA) module to inject extracted foreground knowledge as prompt tokens into text representations, guiding model’s attention towards action-relevant segments and enabling precise cross-modal alignment. Extensive experiments demonstrate that our method achieves state-of-the-art performance on two OV-TAD benchmarks. The code repository is provided as follows: this https URL.
[CV-28] Relative State Estimation using Event-Based Propeller Sensing
【速读】:该论文旨在解决多无人机(UAV)自主集群系统中相对状态估计的准确性与实时性问题。传统基于单目帧式相机的方法在理想条件下表现良好,但在复杂视觉环境中存在延迟高、尺度模糊等问题。为此,作者提出了一种基于事件相机(event camera)的四旋翼无人机相对状态估计算法,其关键在于利用螺旋桨在事件流中的运动特征进行频率检测,并将其作为推力输入驱动运动学状态估计模块;同时结合从事件流中提取的几何结构信息(如椭圆拟合)来估计机体姿态,从而实现高精度、低延迟的相对定位。该方法在五个真实室外飞行数据集上实现了小于3%误差的螺旋桨频率估计,为多机器人系统的去中心化相对定位提供了可行方案。
链接: https://arxiv.org/abs/2604.18289
作者: Ravi Kumar Thakur,Luis Granados Segura,Jan Klivan,Radim Špetlík,Tobiáš Vinklárek,Matouš Vrba,Martin Saska
机构: Czech Technical University in Prague (捷克技术大学); Czech Science Foundation (GAČR) (捷克科学基金会)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:
Abstract:Autonomous swarms of multi-Unmanned Aerial Vehicle (UAV) system requires an accurate and fast relative state estimation. Although monocular frame-based camera methods perform well in ideal conditions, they are slow, suffer scale ambiguity, and often struggle in visually challenging conditions. The advent of event cameras addresses these challenging tasks by providing low latency, high dynamic range, and microsecond-level temporal resolution. This paper proposes a framework for relative state estimation for quadrotors using event-based propeller sensing. The propellers in the event stream are tracked by detection to extract the region-of-interests. The event streams in these regions are processed in temporal chunks to estimate per-propeller frequencies. These frequency measurements drive a kinematic state estimation module as a thrust input, while camera-derived position measurements provide the update step. Additionally, we use geometric primitives derived from event streams to estimate the orientation of the quadrotor by fitting an ellipse over a propeller and backprojecting it to recover body-frame tilt-axis. The existing event-based approaches for quadrotor state estimation use the propeller frequency in simulated flight sequences. Our approach estimates the propeller frequency under 3% error on a test dataset of five real-world outdoor flight sequences, providing a method for decentralized relative localization for multi-robot systems using event camera.
[CV-29] Spike-NVPT: Learning Robust Visual Prompts via Bio-Inspired Temporal Filtering and Discretization
【速读】:该论文旨在解决预训练视觉模型在使用基于提示调优(prompt tuning)方法时,因提示参数连续且密集而导致对输入噪声敏感的问题。这种高容量的提示容易过拟合与任务无关的细节,从而削弱模型鲁棒性。解决方案的关键在于提出Spike-NVPT方法,其核心创新是引入基于脉冲神经元(spiking neurons)的信号滤波层(Signal Filtering Layer),利用积分发放(Integrate-and-Fire, IF)机制积累任务相关信号并抑制瞬态噪声波动;随后通过脉冲离散单元(Spike Discretization Unit)将过滤后的信号转化为稀疏二进制提示,该过程起到强正则化作用,促使模型聚焦于最具判别力和鲁棒性的特征。值得注意的是,生成的二进制提示在部署阶段保持静态,确保推理时无额外计算开销。
链接: https://arxiv.org/abs/2604.18284
作者: Qiugang Zhan,Anning Jiang,Ran Tao,Ao Ma,Xiangyu Zhang,Xiurui Xie,Guisong Liu
机构: Southwestern University of Finance and Economics (西南财经大学); Ministry of Education (教育部); University of Electronic Science and Technology of China (电子科技大学); China Mobile Qilu Innovation Research Institute (中国移动齐鲁创新研究院); Kash Institute of Electronics and Information Industry (喀什电子与信息产业研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pre-trained vision models have found widespread application across diverse domains. Prompt tuning-based methods have emerged as a parameter-efficient paradigm for adapting pre-trained vision models. While effective on standard benchmarks, the continuous and dense nature of learned prompts can lead to sensitivity against input noise, as the high-capacity prompts tend to overfit task-irrelevant details. To address this trade-off, we propose Spike-NVPT, a noise-robust visual prompt tuning method. Specifically, we design a Signal Filtering Layer based on spiking neurons, which uses the integrate-and-fire (IF) mechanism to accumulate task-relevant signals over time and filter transient noise fluctuations. A subsequent Spike Discretization Unit converts filtered signals into sparse binary prompts. This discretization acts as a strong regularizer, forcing the model to anchor decision boundaries on the most discriminative and robust features. Notably, the resulting binary prompts remain static during deployment, ensuring zero additional computational overhead during inference. Experimental results demonstrate that Spike-NVPT achieves superior robustness performance, with a maximum improvement of 11.2% over conventional methods, and retains competitive accuracy on clean datasets. To the best of our knowledge, this is the first attempt to leverage spiking neurons for fine-tuning traditional artificial neural network (ANN)-based visual models.
[CV-30] LiquidTAD: An Efficient Method for Temporal Action Detection via Liquid Neural Dynamics
【速读】:该论文旨在解决当前基于Transformer的时序动作检测(Temporal Action Detection, TAD)方法在资源受限环境下部署困难的问题,其核心挑战在于Transformer架构存在二次计算复杂度和参数冗余。解决方案的关键在于提出LiquidTAD框架,通过用并行化的ActionLiquid模块替代传统的自注意力层,利用闭合形式连续时间(Closed-form Continuous-time, CfC)公式将模型重构为可并行运算的算子,同时保留了连续时间动力学的物理先验。该设计实现了O(N)线性时间复杂度,并通过学习的时间常数(τ)自适应调节时间敏感性,从而高效捕捉复杂时序依赖关系,显著提升模型效率与鲁棒性。
链接: https://arxiv.org/abs/2604.18274
作者: Zepeng Sun,Naichuan Zheng,Hailun Xia,Junjie Wu,Liwei Bao,Xiaotai Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Temporal Action Detection (TAD) in untrimmed videos is currently dominated by Transformer-based architectures. While high-performing, their quadratic computational complexity and substantial parameter redundancy limit deployment in resource-constrained environments. In this paper, we propose LiquidTAD, a novel parameter-efficient framework that replaces cumbersome self-attention layers with parallelized ActionLiquid blocks. Unlike traditional Liquid Neural Networks (LNNs) that suffer from sequential execution bottlenecks, LiquidTAD leverages a closed-form continuous-time (CfC) formulation, allowing the model to be reformulated as a parallelizable operator while preserving the intrinsic physical prior of continuous-time dynamics. This architecture captures complex temporal dependencies with O(N) linear complexity and adaptively modulates temporal sensitivity through learned time-constants ( \tau ), providing a robust mechanism for handling varying action durations. To the best of our knowledge, this work is the first to introduce a parallelized LNN-based architecture to the TAD domain. Experimental results on the THUMOS-14 dataset demonstrate that LiquidTAD achieves a highly competitive Average mAP of 69.46% with only 10.82M parameters – a 63% reduction compared to the ActionFormer baseline. Further evaluations on ActivityNet-1.3 and Ego4D benchmarks confirm that LiquidTAD achieves an optimal accuracy-efficiency trade-off and exhibits superior robustness to temporal sampling variations, advancing the Pareto frontier of modern TAD frameworks.
[CV-31] MARCO: Navigating the Unseen Space of Semantic Correspondence CVPR2026
【速读】:该论文旨在解决当前基于双编码器架构(如DINOv2与扩散模型结合)的语义对应方法在实际应用中泛化能力不足的问题,即模型在训练时所见的关键点(keypoints)与真实场景中查询的关键点不匹配时性能显著下降。其解决方案的核心在于提出MARCO模型,通过一种新颖的训练框架实现细粒度定位与语义泛化能力的协同增强:一方面采用粗到精的目标函数提升空间精度,另一方面引入自蒸馏机制将稀疏标注监督扩展至未标注区域,从而将少量关键点转化为密集且语义一致的对应关系。该方法在SPair-71k、AP-10K和PF-PASCAL等多个基准上达到新SOTA,尤其在细粒度阈值(PCK@0.01)和未见关键点/类别上的泛化性能提升显著,同时模型体积仅为扩散方法的1/3,推理速度提升10倍。
链接: https://arxiv.org/abs/2604.18267
作者: Claudia Cuttano,Gabriele Trivigno,Carlo Masone,Stefan Roth
机构: Politecnico di Torino(都灵理工大学); TU Darmstadt(达姆施塔特工业大学); hessian.AI(黑森人工智能); ELIZA(艾丽莎)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Oral. Project page: this https URL
Abstract:Recent advances in semantic correspondence rely on dual-encoder architectures, combining DINOv2 with diffusion backbones. While accurate, these billion-parameter models generalize poorly beyond training keypoints, revealing a gap between benchmark performance and real-world usability, where queried points rarely match those seen during training. Building upon DINOv2, we introduce MARCO, a unified model for generalizable correspondence driven by a novel training framework that enhances both fine-grained localization and semantic generalization. By coupling a coarse-to-fine objective that refines spatial precision with a self-distillation framework, which expands sparse supervision beyond annotated regions, our approach transforms a handful of keypoints into dense, semantically coherent correspondences. MARCO sets a new state of the art on SPair-71k, AP-10K, and PF-PASCAL, with gains that amplify at fine-grained localization thresholds (+8.9 PCK@0.01), strongest generalization to unseen keypoints (+5.1, SPair-U) and categories (+4.7, MP-100), while remaining 3x smaller and 10x faster than diffusion-based approaches. Code is available at this https URL .
[CV-32] Geometry-Guided 3D Visual Token Pruning for Video-Language Models CVPR2026
【速读】:该论文旨在解决3D场景理解中因空间视频包含大量视觉标记(visual tokens)而导致的推理效率低下和上下文管理困难的问题。现有剪枝方法未能充分考虑空间视频的视角一致性与剩余标记的空间多样性,难以有效去除帧间冗余并保持场景完整性。其解决方案的关键在于提出Geo3DPruner框架,该框架通过几何引导的全局注意力机制建模跨帧相关性,并采用两阶段剪枝策略:首先在体素内选择多视角代表性特征(intra-voxel),其次在体素间选择空间分布均匀的子集以保留空间多样性(inter-voxel),从而在显著减少视觉标记数量的同时维持高精度的3D场景理解性能。
链接: https://arxiv.org/abs/2604.18260
作者: Han Li,Zehao Huang,Jiahui Fu,Naiyan Wang,Si Liu
机构: Beihang University (北京航空航天大学); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Multimodal large language models have demonstrated remarkable capabilities in 2D vision, motivating their extension to 3D scene understanding. Recent studies represent 3D scenes as 3D spatial videos composed of image sequences with depth and camera pose information, enabling pre-trained video-language models to perform 3D reasoning tasks. However, the large number of visual tokens in spatial videos remains a major bottleneck for efficient inference and context management. Existing pruning methods overlook the view consistency of spatial videos and the spatial diversity of the remaining tokens, which prevents them from effectively removing inter-frame redundancy and preserving scene completeness. In this paper, we propose Geo3DPruner, a Geometry-Guided 3D Visual Token Pruning framework. Geo3DPruner first models cross-frame relevance through geometry-aware global attention, and then performs a two-stage pruning process. The intra-voxel stage selects representative multi-view features within each voxel, while the inter-voxel stage preserves spatial diversity by selecting a globally distributed subset of voxels. Extensive experiments on multiple 3D scene understanding benchmarks demonstrate that Geo3DPruner retains over 90% of the original performance while pruning 90% of visual tokens, significantly outperforming existing text-guided and vision-guided pruning methods.
[CV-33] Long-Text-to-Image Generation via Compositional Prompt Decomposition ICLR2026
【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)模型在处理描述性段落类长输入时难以捕捉关键细节的问题,其根源在于训练数据中普遍存在的简短标题(caption)分布导致模型对长序列理解能力不足。解决方案的关键在于提出一种名为 Prompt Refraction for Intricate Scene Modeling (PRISM) 的组合式方法:通过轻量级模块从长提示中提取成分表征,使预训练T2I模型对每个组件独立进行噪声预测,并利用基于能量的联结策略将各组件输出融合至单一去噪步骤中,从而在不依赖微调的前提下实现对复杂场景的有效建模与高保真生成。
链接: https://arxiv.org/abs/2604.18258
作者: Jen-Yuan Huang,Tong Lin,Yilun Du
机构: Peking University (北京大学); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to the Fourteenth International Conference on Learning Representations (ICLR 2026)
Abstract:While modern text-to-image (T2I) models excel at generating images from intricate prompts, they struggle to capture the key details when the inputs are descriptive paragraphs. This limitation stems from the prevalence of concise captions that shape their training distributions. Existing methods attempt to bridge this gap by either fine-tuning T2I models on long prompts, which generalizes poorly to longer lengths; or by projecting the oversize inputs into normal-prompt space and compromising fidelity. We propose Prompt Refraction for Intricate Scene Modeling (PRISM), a compositional approach that enables pre-trained T2I models to process long sequence inputs. PRISM uses a lightweight module to extract constituent representations from the long prompts. The T2I model makes independent noise predictions for each component, and their outputs are merged into a single denoising step using energy-based conjunction. We evaluate PRISM across a wide range of model architectures, showing comparable performances to models fine-tuned on the same training data. Furthermore, PRISM demonstrates superior generalization, outperforming baseline models by 7.4% on prompts over 500 tokens in a challenging public benchmark.
[CV-34] Domain-Specialized Object Detection via Model-Level Mixtures of Experts IJCNN2026
【速读】:该论文旨在解决对象检测任务中传统集成方法(ensemble)在性能提升与模型可解释性方面的局限性,尤其是在处理密集且结构化的预测输出时存在的挑战。其解决方案的关键在于提出一种基于模型层面的混合专家(Mixture-of-Experts, MoE)架构,该架构将多个基于YOLO的对象检测器作为专家(expert),每个专家在语义上互斥的数据子集上训练,并引入一个可学习的门控网络(gating network)动态调整各专家对最终检测结果的贡献权重。通过优化门控机制和融合策略(如损失平衡以防止专家坍塌),该方法不仅显著优于标准集成方法,还提供了关于专家专业化能力的可解释洞察,为对象检测任务提供了一种更具结构性和可解释性的替代集成方案。
链接: https://arxiv.org/abs/2604.18256
作者: Svetlana Pavlitska,Malte Stüven,Beyza Keskin,J. Marius Zöllner
机构: 1. University of Stuttgart (斯图加特大学); 2. German Research Center for Artificial Intelligence (德国人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at IJCNN 2026
Abstract:Mixture-of-Experts (MoE) models provide a structured approach to combining specialized neural networks and offer greater interpretability than conventional ensembles. While MoEs have been successfully applied to image classification and semantic segmentation, their use in object detection remains limited due to challenges in merging dense and structured predictions. In this work, we investigate model-level mixtures of object detectors and analyze their suitability for improving performance and interpretability in object detection. We propose an MoE architecture that combines YOLO-based detectors trained on semantically disjoint data subsets, with a learned gating network that dynamically weights expert contributions. We study different strategies for fusing detection outputs and for training the gating mechanism, including balancing losses to prevent expert collapse. Experiments on the BDD100K dataset demonstrate that the proposed MoE consistently outperforms standard ensemble approaches and provides insights into expert specialization across domains, highlighting model-level MoEs as a viable alternative to traditional ensembling for object detection. Our code is available at this https URL.
[CV-35] Style-Based Neural Architectures for Real-Time Weather Classification
【速读】:该论文旨在解决从图像中实时分类天气状况(晴天、雨天、雪天、雾天)的问题,核心挑战在于如何有效提取图像中的风格化特征(stylistic elements),这些特征通常具有细微且高频率的特性,传统方法难以捕捉。解决方案的关键在于提出三种神经网络架构:其中“Multi-PatchGAN”通过多尺度patch判别机制增强对局部风格信息的感知能力;“Truncated ResNet50”通过进化算法确定最优层数截断策略,保留前九层以提取高频率细节特征;而“Truncated ResNet50 with Gram Matrix and Attention”则创新性地引入Gram矩阵与注意力机制,在训练过程中自动加权各层风格表达的重要性,从而显著提升模型对风格特征的判别力和泛化性能。这三项改进共同实现了优于现有技术的分类准确率,并展现出在其他基于外观的分类任务(如动物识别、医学影像疾病检测等)中的广泛适用性。
链接: https://arxiv.org/abs/2604.18251
作者: Hamed Ouattara,Pascal Houssam Salmane,Pierre Duthon,Frédéric Bernardin,Omar Ait Aider
机构: CEREMA(法国国家环境与城市规划研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
备注: 9 pages, 21 figures
Abstract:In this paper, we present three neural network architectures designed for real-time classification of weather conditions (sunny, rain, snow, fog) from images. These models, inspired by recent advances in style transfer, aim to capture the stylistic elements present in images. One model, called “Multi-PatchGAN”, is based on PatchGANs used in well-known architectures such as Pix2Pix and CycleGAN, but here adapted with multiple patch sizes for detection tasks. The second model, “Truncated ResNet50”, is a simplified version of ResNet50 retaining only its first nine layers. This truncation, determined by an evolutionary algorithm, facilitates the extraction of high-frequency features essential for capturing subtle stylistic details. Finally, we propose “Truncated ResNet50 with Gram Matrix and Attention”, which computes Gram matrices for each layer during training and automatically weights them via an attention mechanism, thus optimizing the extraction of the most relevant stylistic expressions for classification. These last two models outperform the state of the art and demonstrate remarkable generalization capability on several public databases. Although developed for weather detection, these architectures are also suitable for other appearance-based classification tasks, such as animal species recognition, texture classification, disease detection in medical imaging, or industrial defect identification.
[CV-36] Medical Image Understanding Improves Survival Prediction via Visual Instruction Tuning MICCAI2026
【速读】:该论文旨在解决基于CT影像的生存预测中因人工解读导致的信息损失及临床决策支持不足的问题。其关键解决方案是提出一种视觉-语言框架,通过大规模开源CT图像与放射科报告的配对数据进行视觉指令微调(visual instruction tuning),使模型学习到具有临床意义的视觉-文本表征;在此基础上引入生存预测头(survival prediction head),从而在整合CT影像与临床数据的同时生成可解释的语言响应,显著提升生存预测性能,尤其在临床数据预测能力较弱时表现更优。
链接: https://arxiv.org/abs/2604.18250
作者: Xixi Liu,Jorge Lazo,Andreas Hallqvist,Mikael Johansson,Åse Johnsson,Jonas S Andersson,Ella Äng Eklund,Patrik Sund,Nasser Hosseini,Jennifer Alvén,Ida Häggström
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to MICCAI 2026
Abstract:Accurate prognostication and risk estimation are essential for guiding clinical decision-making and optimizing patient management. While radiologist-assessed features from CT scans provide valuable indicators of disease severity and outcomes, interpreting such images requires expert knowledge, and translating rich visual information into textual summaries inevitably leads to information loss. In this work, we propose a vision-language framework for 3D CT image understanding that leverages large-scale open-sourced CT images paired with radiology reports through visual instruction tuning. This pre-training enables the model to learn clinically meaningful visual-textual representations, which can then be adapted to downstream survival prediction tasks. By incorporating a survival prediction head on top of the pre-trained model, our approach improves survival prediction from CT images and clinical data while generating clinically meaningful language responses to predefined questions. Experimental results demonstrate that our method outperforms baseline methods in survival prediction, particularly, when clinical data alone is less predictive. The code will be released upon acceptance.
[CV-37] Is SAM3 ready for pathology segmentation?
【速读】:该论文旨在解决数字病理图像分割中传统方法因标注成本高和泛化能力差而面临的挑战,特别是探索生成式 AI(Generative AI)模型 Segment Anything Model 3 (SAM3) 在病理图像中的分割能力边界。其解决方案的关键在于提出了一套系统化的评估协议,通过零样本、少样本和监督学习等多种标注设置,结合不同提示策略(prompting strategies),对 SAM3 在 NuInsSeg、PanNuke 和 GlaS 等病理数据集上的表现进行结构化分析,从而揭示其在核级与组织级分割任务中的性能潜力与局限性,并强调了领域适配的必要性。
链接: https://arxiv.org/abs/2604.18225
作者: Qiuyu Kong,Shakiba Sharifi,Zanxi Ruan,Yiming Wang,Marco Cristani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Is Segment Anything Model 3 (SAM3) capable in segmenting Any Pathology Images? Digital pathology segmentation spans tissue-level and nuclei-level scales, where traditional methods often suffer from high annotation costs and poor generalization. SAM3 introduces Promptable Concept Segmentation, offering a potential automated interface via text prompts. With this work, we propose a systematic evaluation protocol to explore the capability space of SAM3 in a structured manner. Specifically, we evaluate SAM3 under different supervision settings including zero-shot, few-shot, and supervised with varying prompting strategies. Our extensive evaluation on pathological datasets including NuInsSeg, PanNuke and GlaS, reveals that: this http URL-only prompts poorly activate nuclear concepts. this http URL is highly sensitive to visual prompt types and budgets. this http URL-shot learning offers gains, but SAM3 lacks robustness against visual prompt noise. and 4.a significant gap persists between prompt-based usage and task-trained adapter-based reference. Our study delineates SAM3’s boundaries in pathology image segmentation and provides practical guidance on the necessity of pathology domain adaptation.
[CV-38] Instruction-as-State: Environment-Guided and State-Conditioned Semantic Understanding for Embodied Navigation
【速读】:该论文旨在解决视觉-语言导航(Vision-and-Language Navigation, VLN)中指令与观测动态耦合问题:传统模型将指令编码为静态全局表示,无法随代理的视觉场景和空间上下文变化而调整指令语义,导致导航性能受限。解决方案的关键在于提出“指令即状态”(Instruction-as-State)建模范式——将指令理解视为一个由感知状态(perceptual state)条件驱动的、逐token更新的动态变量。为此,作者设计了粗粒度到细粒度的S-EGIU框架:首先激活与当前观测语义对齐的指令片段,再通过观察引导的token锚定与上下文建模实现语义细化,从而在导航过程中持续更新与感知状态耦合的指令状态,显著提升多基准测试下的导航准确率与效率。
链接: https://arxiv.org/abs/2604.18223
作者: Zhen Liu,Yuhan Liu,Jinjun Wang,Jianyi Liu,Wei Song,Jingwen Fu
机构: Xi’an Jiaotong University (西安交通大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-and-Language Navigation requires agents to follow natural-language instructions in visually changing environments. A central challenge is the dynamic entanglement between language and observations: the meaning of instruction shifts as the agent’s field of view and spatial context evolve. However, many existing models encode the instruction as a static global representation, limiting their ability to adapt instruction meaning to the current visual context. We therefore model instruction understanding as an Instruction-as-State variable: a decision-relevant, token-level instruction state that evolves step by step conditioned on the agent’s perceptual state, where the perceptual state denotes the observation-grounded navigation context at each step. To realize this principle, we introduce State-Entangled Environment-Guided Instruction Understanding (S-EGIU), a coarse-to-fine framework for state-conditioned segment activation and token-level semantic refinement. At the coarse level, S-EGIU activates the instruction segment whose semantics align with the current observation. At the fine level, it refines the activated segment through observation-guided token grounding and contextual modeling, sharpening its internal semantics under the current observation. Together, these stages maintain an instruction state that is continuously updated according to the agent’s perceptual state during navigation. S-EGIU delivers strong performance on several key metrics, including a +2.68% SPL gain on REVERIE Test Unseen, and demonstrates consistent efficiency gains across multiple VLN benchmarks, underscoring the value of dynamic instruction–perception entanglement.
[CV-39] Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
【速读】:该论文旨在解决长时视频生成中时空一致性难以保持的问题,特别是现有方法将记忆建模与生成过程耦合,导致在场景重复访问时内容不一致,且在探索新区域时生成能力受限。其解决方案的关键在于提出一种解耦框架,将记忆条件控制与生成模块分离:通过轻量级独立的记忆分支学习精确的空间一致性;引入混合记忆表示以融合时序与空间线索,并利用帧级交叉注意力机制确保每帧仅依赖最相关的历史信息进行条件控制;同时设计相机感知门控机制,在生成新场景时仅在存在有意义历史参考时激活记忆条件,从而显著降低训练成本并提升空间一致性与新场景生成能力。
链接: https://arxiv.org/abs/2604.18215
作者: Yanjun Guo,Zhengqiang Zhang,Pengfei Wang,Xinyue Liang,Zhiyuan Ma,Lei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, with supplementary material
Abstract:Spatially consistent long-horizon video generation aims to maintain temporal and spatial consistency along predefined camera trajectories. Existing methods mostly entangle memory modeling with video generation, leading to inconsistent content during scene revisits and diminished generative capacity when exploring novel regions, even trained on extensive annotated data. To address these limitations, we propose a decoupled framework that separates memory conditioning from generation. Our approach significantly reduces training costs while simultaneously enhancing spatial consistency and preserving the generative capacity for novel scene exploration. Specifically, we employ a lightweight, independent memory branch to learn precise spatial consistency from historical observation. We first introduce a hybrid memory representation to capture complementary temporal and spatial cues from generated frames, then leverage a per-frame cross-attention mechanism to ensure each frame is conditioned exclusively on the most spatially relevant historical information, which is injected into the generative model to ensure spatial consistency. When generating new scenes, a camera-aware gating mechanism is proposed to mediate the interaction between memory and generation modules, enabling memory conditioning only when meaningful historical references exist. Compared with the existing method, our method is highly data-efficient, yet the experiments demonstrate that our approach achieves state-of-the-art performance in terms of both visual quality and spatial consistency.
[CV-40] owards Symmetry-sensitive Pose Estimation: A Rotation Representation for Symmetric Object Classes
【速读】:该论文旨在解决对称物体在6D姿态估计中因固有方向歧义(orientation ambiguity)导致深度学习网络训练困难的问题。现有方法通常依赖特定损失函数、网络结构设计或对称不变评价指标,而本文提出的关键解决方案是通过改进三角恒等式,将物体形状的对称度数融入旋转的数值表示中,构建出名为SARR(Symmetry-Aware Rotation Representation)的新表示方法。该方法能够为对称物体生成唯一且连续的规范姿态(canonic pose),从而使得标准卷积神经网络(CNN)可以直接用于3D方向估计,并在使用对称敏感的余弦距离(ARC)评估时显著优于当前最优方法;此外,SARR无需3D模型输入,仅需深度图或无纹理RGB/灰度图像即可实现高性能,且在推理阶段即使缺乏物体对称性先验知识仍表现优异。
链接: https://arxiv.org/abs/2604.18208
作者: Andreas Kriegler,Csaba Beleznai,Margrit Gelautz
机构: TU Wien (维也纳工业大学); AIT Austrian Institute of Technology (奥地利技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Geometric Topology (math.GT)
备注: Published Open-Access in IJCV, see this https URL . 28 pages, 6 figures, 9 tables, 1 algorithm
Abstract:Symmetric objects are common in daily life and industry, yet their inherent orientation ambiguities that impede the training of deep learning networks for pose estimation are rarely discussed in the literature. To cope with these ambiguities, existing solutions typically require the design of specific loss functions and network architectures or resort to symmetry-invariant evaluation metrics. In contrast, we focus on the numeric representation of the rotation itself, modifying trigonometric identities with the degrees of symmetry derived from the objects’ shapes. We use our representation, SARR, to obtain canonic (symmetry-resolved) poses for the symmetric objects in two popular 6D pose estimation datasets, T-LESS and ITODD, where SARR is unique and continuous w.r.t. the visual appearance. This allows us to use a standard CNN for 3D orientation estimation whose performance is evaluated with the symmetry-sensitive cosine distance \textAR_\textC . Our networks outperform the state of the art using \textAR_\textC and achieve satisfactory performance when using conventional symmetry-invariant measures. Our method does not require any 3D models but only depth, or, as part of an additional experiment, texture-less RGB/grayscale images as input. We also show that networks trained on SARR outperform the same networks trained on rotation matrices, Euler angles, quaternions, standard trigonometrics or the recently popular 6d representation – even in inference scenarios where no prior knowledge of the objects’ symmetry properties is available. Code and a visualization toolkit are available at this https URL .
[CV-41] A Comparative Evaluation of Geometric Accuracy in NeRF and Gaussian Splatting
【速读】:该论文旨在解决当前神经渲染(Neural Rendering)方法在评估过程中过度依赖视觉质量指标而忽视表面几何精度的问题,尤其在机器人领域中,精确的几何信息对抓取和操作任务至关重要。其解决方案的关键在于提出了一套专注于几何准确性的评估流程,并构建了一个包含19个多样化场景的基准测试集,从而实现对重建方法在表面和形状保真度方面的系统性量化评估,弥补了传统视觉指标的不足。
链接: https://arxiv.org/abs/2604.18205
作者: Mikolaj Zielinski,Eryk Vykysaly,Bartlomiej Biesiada,Jan Baturo,Mateusz Capala,Dominik Belter
机构: Poznan University of Technology (波兹南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Recent advances in neural rendering have introduced numerous 3D scene representations. Although standard computer vision metrics evaluate the visual quality of generated images, they often overlook the fidelity of surface geometry. This limitation is particularly critical in robotics, where accurate geometry is essential for tasks such as grasping and object manipulation. In this paper, we present an evaluation pipeline for neural rendering methods that focuses on geometric accuracy, along with a benchmark comprising 19 diverse scenes. Our approach enables a systematic assessment of reconstruction methods in terms of surface and shape fidelity, complementing traditional visual metrics.
[CV-42] DiffuSAM: Diffusion Guided Zero-Shot Object Grounding for Remote Sensing Imagery ICLR2026
【速读】:该论文旨在解决遥感影像中目标定位(object grounding)的准确性问题,尤其是在复杂场景下如何实现鲁棒且自适应的目标边界框(bounding box)生成。其解决方案的关键在于提出了一种混合流水线,将基于扩散模型(diffusion models)提供的定位线索与先进的分割模型(如RemoteSAM和SAM3)相结合,通过融合生成式模型的语义感知能力与基础分割模型的高精度边界框预测能力,显著提升了定位性能,在Acc@0.5指标上较现有最优方法提高了超过14%。
链接: https://arxiv.org/abs/2604.18201
作者: Geet Sethi,Panav Shah,Ashutosh Gandhe,Soumitra Darshan Nayak
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICLR 2026 ML4RS Workshop
Abstract:Diffusion models have emerged as powerful tools for a wide range of vision tasks, including text-guided image generation and editing. In this work, we explore their potential for object grounding in remote sensing imagery. We propose a hybrid pipeline that integrates diffusion-based localization cues with state-of-the-art segmentation models such as RemoteSAM and SAM3 to obtain more accurate bounding boxes. By leveraging the complementary strengths of generative diffusion models and foundational segmentation models, our approach enables robust and adaptive object localization across complex scenes. Experiments demonstrate that our pipeline significantly improves localization performance, achieving over a 14% increase in Acc@0.5 compared to existing state-of-the-art methods.
[CV-43] Attraction Repulsion and Friction: Introducing DMF a Friction-Augmented Drifting Model
【速读】:该论文旨在解决生成式 AI(Generative AI)中基于漂移场(drift field)的模型在推理阶段避免常微分方程(ODE)积分的同时,如何保证分布匹配的理论严谨性和实际性能的问题。核心挑战在于:原始漂移模型存在局部排斥区域且未证明漂移场为零(Vp,q≡0)是否足以确保学习分布 q 等于目标分布 p。解决方案的关键在于提出带摩擦项的漂移模型(DMF),通过推导代理模型的收缩阈值并引入线性调度的摩擦系数,实现有限时间内的误差轨迹上界;同时在高斯核假设下严格证明了漂移场平衡点的可识别性——即漂移场在任意开集上消失可推出 q=p,从而闭合了前人工作的逆命题。实验表明,DMF 在训练计算量降低16倍的情况下,性能优于或等同于最优流匹配(Optimal Flow Matching)在FFHQ成人到儿童图像转换任务中的表现。
链接: https://arxiv.org/abs/2604.18194
作者: Arkadii Kazanskii,Tatiana Petrova,Konstantin Bagrianskii,Aleksandr Puzikov,Radu State
机构: SEDAN, SnT, University of Luxembourg (SEDAN, SnT, 卢森堡大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 2 figures, 2 tables
Abstract:Drifting Models [Deng et al., 2026] train a one-step generator by evolving samples under a kernel-based drift field, avoiding ODE integration at inference. The original analysis leaves two questions open. The drift-field iteration admits a locally repulsive regime in a two-particle surrogate, and vanishing of the drift ( V_p,q\equiv 0 ) is not known to force the learned distribution q to match the target p . We derive a contraction threshold for the surrogate and show that a linearly-scheduled friction coefficient gives a finite-horizon bound on the error trajectory. Under a Gaussian kernel we prove that the drift-field equilibrium is identifiable: vanishing of V_p,q on any open set forces q=p , closing the converse of Proposition 3.1 of Deng et al. Our friction-augmented model, DMF (Drifting Model with Friction), matches or exceeds Optimal Flow Matching on FFHQ adult-to-child domain translation at 16x lower training compute.
[CV-44] CanonSLR: Canonical-View Guided Multi-View Continuous Sign Language Recognition
【速读】:该论文旨在解决连续手语识别(Continuous Sign Language Recognition, CSLR)在多视角场景下鲁棒性不足的问题,尤其针对真实世界中因视角变化导致的识别性能下降。其核心解决方案是提出一种基于规范视图(canonical view)引导的多视角CSLR框架——CanonSLR,关键在于引入“前视图锚定的师生学习策略”,即利用前视图训练的教师网络为学生网络提供结构化的时序监督信号;同时结合序列级软目标蒸馏(Sequence-Level Soft-Target Distillation)以减少跨视角语义差异,并通过时间运动关系增强(Temporal Motion Relational Enhancement)显式建模高层视觉特征中的运动感知时序关系,从而提升动态表征稳定性并抑制视角敏感的外观干扰。
链接: https://arxiv.org/abs/2604.18184
作者: Xu Wang,Shengeng Tang,Wan Jiang,Yaxiong Wang,Lechao Cheng,Richang Hong
机构: Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Continuous Sign Language Recognition (CSLR) has achieved remarkable progress in recent years; however, most existing methods are developed under single-view settings and thus remain insufficiently robust to viewpoint variations in real-world scenarios. To address this limitation, we propose CanonSLR, a canonical-view guided framework for multi-view CSLR. Specifically, we introduce a frontal-view-anchored teacher-student learning strategy, in which a teacher network trained on frontal-view data provides canonical temporal supervision for a student network trained on all viewpoints. To further reduce cross-view semantic discrepancy, we propose Sequence-Level Soft-Target Distillation, which transfers structured temporal knowledge from the frontal view to non-frontal samples, thereby alleviating gloss boundary ambiguity and category confusion caused by occlusion and projection variation. In addition, we introduce Temporal Motion Relational Enhancement to explicitly model motion-aware temporal relations in high-level visual features, strengthening stable dynamic representations while suppressing viewpoint-sensitive appearance disturbances. To support multi-view CSLR research, we further develop a universal multi-view sign language data construction pipeline that transforms original single-view RGB videos into semantically consistent, temporally coherent, and viewpoint-controllable multi-view sign language videos. Based on this pipeline, we extend PHOENIX-2014T and CSL-Daily into two seven-view benchmarks, namely PT14-MV and CSL-MV, providing a new experimental foundation for multi-view CSLR. Extensive experiments on PT14-MV and CSL-MV demonstrate that CanonSLR consistently outperforms existing approaches under multi-view settings and exhibits stronger robustness, especially on challenging non-frontal views.
[CV-45] Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation CVPR2026
【速读】:该论文旨在解决如何将MeanFlow(一种少步生成方法)从仅支持固定类别标签的图像生成扩展至支持灵活文本条件输入的问题,从而实现更丰富的文本条件图像合成。其核心挑战在于,相较于离散且易区分的类别特征,文本条件对模型的理解能力要求更高,而MeanFlow本身仅有极少数优化步骤(如一步),导致文本特征需具备高度判别性才能有效驱动生成过程。解决方案的关键在于:首先通过细致分析揭示了文本特征判别力不足是性能不佳的根本原因;其次,利用经过验证具有强语义区分能力的大语言模型(LLM)级文本编码器,并针对性地调整MeanFlow的生成流程以适配该编码器,最终实现了首个高效、高质量的文本条件MeanFlow图像生成方法。
链接: https://arxiv.org/abs/2604.18168
作者: Chenxi Zhao,Chen Zhu,Xiaokun Feng,Aiming Hao,Jiashu Zhu,Jiachen Lei,Jiahong Wu,Xiangxiang Chu,Jufeng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026
Abstract:Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily focuses on class-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation. Compared to the limited class labels, text conditions pose greater challenges to the model’s understanding capability, necessitating the effective integration of powerful text encoders into the MeanFlow framework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerful LLM-based text encoders using conventional training strategies results in unsatisfactory performance. To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the extremely limited number of refinement steps in the MeanFlow generation, such as only one step, the text feature representations are required to possess sufficiently high discriminability. This also explains why discrete and easily distinguishable class features perform well within the MeanFlow framework. Guided by these insights, we leverage a powerful LLM-based text encoder validated to possess the required semantic properties and adapt the MeanFlow generation process to this framework, resulting in efficient text-conditioned synthesis for the first time. Furthermore, we validate our approach on the widely used diffusion model, demonstrating significant generation performance improvements. We hope this work provides a general and practical reference for future research on text-conditioned MeanFlow generation. The code is available at this https URL.
[CV-46] Embedding Arithmetic: A Lightweight Tuning-Free Framework for Post-hoc Bias Mitigation in Text-to-Image Models
【速读】:该论文旨在解决现代文本到图像(Text-to-Image, T2I)生成模型在推理阶段放大的社会偏见问题,同时保持提示语义和视觉上下文(如背景、布局与风格)的一致性,从而实现公平性与图像保真度之间的有效平衡。其解决方案的关键在于提出一种无需微调模型权重、修改提示或数据集的推理时干预方法,基于嵌入空间算术(Embedding Arithmetic)分析偏见在潜在空间中的结构,并通过调整条件嵌入空间中的特定向量来纠正偏见,而不会破坏原始语义信息。该方法在FLUX 1.0-Dev和Stable Diffusion 3.5-Large等主流模型上验证了其有效性,尤其通过引入概念一致性评分(Concept Coherence Score, CCS)这一更鲁棒的评估指标,证明其在提升多样性的同时显著维持高概念保真度,为生成式AI的公平性治理提供了可解释且可控的几何路径。
链接: https://arxiv.org/abs/2604.18167
作者: Venkatesh Thirugnana Sambandham,Torsten Schön
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: A demo notebook with basic implementations can be found at \url{ this https URL }
Abstract:Modern text-to-image (T2I) models amplify harmful societal biases, challenging their ethical deployment. We introduce an inference-time method that reliably mitigates social bias while keeping prompt semantics and visual context (background, layout, and style) intact. This ensures context persistency and provides a controllable parameter to adjust mitigation strength, giving practitioners fine-grained control over fairness-coherence trade-offs. Using Embedding Arithmetic, we analyze how bias is structured in the embedding space and correct it without altering model weights, prompts, or datasets. Experiments on FLUX 1.0-Dev and Stable Diffusion 3.5-Large show that the conditional embedding space forms a complex, entangled manifold rather than a grid of disentangled concepts. To rigorously assess semantic preservation beyond the circularity and bias limitations of of CLIP scores, we propose the Concept Coherence Score (CCS). Evaluated against this robust metric, our lightweight, tuning-free method significantly outperforms existing baselines in improving diversity while maintaining high concept coherence, effectively resolving the critical fairness-coherence trade-off. By characterizing how models represent social concepts, we establish geometric understanding of latent space as a principled path toward more transparent, controllable, and fair image generation.
[CV-47] AI-based Waste Mapping for Addressing Climate-Exacerbated Flood Risk
【速读】:该论文旨在解决非洲快速扩张城市中因垃圾管理不善导致排水系统堵塞、进而加剧城市内涝(urban flooding)的问题。其解决方案的关键在于提出了一种基于人工智能(AI)的高分辨率城市垃圾制图工作流程,利用公开获取的航空影像和街景图像自动识别市政固体废物的空间分布,并结合本地合作伙伴进行文化与情境相关的数据标注,从而精准识别出垃圾堆积热点区域(如非正式住区),为城市规划、气候适应和可持续废物管理提供可操作的决策依据。
链接: https://arxiv.org/abs/2604.18151
作者: Steffen Knoblauch,Levi Szamek,Iddy Chazua,Benedcto Adamu,Innocent Maholi,Alexander Zipf
机构: Heidelberg University (海德堡大学); OpenMap Development Tanzania (OpenMap开发坦桑尼亚)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:
Abstract:Urban flooding is a growing climate change-related hazard in rapidly expanding African cities, where inadequate waste management often blocks drainage systems and amplifies flood risks. This study introduces an AI-powered urban waste mapping workflow that leverages openly available aerial and street-view imagery to detect municipal solid waste at high resolution. Applied in Dar es Salaam, Tanzania, our approach reveals spatial waste patterns linked to informal settlements and socio-economic factors. Waste accumulation in waterways was found to be up to three times higher than in adjacent urban areas, highlighting critical hotspots for climate-exacerbated flooding. Unlike traditional manual mapping methods, this scalable AI approach allows city-wide monitoring and prioritization of interventions. Crucially, our collaboration with local partners ensured culturally and contextually relevant data labeling, reflecting real-world reuse practices for solid waste. The results offer actionable insights for urban planning, climate adaptation, and sustainable waste management in flood-prone urban areas.
[CV-48] Attention-ResUNet for Automated Fetal Head Segmentation
【速读】:该论文旨在解决超声图像中胎儿头部自动分割的难题,尤其针对低对比度、噪声干扰以及复杂解剖边界等挑战性问题。其解决方案的关键在于提出一种新型网络架构——Attention-ResUNet,该架构通过在解码器四个层级集成注意力门(attention gates)来聚焦于解剖学相关区域并抑制背景噪声,同时结合残差连接(residual connections)以促进梯度流动和特征复用,从而显著提升分割精度与模型可解释性。
链接: https://arxiv.org/abs/2604.18148
作者: Ammar Bhilwarawala,Mainak Bandyopadhyay
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted and Presented at ANTIC 2025, IIITM Gwalior (5th International Conference on Advanced Network Technologies and Intelligent Computing) on 23rd December 2025. Presented with the best paper award in Image Processing Track
Abstract:Automated fetal head segmentation in ultrasound images is critical for accurate biometric measurements in prenatal care. While existing deep learning approaches have achieved a reasonable performance, they struggle with issues like low contrast, noise, and complex anatomical boundaries which are inherent to ultrasound imaging. This paper presents Attention-ResUNet. It is a novel architecture that synergistically combines residual learning with multi-scale attention mechanisms in order to achieve enhanced fetal head segmentation. Our approach integrates attention gates at four decoder levels to focus selectively on anatomically relevant regions while suppressing the background noise, and complemented by residual connections which facilitates gradient flow and feature reuse. Extensive evaluation on the HC18 Challenge dataset where n = 200 demonstrates that Attention ResUNet achieves a superior performance with a mean Dice score of 99.30 +/- 0.14% against similar architectures. It significantly outperforms five baseline architectures including ResUNet (99.26%), Attention U-Net (98.79%), Swin U-Net (98.60%), Standard U-Net (98.58%), and U-Net++ (97.46%). Through statistical analysis we confirm highly significant improvements (p 0.001) with effect sizes that range from 0.230 to 13.159 (Cohen’s d). Using Saliency map analysis, we reveal that our architecture produces highly concentrated, anatomically consistent activation patterns, which demonstrate an enhanced interpretability which is crucial for clinical deployment. The proposed method establishes a new state of the art performance for automated fetal head segmentation whilst maintaining computational efficiency with 14.7M parameters and a 45 GFLOPs inference cost. Code repository: this https URL
[CV-49] Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework ACL2026
【速读】:该论文旨在解决3D PET/CT医学影像报告自动生成中因高维数据特性与标注数据稀缺(尤其是低资源语言)导致的临床可靠性不足问题,同时克服现有黑箱模型忽略放射科医生从局部区域(Region of Interest, RoI)分析得出诊断结论的临床流程。其解决方案的关键在于提出首个面向低资源语言的大型3D PET/CT细粒度RoI标注数据集VietPET-RoI(含600例样本和1960个手动标注RoI),并设计HiRRA框架——该框架通过图结构关系模块模拟专业放射科医生的诊断流程,显式建模RoI属性间的依赖关系,实现从全局模式匹配向局部临床发现的转变;此外,引入基于大语言模型(LLM)提取的RoI Coverage和RoI Quality Index两项新临床评估指标,有效量化RoI定位准确性和属性描述保真度,实验表明该方法在BLEU和ROUGE-L上优于现有模型19.7%和4.7%,并在临床指标上提升45.8%,显著增强生成报告的临床可信度与减少幻觉。
链接: https://arxiv.org/abs/2604.18145
作者: Cong Huy Nguyen,Son Dinh Nguyen,Guanlin Li,Tuan Dung Nguyen,Aditya Narayan Sankaran,Mai Huy Thong,Thanh Trung Nguyen,Mai Hong Son,Reza Farahbakhsh,Phi Le Nguyen,Noel Crespi
机构: AI4LIFE, Hanoi University of Science and Technology, Vietnam; SAMOVAR, Télécom SudParis, Institut Polytechnique de Paris, France; 108 Military Central Hospital, Vietnam
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages; Accepted to appear in ACL 2026
Abstract:Automated medical report generation for 3D PET/CT imaging is fundamentally challenged by the high-dimensional nature of volumetric data and a critical scarcity of annotated datasets, particularly for low-resource languages. Current black-box methods map whole volumes to reports, ignoring the clinical workflow of analyzing localized Regions of Interest (RoIs) to derive diagnostic conclusions. In this paper, we bridge this gap by introducing VietPET-RoI, the first large-scale 3D PET/CT dataset with fine-grained RoI annotation for a low-resource language, comprising 600 PET/CT samples and 1,960 manually annotated RoIs, paired with corresponding clinical reports. Furthermore, to demonstrate the utility of this dataset, we propose HiRRA, a novel framework that mimics the professional radiologist diagnostic workflow by employing graph-based relational modules to capture dependencies between RoI attributes. This approach shifts from global pattern matching toward localized clinical findings. Additionally, we introduce new clinical evaluation metrics, namely RoI Coverage and RoI Quality Index, that measure both RoI localization accuracy and attribute description fidelity using LLM-based extraction. Extensive evaluation demonstrates that our framework achieves SOTA performance, surpassing existing models by 19.7% in BLEU and 4.7% in ROUGE-L, while achieving a remarkable 45.8% improvement in clinical metrics, indicating enhanced clinical reliability and reduced hallucination. Our code and dataset are available on GitHub.
[CV-50] Soft Label Pruning and Quantization for Large-Scale Dataset Distillation
【速读】:该论文旨在解决大规模数据集蒸馏(Dataset Distillation)中软标签(soft labels)存储开销过大的问题,尤其在ImageNet-1K和ImageNet-21K上,软标签体积可达压缩图像的30–40倍和200倍,严重违背了数据压缩的目标。作者识别出两个根本原因:一是合成图像类内相似度过高导致需要大量增强以提升多样性;二是训练过程中监督信号种类不足,在高压缩率下造成性能下降。解决方案的关键在于提出Label Pruning and Quantization for Large-scale Distillation (LPQLD),通过两类机制优化:(1) 增强图像多样性——采用类别级分批(class-wise batching)与批量归一化监督(batch-normalization supervision)进行图像合成;(2) 提升监督多样性——引入动态知识重用的标签剪枝(Label Pruning with Dynamic Knowledge Reuse)以提高每增强样本的标签多样性,以及校准师生对齐的标签量化(Label Quantization with Calibrated Student-Teacher Alignment)以增强每张图像的增强多样性。最终在保持高压缩比的同时显著提升准确率,并大幅降低软标签存储需求(ImageNet-1K减少78倍,ImageNet-21K减少500倍)。
链接: https://arxiv.org/abs/2604.18135
作者: Xiao Lingao,Yang He
机构: CFAR, Agency for Science, Technology and Research, Singapore (新加坡科技研究局); IHPC, Agency for Science, Technology and Research, Singapore (新加坡科技研究局); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large-scale dataset distillation requires storing auxiliary soft labels that can be 30-40x larger on ImageNet-1K and 200x larger on ImageNet-21K than the condensed images, undermining the goal of dataset compression. We identify two fundamental issues necessitating such extensive labels: (1) insufficient image diversity, where high within-class similarity in synthetic images requires extensive augmentation, and (2) insufficient supervision diversity, where limited variety in supervisory signals during training leads to performance degradation at high compression rates. To address these challenges, we propose Label Pruning and Quantization for Large-scale Distillation (LPQLD). We enhance image diversity via class-wise batching and batch-normalization supervision during synthesis. For supervision diversity, we introduce Label Pruning with Dynamic Knowledge Reuse to improve label-per-augmentation diversity, and Label Quantization with Calibrated Student-Teacher Alignment to improve augmentation-per-image diversity. Our approach reduces soft label storage by 78x on ImageNet-1K and 500x on ImageNet-21K while improving accuracy by up to 7.2% and 2.8%, respectively. Extensive experiments validate the superiority of LPQLD across different network architectures and dataset distillation methods. Code is available at this https URL.
[CV-51] Can LLM -Generated Text Empower Surgical Vision-Language Pre-training? CVPR
【速读】:该论文旨在解决多模态医学视觉预训练中因依赖昂贵专家文本标注而导致的可扩展性瓶颈问题。现有基于对比学习的视觉-语言预训练方法在使用噪声文本(如大语言模型生成的叙述)时,可能引入幻觉等错误,从而破坏医疗先验知识的可靠性。解决方案的关键在于提出SurgLIME框架:它采用LoRA适配的双编码器结构以保留视觉基础模型的医学先验,并引入自动化置信度估计机制,在对比对齐过程中动态降低不确定文本的权重,从而实现鲁棒的跨模态对齐与性能保持。
链接: https://arxiv.org/abs/2604.18134
作者: Chengan Che,Chao Wang,Jiayuan Huang,Xinyue Chen,Luis C. Garcia-Peraza-Herrera
机构: King’s College London (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPRW 2026 (AI4RWC Oral presentationn)
Abstract:Recent advancements in self-supervised learning have led to powerful surgical vision encoders capable of spatiotemporal understanding. However, extending these visual foundations to multi-modal reasoning tasks is severely bottlenecked by the prohibitive cost of expert textual annotations. To overcome this scalability limitation, we introduce \textbfLIME, a large-scale multi-modal dataset derived from open-access surgical videos using human-free, Large Language Model (LLM)-generated narratives. While LIME offers immense scalability, unverified generated texts may contain errors, including hallucinations, that could potentially lead to catastrophically degraded pre-trained medical priors in standard contrastive pipelines. To mitigate this, we propose \textbfSurgLIME, a parameter-efficient Vision-Language Pre-training (VLP) framework designed to learn reliable cross-modal alignments using noisy narratives. SurgLIME preserves foundational medical priors using a LoRA-adapted dual-encoder architecture and introduces an automated confidence estimation mechanism that dynamically down-weights uncertain text during contrastive alignment. Evaluations on the AutoLaparo and Cholec80 benchmarks show that SurgLIME achieves competitive zero-shot cross-modal alignment while preserving the robust linear probing performance of the visual foundation model. Dataset, code, and models are publicly available at \hrefthis https URLthis https URL.
[CV-52] Chatting about Conditional Trajectory Prediction
【速读】:该论文旨在解决人机交互系统中轨迹预测的准确性问题,特别是现有方法通常忽略自车(ego agent)自身运动状态,仅基于静态信息建模周围代理(surrounding agents)之间的社会交互,导致预测结果不够精确且难以适配动态路径规划需求。其解决方案的关键在于提出一种跨时间域意图-交互方法(Cross time domain Intention-interactive method for conditional Trajectory prediction, CiT),通过联合分析不同时间域内的行为意图,实现跨时域的信息互补与融合:自车在自身时间域中的意图可借助其他时间域的社会交互信息进行修正,从而获得更精准的意图表征;同时,CiT 与机器人运动规划和控制模块紧密集成,可根据自车潜在运动生成所有周围代理的多组可选轨迹预测结果,显著提升预测性能并推动系统向安全、协同的决策演化。
链接: https://arxiv.org/abs/2604.18126
作者: Yuxiang Zhao,Wei Huang,Haipeng Zeng,Huan Zhao,Yujie Song
机构: Sun Yat-sen University (中山大学); Alibaba Group (阿里巴巴集团)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human behavior has the nature of mutual dependencies, which requires human-robot interactive systems to predict surrounding agents trajectories by modeling complex social interactions, avoiding collisions and executing safe path planning. While there exist many trajectory prediction methods, most of them do not incorporate the own motion of the ego agent and only model interactions based on static information. We are inspired by the humans theory of mind during trajectory selection and propose a Cross time domain intention-interactive method for conditional Trajectory prediction(CiT). Our proposed CiT conducts joint analysis of behavior intentions over time, and achieves information complementarity and integration across different time domains. The intention in its own time domain can be corrected by the social interaction information from the other time domain to obtain a more precise intention representation. In addition, CiT is designed to closely integrate with robotic motion planning and control modules, capable of generating a set of optional trajectory prediction results for all surrounding agents based on potential motions of the ego agent. Extensive experiments demonstrate that the proposed CiT significantly outperforms the existing methods, achieving state-of-the-art performance in the benchmarks.
[CV-53] st-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models
【速读】:该论文旨在解决视觉-语言-动作模型(Vision-Language-Action models, VLAs)在序列决策任务中对环境微小变化(如物体姿态的细微改变)敏感、表现脆弱的问题。作者指出,这种脆弱性源于轨迹过拟合(trajectory overfitting),即VLAs过度关注动作与实体之间的伪相关性,并复现记忆中的动作模式。解决方案的关键在于提出一种无需验证器的测试时自适应框架——扰动学习与延迟反馈(Perturbation learning with Delayed Feedback, PDF),其核心机制包括:基于不确定性的数据增强和动作投票以缓解伪相关性;自适应调度器动态分配增强预算以权衡性能与效率;以及一个轻量级扰动模块,通过延迟反馈回溯性地调整动作logits,修正模型过自信问题。实验表明,PDF在LIBERO和Atari基准上均显著提升任务成功率,为多模态决策智能体的可靠测试时自适应提供了实用路径。
链接: https://arxiv.org/abs/2604.18107
作者: Zehua Zang,Xi Wang,Fuchun Sun,Xiao Xu,Lixiang Lium,Jiahuan Zhou,Jiangmeng Li
机构: Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); University of Chinese Academy of Sciences (中国科学院大学); Tsinghua University (清华大学); Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机技术研究所); National Defense University (国防大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures, 5 tables
Abstract:Vision-Language-Action models (VLAs) achieve remarkable performance in sequential decision-making but remain fragile to subtle environmental shifts, such as small changes in object pose. We attribute this brittleness to trajectory overfitting, where VLAs over-attend to the spurious correlation between actions and entities, then reproduce memorized action patterns. We propose Perturbation learning with Delayed Feedback (PDF), a verifier-free test-time adaptation framework that improves decision performance without fine-tuning the base model. PDF mitigates the spurious correlation through uncertainty-based data augmentation and action voting, while an adaptive scheduler allocates augmentation budgets to balance performance and efficiency. To further improve stability, PDF learns a lightweight perturbation module that retrospectively adjusts action logits guided by delayed feedback, correcting overconfidence issue. Experiments on LIBERO (+7.4% success rate) and Atari (+10.3 human normalized score) demonstrate consistent gains of PDF in task success over vanilla VLA and VLA with test-time adaptation, establishing a practical path toward reliable test-time adaptation in multimodal decision-making agents. The code is available at \hrefthis https URLthis https URL.
[CV-54] Decision-Aware Attention Propagation for Vision Transformer Explainability
【速读】:该论文旨在解决视觉 Transformer (Vision Transformer, ViT) 模型预测过程难以解释的问题,特别是现有基于注意力机制的解释方法因依赖原始注意力权重而缺乏类别判别力,以及梯度类定位方法未能充分利用 Transformer 层级注意力传播机制的局限性。解决方案的关键在于提出决策感知的注意力传播(Decision-Aware Attention Propagation, DAP)方法:通过梯度定位估计 token 重要性,并将其作为决策相关先验注入逐层注意力滚播(layer-wise attention rollout)过程中,从而同时建模注意力的结构传播路径与对最终决策最相关的证据,生成更具类别敏感性、紧凑性和忠实性的归因图(attribution maps)。
链接: https://arxiv.org/abs/2604.18094
作者: Sehyeong Jo,Gangjae Jang,Haesol Park
机构: Aarhus University (奥胡斯大学); University of Colorado Boulder (科罗拉多大学博尔德分校); KIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 4 figures
Abstract:Vision Transformers (ViTs) have become a dominant architecture in computer vision, yet their prediction process remains difficult to interpret because information is propagated through complex interactions across layers and attention heads. Existing attention based explanation methods provide an intuitive way to trace information flow. However, they rely mainly on raw attention weights, which do not explicitly reflect the final decision and often lead to explanations with limited class discriminability. In contrast, gradient based localization methods are more effective at highlighting class specific evidence, but they do not fully exploit the hierarchical attention propagation mechanism of transformers. To address this limitation, we propose Decision-Aware Attention Propagation (DAP), an attribution method that injects decision-relevant priors into transformer attention propagation. By estimating token importance through gradient based localization and integrating it into layer wise attention rollout, the method captures both the structural flow of attention and the evidence most relevant to the final prediction. Consequently, DAP produces attribution maps that are more class sensitive, compact, and faithful than those generated by conventional attention based methods. Extensive experiments across Vision Transformer variants of different model scales show that DAP consistently outperforms existing baselines in both quantitative metrics and qualitative visualizations, indicating that decision aware propagation is an effective direction for improving ViT interpretability.
[CV-55] Autonomous Unmanned Aircraft Systems for Enhanced Search and Rescue of Drowning Swimmers: Image-Based Localization and Mission Simulation
【速读】:该论文旨在解决水上救援中响应时间长、定位困难及救援人员运输效率低等挑战,尤其针对无监督且范围广阔的游泳区域。其核心解决方案是引入一种“无人机箱式系统”(Unmanned Aircraft System, UAS),由部署在泳区附近的专用机库中的多架无人飞行器(Unmanned Aerial Vehicles, UAVs)组成,能够在紧急情况下自动执行搜索与救援(Search and Rescue, SR)任务,通过图像识别技术精准定位溺水者并投掷救生浮具。关键技术在于利用YOLO系列目标检测模型实现对溺水者的自动识别,并结合离散事件仿真(Discrete-Event Simulation, DES)优化UAS配置以最小化响应时间,实验表明即使小型UAS系统也能将响应时间缩短至传统标准救援操作(Standard Rescue Operation, SRO)的五分之一。
链接: https://arxiv.org/abs/2604.18088
作者: Sascha Emanuel Zell,Toni Schneidereit,Armin Fügenschuh,Michael Breuß
机构: Brandenburg University of Technology Cottbus–Senftenberg (勃兰登堡工业大学科特布斯-森滕贝格)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: Submitted to “Applied Intelligence”
Abstract:Drowning is an omnipresent risk associated with any activity on or in the water, and rescuing a drowning person is particularly challenging because of the time pressure, making a short response time important. Further complicating water rescue are unsupervised and extensive swimming areas, precise localization of the target, and the transport of rescue personnel. Technical innovations can provide a remedy: We propose an Unmanned Aircraft System (UAS), also known as a drone-in-a-box system, consisting of a fleet of Unmanned Aerial Vehicles (UAVs) allocated to purpose-built hangars near swimming areas. In an emergency, the UAS can be deployed in addition to Standard Rescue Operation (SRO) equipment to locate the distressed person early by performing a fully automated Search and Rescue (SR) operation and dropping a flotation device. In this paper, we address automatically locating distressed swimmers using the image-based object detection architecture You Only Look Once (YOLO). We present a dataset created for this application and outline the training process. We evaluate the performance of YOLO versions 3, 5, and 8 and architecture sizes (nano, extra-large) using Mean Average Precision (mAP) metrics mAP@.5 and mAP@.5:.95. Furthermore, we present two Discrete-Event Simulation (DES) approaches to simulate response times of SRO and UAS-based water rescue. This enables estimation of time savings relative to SRO when selecting the UAS configuration (type, number, and location of UAVs and hangars). Computational experiments for a test area in the Lusatian Lake District, Germany, show that UAS assistance shortens response time. Even a small UAS with two hangars, each containing one UAV, reduces response time by a factor of five compared to SRO.
[CV-56] Class-specific diffusion models improve military object detection in a low-data domain
【速读】:该论文旨在解决军事目标检测在低数据条件下性能受限的问题,即如何利用有限的真实样本提升检测模型的泛化能力。解决方案的关键在于使用基于扩散机制的生成式AI(Generative AI)技术,通过仅用8或24张真实图像/类进行LoRA微调,构建类别特定的扩散模型以生成合成训练数据;进一步结合ControlNet结构引导(如Canny边缘图条件控制),实现对生成图像视角和姿态的显式控制,从而增强模型在极端数据稀缺场景下的检测精度(mAP₅₀最高提升8.0%)。该方法无需额外真实数据,直接利用原始有限样本驱动生成模型,显著优于传统模拟流水线。
链接: https://arxiv.org/abs/2604.18076
作者: Ella P. Fokkinga,Jan Erik van Woerden,Thijs A. Eker,Sebastiaan P. Snel,Elfi I.S. Hofmeijer,Klamer Schutte,Friso G. Heslinga
机构: TNO - Intelligent Imaging (TNO-智能成像)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to SPIE Defense + Security
Abstract:Diffusion-based image synthesis has emerged as a promising source of synthetic training data for AI-based object detection and classification. In this work, we investigate whether images generated with diffusion can improve military vehicle detection under low-data conditions. We fine-tuned the text-to-image diffusion model FLUX.1 [dev] using LoRA with only 8 or 24 real images per class across 15 vehicle categories, resulting in class-specific diffusion models, which were used to generate new samples from automatically generated text prompts. The same real images were used to fine-tune the RF-DETR detector for a 15-class object detection task. Synthetic datasets generated by the diffusion models were then used to further improve detector performance. Importantly, no additional real data was required, as the generative models leveraged the same limited training samples. FLUX-generated images improved detection performance, particularly in the low-data regime (up to +8.0% mAP _50 with 8 real samples). To address the limited geometric control of text prompt-based diffusion, we additionally generated structurally guided synthetic data using ControlNet with Canny edge-map conditioning, yielding a FLUX-ControlNet (FLUX-CN) dataset with explicit control over viewpoint and pose. Structural guidance further enhanced performance when data is scarce (+4.1% mAP _50 with 8 real samples), but no additional benefit was observed when more real data is available. This study demonstrates that object-specific diffusion models are effective for improving military object detection in a low-data domain, and that structural guidance is most beneficial when real data is highly limited. These results highlight generative image data as an alternative to traditional simulation pipelines for the training of military AI systems.
[CV-57] Enhancing Continual Learning of Vision-Language Models via Dynamic Prefix Weighting CVPR2026
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在域-类增量学习(domain-class incremental learning)场景下的灾难性遗忘问题,其核心挑战在于如何高效且灵活地适应新任务,同时保留对旧知识的表征能力。现有方法多采用参数高效策略(如前缀调优(prefix-tuning)或适配器(adapters)),通过向输入token注入特定任务的加性向量来实现模型微调,但这些方法通常对所有前缀向量进行权重归一化处理,忽略了不同输入token对调整强度的需求差异。本文提出的动态前缀加权(Dynamic Prefix Weighting, DPW)框架的关键创新在于:1)引入门控模块(gating module),依据输入token的重要性动态调节每个前缀向量的权重;2)设计残差加权机制,将适配器输出权重作为前缀调优权重的残差项,从而仅在必要时激活适配器,避免冗余更新。此方案显著提升了模型在增量学习中的稳定性和性能表现。
链接: https://arxiv.org/abs/2604.18075
作者: Hyeonseo Jang,Hyuk Kwon,Kibok Lee
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026; revised text and figures for improved readability
Abstract:We investigate recently introduced domain-class incremental learning scenarios for vision-language models (VLMs). Recent works address this challenge using parameter-efficient methods, such as prefix-tuning or adapters, which facilitate model adaptation to downstream tasks by incorporating task-specific information into input tokens through additive vectors. However, previous approaches often normalize the weights of these vectors, disregarding the fact that different input tokens require different degrees of adjustment. To overcome this issue, we propose Dynamic Prefix Weighting (DPW), a framework that dynamically assigns weights to prefixes, complemented by adapters. DPW consists of 1) a gating module that adjusts the weights of each prefix based on the importance of the corresponding input token, and 2) a weighting mechanism that derives adapter output weights as a residual of prefix-tuning weights, ensuring that adapters are utilized only when necessary. Experimental results demonstrate that our method achieves state-of-the-art performance in domain-class incremental learning scenarios for VLMs. The code is available at: this https URL.
[CV-58] INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval AAAI2026
【速读】:该论文旨在解决组成图像检索(Composed Image Retrieval, CIR)中因标注错误导致的噪声三元组(Noisy Triplet Correspondence, NTC)问题,尤其关注两类噪声:跨模态对应噪声(cross-modal correspondence noise)和模态内固有噪声(modality-inherent noise)。前者源于不同模态间的错配,后者则由模态内部背景干扰或与粗粒度修改标注无关的视觉因素引起,而后者常被忽视。解决方案的关键在于提出Invariance and DiscrimiNaTion-awarE Noise neTwork (INTENT),其核心包括两个模块:一是视觉不变性组合(Visual Invariant Composition),通过快速傅里叶变换(Fast Fourier Transform, FFT)在视觉侧实施因果干预,生成干预后的组合特征,从而抑制模态内固有噪声;二是双目标判别学习(Bi-Objective Discriminative Learning),结合正负样本协同优化,构建可扩展的决策边界并依据样本忠诚度动态调整判断,增强跨模态对应关系的鲁棒性。
链接: https://arxiv.org/abs/2604.18051
作者: Zhiwei Chen,Yupeng Hu,Zhiheng Fu,Zixu Li,Jiale Huang,Qinlei Huang,Yinwei Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026
Abstract:Composed Image Retrieval (CIR) is a challenging image retrieval paradigm that enables to retrieve target images based on multimodal queries consisting of reference images and modification texts. Although substantial progress has been made in recent years, existing methods assume that all samples are correctly matched. However, in real-world scenarios, due to high triplet annotation costs, CIR datasets inevitably contain annotation errors, resulting in incorrectly matched triplets. To address this issue, the problem of Noisy Triplet Correspondence (NTC) has attracted growing attention. We argue that noise in CIR can be categorized into two types: cross-modal correspondence noise and modality-inherent noise. The former arises from mismatches across modalities, whereas the latter originates from intra-modal background interference or visual factors irrelevant to the coarse-grained modification annotations. However, modality-inherent noise is often overlooked, and research on cross-modal correspondence noise remains nascent. To tackle above issues, we propose the Invariance and discrimiNaTion-awarE Noise neTwork (INTENT), comprising two components: Visual Invariant Composition and Bi-Objective Discriminative Learning, specifically designed to handle the two-aspect noise. The former applies causal intervention on the visual side via Fast Fourier Transform (FFT) to generate intervened composed features, enforcing visual invariance and enabling the model to ignore modality-inherent noise during composition. The latter adopts collaborative optimization with both positive and negative samples, and constructs a scalable decision boundary that dynamically adjusts decisions based on the loyalty degree, enabling robust correspondence discrimination. Extensive experiments on two widely used benchmark datasets demonstrate the superiority and robustness of INTENT.
[CV-59] GS-STVSR: Ultra-Efficient Continuous Spatio-Temporal Video Super-Resolution via 2D Gaussian Splatting
【速读】:该论文旨在解决连续时空视频超分辨率(Continuous Spatio-Temporal Video Super-Resolution, C-STVSR)中因依赖密集像素级网格查询而导致计算复杂度随插值帧数线性增长、推理效率低下的问题。现有基于隐式神经表示(Implicit Neural Representations, INR)的方法虽能实现任意尺度的时空增强,但其本质上的密集采样机制严重限制了实际应用中的速度表现。论文提出GS-STVSR框架,其核心创新在于引入2D高斯点绘(2D Gaussian Splatting, 2D-GS)机制,通过连续运动建模驱动高斯核的时空演化,从而完全规避了传统方法中的密集网格查询过程;关键设计包括:利用协方差参数的时间稳定性进行轻量中间拟合、基于光流引导的运动模块实现任意时间步的高斯位置与颜色推断、协方差重采样对齐模块防止协方差漂移,以及自适应偏移窗口以处理大范围运动。实验表明,该方案在多个基准数据集上达到最优质量,并在常规时域倍率(X2–X8)下保持近乎恒定的推理时间,在极端尺度(X32)下实现超过3倍的速度提升,显著增强了C-STVSR的实际可部署性。
链接: https://arxiv.org/abs/2604.18047
作者: Mingyu Shi,Xin Di,Long Peng,Boxiang Cao,Anran Wu,Zhanfeng Feng,Jiaming Guo,Renjing Pei,Xueyang Fu,Yang Cao,Zhengjun Zha
机构: University of Science and Technology of China (中国科学技术大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Continuous Spatio-Temporal Video Super-Resolution (C-STVSR) aims to simultaneously enhance the spatial resolution and frame rate of videos by arbitrary scale factors, offering greater flexibility than fixed-scale methods that are constrained by predefined upsampling ratios. In recent years, methods based on Implicit Neural Representations (INR) have made significant progress in C-STVSR by learning continuous mappings from spatio-temporal coordinates to pixel values. However, these methods fundamentally rely on dense pixel-wise grid queries, causing computational cost to scale linearly with the number of interpolated frames and severely limiting inference efficiency. We propose GS-STVSR, an ultra-efficient C-STVSR framework based on 2D Gaussian Splatting (2D-GS) that drives the spatiotemporal evolution of Gaussian kernels through continuous motion modeling, bypassing dense grid queries entirely. We exploit the strong temporal stability of covariance parameters for lightweight intermediate fitting, design an optical flow-guided motion module to derive Gaussian position and color at arbitrary time steps, introduce a Covariance resampling alignment module to prevent covariance drift, and propose an adaptive offset window for large-scale motion. Extensive experiments on Vid4, GoPro, and Adobe240 show that GS-STVSR achieves state-of-the-art quality across all benchmarks. Moreover, its inference time remains nearly constant at conventional temporal scales (X2–X8) and delivers over X3 speedup at extreme scales X32, demonstrating strong practical applicability.
[CV-60] HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval AAAI2026
【速读】:该论文旨在解决组成图像检索(Composed Image Retrieval, CIR)任务中因标注成本高和主观性强导致的噪声三元组对应(Noise Triplet Correspondence, NTC)问题,该问题严重影响模型在实际场景下的鲁棒性和性能。解决方案的关键在于提出了一种名为cHrono-synergiA roBust progressIve learning framework for composed image reTrieval (HABIT) 的框架,其核心创新包括:互知识估计模块(Mutual Knowledge Estimation Module),通过计算组合特征与目标图像之间互信息的转移率来量化样本纯净度,从而识别符合语义修改意图的干净样本;以及双一致性渐进学习模块(Dual-consistency Progressive Learning Module),通过历史与当前模型间的协同机制模拟人类习惯形成过程,保留良好习惯并校正不良习惯,实现对噪声数据的渐进式鲁棒适应。
链接: https://arxiv.org/abs/2604.18037
作者: Zixu Li,Yupeng Hu,Zhiwei Chen,Shiqi Zhang,Qinlei Huang,Zhiheng Fu,Yinwei Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026
Abstract:Composed Image Retrieval (CIR) is a flexible image retrieval paradigm that enables users to accurately locate the target image through a multimodal query composed of a reference image and modification text. Although this task has demonstrated promising applications in personalized search and recommendation systems, it encounters a severe challenge in practical scenarios known as the Noise Triplet Correspondence (NTC) problem. This issue primarily arises from the high cost and subjectivity involved in annotating triplet data. To address this problem, we identify two central challenges: the precise estimation of composed semantic discrepancy and the insufficient progressive adaptation to modification discrepancy. To tackle these challenges, we propose a cHrono-synergiA roBust progressIve learning framework for composed image reTrieval (HABIT), which consists of two core modules. First, the Mutual Knowledge Estimation Module quantifies sample cleanliness by calculating the Transition Rate of mutual information between the composed feature and the target image, thereby effectively identifying clean samples that align with the intended modification semantics. Second, the Dual-consistency Progressive Learning Module introduces a collaborative mechanism between the historical and current models, simulating human habit formation to retain good habits and calibrate bad habits, ultimately enabling robust learning under the presence of NTC. Extensive experiments conducted on two standard CIR datasets demonstrate that HABIT significantly outperforms most methods under various noise ratios, exhibiting superior robustness and retrieval performance. Codes are available at this https URL
[CV-61] CFSR: Geometry-Conditioned Shadow Removal via Physical Disentanglement
【速读】:该论文旨在解决传统阴影去除网络将图像修复视为无约束映射所带来的问题,即缺乏物理可解释性,难以在局部纹理恢复与全局光照一致性之间取得平衡。其解决方案的关键在于提出一种多模态先验驱动的框架CFSR,该框架将阴影去除重构为一个物理约束的恢复过程:首先通过自定义HVI颜色空间抑制阴影噪声并融合RGB数据与估计深度先验;其次引入几何-语义双显式引导注意力机制,利用DINO特征和3D表面法向量直接调节注意力亲和矩阵,从而结构化地施加物理光照约束;最后借助冻结的CLIP编码器注入整体先验,并通过频率协同重建模块(FCRM)解耦解码过程,实现高频边缘锐化与低频全局光照恢复的协同优化。
链接: https://arxiv.org/abs/2604.18032
作者: Pan Wang,Yihao Hu,Xiujin Liu,Hang Wang
机构: University of Science and Technology of China (中国科学技术大学); Ant Group (蚂蚁集团); University of Michigan - Ann Arbor (密歇根大学安娜堡分校); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traditional shadow removal networks often treat image restoration as an unconstrained mapping, lacking the physical interpretability required to balance localized texture recovery with global illumination consistency. To address this, we propose CFSR, a multi-modal prior-driven framework that reframes shadow removal as a physics-constrained restoration process. By seamlessly integrating 3D geometric cues with large-scale foundation model semantics, CFSR effectively bridges the 2D-3D domain gap. Specifically, we first map observations into a custom HVI color space to suppress shadow-induced noise and robustly fuse RGB data with estimated depth priors. At its core, our Geometric Semantic Dual Explicit Guided Attention mechanism utilizes DINO features and 3D surface normals to directly modulate the attention affinity matrix, structurally enforcing physical lighting constraints. To recover severely degraded regions, we inject holistic priors via a frozen CLIP encoder. Finally, our Frequency Collaborative Reconstruction Module (FCRM) achieves an optimal synthesis by decoupling the decoding process. Conditioned on geometric priors, FCRM seamlessly harmonizes the reconstruction of sharp high-frequency occlusion boundaries with the restoration of low-frequency global illumination. Extensive experiments demonstrate that CFSR achieves state-of-the-art performance across multiple challenging benchmarks.
[CV-62] Multi-View Hierarchical Graph Neural Network for Sketch-Based 3D Shape Retrieval
【速读】:该论文旨在解决草图驱动的三维形状检索(Sketch-based 3D Shape Retrieval, SBSR)中的两个核心问题:一是现有方法对多视角特征采用简化的聚合策略,忽视了视图间的几何关系与多层次细节,导致三维表示能力较弱;二是传统SBSR方法受限于可见类别限制,在零样本(zero-shot)场景下性能不佳。解决方案的关键在于提出多视图分层图神经网络(Multi-View Hierarchical Graph Neural Network, MV-HGNN),通过构建视图级图结构并利用局部图卷积和全局注意力机制捕捉相邻几何依赖与跨视图信息传递,同时引入视图选择器实现分层图粗化,逐步扩大感受野并减少冗余视图干扰,从而获得更具判别性的层次化三维表示;此外,借助CLIP文本嵌入作为语义原型,将草图与三维特征投影至共享语义空间,实现类别无关对齐并缓解过拟合已见类别问题,最终在类别级与零样本两种设置下均显著优于当前最优方法。
链接: https://arxiv.org/abs/2604.18019
作者: Hang Cheng,Muyan He,Mingyu Fan,Chengfeng Xie,Xi Cheng,Long Zeng
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sketch-based 3D shape retrieval (SBSR) aims to retrieve 3D shapes that are consistent with the category of the input hand-drawn sketch. The core challenge of this task lies in two aspects: existing methods typically employ simplified aggregation strategies for independently encoded 3D multi-view features, which ignore the geometric relationships between views and multi-level details, resulting in weak 3D representation. Simultaneously, traditional SBSR methods are constrained by visible category limitations, leading to poor performance in zero-shot scenarios. To address these challenges, we propose Multi-View Hierarchical Graph Neural Network (MV-HGNN), a novel framework for SBSR. Specifically, we construct a view-level graph and capture adjacent geometric dependencies and cross-view message passing via local graph convolution and global attention. A view selector is further introduced to perform hierarchical graph coarsening, enabling a progressively larger receptive field for graph convolution and mitigating the interference of redundant views, which leads to more discriminate discriminative hierarchical 3D representation. To enable category agnostic alignment and mitigate overfitting to seen classes, we leverage CLIP text embeddings as semantic prototypes and project both sketch and 3D features into a shared semantic space. We use a two-stage training strategy for category-level retrieval and a one-stage strategy for zero-shot retrieval under the same model architecture. Under both category-level and zero-shot settings, extensive experiments on two public benchmarks demonstrate that MV-HGNN outperforms state-of-the-art methods.
[CV-63] rustworthy Endoscopic Super-Resolution
【速读】:该论文旨在解决超分辨率(Super-resolution, SR)模型在微创手术和诊断视频增强中因引入幻觉结构(hallucinated structures)和放大噪声而导致的可靠性问题,尤其在安全关键场景下限制了其应用。解决方案的关键在于提出一种轻量级、低延迟的误差预测网络,通过分析中间特征表示来估计像素级重建误差,并基于共形风险控制原理构建“共形失败掩膜”(Conformal Failure Masks, CFM),从而实现对不可信重建区域的定位与量化,提供理论保障以控制容许误差上限和误覆盖率,且不依赖具体SR模型,具备良好的通用性和实用性。
链接: https://arxiv.org/abs/2604.18001
作者: Julio Silva-Rodríguez,Ender Konukoglu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
Abstract:Super-resolution (SR) models are attracting growing interest for enhancing minimally invasive surgery and diagnostic videos under hardware constraints. However, valid concerns remain regarding the introduction of hallucinated structures and amplified noise, limiting their reliability in safety-critical settings. We propose a direct and practical framework to make SR systems more trustworthy by identifying where reconstructions are likely to fail. Our approach integrates a lightweight error-prediction network that operates on intermediate representations to estimate pixel-wise reconstruction error. The module is computationally efficient and low-latency, making it suitable for real-time deployment. We convert these predictions into operational failure decisions by constructing Conformal Failure Masks (CFM), which localize regions where the SR output should not be trusted. Built on conformal risk control principles, our method provides theoretical guarantees for controlling both the tolerated error limit and the miscoverage in detected failures. We evaluate our approach on image and video SR, demonstrating its effectiveness in detecting unreliable reconstructions in endoscopic and robotic surgery settings. To our knowledge, this is the first study to provide a model-agnostic, theoretically grounded approach to improving the safety of real-time endoscopic image SR.
[CV-64] Identifying Ethical Biases in Action Recognition Models
【速读】:该论文旨在解决人类动作识别(Human Action Recognition, HAR)模型在不同人群中的公平性问题,尤其是当视觉身份特征(如肤色)变化时模型预测是否存在系统性偏差。以往研究多集中于静态图像或姿态估计,忽略了动作的时间一致性;而本文通过使用BEDLAM仿真平台生成可控的合成视频数据,实现了对单一属性(如肤色)的精确干预,同时保持动作内容完全一致,从而能够隔离并量化肤色对模型输出的影响。其解决方案的关键在于利用合成数据与受控实验设计,首次在动态视频场景中系统性地审计HAR模型的偏见,揭示了即使运动模式相同,某些模型仍可能因肤色差异产生显著错误预测,为开发更透明、可问责的HAR系统提供了可操作的评估框架。
链接: https://arxiv.org/abs/2604.17971
作者: Ana Baltaretu,Pascal Benschop,Jan van Gemert
机构: Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human Action Recognition (HAR) models are increasingly deployed in high-stakes environments, yet their fairness across different human appearances has not been analyzed. We introduce a framework for auditing bias in HAR models using synthetic video data, generated with full control over visual identity attributes such as skin color. Unlike prior work that focuses on static images or pose estimation, our approach preserves temporal consistency, allowing us to isolate and test how changes to a single attribute affect model predictions. Through controlled interventions using the BEDLAM simulation platform, we show whether some popular HAR models exhibit statistically significant biases on the skin color even when the motion remains identical. Our results highlight how models may encode unwanted visual associations, and we provide evidence of systematic errors across groups. This work contributes a framework for auditing HAR models and supports the development of more transparent, accountable systems in light of upcoming regulatory standards.
[CV-65] E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes
【速读】:该论文旨在解决现有具身视觉搜索(Embodied Visual Search)基准在评估代理在真实三维环境中进行细粒度视角依赖性感知能力方面的不足。当前主流基准如EQA主要依赖静态观察或受限的自我中心运动,无法有效衡量在无约束5自由度(5-DoF)视角控制下产生的关键现象,例如因垂直视角变化导致的可见性改变、容器内部内容的揭示以及仅从特定角度可辨识的对象属性。为应对这一问题,作者提出E3VS-Bench,其核心创新在于利用3D高斯泼溅(3D Gaussian Splatting)技术构建高保真度的3D场景,实现逼真的多视角渲染并保留细微视觉细节(如小尺寸文字和微弱特征),从而设计出必须通过跨视角主动检查才能解答的问题。该方案使模型需在5-DoF范围内规划连续视角变换以获取任务所需证据,显著提升了对具身智能体主动感知与连贯视角规划能力的评测精度。
链接: https://arxiv.org/abs/2604.17969
作者: Koya Sakamoto,Taiki Miyanishi,Daichi Azuma,Shuhei Kurita,Shu Morikuni,Naoya Chiba,Motoaki Kawanabe,Yusuke Iwasawa,Yutaka Matsuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual search in 3D environments requires embodied agents to actively explore their surroundings and acquire task-relevant evidence. However, existing visual search and embodied AI benchmarks, including EQA, typically rely on static observations or constrained egocentric motion, and thus do not explicitly evaluate fine-grained viewpoint-dependent phenomena that arise under unrestricted 5-DoF viewpoint control in real-world 3D environments, such as visibility changes caused by vertical viewpoint shifts, revealing contents inside containers, and disambiguating object attributes that are only observable from specific angles. To address this limitation, we introduce E3VS-Bench, a benchmark for embodied 3D visual search where agents must control their viewpoints in 5-DoF to gather viewpoint-dependent evidence for question answering. E3VS-Bench consists of 99 high-fidelity 3D scenes reconstructed using 3D Gaussian Splatting and 2,014 question-driven episodes. 3D Gaussian Splatting enables photorealistic free-viewpoint rendering that preserves fine-grained visual details (e.g., small text and subtle attributes) often degraded in mesh-based simulators, thereby allowing the construction of questions that cannot be answered from a single view and instead require active inspection across viewpoints in 5-DoF. We evaluate multiple state-of-the-art VLMs and compare their performance with humans. Despite strong 2D reasoning ability, all models exhibit a substantial gap from humans, highlighting limitations in active perception and coherent viewpoint planning specifically under full 5-DoF viewpoint changes.
[CV-66] MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene CVPR2026
【速读】:该论文旨在解决生成式神经辐射场(GeNeRF)在存在瞬态干扰物(transient distractors)时,由于跨视图结构一致性被破坏而导致监督信号 corrupted 与重建质量下降的问题。现有方法依赖于每场景优化并基于单视图重建误差估计不确定性,这在 GeNeRF 中不可靠,常将静态结构不一致误判为干扰物。其解决方案的关键在于提出 MU-GeNeRF 框架,通过分解干扰感知能力为两个互补的不确定性分量:源视图不确定性(Source-view Uncertainty),用于捕捉因视角变化或动态因素引起的源视图间结构差异;目标视图不确定性(Target-view Uncertainty),用于检测目标图像中由瞬态干扰引发的观测异常。这两个不确定性通过异方差重建损失(heteroscedastic reconstruction loss)融合,引导模型自适应调节监督强度,从而实现更鲁棒的干扰抑制与几何一致性保持。
链接: https://arxiv.org/abs/2604.17965
作者: Wenjie Mu,Zhan Li,Chuanzhou Su,Xuanyi Shen,Ziniu Liu,Fan Lu,Yujian Mo,Junqiao Zhao,Tiantian Feng,Chen Ye,Guang Chen
机构: Tongji University (同济大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Generalizable Neural Radiance Fields (GeNeRFs) enable high-quality scene reconstruction from sparse views and can generalize to unseen scenes. However, in real-world settings, transient distractors break cross-view structural consistency, corrupting supervision and degrading reconstruction quality. Existing distractor-free NeRF methods rely on per-scene optimization and estimate uncertainty from per-view reconstruction errors, which are not reliable for GeNeRFs and often misjudge inconsistent static structures as distractors. To this end, we propose MU-GeNeRF, a Multi-view Uncertainty-guided distractor-aware GeNeRF framework designed to alleviate GeNeRF’s robust modeling challenges in the presence of transient distractions. We decompose distractor awareness into two complementary uncertainty components: Source-view Uncertainty, which captures structural discrepancies across source views caused by viewpoint changes or dynamic factors; and Target-view Uncertainty, which detects observation anomalies in the target image induced by transient this http URL two uncertainties address distinct error sources and are combined through a heteroscedastic reconstruction loss, which guides the model to adaptively modulate supervision, enabling more robust distractor suppression and geometric this http URL experiments show that our method not only surpasses existing GeNeRFs but also achieves performance comparable to scene-specific distractor-free NeRFs.
[CV-67] DifFoundMAD: Foundation Models meet Differential Morphing Attack Detection
【速读】:该论文旨在解决传统差异人脸防伪检测(D-MAD)系统在高安全性场景下误检率过高、泛化能力不足的问题。现有方法多依赖于人脸识别嵌入或手工设计特征差异,难以有效捕捉可疑伪造图像与真实活体图像之间的细微差别。其解决方案的关键在于提出一种参数高效的D-MAD框架——DifFoundMAD,该框架利用视觉基础模型(Vision Foundation Models, VFM)的强大泛化能力,将原始图像差异建模从传统特征空间迁移至由VFM提取的嵌入空间中;同时通过轻量级微调与类别平衡优化策略,在仅更新少量参数的前提下保留基础模型丰富的表征先验,从而显著提升检测精度。实验表明,在边境控制等严格安全场景下,错误率由6.16%降至2.17%,验证了该方法的有效性。
链接: https://arxiv.org/abs/2604.17961
作者: Lazaro J. Gonzalez-Soler,André Dörsch,Christian Rathgeb,Christoph Busch
机构: da/sec - Biometrics and Security Research Group (da/sec - 生物识别与安全研究组); Darmstadt, Germany (达姆施塔特, 德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this work, we introduce DifFoundMAD, a parameter-efficient D-MAD framework that exploits the generalisation capabilities of vision foundation models (FM) to capture discrepancies between suspected morphs and live capture images. In contrast to conventional D-MAD systems that rely on face recognition embeddings or handcrafted feature differences, DifFoundMAD follows the standard differential paradigm while replacing the underlying representation space with embeddings extracted from FMs. By combining lightweight finetuning with class-balanced optimisation, the proposed method updates only a small subset of parameters while preserving the rich representational priors of the underlying FMs. Extensive cross-database evaluations on standard D-MAD benchmarks demonstrate that DifFoundMAD achieves consistent improvements over state-of-the-art systems, particularly at the strict security levels required in operational deployments such as border control: The error rates reported in the current state-of-the-art were reduced from 6.16% to 2.17% for high-security levels using DifFoundMAD.
[CV-68] Chatting about Upper-Body Expressive Human Pose and Shape Estimation
【速读】:该论文旨在解决当前生成式人体姿态与形状估计(Expressive Human Pose and Shape Estimation, EHPS)方法在面部和手部区域参数估计不准确以及在真实复杂场景(wild images)中泛化能力有限的问题。解决方案的关键在于提出一种新颖的一阶段协同交叉依赖Transformer框架——CoEvoer,其通过显式的特征级交互机制,在人体上半身各部位之间建立双向信息传递:躯干等大尺度且易估计区域提供全局语义和位置先验以指导面部与手部的精细估计;而面部与手部的局部细节则反向校准相邻区域,实现多部位间的相互增强与联合优化,从而有效捕捉面部、手部与躯干之间的强耦合关系与语义依赖性。
链接: https://arxiv.org/abs/2604.17959
作者: Yuxiang Zhao,Wei Huang,Yujie Song,Liu Wang,Huan Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Expressive Human Pose and Shape Estimation (EHPS) plays a crucial role in various AR/VR applications and has witnessed significant progress in recent years. However, current state-of-the-art methods still struggle with accurate parameter estimation for facial and hand regions and exhibit limited generalization to wild images. To address these challenges, we present CoEvoer, a novel one-stage synergistic cross-dependency transformer framework tailored for upper-body EHPS. CoEvoer enables explicit feature-level interaction across different body parts, allowing for mutual enhancement through contextual information exchange. Specifically, larger and more easily estimated regions such as the torso provide global semantics and positional priors to guide the estimation of finer, more complex regions like the face and hands. Conversely, the localized details captured in facial and hand regions help refine and calibrate adjacent body parts. To the best of our knowledge, CoEvoer is the first framework designed specifically for upper-body EHPS, with the goal of capturing the strong coupling and semantic dependencies among the face, hands, and torso through joint parameter regression. Extensive experiments demonstrate that CoEvoer achieves state-of-the-art performance on upper-body benchmarks and exhibits strong generalization capability even on unseen wild images.
[CV-69] ZSG-IAD: A Multimodal Framework for Zero-Shot Grounded Industrial Anomaly Detection
【速读】:该论文旨在解决工业异常检测中深度学习模型作为“黑箱”导致难以提供物理意义明确的缺陷证据的问题。其核心解决方案是提出ZSG-IAD(Zero-Shot Grounded Industrial Anomaly Detection),一个基于多模态视觉-语言的零样本异常检测框架。关键创新在于引入语言引导的两跳定位模块:首先通过与异常相关的句子筛选出从多模态特征中提炼出的类证据潜在槽位,获得粗粒度空间支持;随后,这些槽位通过通道-空间门控机制调制特征图,并结合轻量解码器生成像素级异常掩膜。此外,为提升可靠性,进一步采用可执行规则的GRPO(Executable-Rule GRPO)优化策略,以可验证奖励促进结构化输出、异常区域一致性及推理-结论连贯性,从而实现更透明且物理可解释的异常检测结果。
链接: https://arxiv.org/abs/2604.17949
作者: Qiuhui Chen,Jiaxiang Song,Shuai Tan,Weimin Zhong
机构: East China University of Science and Technology (华东理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning-based industrial anomaly detectors often behave as black boxes, making it hard to justify decisions with physically meaningful defect evidence. We propose ZSG-IAD, a multimodal vision-language framework for zero-shot grounded industrial anomaly detection. Given RGB images, sensor images, and 3D point clouds, ZSG-IAD generates structured anomaly reports and pixel-level anomaly masks. ZSG-IAD introduces a language-guided two-hop grounding module: (1) anomaly-related sentences select evidence-like latent slots distilled from multimodal features, yielding coarse spatial support; (2) selected slots modulate feature maps via channel-spatial gating and a lightweight decoder to produce fine-grained masks. To improve reliability, we further apply Executable-Rule GRPO with verifiable rewards to promote structured outputs, anomaly-region consistency, and reasoning-conclusion coherence. Experiments across multiple industrial anomaly benchmarks show strong zero-shot performance and more transparent, physically grounded explanations than prior methods. We will release code and annotations to support future research on trustworthy industrial anomaly detection systems.
[CV-70] Brain-Inspired Capture: Evidence-Driven Neuromimetic Perceptual Simulation for Visual Decoding
【速读】:该论文旨在解决脑-机接口(BCI)与计算神经科学中视觉解码的核心挑战,即神经生理信号与视觉模态之间存在的系统性与随机性差异问题,且现有方法普遍忽视了人类视觉系统(HVS)的内在计算机制。其解决方案的关键在于提出一种类脑感知模拟范式——Brain-Inspired Capture (BI-Cap),通过构建包含四个生物合理动态与静态变换的神经模仿流水线,并引入基于互信息(MI)引导的动态模糊调节机制来模拟自适应视觉处理;同时,为缓解神经活动固有的非平稳性,设计了一种证据驱动的潜在空间表示,显式建模不确定性以确保鲁棒的神经嵌入表示。
链接: https://arxiv.org/abs/2604.17927
作者: Feixue Shao,Guangze Shi,Xueyu Liu,Yongfei Wu,Mingqiang Wei,Jianan Zhang,Jianbo Lu,Guiying Yan,Weihua Yang
机构: Taiyuan University of Technology (太原理工大学); National Human Genetics Resource Center (国家人类遗传资源中心); Academy of Mathematics and Systems Science, Chinese Academy of Sciences (中国科学院数学与系统科学研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Visual decoding of neurophysiological signals is a critical challenge for brain-computer interfaces (BCIs) and computational neuroscience. However, current approaches are often constrained by the systematic and stochastic gaps between neural and visual modalities, largely neglecting the intrinsic computational mechanisms of the Human Visual System (HVS). To address this, we propose Brain-Inspired Capture (BI-Cap), a neuromimetic perceptual simulation paradigm that aligns these modalities by emulating HVS processing. Specifically, we construct a neuromimetic pipeline comprising four biologically plausible dynamic and static transformations, coupled with Mutual Information (MI)-guided dynamic blur regulation to simulate adaptive visual processing. Furthermore, to mitigate the inherent non-stationarity of neural activity, we introduce an evidence-driven latent space representation. This formulation explicitly models uncertainty, thereby ensuring robust neural embeddings. Extensive evaluations on zero-shot brain-to-image retrieval across two public benchmarks demonstrate that BI-Cap substantially outperforms state-of-the-art methods, achieving relative gains of 9.2% and 8.0%, respectively. We have released the source code on GitHub through the link this https URL.
[CV-71] Prompting Foundation Models for Zero-Shot Ship Instance Segmentation in SAR Imagery
【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)图像中舰船实例分割缺乏像素级标注的问题,从而推动深度学习在SAR海洋监视中的应用。解决方案的关键在于利用通用视觉基础模型(如Segment Anything Model 2, SAM2)与一个在公开SAR数据集上训练的YOLOv11检测器相结合:YOLOv11通过边界框定位舰船,作为提示(prompt)引导SAM2生成实例掩码,而无需任何掩码标注。该方法不依赖微调或适配器,仅依靠SAR训练检测器提供的空间约束即可有效规范基础模型预测,部分缓解光学-SAR域差异,实现零样本舰船实例分割,且在SSDD基准上达到89%的全监督基线性能(平均IoU为0.637),展现出可扩展、标注高效的SAR图像理解路径。
链接: https://arxiv.org/abs/2604.17920
作者: Islam Mansour,Francescopaolo Sica,Michael Schmitt
机构: University of the Bundeswehr Munich (联邦国防大学慕尼黑分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages
Abstract:Synthetic Aperture Radar (SAR) plays a critical role in maritime surveillance, yet deep learning for SAR analysis is limited by the lack of pixel-level annotations. This paper explores how general-purpose vision foundation models can enable zero-shot ship instance segmentation in SAR imagery, eliminating the need for pixel-level supervision. A YOLOv11-based detector trained on open SAR datasets localizes ships via bounding boxes, which then prompt the Segment Anything Model 2 (SAM2) to produce instance masks without any mask annotations. Unlike prior SAM-based SAR approaches that rely on fine tuning or adapters, our method demonstrates that spatial constraints from a SAR-trained detector alone can effectively regularize foundation model predictions. This design partially mitigates the optical-SAR domain gap and enables downstream applications such as vessel classification, size estimation, and wake analysis. Experiments on the SSDD benchmark achieve a mean IoU of 0.637 (89% of a fully supervised baseline) with an overall ship detection rate of 89.2%, confirming a scalable, annotation-efficient pathway toward foundation-model-driven SAR image understanding.
[CV-72] OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
【速读】:该论文旨在解决端到端自动驾驶系统中多任务学习与异构解码行为(如自回归文本生成、并行目标检测和轨迹回归)难以统一建模的问题。现有方法通常采用分离或级联的解码器结构,导致架构碎片化且骨干网络复用受限。其解决方案的关键在于:基于预训练视觉-语言模型(Vision-Language Model, VLM),设计一个统一的Transformer解码器架构,将视觉特征、结构化查询token(如轨迹查询)与文本生成任务整合在同一因果解码框架内,利用VLM原始注意力机制实现结构化输出对视觉上下文的自然条件依赖,并通过共享注意力骨干网络实现跨异构任务的稳定联合优化,从而在保持多模态生成能力的同时显著提升推理效率(约降低40%延迟)。
链接: https://arxiv.org/abs/2604.17915
作者: Yiwei Zhang,Xuesong Chen,Jin Gao,Hanshi Wang,Fudong Ge,Weiming Hu,Shaoshuai Shi,Zhipeng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models(VLMs) excel at autoregressive text generation, yet end-to-end autonomous driving requires multi-task learning with structured outputs and heterogeneous decoding behaviors, such as autoregressive language generation, parallel object detection and trajectory regression. To accommodate these differences, existing systems typically introduce separate or cascaded decoders, resulting in architectural fragmentation and limited backbone reuse. In this work, we present a unified autonomous driving framework built upon a pretrained VLM, where heterogeneous decoding behaviors are reconciled within a single transformer decoder. We demonstrate that pretrained VLM attention exhibits strong transferability beyond pure language modeling. By organizing visual and structured query tokens within a single causal decoder, structured queries can naturally condition on visual context through the original attention mechanism. Textual and structured outputs share a common attention backbone, enabling stable joint optimization across heterogeneous tasks. Trajectory planning is realized within the same causal LLM decoder by introducing structured trajectory queries. This unified formulation enables planning to share the pretrained attention backbone with images and perception tokens. Extensive experiments on end-to-end autonomous driving benchmarks demonstrate state-of-the-art performance, including 0.28 L2 and 0.18 collision rate on nuScenes open-loop evaluation and competitive results (86.8 PDMS) on NAVSIM closed-loop evaluation. The full model preserves multi-modal generation capability, while an efficient inference mode achieves approximately 40% lower latency. Code and models are available at this https URL
[CV-73] Beyond Binary Contrast: Modeling Continuous Skeleton Action Spaces with Transitional Anchors
【速读】:该论文旨在解决基于骨架的动作识别中,现有自监督对比学习方法依赖二元对比目标而忽略人体运动内在连续性的问题,导致特征簇碎片化和类别边界僵硬。其解决方案的关键在于提出一种基于过渡锚点的对比学习框架(TranCLR),核心创新包括:1)动作过渡锚点构建(Action Transitional Anchor Construction, ATAC),显式建模过渡状态的几何结构以增强对运动连续性的感知;2)多级几何流形校准机制(Multi-Level Geometric Manifold Calibration, MGMC),在多个连续性层级上自适应校准动作流形,从而获得更平滑且更具判别力的表示空间。
链接: https://arxiv.org/abs/2604.17914
作者: Yingjie Feng,Yi Wang,Jiaze Wang,Anfeng Liu,Zhuotao Tian
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳); Central South University (中南大学); FitX Technology (Hong Kong) Limited (FitX科技(香港)有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Self-supervised contrastive learning has emerged as a powerful paradigm for skeleton-based action recognition by enforcing consistency in the embedding space. However, existing methods rely on binary contrastive objectives that overlook the intrinsic continuity of human motion, resulting in fragmented feature clusters and rigid class boundaries. To address these limitations, we propose TranCLR, a Transitional anchor-based Contrastive Learning framework that captures the continuous geometry of the action space. Specifically, the proposed Action Transitional Anchor Construction (ATAC) explicitly models the geometric structure of transitional states to enhance the model’s perception of motion continuity. Building upon these anchors, a Multi-Level Geometric Manifold Calibration (MGMC) mechanism is introduced to adaptively calibrate the action manifold across multiple levels of continuity, yielding a smoother and more discriminative representation space. Extensive experiments on the NTU RGB+D, NTU RGB+D 120 and PKU-MMD datasets demonstrate that TranCLR achieves superior accuracy and calibration performance, effectively learning continuous and uncertainty-aware skeleton representations. The code is available at this https URL.
[CV-74] MEDN: Motion-Emotion Feature Decoupling Network for Micro-Expression Recognition
【速读】:该论文旨在解决微表情识别(Micro-expression Recognition, MER)中因动作单元(Action Units, AUs)与情绪类别之间缺乏严格映射关系而导致的误识别问题,尤其针对那些具有相同AUs但表达相反情绪的微表情样本难以区分的挑战。其解决方案的关键在于提出一种运动-情绪特征解耦网络(Motion Emotion Feature Decoupling Network, MEDN),通过双分支结构分别提取显式运动特征和隐式情绪特征:在运动分支中引入AU检测任务并采用正交损失以降低运动与情绪特征的耦合;在情绪分支中设计稀疏情感视觉Transformer(Sparse Emotion Vision Transformer, SEVit),利用多尺度稀疏化空间token来突出局部时序变化;最后通过协同融合模块(Collaborative Fusion Module, CoFM)自适应地融合解耦后的特征,从而提升模型对微表情的识别准确性和泛化能力。
链接: https://arxiv.org/abs/2604.17899
作者: Chenxing Hu,Kun Xie,Qiguang Miao,Ruyi Liu,Quan Wang,Zongkai Yang
机构: Xidian University (西安电子科技大学); Central China Normal University (华中师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures, 7 tabels
Abstract:Unlike macro-expression, micro-expression does not follow a strictly consistent mapping rule between emotions and Action Units (AUs). As a result, some micro-expressions share identical AUs yet represent completely opposite emotional categories, making them highly visually similar. Existing microexpression recognition (MER) methods mostly rely on explicit facial motion cues (e.g., optical flow, frame differences, AU features) while ignoring implicit emotion information. To tackle this issue, this paper presents a Motion Emotion Feature Decoupling Network (MEDN) for MER. We design a dual-branch framework to separately extract motion and emotion features. In the motion branch, an AU-detection task restricts features to the explicit motion domain, and orthogonal loss is adopted to reduce motion emotion feature coupling. For implicit emotion modeling, we propose a Sparse Emotion Vision Transformer (SEVit) that sparsifies spatial tokens to highlight local temporal variations with multi-scale sparsity rates. A Collaborative Fusion Module (CoFM) is further developed to fuse disentangled motion and emotion features adaptively. Extensive experiments on three benchmark datasets validate that MEDN effectively decouples motion and emotion features and achieves superior recognition performance, offering a new perspective for enhancing recognition accuracy and generalization.
[CV-75] ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval AAAI2026
【速读】:该论文旨在解决**组合视频检索(Composed Video Retrieval, CVR)**任务中因视频与文本模态间信息密度差异导致的特征融合偏差问题,具体表现为:传统方法在融合参考视频与修改文本时倾向于过度依赖视频模态,从而削弱了文本对目标视频检索的关键引导作用。这一问题源于三个核心挑战:模态贡献纠缠、组合特征显式优化不足以及检索不确定性。解决方案的关键在于提出首个基于方向性校准机制的框架 ReTrack,其核心创新为通过“语义贡献解耦”、“组合几何校准”和“可靠证据驱动对齐”三大模块,显式估计并校准各模态在组合特征中的方向性偏置,进而利用双向证据增强从组合特征到目标视频的相似度计算可靠性,显著提升了多模态查询理解能力与检索性能。
链接: https://arxiv.org/abs/2604.17898
作者: Zixu Li,Yupeng Hu,Zhiwei Chen,Qinlei Huang,Guozhi Qiu,Zhiheng Fu,Meng Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026
Abstract:With the rapid growth of video data, Composed Video Retrieval (CVR) has emerged as a novel paradigm in video retrieval and is receiving increasing attention from researchers. Unlike unimodal video retrieval methods, the CVR task takes a multi-modal query consisting of a reference video and a piece of modification text as input. The modification text conveys the user’s intended alterations to the reference video. Based on this input, the model aims to retrieve the most relevant target video. In the CVR task, there exists a substantial discrepancy in information density between video and text modalities. Traditional composition methods tend to bias the composed feature toward the reference video, which leads to suboptimal retrieval performance. This limitation is significant due to the presence of three core challenges: (1) modal contribution entanglement, (2) explicit optimization of composed features, and (3) retrieval uncertainty. To address these challenges, we propose the evidence-dRivRn dual-sTream diRectionAl anChor calibration networK (ReTrack). ReTrack is the first CVR framework that improves multi-modal query understanding by calibrating directional bias in composed features. It consists of three key modules: Semantic Contribution Disentanglement, Composition Geometry Calibration, and Reliable Evidence-driven Alignment. Specifically, ReTrack estimates the semantic contribution of each modality to calibrate the directional bias of the composed feature. It then uses the calibrated directional anchors to compute bidirectional evidence that drives reliable composed-to-target similarity estimation. Moreover, ReTrack exhibits strong generalization to the Composed Image Retrieval (CIR) task, achieving SOTA performance across three benchmark datasets in both CVR and CIR scenarios. Codes are available at this https URL
[CV-76] AeroRAG : Structured Multimodal Retrieval-Augmented LLM for Fine-Grained Aerial Visual Reasoning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在航空场景下视觉问答(Visual Question Answering, VQA)任务中表现不佳的问题,特别是当关键证据由小物体、显式数量、粗略位置及对象间关系等结构化语义信息承载时,传统密集视觉token表示难以与这些语义对齐。解决方案的关键在于提出AeroRAG框架——一种基于场景图引导的检索增强生成方法,通过将输入图像转化为包含对象类别、数量、空间位置和语义关系的结构化视觉知识,并检索与查询相关的语义片段以构建紧凑提示(prompt),从而在感知与语言推理之间引入更明确的中间接口,实现更可靠且可部署的视觉推理。
链接: https://arxiv.org/abs/2604.17889
作者: Junxiao Xue,Quan Deng,Tingqi Hu,Meicong Si,Xinyi Yin,Yunyun Shi,Xuecheng Wu
机构: Zhejiang Lab (浙江实验室); University of Chinese Academy of Sciences (中国科学院大学); Zhengzhou University (郑州大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite recent progress in multimodal large language models (MLLMs), reliable visual question answering in aerial scenes remains challenging. In such scenes, task-critical evidence is often carried by small objects, explicit quantities, coarse locations, and inter-object relations, whereas conventional dense visual-token representations are not well aligned with these structured semantics. To address this interface mismatch, we propose AeroRAG, a scene-graph-guided multimodal retrieval-augmented generation framework for visual question answering. The framework first converts an input image into structured visual knowledge, including object categories, quantities, spatial locations, and semantic relations, and then retrieves query-relevant semantic chunks to construct compact prompts for a text-based large language model. Rather than relying on direct reasoning over dense visual tokens, our method introduces a more explicit intermediate interface between perception and language reasoning. Experiments on the AUG aerial dataset and the general-domain VG-150 benchmark show consistent improvements over six strong MLLM baselines, with the largest gains observed in dense aerial scenes and relation-sensitive reasoning. We further evaluate the framework on VQAv2 to verify that the proposed interface remains compatible with standard visual reasoning settings. These results suggest that structured retrieval is a practical design direction for deployment-oriented and grounded visual reasoning systems.
[CV-77] ST-π: Structured SpatioTemporal VLA for Robotic Manipulation
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在精细时空操作任务中面临的挑战,特别是如何显式建模多步骤行为的时空边界与因果关系。现有方法通常将时空知识隐式嵌入视觉和动作表示中,并直接进行跨模态映射以预测步骤级动作,导致难以处理具有明确时空边界的复杂序列行为。其解决方案的关键在于提出ST-π模型,该模型包含两个核心设计:一是构建结构化的时空VLM(Vision-Language-Model),将4D观测和任务指令编码至潜在空间并输入大语言模型(LLM),生成因果有序的块级动作提示(包括子任务、空间定位和时间定位);二是设计结构化的时空动作专家模块,基于块级提示采用双生成器引导机制,联合建模空间依赖性和时间因果性,从而精准预测步骤级动作参数。这一框架实现了全局时空行为规划与局部时空控制的协同优化,显著提升了机器人在复杂精细操作中的表现。
链接: https://arxiv.org/abs/2604.17880
作者: Chuanhao Ma,Hanyu Zhou,Shihan Peng,Yan Li,Tao Gu,Luxin Yan
机构: Huazhong University of Science and Technology (华中科技大学); National University of Singapore (新加坡国立大学); Macquarie University (麦考瑞大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language-action (VLA) models have achieved great success on general robotic tasks, but still face challenges in fine-grained spatiotemporal manipulation. Typically, existing methods mainly embed spatiotemporal knowledge into visual and action representations, and directly perform a cross-modal mapping for step-level action prediction. However, such spatiotemporal reasoning remains largely implicit, making it difficult to handle multiple sequential behaviors with explicit spatiotemporal boundaries. In this work, we propose ST- \pi , a structured spatiotemporal VLA model for robotic manipulation. Our model is guided by two key designs: 1) Spatiotemporal VLM. We encode 4D observations and task instructions into latent spaces, and feed them into the LLM to generate a sequence of causally ordered chunk-level action prompts consisting of sub-tasks, spatial grounding and temporal grounding. 2) Spatiotemporal action expert. Conditioned on chunk-level action prompts, we design a structured dual-generator guidance to jointly model spatial dependencies and temporal causality, thus predicting step-level action parameters. Within this structured framework, the VLM explicitly plans global spatiotemporal behavior, and the action expert further refines local spatiotemporal control. In addition, we propose a real-world robotic dataset with structured spatiotemporal annotations for fine-tuning. Extensive experiments have been conducted to demonstrate the effectiveness of our model. Our code link: this https URL.
[CV-78] Exploring Boundary-Aware Spatial-Frequency Fusion for Camouflaged Object Detection
【速读】:该论文旨在解决伪装目标检测(Camouflaged Object Detection, COD)中因目标与背景高度相似而导致的检测困难问题,尤其针对现有方法仅依赖空间域的边缘提取和局部像素信息、忽视频域中相位谱信息重要性的局限。其解决方案的关键在于提出一种基于边界感知的频域与空域融合网络(BASFNet),通过引入基于相位谱的频域增强边缘探索模块(FEEM)和空间核心分割模块(SCSM),协同捕捉伪装目标的边界与结构特征,并借助空间-频域融合交互模块(SFFIM)实现高效特征融合;同时采用边界感知训练策略进一步优化边界检测性能,从而显著提升COD任务的准确性。
链接: https://arxiv.org/abs/2604.17879
作者: Song Yu,Yang Hu,Haokang Ding,Zhifang Liao,Yucheng Song
机构: Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Camouflaged Object Detection is challenging due to the high degree of similarity between camouflaged objects and their surrounding backgrounds. Current COD methods mainly rely on edge extraction in the spatial domain and local pixel-level information, neglecting the importance of global structural features. Additionally, they fail to effectively leverage the importance of phase spectrum information within frequency domain features. To this end, we propose a COD framework BASFNet based on boundary-aware frequency domain and spatial domain this http URL method uses dual guided integration of frequency domain and spatial domain features. A phase-spectrum-based frequency-enhanced edge exploration module (FEEM) and a spatial core segmentation module (SCSM) are introduced to jointly capture the boundary and object features of camouflaged objects. These features are then effectively integrated through a spatial-frequency fusion interaction module (SFFIM). Furthermore, the boundary detection is further optimized through an boundary-aware training strategy. BASFNet outperforms existing state-of-the-art methods on three benchmark datasets, validating the effectiveness of the fusion of frequency and spatial domain information in COD tasks.
[CV-79] Spatiotemporal Sycophancy: Negation-Based Gaslighting in Video Large Language Models
【速读】:该论文旨在解决视频大语言模型(Video Large Language Models, Vid-LLMs)在对话交互中对否定型煤气灯效应(negation-based gaslighting)的脆弱性问题,即模型在面对误导性用户反馈时,会放弃原本基于视觉证据的正确判断,并生成不支持的时空解释以合理化错误修正。解决方案的关键在于提出一种基于否定的煤气灯效应评估框架,并构建了GasVideo-1000基准数据集,该数据集通过明确的视觉锚定和时间推理需求系统性地探测此类“时空谄媚”(spatiotemporal sycophancy)行为,从而揭示当前Vid-LLMs在对抗性对话反馈下缺乏维持 grounded spatiotemporal belief 的鲁棒机制。
链接: https://arxiv.org/abs/2604.17873
作者: Ziyao Tang,Pengkun Jiao,Bin Zhu,Huiyan Qi,Jingjing Chen,Yu-Gang Jiang
机构: Fudan University (复旦大学); Shanghai Key Laboratory of Multimodal Embodied AI (上海市多模态具身智能重点实验室); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video Large Language Models (Vid-LLMs) have demonstrated remarkable performance in video understanding tasks, yet their robustness under conversational interaction remains largely underexplored. In this paper, we identify spatiotemporal sycophancy, a failure mode in which Vid-LLMs retract initially correct, visually grounded judgments and conform to misleading user feedback under negation-based gaslighting. Rather than merely changing their answers, the models often fabricate unsupported temporal or spatial explanations to justify incorrect revisions. To systematically investigate this phenomenon, we propose a negation-based gaslighting evaluation framework and introduce GasVideo-1000, a curated benchmark designed to probe spatiotemporal sycophancy with clear visual grounding and temporal reasoning requirements. We evaluate a broad range of state-of-the-art open-source and proprietary Vid-LLMs across diverse video understanding tasks. Extensive experiments reveal that vulnerability to negation-based gaslighting is pervasive and severe, even among models with strong baseline performance. While prompt-level grounding constraints can partially mitigate this behavior, they do not reliably prevent hallucinated justifications or belief reversal. Our results indicate that current Vid-LLMs lack robust mechanisms for maintaining grounded spatiotemporal beliefs under adversarial conversational feedback.
[CV-80] Sharpening Lightweight Models for Generalized Polyp Segmentation: A Boundary Guided Distillation from Foundation Models
【速读】:该论文旨在解决结肠息肉(polyp)自动分割任务中轻量级模型语义与结构信息捕捉不足、而大型视觉基础模型(Vision Foundation Models, VFMs)因领域差异大、边界敏感性弱及计算成本高导致迁移效果差的问题。其核心解决方案是提出 LiteBounD 框架,关键在于:(i) 设计双路径蒸馏机制以解耦语义与边界感知表征;(ii) 引入频域感知对齐策略,分别监督低频全局语义和高频边界细节;(iii) 构建边界感知解码器,融合多尺度编码特征与蒸馏得到的丰富边界信息,实现精确分割。该方法在多个公开数据集上显著优于轻量级基线模型,并达到与先进方法相当的性能,同时保持临床实时应用所需的高效性。
链接: https://arxiv.org/abs/2604.17865
作者: Shivanshu Agnihotri,Snehashis Majhi,Deepak Ranjan Nayak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated polyp segmentation is critical for early colorectal cancer detection and its prevention, yet remains challenging due to weak boundaries, large appearance variations, and limited annotated data. Lightweight segmentation models such as U-Net, U-Net++, and PraNet offer practical efficiency for clinical deployment but struggle to capture the rich semantic and structural cues required for accurate delineation of complex polyp regions. In contrast, large Vision Foundation Models (VFMs), including SAM, OneFormer, Mask2Former, and DINOv2, exhibit strong generalization but transfer poorly to polyp segmentation due to domain mismatch, insufficient boundary sensitivity, and high computational cost. To bridge this gap, we propose \textit\textbfLiteBounD, a \underlineLigh\underlinetw\underlineeight \underlineBoundary-guided \underlineDistillation framework that transfers complementary semantic and structural priors from multiple VFMs into compact segmentation backbones. LiteBounD introduces (i) a dual-path distillation mechanism that disentangles semantic and boundary-aware representations, (ii) a frequency-aware alignment strategy that supervises low-frequency global semantics and high-frequency boundary details separately, and (iii) a boundary-aware decoder that fuses multi-scale encoder features with distilled semantically rich boundary information for precise segmentation. Extensive experiments on both seen (Kvasir-SEG, CVC-ClinicDB) and unseen (ColonDB, CVC-300, ETIS) datasets demonstrate that LiteBounD consistently outperforms its lightweight baselines by a significant margin and achieves performance competitive with state-of-the-art methods, while maintaining the efficiency required for real-time clinical use. Our code is available at this https URL.
[CV-81] PlankFormer: Robust Plankton Instance Segmentation via MAE-Pretrained Vision Transformers and Pseudo Community Image Generation ICPR2026
【速读】:该论文旨在解决浮游生物(plankton)图像实例分割中的两大难题:一是像素级标注数据稀缺,二是传统基于卷积神经网络(CNN)的方法难以区分浮游生物与碎片及重叠个体。其解决方案的关键在于两个创新:首先,提出一种生成伪群体图像(Pseudo Community Images, PCI)的合成策略,通过将单个浮游生物图像叠加到多样背景(包括生成式AI创建的背景)上来扩充训练数据;其次,采用基于视觉Transformer(Vision Transformer, ViT)骨干网络和Mask2Former解码器的分割模型,并利用掩码自动编码器(Masked Autoencoder, MAE)对未标注的单个浮游生物图像进行自监督预训练,从而增强模型对遮挡和杂乱背景的鲁棒性。实验证明,该方法在真实数据集上显著优于Mask R-CNN等传统方法,尤其在高碎片密度环境中表现优异,且大幅减少对人工标注的依赖。
链接: https://arxiv.org/abs/2604.17856
作者: Masaharu Miyazaki,Yurie Otake,Koichi Ito,Wataru Makino,Jotaro Urabe,Takafumi Aoki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICPR2026
Abstract:Plankton monitoring is essential for assessing aquatic ecosystems but is limited by the labor-intensive nature of manual microscopic analysis. Automating the segmentation of plankton from crowded images is crucial, however, it faces two major challenges: (i) the scarcity of pixel-level annotated datasets and (ii) the difficulty of distinguishing plankton from debris and overlapping individuals using conventional CNN-based methods. To address these issues, we propose PlankFormer, a novel framework for plankton instance segmentation. First, to overcome the data shortage, we introduce a method to generate labeled Pseudo Community Images (PCI) by synthesizing individual plankton images onto diverse backgrounds, including those created by generative models. Second, we propose a segmentation model utilizing a Vision Transformer (ViT) backbone with a Mask2Former decoder. To robustly capture the global structural features of plankton against occlusion and debris, we employ a Masked Autoencoder (MAE) for self-supervised pre-training on unlabeled individual images. Experimental results on real-world datasets demonstrate that our method significantly outperforms conventional methods, such as Mask R-CNN, particularly in challenging environments with high debris density. We demonstrate that our synthetic training strategy and MAE-based architecture enable high-precision segmentation with requiring less manual annotations for individual plankton images.
[CV-82] UniCSG: Unified High-Fidelity Content-Constrained Style-Driven Generation via Staged Semantic and Frequency Disentanglement
【速读】:该论文旨在解决基于扩散模型(diffusion models)的风格迁移(style transfer)中普遍存在的内容-风格纠缠(content-style entanglement)问题,该问题会导致参考内容泄露(reference-content leakage)和生成不稳定。其核心解决方案是提出一个统一框架UniCSG,关键在于分阶段训练策略:首先在潜在空间进行语义解耦训练,通过低频预处理与条件扰动(conditioning corruption)促进内容与风格的分离;其次在潜在空间引入频域感知的细节重建阶段,利用多尺度频率监督优化细节还原;此外还结合像素空间奖励学习(reward learning),使潜在空间目标与解码后的感知质量对齐,从而显著提升内容忠实度、风格一致性及鲁棒性。
链接: https://arxiv.org/abs/2604.17850
作者: Jingwei Yang,Ruoxi Wu,Wei Shen,Meng Li,Yulong Liu,Huimin She,Lunxi Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Style transfer must match a target style while preserving content semantics. DiT-based diffusion models often suffer from content-style entanglement, leading to reference-content leakage and unstable generation. We present UniCSG, a unified framework for content-constrained, style-driven generation in both text-guided and reference-guided settings. UniCSG employs staged training: (i) a latent-space semantic disentanglement stage that combines low-frequency preprocessing with conditioning corruption to encourage content-style separation, and (ii) a latent-space frequency-aware detail reconstruction stage that refines details via multi-scale frequency supervision. We further incorporate pixel-space reward learning to align latent objectives with perceptual quality after decoding. Experiments demonstrate improved content faithfulness, style alignment, and robustness in both settings.
[CV-83] AI Approach for MRI-only Full-Spine Vertebral Segmentation and 3D Reconstruction in Paediatric Scoliosis
【速读】:该论文旨在解决儿童脊柱畸形评估中因MRI缺乏自动化、高分辨率3D骨性结构重建技术而难以替代CT的问题,从而避免离子辐射暴露。其关键解决方案是构建一个基于生成式对抗网络(GAN)与U-Net的AI框架:首先利用GAN将历史低剂量CT图像转换为MRI-like图像以扩充数据集,再结合已有的标注胸椎MRI数据训练分割模型,最终实现从MRI单模态图像中全自动、高精度(Dice分数88%)地完成胸腰段(T1-L5)骨骼分割与3D重建,处理时间从约1小时缩短至1分钟以内,同时保留青少年特发性脊柱侧弯(AIS)特有的形态特征,为临床评估、术前规划和导航提供无辐射的3D成像支持。
链接: https://arxiv.org/abs/2604.17846
作者: Nathasha Naranpanawa,Maree T. Izatt,Robert D. Labrom,Geoffrey N. Askin,J. Paige Little
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Presented at 2026 Spine Society of Australia 37th Annual Scientific Meeting
Abstract:MRI is preferred over CT in paediatric imaging because it avoids ionising radiation, but its use in spine deformity assessment is largely limited by the lack of automated, high-resolution 3D bony reconstruction, which continues to rely on CT. MRI-based 3D reconstruction remains impractical due to manual workflows and the scarcity of labelled full-spine datasets. This study introduces an AI framework that enables fully automated thoracolumbar spine (T1-L5) segmentation and 3D reconstruction from MRI alone. Historical low-dose CT scans from adolescent idiopathic scoliosis (AIS) patients were converted into MRI-like images using a GAN and combined with existing labelled thoracic MRI data to train a U-Net-based model. The resulting algorithm accurately generated continuous thoracolumbar 3D reconstructions, improved segmentation accuracy (88% Dice score), and reduced processing time from approximately 1 hour to under one minute, while preserving AIS-specific deformity features. This approach enables radiation-free 3D deformity assessment from MRI, supporting clinical evaluation, surgical planning, and navigation in paediatric spine care.
[CV-84] PCM-NeRF: Probabilistic Camera Modeling for Neural Radiance Fields under Pose Uncertainty CVPR
【速读】:该论文旨在解决神经表面重建方法在相机位姿(camera poses)存在误差时导致重建结果失真或不完整的问题,尤其针对Structure-from-Motion (SfM)系统输出的位姿估计不准确的情况。其关键解决方案是提出PCM-NeRF框架,通过引入每张图像对应的可学习不确定性(per-camera learnable uncertainty),将相机位姿建模为具有可学习均值和方差的概率分布,并基于SfM对应点质量进行初始化。通过设计不确定性正则化损失函数,使学习到的方差与视图置信度耦合,进而动态调节各相机位姿的学习速率:不确定性高的相机获得更小的梯度更新幅度,从而避免不良初始位姿对整体重建造成干扰。该机制无需修改渲染流程,计算开销极低,且在包含严重位姿异常值的复杂几何场景中显著优于现有最优方法。
链接: https://arxiv.org/abs/2604.17831
作者: Shravan Venkatraman,Rakesh Raj Madavan,Pavan Kumar Sathya Venkatesh
机构: Mohamed bin Zayed University of AI (穆罕默德·本·扎耶德人工智能大学); University of Amsterdam (阿姆斯特丹大学); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: CVPR-W 2026 (GenRec3D)
Abstract:Neural surface reconstruction methods typically treat camera poses as fixed values, assuming perfect accuracy from Structure-from-Motion (SfM) systems. This assumption breaks down with imperfect pose estimates, leading to distorted or incomplete reconstructions. We present PCM-NeRF, a probabilistic framework that augments neural surface reconstruction with per-camera learnable uncertainty, built on top of SG-NeRF. Rather than treating all cameras equally throughout optimization, we represent each pose as a distribution with a learnable mean and variance, initialized from SfM correspondence quality. An uncertainty regularization loss couples the learned variance to view confidence, and the resulting uncertainty directly modulates the effective pose learning rate: uncertain cameras receive damped gradient updates, preventing poorly initialized views from corrupting the reconstruction. This lightweight mechanism requires no changes to the rendering pipeline and adds negligible overhead. Experiments on challenging scenes with severe pose outliers demonstrate that PCM-NeRF consistently outperforms state-of-the-art methods in both Chamfer Distance and F-Score, particularly for geometrically complex structures, without requiring foreground masks.
[CV-85] GR4CIL: Gap-compensated Routing for CLIP-based Class Incremental Learning
【速读】:该论文旨在解决基于CLIP(Contrastive Language-Image Pre-trained)模型的类别增量学习(Class-Incremental Learning, CIL)中两个核心问题:一是共享参数适应导致的旧知识漂移(old-knowledge drift),二是任务特定知识组织引发的跨任务响应校准不足,从而难以实现可靠的特征路由。解决方案的关键在于提出GR4CIL框架,通过引入任务判别机制与知识路由策略,在保持增量稳定的共享文本语义空间的同时,保留任务特定的视觉知识;此外,设计正交补偿机制以缓解模态间隙引起的偏差,增强任务内判别能力,并扩大真实任务与竞争任务之间的得分差距,从而实现更可靠的任务感知知识路由,同时维持零样本泛化能力。
链接: https://arxiv.org/abs/2604.17822
作者: Tianqi Wang,Jingcai Guo
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Class-Incremental Learning (CIL) aims to continuously acquire new categories while preserving previously learned knowledge. Recently, Contrastive Language-Image Pre-trained (CLIP) models have shown strong potential for CIL due to their powerful generalization ability. However, existing methods still face two key challenges: shared-parameter adaptation tends to cause old-knowledge drift, and task-specific knowledge organization often leads to poorly calibrated cross-task responses, making reliable routing difficult. To address these issues, we propose GR4CIL, a framework combining task discrimination and knowledge routing for CLIP-based CIL. GR4CIL preserves task-specific visual knowledge while maintaining an incrementally stable shared textual semantic space, thereby reducing interference across tasks. Moreover, we introduce an orthogonal compensation mechanism to mitigate modality-gap-induced bias, enhance within-task discrimination, and enlarge the score margin between the ground-truth task and competing tasks. As a result, GR4CIL enables more reliable task-aware routing over learned knowledge while retaining the zero-shot generalization capability. Experiments on multiple benchmarks show that GR4CIL consistently outperforms strong baselines.
[CV-86] AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion CVPR2026
【速读】:该论文旨在解决从互联网视频中重建三维人体运动(3D human motion)及人-物交互(Human-Object Interaction, HOI)的难题,尤其针对动态摄像机视角下难以获得全局一致性的3D姿态以及现有动作捕捉(Motion Capture, MoCap)数据集覆盖不足的罕见动作类型。解决方案的关键在于提出一个两阶段框架:第一阶段利用2D扩散模型生成各领域的多视角2D运动数据,通过提取互联网视频中的2D关键点来引入MoCap数据中罕见的人体动作;第二阶段训练一个相机条件约束的多视角2D运动扩散模型,基于上述合成数据恢复世界空间中的3D人体运动与3D人-物交互,从而实现更真实、连贯的三维行为建模。
链接: https://arxiv.org/abs/2604.17818
作者: Hongjie Li,Heng Yu,Jiaman Li,Hong-Xing Yu,Ehsan Adeli,C. Karen Liu,Jiajun Wu
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026. Project website: this https URL The first two authors contribute equally
Abstract:Reconstructing 3D human motion and human-object interactions (HOI) from Internet videos is a fundamental step toward building large-scale datasets of human behavior. Existing methods struggle to recover globally consistent 3D motion under dynamic cameras, especially for motion types underrepresented in current motion-capture datasets, and face additional difficulty recovering coherent human-object interactions in 3D. We introduce a two-stage framework leveraging 2D diffusion that reconstructs 3D human motion and HOI from Internet videos. In the first stage, we synthesize multi-view 2D motion data for each domain, leveraging 2D keypoints extracted from Internet videos to incorporate human motions that rarely appear in existing MoCap datasets. In the second stage, a camera-conditioned multi-view 2D motion diffusion model is trained on the domain-specific synthetic data to recover 3D human motion and 3D HOI in the world space. We demonstrate the effectiveness of our method on Internet videos featuring challenging motions such as gymnastics, as well as in-the-wild HOI videos, and show that it outperforms prior work in producing realistic human motion and human-object interaction.
[CV-87] Re2MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement
【速读】:该论文旨在解决文本到动作(Text-to-Motion, T2M)生成模型在面对与训练文本分布差异较大的描述时性能显著下降的问题,即开放词汇(open-vocabulary)场景下的泛化能力不足。解决方案的关键在于提出一种名为Re²MoGen的推理与精炼框架,其核心创新包括:首先利用蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)增强大型语言模型(Large Language Model, LLM)的推理能力,生成语义合理的关键帧;其次通过人体姿态模型作为先验知识优化全身体态,并基于规划的关键帧监督微调预训练的动作生成器,实现时空补全;最后采用物理感知奖励函数进行后训练强化学习(Reinforcement Learning, RL),以消除LLM规划动作中的物理不合理性,从而提升生成动作的物理合理性与语义一致性。
链接: https://arxiv.org/abs/2604.17807
作者: Jiakun Zheng,Ting Xiao,Shiqin Cao,Xinran Li,Zhe Wang,Chenjia Bai
机构: East China University of Science and Technology (华东理工大学); Hong Kong University of Science and Technology (香港科技大学); Institute of Artificial Intelligence (TeleAI), China Telecom (中国电信人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Text-to-motion (T2M) generation aims to control the behavior of a target character via textual descriptions. Leveraging text-motion paired datasets, existing T2M models have achieved impressive performance in generating high-quality motions within the distribution of their training data. However, their performance deteriorates notably when the motion descriptions differ significantly from the training texts. To address this issue, we propose Re ^2 MoGen, a Reasoning and Refinement open-vocabulary Motion Generation framework that leverages enhanced Large Language Model (LLM) reasoning to generate an initial motion planning and then refine its physical plausibility via reinforcement learning (RL) post-training. Specifically, Re ^2 MoGen consists of three stages: We first employ Monte Carlo tree search to enhance the LLM’s reasoning ability in generating reasonable keyframes of the motion based on text prompts, specifying only the root and several key joints’ positions to ease the reasoning process. Then, we apply a human pose model as a prior to optimize the full-body poses based on the planned keyframes and use the resulting incomplete motion to supervise fine-tuning a pre-trained motion generator via a dynamic temporal matching objective, enabling spatiotemporal completion. Finally, we use post-training with physics-aware reward to refine motion quality to eliminate physical implausibility in LLM-planned motions. Extensive experiments demonstrate that our framework can generate semantically consistent and physically plausible motions and achieve state-of-the-art performance in open-vocabulary motion generation.
[CV-88] View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity
【速读】:该论文旨在解决文本驱动的3D场景编辑中多视角不一致性的问题(multi-view inconsistency),这是现有基于渲染-编辑-优化(render-edit-optimize)范式方法的主要瓶颈。其解决方案的关键在于从分布建模的角度重新定义3D编辑任务,提出一种显式引入跨视图依赖关系的视图一致3D编辑框架;进一步设计双路径一致性机制,分别通过投影引导的结构指导(projection-guided structural guidance)和补丁级语义传播(patch-level semantic propagation)来分别建模结构对应性和语义连续性,从而实现更鲁棒且精确的跨视角编辑效果。
链接: https://arxiv.org/abs/2604.17801
作者: Pufan Li,Bi’an Du,Shenghe Zheng,Junyi Yao,Wei Hu
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所); Department of Computer Science and Engineering, The Hong Kong University of Science and Technology (香港科技大学计算机科学与工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. 11 pages, 7 figures
Abstract:Text-driven 3D scene editing has recently attracted increasing attention. Most existing methods follow a render-edit-optimize pipeline, where multi-view images are rendered from a 3D scene, edited with 2D image editors, and then used to optimize the underlying 3D representation. However, cross-view inconsistency remains a major bottleneck. Although recent methods introduce geometric cues, cross-view interactions, or video priors to mitigate this issue, they still largely rely on inference-time synchronization and thus remain limited in robustness and this http URL this work, we recast multi-view consistent 3D editing from a distributional perspective: 3D scene editing essentially requires a joint distribution modeling across this http URL on this insight, we propose a view-consistent 3D editing framework that explicitly introduces cross-view dependencies into the editing process. Furthermore, motivated by the observation that structural correspondence and semantic continuity rely on different cross-view cues, we introduce a dual-path consistency mechanism consisting of projection-guided structural guidance and patch-level semantic propagation for effective cross-view editing. Further, we construct a paired multi-view editing dataset that provides reliable supervision for learning cross-view consistency in edited scenes. Extensive experiments demonstrate that our method achieves superior editing performance with precise and consistent views for complex scenes.
[CV-89] ReFineVLA: Multimodal Reasoning -Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在复杂、长程操作任务中因缺乏显式推理过程而导致的可解释性差与泛化能力不足的问题。现有VLA模型通常直接学习输入观测与动作之间的映射关系,忽略了中间逻辑步骤,限制了其在高阶任务中的适应性。解决方案的关键在于提出ReFineVLA框架,通过引入由专家教师模型生成的推理理由(reasoning rationales)来增强机器人数据集,并在此基础上对预训练VLA模型进行微调,从而引导模型显式学习动作背后的逻辑推理机制,同时保持原有泛化能力。实验表明,该方法显著提升了模型在SimperEnv仿真环境下的WidowX和Google Robot任务上的成功率,实现了视觉-语言与动作域之间更一致的对齐与理解。
链接: https://arxiv.org/abs/2604.17800
作者: Tuan Van Vo,Tan Q. Nguyen,Khang Nguyen,Nhat Xuan Tran,Duy H. M. Nguyen,An T. Le,Ngo Anh Vien,Minh Nhat Vu
机构: VinRobotics (VinRobotics); University of Texas at Arlington (德克萨斯大学阿灵顿分校); Max Planck Research School for Intelligent Systems (IMPRS-IS) (马克斯·普朗克智能系统研究所); Intelligent Autonomous Systems Lab, TU Darmstadt (达姆施塔特工业大学智能自主系统实验室); Automation Control Institute, TU Wien (维也纳工业大学自动化控制研究所); Austrian Institute of Technology (AIT) (奥地利技术研究院); Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) (穆罕默德·本·扎耶德人工智能大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2505.19080
Abstract:Vision-Language-Action (VLA) models have gained much attention from the research community thanks to their strength in translating multimodal observations with linguistic instructions into desired robotic actions. Despite their advancements, VLAs often overlook explicit reasoning and learn the functional input-action mappings, omitting crucial logical steps, which are especially pronounced in interpretability and generalization for complex, long-horizon manipulation tasks. In this work, we propose ReFineVLA, a multimodal reasoning-aware framework that fine-tunes VLAs with teacher-guided reasons. We first augment robotic datasets with reasoning rationales generated by an expert teacher model, guiding VLA models to learn to reason about their actions. Then, we fine-tune pre-trained VLAs with the reasoning-enriched datasets with ReFineVLA, while maintaining the underlying generalization abilities and boosting reasoning capabilities. We also conduct attention map visualization to analyze the alignment among visual observation, linguistic prompts, and to-be-executed actions of ReFineVLA, reflecting the model is ability to focus on relevant tasks and actions. Through this additional step, we explore that ReFineVLA-trained models exhibit a meaningful agreement between vision-language and action domains, highlighting the enhanced multimodal understanding and generalization. Evaluated across a suite of simulated manipulation benchmarks on SimplerEnv with both WidowX and Google Robot tasks, ReFineVLA achieves state-of-the-art performance, in success rate over the second-best method on the both the WidowX benchmark and Google Robot Tasks.
[CV-90] Weakly-Supervised Referring Video Object Segmentation through Text Supervision CVPR2026
【速读】:该论文旨在解决引用视频目标分割(Referring Video Object Segmentation, RVOS)任务中对昂贵像素级掩码标注的依赖问题。传统方法多采用监督学习,需大量精细标注,而现有弱监督方案虽使用边界框或点标注,仍存在劳动密集的问题。本文提出一种全新的仅依赖文本表达进行训练的弱监督RVOS方法(WSRVOS),其关键在于:1)设计对比性文本增强策略,利用多模态大语言模型生成正负样本表达以构建对比学习信号;2)通过双向视觉-语言特征选择与交互实现细粒度跨模态对齐;3)引入实例感知的表达分类机制优化模型区分能力;4)提出正预测融合策略生成高质量伪掩码作为额外监督信号;5)设计时序片段排序约束,确保相邻帧预测掩码重叠遵循特定顺序,从而提升时序一致性。该方法显著降低了标注成本并提升了分割性能。
链接: https://arxiv.org/abs/2604.17797
作者: Miaojing Shi,Jun Huang,Zijie Yue,Hanli Wang
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026 Findings
Abstract:Referring video object segmentation (RVOS) aims to segment the target instance in a video, referred by a text expression. Conventional approaches are mostly supervised learning, requiring expensive pixel-level mask annotations. To tackle it, weakly-supervised RVOS has recently been proposed to replace mask annotations with bounding boxes or points, which are however still costly and labor-intensive. In this paper, we design a novel weakly-supervised RVOS method, namely WSRVOS, to train the model with only text expressions. Given an input video and the referring expression, we first design a contrastive referring expression augmentation scheme that leverages the captioning capabilities of a multimodal large language model to generate both positive and negative expressions. We extract visual and linguistic features from the input video and generated expressions, then perform bi-directional vision-language feature selection and interaction to enable fine-grained multimodal alignment. Next, we propose an instance-aware expression classification scheme to optimize the model in distinguishing positive from negative expressions. Also, we introduce a positive-prediction fusion strategy to generate high-quality pseudo-masks, which serve as additional supervision to the model. Last, we design a temporal segment ranking constraint such that the overlaps between mask predictions of temporally neighboring frames are required to conform to specific orders. Extensive experiments on four publicly available RVOS datasets, including A2D Sentences, J-HMDB Sentences, Ref-YouTube-VOS, and Ref-DAVIS17, demonstrate the superiority of our method. Code is available at \hrefthis https URLthis https URL.
[CV-91] Subject-Aware Multi-Granularity Alignment for Zero-Shot EEG-to-Image Retrieval
【速读】:该论文旨在解决零样本脑电图到图像检索(zero-shot EEG-to-image retrieval)中的鲁棒性问题,即如何从脑电图(EEG)信号中准确解码感知的视觉内容,尤其是在跨被试场景下仍保持高精度。现有方法通常依赖单一固定视觉目标或不变于个体的目标构建方式,忽视了视觉诱发EEG信号在多表征尺度上的信息保留特性以及不同个体间最优视觉粒度存在差异的关键事实。解决方案的核心在于提出一种主体感知的多粒度对齐(subject-aware multi-granularity alignment, SAMGA)框架:首先通过自适应聚合预训练视觉编码器的多个中间表示来构建主体感知的监督目标,使模型在训练阶段吸收个体差异的粒度偏差,同时保持跨个体推理的一致性;进而设计粗到细的跨模态对齐策略,在共享编码器架构下,粗粒度阶段稳定共享语义几何并减少个体引起的分布偏移,细粒度阶段进一步提升实例级检索判别能力。
链接: https://arxiv.org/abs/2604.17782
作者: Lin Jiang,Qingshan She,Jiale Xu,Haiqi Xu,Duanpo Wu,Zhenzhong Kuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Zero-shot EEG-to-image retrieval aims to decode perceived visual content from electroencephalography (EEG) by aligning neural responses with pretrained visual representations, providing a promising route toward scalable visual neural decoding and practical brain-computer interfaces. However, robust EEG-to-image retrieval remains challenging, because prior methods usually rely on either a single fixed visual target or a subject-invariant target construction scheme. Such designs overlook two important properties of visually evoked EEG signals: they preserve information across multiple representational scales, and the visual granularity best matched to EEG may vary across subjects. To address these issues, subject-aware multi-granularity alignment (SAMGA) framework is proposed for zero-shot EEG-to-image retrieval. SAMGA first constructs a subject-aware visual supervision target by adaptively aggregating multiple intermediate representations from a pretrained vision encoder, allowing the model to absorb subject-dependent granularity deviations during training while preserving subject-agnostic inference. Building on this adaptive target construction, a coarse-to-fine cross-modal alignment strategy is further designed with a shared encoder wherein the coarse stage stabilizes the shared semantic geometry and reduces subject-induced distribution shift, and the fine stage further improves instance-level retrieval discrimination. Extensive experiments on the THINGS-EEG benchmark demonstrate that the proposed method achieves 91.3% Top-1 and 98.8% Top-5 accuracy in the intra-subject setting, and 34.4% Top-1 and 64.8% Top-5 accuracy in the inter-subject setting, outperforming recent state-of-the-art methods.
[CV-92] Structure-Adaptive Sparse Diffusion in Voxel Space for 3D Medical Image Enhancement
【速读】:该论文旨在解决三维(3D)医学图像增强(包括去噪和超分辨率)在计算上的高成本问题,尤其是在将扩散模型从二维扩展到高分辨率3D体数据时,由于长扩散轨迹导致的训练效率低下。其核心解决方案是提出一种稀疏体素空间扩散框架(sparse voxel-space diffusion framework),通过在均匀下采样的时间步上进行训练和采样,显著减少计算开销;同时,利用条件增强中输入图像中存在的强解剖先验,使密集的时间步调度冗余,从而实现仅用稀疏时间步即可稳定训练。此外,引入轻量级结构感知轨迹调制(Structure-aware Trajectory Modulation, STM)模块,根据局部解剖内容动态调整每个网络块的时间嵌入,实现结构自适应去噪,并在保持细粒度解剖细节的同时,达到最高10倍的训练加速效果。
链接: https://arxiv.org/abs/2604.17773
作者: Hongxu Jiang,Fei Li,Boxiao Yu,Ying Zhang,Kaleb Smith,Kuang Gong,Wei Shao
机构: University of Florida (佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Three-dimensional (3D) medical image enhancement, including denoising and super-resolution, is critical for clinical diagnosis in CT, PET, and MRI. Although diffusion models have shown remarkable success in 2D medical imaging, scaling them to high-resolution 3D volumes remains computationally prohibitive due to lengthy diffusion trajectories over high-dimensional volumetric data. We observe that in conditional enhancement, strong anatomical priors in the degraded input render dense noise schedules largely redundant. Leveraging this insight, we propose a sparse voxel-space diffusion framework that trains and samples on a compact set of uniformly subsampled timesteps. The network predicts clean data directly on the data manifold, supervised in velocity space for stable gradient scaling. A lightweight Structure-aware Trajectory Modulation (STM) module recalibrates time embeddings at each network block based on local anatomical content, enabling structure-adaptive denoising over the shared sparse schedule. Operating directly in voxel space, our framework preserves fine anatomical detail without lossy compression while achieving up to 10\times training acceleration. Experiments on four datasets spanning CT, PET, and MRI demonstrate state-of-the-art performance on both denoising and super-resolution tasks. Our code is publicly available at: this https URL.
[CV-93] Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos CVPR2026
【速读】:该论文旨在解决从第一人称视角(egocentric perspective)建模物理变换过程的问题,即在简短动作指令下生成描述物体从初始状态到目标状态之间过渡的中间帧序列,这一任务被称为Egocentric Instructed Visual State Transition (EIVST)。当前生成模型面临两大挑战:一是理解初始与目标状态的视觉场景并从第一人称视角推理变换步骤;二是生成符合指令且保持物体外观一致性的连贯过渡序列。解决方案的关键在于提出EgoIn框架,其核心包括:(1) 使用基于自定义数据集微调的TransitionVLM模型推断多步变换过程,以减少幻觉信息;(2) 引入Transition Conditioning模块生成满足条件的帧序列;(3) 通过Object-aware Auxiliary Supervision机制确保物体外观在变换过程中的一致性。
链接: https://arxiv.org/abs/2604.17749
作者: Mengmeng Ge,Takashi Isobe,Xu Jia,Yanan Sun,Zetong Yang,Weinong Wang,Dong Zhou,Dong Li,Huchuan Lu,Emad Barsoum
机构: Advanced Micro Devices, Inc. (超威半导体公司); Dalian University of Technology (大连理工大学); The Hong Kong University of Science and Technology (香港科技大学); The Chinese University of Hong Kong (香港中文大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026
Abstract:Understanding physical transformation processes is crucial for both human cognition and artificial intelligence systems, particularly from an egocentric perspective, which serves as a key bridge between humans and machines in action modeling. We define this modeling process as Egocentric Instructed Visual State Transition (EIVST), which involves generating intermediate frames that depict object transformations between initial and target states under a brief action instruction. EIVST poses two challenges for current generative models: (1) understanding the visual scenes of the initial and target states and reasoning about transformation steps from an egocentric view, and (2) generating a consistent intermediate transition that follows the given instruction while preserving object appearance across the two visual states. To address these challenges, we propose the EgoIn framework. It first infers the multi-step transition process between two given states using TransitionVLM, fine-tuned on our curated dataset to better adapt to this task and reduce hallucinated information. It then generates a sequence of frames based on transition conditions produced by the proposed Transition Conditioning module. Additionally, we introduce Object-aware Auxiliary Supervision to preserve consistent object appearance throughout the transition. Extensive experiments on human-object and robot-object interaction datasets demonstrate EgoIn’s superior performance in generating semantically meaningful and visually coherent transformation sequences.
[CV-94] Source-Free Domain Adaptation with Vision-Language Prior
【速读】:该论文旨在解决无源域适应(Source-Free Domain Adaptation, SFDA)中因依赖伪标签和/或辅助监督而导致的错误传播问题。传统方法在仅使用未标注目标域数据时,容易产生不准确的伪标签,从而限制模型性能提升。为克服此局限,作者首次探索了现成视觉-语言(Vision-Language, ViL)多模态模型(如CLIP)所蕴含的丰富且异构知识的潜力,并提出了一种新颖的DIFO++方法。其关键在于通过两个交替步骤实现任务特定的知识迁移:首先利用提示学习(prompt learning)最大化ViL模型与目标模型之间的互信息以定制化ViL模型;其次将该定制化ViL模型的知识蒸馏至目标模型,聚焦于“间隙区域”(gap region)的减少——即识别并优化特征混杂、类别模糊的区域,以增强任务相关语义表达。在此过程中,结合类别注意力机制、预测一致性约束以及参考熵最小化策略,有效提升了伪标签可靠性与语义对齐精度,显著优于现有最先进方法。
链接: https://arxiv.org/abs/2604.17748
作者: Song Tang,Yunxiang Bai,Wenxin Su,Mao Ye,Jianwei Zhang,Xiatian Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Source-Free Domain Adaptation (SFDA) seeks to adapt a source model, which is pre-trained on a supervised source domain, for a target domain, with only access to unlabeled target training data. Relying on pseudo labeling and/or auxiliary supervision, conventional methods are inevitably error-prone. To mitigate this limitation, in this work we for the first time explore the potentials of off-the-shelf vision-language (ViL) multimodal models (e.g., CLIP) with rich whilst heterogeneous knowledge. We find that directly applying the ViL model to the target domain in a zero-shot fashion is unsatisfactory, as it is not specialized for this particular task but largely generic. To make it task-specific, we propose a novel DIFO++ approach. Specifically, DIFO++ alternates between two steps during adaptation: (i) Customizing the ViL model by maximizing the mutual information with the target model in a prompt learning manner, (ii) Distilling the knowledge of this customized ViL model to the target model, centering on gap region reduction. During progressive knowledge adaptation, we first identify and focus on the gap region, where enclosed features are entangled and class-ambiguous, as it often captures richer task-specific semantics. Reliable pseudo-labels are then generated by fusing predictions from the target and ViL models, supported by a memory mechanism. Finally, gap region reduction is guided by category attention and predictive consistency for semantic alignment, complemented by referenced entropy minimization to suppress uncertainty. Extensive experiments show that DIFO++ significantly outperforms the state-of-the-art alternatives. Our code and data are available at this https URL.
[CV-95] IncreFA: Breaking the Static Wall of Generative Model Attribution
【速读】:该论文旨在解决生成式 AI (Generative AI) 图像归属(image attribution)在模型快速迭代背景下的适应性问题,即现有水印、分类器和逆向方法在新扩散、对抗和自回归生成模型发布后迅速失效的问题。其核心挑战并非模型识别本身,而是归属机制难以持续更新以应对新型生成模型的涌现。解决方案的关键在于提出 IncreFA 框架,将归属任务重新建模为结构化的增量学习问题,并引入两个相互增强的机制:一是层级约束(Hierarchical Constraints),通过可学习的正交先验编码生成架构的层次关系,分离家族级不变特征与模型特异性差异;二是潜在记忆库(Latent Memory Bank),通过回放紧凑的潜在样本并混合生成伪未见样本,稳定表示漂移并提升开放集感知能力。
链接: https://arxiv.org/abs/2604.17736
作者: Haotian Qin,Dongliang Chang,Yueying Gao,Lei Chen,Zhanyu Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As AI generative models evolve at unprecedented speed, image attribution has become a moving target. New diffusion, adversarial and autoregressive generators appear almost monthly, making existing watermark, classifier and inversion methods obsolete upon release. The core problem lies not in model recognition, but in the inability to adapt attribution itself. We introduce IncreFA, a framework that redefines attribution as a structured incremental learning problem, allowing the system to learn continuously as new generative models emerge. IncreFA departs from conventional incremental learning by exploiting the hierarchical relationships among generative architectures and coupling them with continual adaptation. It integrates two mutually reinforcing mechanisms: (1) Hierarchical Constraints, which encode architectural hierarchies through learnable orthogonal priors to disentangle family-level invariants from model-specific idiosyncrasies; and (2) a Latent Memory Bank, which replays compact latent exemplars and mixes them to generate pseudo-unseen samples, stabilising representation drift and enhancing open-set awareness. On the newly constructed Incremental Attribution Benchmark (IABench) covering 28 generative models released between 2022 and 2025, IncreFA achieves state-of-the-art attribution accuracy and 98.93% unseen detection under a temporally ordered open-set protocol. Code will be available at this https URL.
[CV-96] Score-Based Matching with Target Guidance for Cryo-EM Denoising
【速读】:该论文旨在解决冷冻电子显微镜(cryo-EM)图像中因低剂量成像导致的信噪比极低、粒子可见度弱的问题,这一问题严重影响后续的粒子挑选(particle picking)、二维分类(2D classification)和三维重构(3D reconstruction)等分析步骤。现有去噪方法多采用像素级或Noise2Noise风格的目标函数,虽能提升视觉质量,但未充分考虑下游任务所需的结构一致性。其解决方案的关键在于提出一种基于分数(score)的去噪框架,该框架通过学习干净数据的得分函数(score function)来恢复粒子信号并更好地保留结构信息;进一步引入目标引导(target-guided)变体,利用参考密度指导稳定分数学习,尤其在弱信号和模糊条件下表现更优;同时,该方法有效抑制了结构化的低频背景噪声,增强了粒子与背景的可分离性,从而显著提升粒子挑选准确性和三维重构的结构一致性。
链接: https://arxiv.org/abs/2604.17734
作者: Xiaoqi Wu,Xueying Zhan,Wen Li,Junhao Wu,Xin Huang,Min Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cryo-electron microscopy (cryo-EM) enables single-particle analysis of biological macromolecules under strict low-dose imaging conditions, but the resulting micrographs often exhibit extremely low signal-to-noise ratios and weak particle visibility. Image denoising is therefore an important preprocessing step for downstream cryo-EM analysis, including particle picking, 2D classification, and 3D reconstruction. Existing cryo-EM denoising methods are commonly trained with pixel-wise or Noise2Noise-style objectives, which can improve visual quality but do not explicitly account for structural consistency required by downstream analysis. In this work, we propose a score-based denoising framework for cryo-EM that learns the clean-data score to recover particle signals while better preserving structural information. Building on this formulation, we further introduce a target-guided variant that incorporates reference-density guidance to stabilize score learning under weak and ambiguous signal conditions. Rather than simply amplifying particle-like responses, our framework better suppresses structured low-frequency background, which improves particle–background separability for downstream analysis. Experiments on multiple cryo-EM datasets show that our score-based methods consistently improve downstream particle picking and produce more structure-consistent 3D reconstructions. Experiments on multiple cryo-EM datasets show that our methods improve downstream particle picking and produce more structure-consistent reconstructions.
[CV-97] Voronoi-guided Bilateral 2D Gaussian Splatting for Arbitrary-Scale Hyperspectral Image Super-Resolution
【速读】:该论文旨在解决现有高光谱图像超分辨率(Hyperspectral Image Super-Resolution, HSI-SR)方法在任意尺度重建时灵活性不足的问题,以及传统基于栅格化(rasterization)策略在空间建模上的局限性。其关键解决方案是提出 GaussianHSI 框架,采用基于 Voronoi-Guided Bilateral 2D Gaussian Splatting 的空间重建机制:首先预测一组用于表示输入的高斯函数,再通过 Voronoi 区域引导选择与目标像素相关的高斯函数,并利用参考感知的双边加权聚合这些函数,从而兼顾几何相关性和低分辨率特征一致性;此外,引入 Spectral Detail Enhancement 模块以提升光谱重建质量,实现对任意尺度下高光谱图像的高效、高保真重建。
链接: https://arxiv.org/abs/2604.17727
作者: Jie Zhang,Jinkun You,Shi Chen,Yicong Zhou
机构: University of Macau(澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Most existing hyperspectral image super-resolution methods require modifications for different scales, limiting their flexibility in arbitrary-scale reconstruction. 2D Gaussian splatting provides a continuous representation that is compatible with arbitrary-scale super-resolution. Existing methods often rely on rasterization strategies, which may limit flexible spatial modeling. Extending them to hyperspectral image super-resolution remains challenging, as the task requires adaptive spatial reconstruction while preserving spectral fidelity. This paper proposes GaussianHSI, a Gaussian-Splatting-based framework for arbitrary-scale hyperspectral image super-resolution. We develop a Voronoi-Guided Bilateral 2D Gaussian Splatting for spatial reconstruction. After predicting a set of Gaussian functions to represent the input, it associates each target pixel with relevant Gaussian functions through Voronoi-guided selection. The target pixel is then reconstructed by aggregating the selected Gaussian functions with reference-aware bilateral weighting, which considers both geometric relevance and consistency with low-resolution features. We further introduce a Spectral Detail Enhancement module to improve spectral reconstruction. Extensive experiments on benchmark datasets demonstrate the effectiveness of GaussianHSI over state-of-the-art methods for arbitrary-scale hyperspectral image super-resolution.
[CV-98] GeGS-PCR: Effective and Robust 3D Point Cloud Registration with Two-Stage Color-Enhanced Geometric-3DGS Fusion
【速读】:该论文旨在解决点云配准(Point Cloud Registration)中因低重叠度和不完整结构导致的传统几何特征方法性能下降的问题。其核心解决方案是提出一种两阶段的GeGS-PCR方法,关键在于融合几何、颜色与高斯(Gaussian)信息:首先通过专用的颜色编码器提取多层级几何与颜色特征;其次引入几何-3D高斯(Geometric-3DGS)模块,以确保局部邻域内颜色与几何信息的全局不变性上下文;同时利用LORA优化保持3D高斯表示能力,并结合快速可微渲染提升收敛性;最后设计联合光度损失函数,协同优化几何与颜色特征,在极低重叠场景下实现高精度配准,显著优于现有方法。
链接: https://arxiv.org/abs/2604.17721
作者: Jiayi Tian,Haiduo Huang,Tian Xia,Wenzhe Zhao,Pengju Ren
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We address the challenge of point cloud registration using color information, where traditional methods relying solely on geometric features often struggle in low-overlap and incomplete scenarios. To overcome these limitations, we propose GeGS-PCR, a novel two-stage method that combines geometric, color, and Gaussian information for robust registration. Our approach incorporates a dedicated color encoder that enhances color features by extracting multi-level geometric and color data from the original point cloud. We introduce the \textbfGeometric-3D\textbfGS module, which encodes the local neighborhood information of colored superpoints to ensure a globally invariant geometric-color context. Leveraging LORA optimization, we maintain high performance while preserving the expressiveness of 3DGS. Additionally, fast differentiable rendering is utilized to refine the registration process, leading to improved convergence. To further enhance performance, we propose a joint photometric loss that exploits both geometric and color features. This enables strong performance in challenging conditions with extremely low point cloud overlap. We validate our method by colorizing the Kitti dataset as ColorKitti and testing on both Color3DMatch and Color3DLoMatch datasets. Our method achieves state-of-the-art performance with \textitRegistration Recall at 99.9%, \textitRelative Rotation Error as low as 0.013, and \textitRelative Translation Error as low as 0.024, improving precision by at least a factor of 2.
[CV-99] FlashFPS: Efficient Farthest Point Sampling for Large-Scale Point Clouds via Pruning and Caching
【速读】:该论文旨在解决点云处理中基于点的神经网络(Point-based Neural Networks, PNNs)因最远点采样(Farthest Point Sampling, FPS)操作引入显著推理延迟的问题,尤其在大规模点云处理场景下,FPS因其多层网络中反复进行的全量计算成为主要性能瓶颈。解决方案的关键在于系统性识别并消除FPS中的三类冗余:不必要的全云计算、后期迭代冗余以及层间可预测输出导致的重复计算。为此提出硬件无关、即插即用的加速框架FlashFPS,其核心由两个模块构成:FPS-Prune通过候选点剪枝和迭代剪枝减少冗余计算并保持采样质量,FPS-Cache则利用缓存与重用机制消除层间冗余;该方案集成至现有CUDA库及先进PNN加速器后,在GPU上实现5.16倍加速、在PNN加速器上实现2.69倍加速,且精度损失可忽略,显著提升了PNN推理效率与可扩展性。
链接: https://arxiv.org/abs/2604.17720
作者: Yuzhe Fu,Hancheng Ye,Cong Guo,Junyao Zhang,Qinsi Wang,Yueqian Lin,Changchun Zhou, Hai (Helen)Li,Yiran Chen
机构: Duke University (杜克大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to DAC’26
Abstract:Point-based Neural Networks (PNNs) have become a key approach for point cloud processing. However, a core operation in these models, Farthest Point Sampling (FPS), often introduces significant inference latency, especially for large-scale processing. Despite existing CUDA- and hardware-level optimizations, FPS remains a major bottleneck due to exhaustive computations across multiple network layers in PNNs, which hinders scalability. Through systematic analysis, we identify three substantial redundancies in FPS, including unnecessary full-cloud computations, redundant late-stage iterations, and predictable inter-layer outputs that make later FPS computations avoidable. To address these, we propose \textbf\textitFlashFPS, a hardware-agnostic, plug-and-play framework for FPS acceleration, composed of \textitFPS-Prune and \textitFPS-Cache. \textitFPS-Prune introduces candidate pruning and iteration pruning to reduce redundant computations in FPS while preserving sampling quality, and \textitFPS-Cache eliminates layer-wise redundancy via cache-and-reuse. Integrated into existing CUDA libraries and state-of-the-art PNN accelerators, \textitFlashFPS achieves 5.16 \times speedup over the standard CUDA baseline on GPU and 2.69 \times on PNN accelerators, with negligible accuracy loss, enabling efficient and scalable PNN inference. Codes are released at this https URL.
[CV-100] Dynamic Visual-semantic Alignment for Zero-shot Learning with Ambiguous Labels ICME2026
【速读】:该论文旨在解决零样本学习(Zero-shot Learning, ZSL)中现有方法通常假设标签干净、忽视现实世界中标签噪声与模糊性的问题,从而导致模型性能下降。其解决方案的关键在于提出动态视觉-语义对齐(Dynamic Visual-semantic Alignment, DVSA)框架:通过双向视觉-语义对齐模块结合注意力机制,相互校准视觉特征与属性原型;引入基于互信息(Mutual Information, MI)的对比优化策略,在属性层面强化判别性和语义一致性;同时设计动态标签消歧机制,迭代修正噪声监督信号,保持语义一致性,缩小实例与标签之间的差距,从而提升模型在模糊标签下的泛化能力。
链接: https://arxiv.org/abs/2604.17710
作者: Jiangnan Li,Linqing Huang,Xiaowen Yan,Min Gan,Wenpeng Lu,Jinfu Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME 2026 (IEEE International Conference on Multimedia and Expo)
Abstract:Zero-shot learning (ZSL) aims to recognize unseen classes without visual instances. However, existing methods usually assume clean labels, overlooking real-world label noise and ambiguity, which degrades performance. To bridge this gap, we propose the Dynamic Visual-semantic Alignment (DVSA), a robust ZSL framework for learning from ambiguous labels. DVSA uses a bidirectional visual-semantic alignment module with attention to mutually calibrate visual features and attribute prototypes, and a contrastive optimization grounded in Mutual Information (MI) at the attribute level to strengthen discriminative, semantically consistent attributes. In addition, a dynamic label disambiguation mechanism iteratively corrects noisy supervision while preserving semantic consistency, narrowing the instance-label gap, and improving generalization. Extensive experiments on standard benchmarks verify that DVSA achieves stronger performance under ambiguous supervision.
[CV-101] Dual-stream Spatio-Temporal GCN-Transformer Network for 3D Human Pose Estimation
【速读】:该论文旨在解决当前基于Transformer的方法在3D人体姿态估计中忽视局部骨骼关系及不同通道间信息交互的问题。现有方法虽能有效建模全局时空关系,但对局部结构特征的捕捉不足,导致性能受限。解决方案的关键在于提出一种双流时空图卷积网络与Transformer融合架构(MixTGFormer),其核心为堆叠的Mixformer模块,该模块包含两种不同模式的并行Mixformer Block以提取和融合多维骨骼信息,并引入挤压-激励层(SE Layer)进一步增强特征表达;同时,通过将图卷积网络(GCN)嵌入Transformer结构中,实现了局部与全局信息的有效利用,从而在Human3.6M和MPI-INF-3DHP两个基准数据集上分别取得37.6mm和15.7mm的P1误差,达到当前最优性能。
链接: https://arxiv.org/abs/2604.17688
作者: Jiawen Duan,Jian Xiang,Zhiqiang Li,Linlin Xue,Wan Xiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in Displays, Vol. 93, 2026, Article 103429. DOI: this https URL Free access: this https URL
Abstract:3D human pose estimation is a classic and important research direction in the field of computer vision. In recent years, Transformer-based methods have made significant progress in lifting 2D to 3D human pose estimation. However, these methods primarily focus on modeling global temporal and spatial relationships, neglecting local skeletal relationships and the information interaction between different channels. Therefore, we have proposed a novel method,the Dual-stream Spatio-temporal GCN-Transformer Network (MixTGFormer). This method models the spatial and temporal relationships of human skeletons simultaneously through two parallel channels, achieving effective fusion of global and local features. The core of MixTGFormer is composed of stacked Mixformers. Specifically, the Mixformer includes the Mixformer Block and the Squeeze-and-Excitation Layer ( SE Layer). It first extracts and fuses various information of human skeletons through two parallel Mixformer Blocks with different modes. Then, it further supplements the fused information through the SE Layer. The Mixformer Block integrates Graph Convolutional Networks (GCN) into the Transformer, enhancing both local and global information utilization. Additionally, we further implement its temporal and spatial forms to extract both spatial and temporal relationships. We extensively evaluated our model on two benchmark datasets (Human3.6M and MPI-INF-3DHP). The experimental results showed that, compared to other methods, our MixTGFormer achieved state-of-the-art results, with P1 errors of 37.6mm and 15.7mm on these datasets, respectively.
[CV-102] Low Light Image Enhancement Challenge at NTIRE 2026
【速读】:该论文旨在解决低光照图像增强(Low Light Image Enhancement)问题,即在复杂多变的条件下恢复因对比度低和噪声干扰导致的信息丢失,从而生成更清晰、视觉上更具吸引力的图像。其解决方案的关键在于设计有效的神经网络架构,以学习具有代表性的视觉特征,并实现联合去噪与增强(joint denoising and enhancement),从而显著提升图像质量。通过引入新颖的数据集并评估22支参赛团队的先进方法,论文系统展示了该领域近年来的技术进展。
链接: https://arxiv.org/abs/2604.17669
作者: George Ciubotariu,Sharif S M A,Abdur Rehman,Fayaz Ali Dharejo,Rizwan Ali Naqvi,Marcos V. Conde,Radu Timofte,Zhi Jin,Hongjun Wu,Wenjian Zhang,Chang Ye,Xunpeng Yi,Qinglong Yan,Yibing Zhang,Nikhil Akalwadi,Varda I Pattanshetty,Varsha I Pattanshetty,Padmashree Desai,Uma Mudenagudi,Ramesh Ashok Tabib,Hao Yang,Ruikun Zhang,Liyuan Pan,Furkan Kınlı,Donghun Ryou,Inju Ha,Junoh Kang,Bohyung Han,Wei Zhou,Yuval Haitman,Ariel Lapid,Reuven Peretz,Idit Diamant,Leilei Cao,Shuo Zhang,Praful Hambarde,Prateek Shaily,Jayant Kumar,Hardik Sharma,Aashish Negi,Sachin Chaudhary,Akshay Dudhane,Amit Shukla,MoHao Wu,Lin Wang,Jiachen Tu,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Yaokun Shi,Raul Balmez,Alexandru Brateanu,Ciprian Orhei,Cosmin Ancuti,Codruta O. Ancuti,Bilel Benjdira,Anas M. Ali,Wadii Boulila,Kaifan Qiao,Bofei Chen,Jingyi Xu,Duo Zhang,Xin Deng,Mai Xu,Shengxi Li,Lai Jiang,Harini A,Ananya N,Lakshanya K,Ying Xu,Xinyi Zhu,Shijun Shi,Jiangning Zhang,Yong Liu,Kai Hu,Jing Xu,Xianfang Zeng,Jinao Song,Guangsheng Tang,Cheng Li,Yuqiang Yang,Ziyi Wang,Yan Chen,Long Bao,Heng Sun,Mohab Kishawy,Jun Chen,Wan-Chi Siu,Yihao Cheng,Hon Man Hammond Lee,Chun-Chuen Hui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a comprehensive review of the NTIRE 2026 Low Light Image Enhancement Challenge, highlighting the proposed solutions and final results. The objective of this challenge is to identify effective networks capable of producing clearer and visually compelling images in diverse and challenging conditions by learning representative visual cues with the purpose of restoring information loss due to low-contrast and noisy images. A total of 195 participants registered for the first track and 153 for the second track of the competition, and 22 teams ultimately submitted valid entries. This paper thoroughly evaluates the state-of-the-art advances in (joint denoising and) low-light image enhancement, showcasing the significant progress in the field, while leveraging samples of our novel dataset.
[CV-103] Self-Supervised Super-Resolution for Sentinel-5P Hyperspectral Images
【速读】:该论文旨在解决Sentinel-5P(S5P)卫星数据因空间分辨率有限而难以支持精细尺度大气监测的问题。现有超分辨率(Super-Resolution, SR)方法依赖于监督学习,需合成低分辨率(Low-Resolution, LR)与高分辨率(High-Resolution, HR)配对数据,但由于真实HR数据不可获取,限制了其在实际观测中的适用性。解决方案的关键在于提出一种自监督的高光谱超分辨率框架,无需HR真值即可训练模型:该框架结合Stein无偏风险估计(Stein’s Unbiased Risk Estimator, SURE)与等变成像约束(equivariant imaging constraint),引入S5P退化算子及基于信噪比(Signal-to-Noise Ratio, SNR)元数据的噪声统计信息,并采用深度可分离卷积U-Net结构以提升计算效率和光谱保真度。实验表明,该方法在无HR参考的情况下仍能生成物理合理且空间细节优于双三次插值的结果,性能接近监督方法。
链接: https://arxiv.org/abs/2604.17652
作者: Hyam Omar Ali,Antoine Crosnier,Romain Abraham,Baptiste Combelles,Fabrice Jégou,Bruno Galerne
机构: Université d’Orléans (奥尔良大学); Université de Tours (图尔大学); CNRS (法国国家科学研究中心); IDP, UMR 7013 (信息与决策研究所,UMR 7013); Faculty of Mathematical Sciences, University of Khartoum (喀土穆大学数学科学学院); ENS Lyon (里昂高等师范学校); Laboratory of Physics and Chemistry of the Environment and Space (LPC2E), CNRS UMR 7328, University of Orléans (环境与空间物理化学实验室,CNRS UMR 7328,奥尔良大学); Institut universitaire de France (法国大学研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sentinel-5P (S5P) plays a critical role in atmospheric monitoring; however, its spatial resolution limits fine-scale analysis. Existing super-resolution (SR) approaches rely on supervised learning with synthetic low-resolution (LR) data, since true high-resolution (HR) data do not exist, limiting their applicability to real observations. We propose a self-supervised hyperspectral SR framework for S5P that enables training without HR ground truth. The method combines Stein’s Unbiased Risk Estimator (SURE) with an equivariant imaging constraint, incorporating the S5P degradation operator and noise statistics derived from signal-to-noise ratio (SNR) metadata. We also introduce depthwise separable convolution U-Net architectures designed for efficiency and spectral fidelity. The framework is evaluated in two settings: (i) LR-HR, where synthetic LR data are used for direct comparison with supervised learning, and (ii) GT-SHR, where super-resolved images surpass the native spatial resolution without HR reference. Results across multiple bands show that self-supervised models achieve performance comparable to supervised methods while maintaining strong consistency. Qualitative analysis shows improved spatial detail over bicubic interpolation, and validation with EMIT data confirms that reconstructed structures are physically meaningful. Code is available at this https URL
[CV-104] Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception
【速读】:该论文旨在解决当前生成式世界模型(Generative World Models)在自动驾驶领域中普遍采用以车辆为中心视角(ego-vehicle perspective)所带来的局限性,即缺乏对交通基础设施视角的系统建模能力。其核心问题是:如何利用路边固定传感器所具备的鸟瞰、多传感器融合与长期观测优势,构建更具时空互补性的环境模拟机制,从而提升自动驾驶系统的安全性和泛化能力。解决方案的关键在于提出基础设施导向的世界模型(Infrastructure-centric World Models, I-WM),通过三阶段架构实现:(I) 基于质量感知不确定性的生成场景理解;(II) 融合物理约束与多智能体反事实推理的预测动力学建模;(III) 利用潜在空间对齐实现V2X协同建模。此外,论文创新性地设计了双层架构与无标注感知引擎,并引入Infrastructure VLA(I-VLA)作为统一路边感知、语言指令与交通控制动作的新范式,为未来具备预测能力的智能交通基础设施提供理论基础与技术路径。
链接: https://arxiv.org/abs/2604.17651
作者: Siyuan Meng,Chengbo Ai
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 18 pages, 7 tables, 1 figure, vision paper
Abstract:World models, generative AI systems that simulate how environments evolve, are transforming autonomous driving, yet all existing approaches adopt an ego-vehicle perspective, leaving the infrastructure viewpoint unexplored. We argue that infrastructure-centric world models offer a fundamentally complementary capability: the bird’s-eye, multi-sensor, persistent viewpoint that roadside systems uniquely possess. Central to our thesis is a spatio-temporal complementarity: fixed roadside sensors excel at temporal depth, accumulating long-term behavioral distributions including rare safety-critical events, while vehicle-borne sensors excel at spatial breadth, sampling diverse scenes across large road networks. This paper presents a vision for Infrastructure-centric World Models (I-WM) in three phases: (I) generative scene understanding with quality-aware uncertainty propagation, (II) physics-informed predictive dynamics with multi-agent counterfactual reasoning, and (III) collaborative world models for V2X communication via latent space alignment. We propose a dual-layer architecture, annotation-free perception as a multi-modal data engine feeding end-to-end generative world models, with a phased sensor strategy from LiDAR through 4D radar and signal phase data to event cameras. We establish a taxonomy of driving world model paradigms, position I-WM relative to LeCun’s JEPA, Li Fei-Fei’s spatial intelligence, and VLA architectures, and introduce Infrastructure VLA (I-VLA) as a novel unification of roadside perception, language commands, and traffic control actions. Our vision builds upon existing multi-LiDAR pipelines and identifies open-source foundations for each phase, providing a path toward infrastructure that understands and anticipates traffic.
[CV-105] BioVLM: Routing Prompts Not Parameters for Cross-Modality Generalization in Biomedical VLMs ACL
【速读】:该论文旨在解决预训练生物医学视觉语言模型(VLMs)在挑战性模态下性能下降的问题,尤其是在类别间区分度小、采集特定变异显著、少样本监督条件下,以及当模态先验与预训练数据分布差异较大时的跨域泛化能力不足问题。其解决方案的关键在于提出BioVLM框架,通过学习一个多样化的提示(prompt)库并引入动态提示选择机制:对每个输入样本,基于预测分布的低熵准则选取最具判别性的提示,从而将稀疏的少样本证据与大型语言模型(LLM)的语义先验有效耦合;同时,通过蒸馏高置信度LLM衍生属性并利用强/弱增强一致性来强化知识迁移的鲁棒性,使模型在测试时能根据模态特性自适应选择提示,实现对未见类别和领域的高效迁移,且训练轻量、推理高效。
链接: https://arxiv.org/abs/2604.17629
作者: Mainak Singha,Tanisha Gupta,Ankit Jha,Muhammad Haris Khan,Sayantani Ghosh,Biplab Banerjee
机构: University of Trento, Italy; Carnegie Mellon University, USA; LNMIIT Jaipur, India; MBZUAI, UAE; Sunandan Divatia School of Science, Mumbai, India; IIT Bombay, India
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ACL Findings 2026
Abstract:Pretrained biomedical vision-language models (VLMs) such as BioMedCLIP perform well on average but often degrade on challenging modalities where inter-class margins are small and acquisition-specific variations are pronounced, especially under few-shot supervision and when modality priors differ from pretraining corpora substantially. We propose BioVLM, a prompt-learning framework that improves cross-domain generalization without extensive backbone fine-tuning. BioVLM learns a diverse prompt bank and introduces dynamic prompt selection: for each input, it selects the most discriminative prompts via a low-entropy criterion on the predictive distribution, effectively coupling sparse few-shot evidence with rich LLM semantic priors. To strengthen this coupling, we distill high-confidence LLM-derived attributes and enforce robust knowledge transfer through strong/weak augmentation consistency. At test time, BioVLM adapts by choosing modality-appropriate prompts, enabling transfer to unseen categories and domains, while keeping training lightweight and inference efficient. On 11 MedMNIST+ 2D datasets, BioVLM achieves new state of the art across three distinct generalization settings. Codes are available at this https URL.
[CV-106] FlowC2S: Flowing from Current to Succeeding Frames for Fast and Memory-Efficient Video Continuation
【速读】:该论文旨在解决视频续写(video continuation)任务中模型计算效率低、内存占用高以及生成质量不足的问题。其核心解决方案是提出FlowC2S方法,通过微调预训练的文本到视频流模型来学习当前与后续视频片段之间的向量场(vector field)。关键设计包括:1)引入固有最优耦合(inherent optimal couplings),利用训练时相邻视频片段作为真实最优耦合的实用代理,从而获得更平滑的流路径;2)采用目标反演(target inversion),将目标片段的反演潜在表示注入输入表征以增强对应关系并提升视觉保真度。相比传统基于噪声的生成方式,该方法直接从当前帧流向下一帧,使模型输入维度降低两倍,在仅需五次神经网络函数评估的情况下即实现FID和FVD指标上的SOTA性能。
链接: https://arxiv.org/abs/2604.17625
作者: Hovhannes Margaryan,Quentin Bammey,Christian Sandor
机构: Team ARAI, Université Paris-Saclay, CNRS, LISN, France; LTCI, Télécom Paris, Institut Polytechnique de Paris, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper introduces a novel methodology for generating fast and memory-efficient video continuations. Our method, dubbed FlowC2S, fine-tunes a pre-trained text-to-video flow model to learn a vector field between the current and succeeding video chunks. Two design choices are key. First, we introduce inherent optimal couplings, utilizing temporally adjacent video chunks during training as a practical proxy for true optimal couplings, resulting in straighter flows. Second, we incorporate target inversion, injecting the inverted latent of the target chunk into the input representation to strengthen correspondences and improve visual fidelity. By flowing directly from current to succeeding frames, instead of the common combination of current frames with noise to generate a video continuation, we reduce the dimensionality of the model input by a factor of two. The proposed method, fine-tuned from LTXV and Wan, surpasses the state-of-the-art scores across quantitative evaluations with FID and FVD, with as few as five neural function evaluations.
[CV-107] ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes
【速读】:该论文旨在解决3D角色动画中kinematic rigs(运动学骨骼系统)缺乏合理关节配置流形表示的问题,导致随机采样或手动调整参数时易产生语义错误(如解剖学上的过度伸展)或几何违规(如非物理自交)。其解决方案的关键在于提出Video-informed Pose Spaces (ViPS),通过蒸馏预训练视频扩散模型中的运动先验,从视频数据中学习通用的、资产特定的有效姿态分布;同时利用可微分几何验证器确保皮肤网格的物理合理性,无需人工正则化项。ViPS构建了一个平滑、紧凑且可控的姿态空间,支持多样采样、逆运动学的流形投影及关键帧动画的时间一致性轨迹生成,并实现了2D生成视频先验与3D结构化控制之间的闭环反馈。
链接: https://arxiv.org/abs/2604.17623
作者: Honglin Chen,Karran Pandey,Rundi Wu,Matheus Gadelha,Yannick Hold-Geoffroy,Ayush Tewari,Niloy J. Mitra,Changxi Zheng,Paul Guerrero
机构: Columbia University (哥伦比亚大学); Adobe Research (Adobe研究院); University of Toronto (多伦多大学); Google DeepMind (谷歌深度智核); University of Cambridge (剑桥大学); University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL
Abstract:Kinematic rigs provide a structured interface for articulating 3D meshes, but they lack an inherent representation of the plausible manifold of joint configurations for a given asset. Without such a pose space, stochastic sampling or manual manipulation of raw rig parameters often leads to semantic or geometric violations, such as anatomical hyperextension and non-physical self-intersections. We propose Video-informed Pose Spaces (ViPS), a feed-forward framework that discovers the latent distribution of valid articulations for auto-rigged meshes by distilling motion priors from a pretrained video diffusion model. Unlike existing methods that rely on scarce artist-authored 4D datasets, ViPS transfers generative video priors into a universal distribution over a given rig parameterization. Differentiable geometric validators applied to the skinned mesh enforce asset-specific validity without requiring manual regularizers. Our model learns a smooth, compact, and controllable pose space that supports diverse sampling, manifold projection for inverse kinematics, and temporally coherent trajectories for keyframing. Furthermore, the distilled 3D pose samples serve as precise semantic proxies for guiding video diffusion, effectively closing the loop between generative 2D priors and structured 3D kinematic control. Our evaluations show that ViPS, trained solely on video priors, matches the performance of state-of-the-art methods trained on synthetic artist-created 4D data in both plausibility and diversity. Most importantly, as a universal model, ViPS demonstrates robust zero-shot generalization to out-of-distribution species and unseen skeletal topologies.
[CV-108] DGSSM: Diffusion guided state-space models for multimodal salient object detection ICPR2026
【速读】:该论文旨在解决多模态显著目标检测(Multimodal Salient Object Detection, MSOD)中长期上下文依赖建模与细粒度结构细节恢复之间的矛盾问题,尤其针对卷积、基于Transformer和Mamba的状态空间模型在边界精度上的不足。其核心解决方案是提出DGSSM框架,通过将扩散模型的结构先验引导与多尺度状态空间编码相结合,引入自适应显著性提示机制和迭代式Mamba扩散精炼模块,以实现渐进式的去噪过程;同时设计边界感知精炼头与自蒸馏策略,提升空间一致性与特征稳定性,从而在保持模型紧凑性的同时显著改善边界准确性。
链接: https://arxiv.org/abs/2604.17585
作者: Suklav Ghosh,Arijit Sur,Pinaki Mitra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICPR 2026. Diffusion-guided Mamba framework for multimodal salient object detection. Evaluated on 13 benchmarks (RGB, RGB-D, RGB-T)
Abstract:Salient object detection (SOD) requires modeling both long-range contextual dependencies and fine-grained structural details, which remains challenging for convolutional, transformer-based, and Mamba-based state space models. While recent Mamba-based state space approaches enable efficient global reasoning, they often struggle to recover precise object boundaries. In contrast, diffusion models capture strong structural priors through iterative denoising, but their use in discriminative dense prediction is still limited due to computational cost and integration challenges. In this work, we propose DGSSM, a diffusion-guided state space (Mamba) framework that formulates multimodal salient object detection as a progressive denoising process. The framework integrates diffusion structural priors with multi-scale state space encoding, adaptive saliency prompting, and an iterative Mamba diffusion refinement mechanism to improve boundary accuracy. A boundary-aware refinement head and self-distillation strategy further enhance spatial coherence and feature consistency. Extensive experiments on 13 public benchmarks across RGB, RGB-D, and RGB-T settings demonstrate that DGSSM consistently outperforms state-of-the-art methods across multiple evaluation metrics while maintaining a compact model size. These results suggest that diffusion-guided state space modeling is an effective and generalizable paradigm for multimodal dense prediction tasks.
[CV-109] PBSBench: A Multi-Level Vision-Language Framework and Benchmark for Hematopathology Whole Slide Image Interpretation CVPR
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在血涂片(Peripheral Blood Smear, PBS)病理分析中泛化能力不足的问题。现有MLLMs主要基于实体组织全切片图像(Whole-Slide Imaging, WSI)训练,难以适配PBS特有的细胞形态学解析需求,即从单个细胞层面而非组织结构进行诊断推理。解决方案的关键在于构建首个针对PBS的视觉-语言数据集PBSInstr,包含353张PBS WSI及其对应的显微镜印象文本、29k张细胞级图像标注(含细胞类型与形态描述),以及27k个细胞图像问答对和1,286个PBS幻灯片问答对,从而支持指令微调;在此基础上开发了专用于血液病理的视觉-语言模型PBS-VL,实现细胞级与幻灯片级的多层级理解,并通过构建PBSBench基准测试验证其优越性能,显著优于通用及现有病理专用MLLMs。
链接: https://arxiv.org/abs/2604.17570
作者: Yuanlong Wang,Weichi Chen,Adrian Rajab,Wenfang Liu,Yulan Jin,Andrew Srisuwananukorn,Ping Zhang
机构: The Ohio State University (俄亥俄州立大学); The Ohio State University Wexner Medical Center (俄亥俄州立大学韦克斯纳医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 12 figures, Accepted by CVPR Findings 2026
Abstract:Peripheral Blood Smear (PBS) is a critical microscopic examination in hematopathology that yields whole-slide imaging (WSI). Unlike solid tissue pathology, PBS interpretation focuses on individual cell morphologies rather than tissue architecture, making it distinct in both visual characteristics and diagnostic reasoning. However, current multimodal large language models (MLLMs) for pathology are primarily developed on solid-tissue WSIs and struggle to generalize to PBS. To bridge this gap, we construct PBSInstr, the first vision-language dataset for PBS interpretation, comprising 353 PBS WSIs paired with microscopic impression paragraphs and 29k cell-level image crops annotated with cell type labels and morphological descriptions. To facilitate instruction tuning, PBSInstr further includes 27k question-answer (QA) pairs for cell crops and 1,286 QA pairs for PBS slides. Building upon PBSInstr, we develop PBS-VL, a hematopathology-tailored vision-language model for multi-level PBS interpretation at both cell and slide levels. To comprehensively evaluate PBS understanding, we construct PBSBench, a visual question answering (VQA) benchmark featuring four question categories and six PBS interpretation tasks. Experiments show that PBS-VL outperforms existing general-purpose and pathology MLLMs, underscoring the value of PBS-specific data. We release our code, datasets, and model weights to facilitate future research. Our proposed framework lays the foundation for developing practical AI assistants supporting decision-making in hematopathology.
[CV-110] Multi-Camera Self-Calibration in Sports Motion Capture: Leverag ing Human and Stick Poses
【速读】:该论文旨在解决多相机系统在体育场景中进行三维运动捕捉时,外参标定(extrinsic calibration)成本高、耗时长的问题。传统方法依赖专用标定工具,而本文提出了一种无需额外标定工具的高效自标定方法,特别适用于使用棒状器械(如高尔夫球杆、球拍、冰球杆)的运动项目。解决方案的关键在于利用同步多视角视频中的两类互补信息:一是人体关键点(具有未知度量尺度),二是已知长度的刚性棒状器械;通过三阶段优化流程联合优化相机外参、重建人体与棒状物轨迹,并借助棒长约束恢复全局尺度,从而实现高精度外参标定。
链接: https://arxiv.org/abs/2604.17567
作者: Fan Yang,Changsoo Jung,Ryosuke Kawamura,Hon Yung Wong
机构: Fujitsu Research (富士通研究所); Colorado State University (科罗拉多州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Multi-camera systems are widely employed in sports to capture the 3D motion of athletes and equipment, yet calibrating their extrinsic parameters remains costly and labor-intensive. We introduce an efficient, tool-free method for multi-camera extrinsic calibration tailored to sports involving stick-like implements (e.g., golf clubs, bats, hockey sticks). Our approach jointly exploits two complementary cues from synchronized multi-camera videos: (i) human body keypoints with unknown metric scale and (ii) a rigid stick-like implement of known length. We formulate a three-stage optimization pipeline that refines camera extrinsics, reconstructs human and stick trajectories, and resolves global scale via the stick-length constraint. Our method achieves accurate extrinsic calibration without dedicated calibration tools. To benchmark this task, we present the first dataset for multi-camera self-calibration in stick-based sports, consisting of synthetic sequences across four sports categories with 3 to 10 cameras. Comprehensive experiments demonstrate that our method delivers SOTA performance, achieving low rotation and translation errors. Our project page: this https URL.
[CV-111] UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
【速读】:该论文旨在解决相机可控图像编辑(camera-controllable image editing)中因几何引导碎片化和图像扩散模型基于离散视角映射所导致的几何漂移(geometric drift)与结构退化问题。现有方法通常仅在表示层注入点云等几何信息,且依赖于离散视图映射的扩散模型,难以维持连续相机运动下的跨视角几何一致性。解决方案的关键在于提出UniGeo框架,通过在三个层次统一注入几何引导:表示层采用帧解耦几何参考注入机制以提供鲁棒的跨视角几何上下文;架构层引入几何锚点注意力机制对齐多视角特征;损失函数层设计轨迹终点几何监督策略,显式强化目标视角的结构保真度。这一系统性方法显著提升了视觉质量和几何一致性表现。
链接: https://arxiv.org/abs/2604.17565
作者: Hong Jiang,Wensong Song,Zongxing Yang,Ruijie Quan,Yi Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Camera-controllable image editing aims to synthesize novel views of a given scene under varying camera poses while strictly preserving cross-view geometric consistency. However, existing methods typically rely on fragmented geometric guidance, such as only injecting point clouds at the representation level despite models containing multiple levels, and are mainly based on image diffusion models that operate on discrete view mappings. These two limitations jointly lead to geometric drift and structural degradation under continuous camera motion. We observe that while leveraging video models provides continuous viewpoint priors for camera-controllable image editing, they still struggle to form stable geometric understanding if geometric guidance remains fragmented. To systematically address this, we inject unified geometric guidance across three levels that jointly determine the generative output: representation, architecture, and loss function. To this end, we propose UniGeo, a novel camera-controllable editing framework. Specifically, at the representation level, UniGeo incorporates a frame-decoupled geometric reference injection mechanism to provide robust cross-view geometry context. At the architecture level, it introduces geometric anchor attention to align multi-view features. At the loss function level, it proposes a trajectory-endpoint geometric supervision strategy to explicitly reinforce the structural fidelity of target views. Comprehensive experiments across multiple public benchmarks, encompassing both extensive and limited camera motion settings, demonstrate that UniGeo significantly outperforms existing methods in both visual quality and geometric consistency. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.17565 [cs.CV] (or arXiv:2604.17565v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.17565 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hong Jiang [view email] [v1] Sun, 19 Apr 2026 18:11:08 UTC (7,839 KB)
[CV-112] Dual Strategies for Test-Time Adaptation
【速读】:该论文旨在解决传统测试时适应(Test-Time Adaptation, TTA)方法在分布偏移(distribution shift)场景下性能受限的问题,即现有方法通常仅利用少量低熵预测的测试样本进行模型更新,未能充分挖掘测试数据中蕴含的信息。其解决方案的关键在于提出DualTTA框架,通过引入一种新的可靠性判据——基于语义保持与语义改变变换下的预测稳定性度量,自适应地将测试样本划分为两类:一类是预测可靠、语义一致的样本,另一类是预测不可靠、可能包含错误或伪相关性的样本;前者通过最小化预测熵强化可靠决策,后者则通过最大化熵抑制过自信错误并消除虚假学习行为,从而实现更有效的模型更新策略。
链接: https://arxiv.org/abs/2604.17542
作者: Nam Nguyen Phuong,Duc Nguyen The Minh,Phi Le Nguyen,Ehsan Abbasnejad,Minh Hoai
机构: Hanoi University of Science and Technology (河内科技大学); Monash University (莫纳什大学); Adelaide University (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Findings of Computer Vision and Pattern Recognition 2026
Abstract:Conventional test-time adaptation (TTA) approaches typically adapt the model using only a small fraction of test samples, often those with low-entropy predictions, thereby failing to fully leverage the available information in the test distribution. This paper introduces DualTTA, a novel framework that improves performance under distribution shifts by utilizing a larger and more diverse set of test samples. DualTTA identifies two distinct groups: one where the model’s predictions are likely consistent with the underlying semantics, and another where predictions are likely incorrect. For the first group, it minimizes prediction entropy to reinforce reliable decisions; for the second, it maximizes entropy to suppress overconfident errors and unlearn spurious behavior. These groups are adaptively selected using a new reliability criterion that measures prediction stability under both semantic-preserving and semantic-altering transformations, addressing the limitations of purely entropy-based selection. We further provide theoretical analysis and empirical justification showing that our approach enables a tighter separation between reliable and unreliable samples, in the context of their suitability for adaptation, leading to provably more effective model updates.
[CV-113] RS-HyRe-R1: A Hybrid Reward Mechanism to Overcome Perceptual Inertia for Remote Sensing Images Understanding
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)后训练过程中引发的“感知惯性”(perceptual inertia)问题,即模型在处理复杂遥感图像(Remote Sensing Imagery, RSI)时倾向于依赖局部显著特征进行快速推理,从而导致视觉证据挖掘不充分、跨任务视觉注意力难以灵活切换的问题。解决方案的关键在于提出RS-HyRe-R1——一种混合奖励框架,通过三个核心机制协同作用:(1) 空间推理激活奖励,强制结构化视觉推理过程;(2) 感知正确性奖励,提供适应不同遥感任务的几何与语义对齐质量锚点;(3) 视觉-语义路径演化奖励,惩罚重复推理并激励探索互补线索以构建更丰富的证据链,从而有效缓解感知惯性,提升模型的认知完整性与泛化能力。
链接: https://arxiv.org/abs/2604.17504
作者: Gaozhi Zhou,Hu He,Peng Shen,Jipeng Zhang,Liujue Zhang,Linrui Xu,Zeyuan Wang,Ziyu Li,Xuezhi Cui,Wang Guo,Haifeng Li
机构: Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) post-training substantially improves remote sensing vision-language models (RS-VLMs). However, when handling complex remote sensing imagery (RSI) requiring exhaustive visual scanning, models tend to rely on localized salient cues for rapid inference. We term this RL-induced bias “perceptual inertia”. Driven by reward maximization, models favor quick outcome fitting, leading to two limitations: cognitively, overreliance on specific features impedes complete evidence construction; operationally, models struggle to flexibly shift visual focus across tasks. To address this bias and encourage comprehensive visual evidence mining, we propose RS-HyRe-R1, a hybrid reward framework for RSI understanding. It introduces: (1) a spatial reasoning activation reward that enforces structured visual reasoning; (2) a perception correctness reward that provides adaptive quality anchors across RS tasks, ensuring accurate geometric and semantic alignment; and (3) a visual-semantic path evolution reward that penalizes repetitive reasoning and promotes exploration of complementary cues to build richer evidence chains. Experiments show RS-HyRe-R1 effectively mitigates “perceptual inertia”, encouraging deeper, more diverse reasoning. With only 3B parameters, it achieves state-of-the-art performance on REC, OVD, and VQA tasks, outperforming models up to 7B parameters. It also demonstrates strong zero-shot generalization, surpassing the second-best model by 3.16%, 3.97%, and 2.72% on VQA, OVD, and REC, respectively. Code and datasets are available at this https URL.
[CV-114] Edit Fidelity Field: Semantics-Aware Region Isolation for Training-Free Scene Text Editing
【速读】:该论文旨在解决场景文本编辑(Scene Text Editing, STE)中普遍存在但被忽视的“编辑溢出”(edit spillover)问题,即在对目标文本区域进行修改时,现有基于扩散模型的方法会无意中改变非目标区域,尤其是邻近文本区域。实验表明,当前最先进方法的溢出率高达94%,严重影响了编辑的精确性与实用性。解决方案的关键在于提出一种语义感知的连续场——编辑保真度场(Edit Fidelity Field, EFF),它不依赖于二值掩码,而是基于OCR检测到的文本区域构建四区结构:编辑核心区(完全可编辑)、过渡区(平滑衰减)、保护区(非目标文本,显式锁定)和背景区(严格保留)。EFF作为无需训练、与模型无关的后处理模块,能有效抑制编辑溢出,实验显示其将溢出率从94%降至25%,同时非目标区域保真度提升91.4 dB PSNR。
链接: https://arxiv.org/abs/2604.17500
作者: Guandong Li,Mengxia Ye
机构: iFLYTEK(科大讯飞); Aegon THTF
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Scene text editing (STE) has achieved remarkable progress in accurately rendering target text through diffusion-based methods. However, we identify a critical yet overlooked problem: edit spillover – when editing a target text region, existing methods inadvertently modify non-target regions, particularly neighboring text. Through systematic evaluation on 50 real-world scenes across four categories, we reveal that state-of-the-art diffusion editing models exhibit a spillover rate of 94%, meaning nearly all non-target text regions are altered during editing. To address this, we propose the Edit Fidelity Field (EFF), a semantics-aware continuous field that controls per-pixel editing fidelity. Unlike binary masks, EFF leverages OCR-detected text regions to construct a four-zone field: Edit Core (fully editable), Transition Zone (smooth decay), Protected Zone (non-target text, explicitly locked), and Background (strictly preserved). EFF operates as a training-free, model-agnostic post-processing module applicable to any diffusion-based STE method. We further propose per-region spillover quantification, a novel evaluation protocol that measures edit leakage at each non-target text region individually. Experiments demonstrate that EFF reduces spillover rate from 94% to 25% while improving non-target region preservation by +91.4 dB PSNR.
[CV-115] Coevolving Representations in Joint Image-Feature Diffusion
【速读】:该论文旨在解决现有联合图像-特征生成建模方法中,语义表示空间固定不变导致的生成性能受限问题。现有方法依赖于独立构建且训练过程中保持不变的表示空间,无法根据生成任务动态调整,从而限制了语义特征与图像潜在表示之间的互补性。解决方案的关键在于提出共进化表示扩散(Coevolving Representation Diffusion, CoReDi)框架,其中语义表示空间通过学习一个轻量级线性投影与扩散模型共同演化,该投影在训练过程中动态优化以适应生成目标。为避免退化解,CoReDi引入梯度截断(stop-gradient targets)、归一化和针对性正则化策略,确保语义空间稳定地向图像合成任务专业化演进,从而提升生成质量与收敛速度。
链接: https://arxiv.org/abs/2604.17492
作者: Theodoros Kouzelis,Spyros Gidaris,Nikos Komodakis
机构: Archimedes, Athena RC (阿基米德,雅典研究中心); University of Crete (克里特大学); valeo.ai (valeo.ai); National Technical University of Athens (雅典国立技术大学); IACM-Forth (希腊国家研究中心-弗洛斯研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Joint image-feature generative modeling has recently emerged as an effective strategy for improving diffusion training by coupling low-level VAE latents with high-level semantic features extracted from pre-trained visual encoders. However, existing approaches rely on a fixed representation space, constructed independently of the generative objective and kept unchanged during training. We argue that the representation space guiding diffusion should itself adapt to the generative task. To this end, we propose Coevolving Representation Diffusion (CoReDi), a framework in which the semantic representation space evolves during training by learning a lightweight linear projection jointly with the diffusion model. While naively optimizing this projection leads to degenerate solutions, we show that stable coevolution can be achieved through a combination of stop-gradient targets, normalization, and targeted regularization that prevents feature collapse. This formulation enables the semantic space to progressively specialize to the needs of image synthesis, improving its complementarity with image latents. We apply CoReDi to both VAE latent diffusion and pixel-space diffusion, demonstrating that adaptive semantic representations improve generative modeling across both settings. Experiments show that CoReDi achieves faster convergence and higher sample quality compared to joint diffusion models operating in fixed representation spaces.
[CV-116] AutoVQA-G: Self-Improving Agent ic Framework for Automated Visual Question Answering and Grounding Annotation ICASSP2026
【速读】:该论文旨在解决高质视觉问答带定位(VQA-G)数据集手动标注成本高、难以扩展的问题,以及现有自动化方法因模型幻觉导致的数据保真度不一致和基于简单启发式规则的验证机制脆弱两大挑战。其解决方案的关键在于提出一个自增强型智能体框架AutoVQA-G,该框架通过迭代优化循环实现:首先利用链式思维(Chain-of-Thought, CoT)推理进行细粒度视觉验证以评估一致性;随后,基于失败样本的批评信息,引入记忆增强的提示优化代理(Prompt Optimization agent)逐步改进生成提示,从而提升最终生成数据的视觉定位准确性。
链接: https://arxiv.org/abs/2604.17488
作者: Rongsheng Hu,Runwei Guan,Yicheng Di,Jiayu Bao,Yuan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE ICASSP 2026. 5 pages, 5 figures. Code available at this https URL
Abstract:Manual annotation of high-quality visual question answering with grounding (VQA-G) datasets, which pair visual questions with evidential grounding, is crucial for advancing vision-language models (VLMs), but remains unscalable. Existing automated methods are often hindered by two key issues: (1) inconsistent data fidelity due to model hallucinations; (2) brittle verification mechanisms based on simple heuristics. To address these limitations, we introduce AutoVQA-G, a self-improving agentic framework for automated VQA-G annotation. AutoVQA-G employs an iterative refinement loop where a Consistency Evaluation module uses Chain-of-Thought (CoT) reasoning for fine-grained visual verification. Based on this feedback, a memory-augmented Prompt Optimization agent analyzes critiques from failed samples to progressively refine generation prompts. Our experiments show that AutoVQA-G generates VQA-G datasets with superior visual grounding accuracy compared to leading multimodal LLMs, offering a promising approach for creating high-fidelity data to facilitate more robust VLM training and evaluation. Code: this https URL
[CV-117] Unveiling Deepfakes: A Frequency-Aware Triple Branch Network for Deepfake Detection
【速读】:该论文旨在解决当前深度伪造(deepfake)检测方法中存在的两个关键问题:一是现有方法多集中于单一或少数特定频域特征,导致模型对特定伪造痕迹过拟合,鲁棒性不足;二是不同特征往往关注同一伪造区域,造成冗余表示,限制了模型捕捉跨模态互补信息的能力,从而削弱其泛化性能。解决方案的关键在于提出一种三分支网络结构,通过联合学习原始图像与不同频域通道重构图像的时空特征,实现多尺度频域感知;同时基于互信息理论推导出特征解耦与融合损失函数,引导模型聚焦于任务相关特征,提升特征多样性与判别力。实验表明,该方法在六个大规模基准数据集上均达到最优性能。
链接: https://arxiv.org/abs/2604.17477
作者: Qihao Shen,Jiaxing Xuan,Zhenguang Liu,Sifan Wu,Yutong Xie,Zhaoyan Ming,Yingying Jiao,kui Ren
机构: Zhejiang University (浙江大学); State Grid Blockchain Technology (Beijing) Co., Ltd. (国家电网区块链技术(北京)有限公司); Jilin University (吉林大学); Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security (杭州高新区(滨江)区块链与数据安全研究院); Hangzhou City University (杭州城市学院); Zhejiang University of Technology (浙江工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Advanced deepfake technologies are blurring the lines between real and fake, presenting both revolutionary opportunities and alarming threats. While it unlocks novel applications in fields like entertainment and education, its malicious use has sparked urgent ethical and societal concerns ranging from identity theft to the dissemination of misinformation. To tackle these challenges, feature analysis using frequency features has emergedas a promising direction for deepfake detection. However, oneaspect that has been overlooked so far is that existing methodstend to concentrate on one or a few specific frequency domains,which risks overfitting to particular artifacts and significantlyundermines their robustness when facing diverse forgery patterns. Another underexplored aspect we observe is that different features often attend to the same forged region, resulting in redundant feature representations and limiting the diversity of the extracted clues. This may undermine the ability of a model to capture complementary information across different facets, thereby compromising its generalization capability to diverse manipulations. In this paper, we seek to tackle these challenges from two aspects: (1) we propose a triple-branch network that jointly captures spatial and frequency features by learning from both original image and image reconstructed by different frequency channels, and (2) we mathematically derive feature decoupling and fusion losses grounded in the mutual information theory, which enhances the model to focus on task-relevant features across the original image and the image reconstructed by different frequency channels. Extensive experiments on six large-scale benchmark datasets demonstrate that our method consistently achieves state-of-the-art performance. Our code is released at this https URL Deepfake.
[CV-118] Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading
【速读】:该论文旨在解决多用户虚拟现实(Multi-user Virtual Reality)中,由于在每个头显设备上渲染大量用户化身(Avatar)而导致的计算开销过大、难以扩展的问题。解决方案的关键在于提出Privatar框架,其核心思想是利用化身重建的领域特定知识,在最小化计算成本的同时实现可证明的隐私保护。具体而言,系统层面采用水平分割(Horizontal Partitioning, HP),通过BDCT频域分解将高能量分量保留在本地,仅将低能量分量卸载至局域网内的不可信设备,从而降低信息泄露风险;隐私层面则提出分布感知最小扰动(Distribution-Aware Minimal Perturbation, DAMP),基于用户表情分布的缓慢时变特性在线追踪并动态调整噪声强度,在保障局部差分隐私的前提下显著减少对重建质量的影响。组合使用HP与DAMP,Privatar在Meta Quest Pro上实现了2.37倍并发用户数提升,同时维持了对抗经验攻击和神经网络攻击的鲁棒性,并提供形式化的隐私保证。
链接: https://arxiv.org/abs/2604.17476
作者: Jianming Tong,Hanshen Xiao,Krishna Kumar Nair,Hao Kang,Ashish Sirasao,Ziqi Zhang,G. Edward Suh,Tushar Krishna
机构: 未知
类目: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: Proceedings of the 7th Machine Learning and System Conference (MLSys)
Abstract:Multi-user virtual reality enables immersive interaction. However, rendering avatars for numerous participants on each headset incurs prohibitive computational overhead, limiting scalability. We introduce a framework, Privatar, to offload avatar reconstruction from headset to untrusted devices within the same local network while safeguarding attacks against adversaries capable of intercepting offloaded data. Privatar’s key insight is that domain-specific knowledge of avatar reconstruction enables provably private offloading at minimal cost. (1) System level. We observe avatar reconstruction is frequency-domain decomposable via BDCT with negligible quality drop, and propose Horizontal Partitioning (HP) to keep high-energy frequency components on-device and offloads only low-energy components. HP offloads local computation while reducing information leakage to low-energy subsets only. (2) Privacy level. For individually offloaded, multi-dimensional signals without aggregation, worst-case local Differential Privacy requires prohibitive noise, ruining utility. We observe users’ expression statistical distribution are slowly changing over time and trackable online, and hence propose Distribution-Aware Minimal Perturbation. DAMP minimizes noise based on each user’s expression distribution to significantly reduce its effects on utility, retaining formal privacy guarantee. Combined, HP provides empirical privacy against expression identification attacks. DAMP further augments it to offer a formal guarantee against arbitrary adversaries. On a Meta Quest Pro, Privatar supports 2.37x more concurrent users at 6.5% higher reconstruction loss and 9% energy overhead, providing a better throughout-loss Pareto frontier over quantization, sparsity and local construction baselines. Privatar provides both provable privacy guarantee and stays robust against both empirical and NN-based attacks.
[CV-119] Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
【速读】:该论文旨在解决视觉语言导航(Vision-Language Navigation, VLN)任务中长期场景下因状态漂移(State Drift)导致的导航失败问题,具体表现为两个认知缺陷:进度漂移(Progress Drift)——代理无法区分已完成与未完成的子目标;以及记忆漂移(Memory Drift)——历史表征退化,使代理失去对已访问地标(landmark)的追踪能力。解决方案的关键在于提出一种双锚定框架(Dual-Anchoring Framework),通过两个核心机制实现:一是指令进度锚定(Instruction Progress Anchoring),利用结构化文本标记监督代理明确划分已完成与剩余子目标;二是记忆地标锚定(Memory Landmark Anchoring),基于以地标为中心的世界模型(Landmark-Centric World Model)回溯预测由Segment Anything Model提取的对象中心嵌入,强制代理验证过往观测并保持对已访问地标清晰的表征。
链接: https://arxiv.org/abs/2604.17473
作者: Kangyi Wu,Pengna Li,Kailin Lyu,Lin Zhao,Qingrong He,Jinjun Wang,Jianyi Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video Large Language Models(Video-LLMs) have largely advanced VLN, they remain highly susceptible to State Drift in long scenarios. In these cases, the agent’s internal state drifts away from the true task execution state, leading to aimless wandering and failure to execute essential maneuvers in the instruction. We attribute this failure to two distinct cognitive deficits: Progress Drift, where the agent fails to distinguish completed sub-goals from remaining ones, and Memory Drift, where the agent’s history representations degrade, making it lose track of visited landmarks. In this paper, we propose a Dual-Anchoring Framework that explicitly anchors the instruction progress and history representations. First, to address progress drift, we introduce Instruction Progress Anchoring, which supervises the agent to generate structured text tokens that delineate completed versus remaining sub-goals. Second, to mitigate memory drift, we propose Memory Landmark Anchoring, which utilizes a Landmark-Centric World Model to retrospectively predict object-centric embeddings extracted by the Segment Anything Model, compelling the agent to explicitly verify past observations and preserve distinct representations of visited landmarks. Facilitating this framework, we curate two extensive datasets: 3.6 million samples with explicit progress descriptions, and 937k grounded landmark data for retrospective verification. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our method, achieving a 15.2% improvement in Success Rate and a remarkable 24.7% gain on long-horizon trajectories. To facilitate further research, we will release our code, data generation pipelines, and the collected datasets.
[CV-120] UniMesh: Unifying 3D Mesh Understanding and Generation
【速读】:该论文旨在解决3D视觉领域中生成(3D generation)与理解(3D understanding)任务长期割裂的问题,即现有模型通常孤立地处理形状分类、分割、重建等理解任务或合成、补全、编辑等生成任务,导致架构碎片化、表示不统一,阻碍知识迁移与场景的全局建模。其解决方案的关键在于提出UniMesh框架,通过三个核心创新实现统一:(1)引入Mesh Head作为跨模型接口,连接基于扩散的图像生成与隐式形状解码器;(2)设计Chain of Mesh(CoM),以几何化的迭代推理机制支持用户驱动的语义网格编辑,形成闭环的潜在空间提示与重生成循环;(3)集成基于Actor-Evaluator-Self Reflection三元组的自反思机制,用于诊断并修正如3D描述生成等高层任务中的错误。这一统一架构显著提升了任务间的互增强能力,并在标准基准上取得竞争力表现。
链接: https://arxiv.org/abs/2604.17472
作者: Peng Huang,Yifeng Chen,Zeyu Zhang,Hao Tang
机构: Boston University (波士顿大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in 3D vision have led to specialized models for either 3D understanding (e.g., shape classification, segmentation, reconstruction) or 3D generation (e.g., synthesis, completion, and editing). However, these tasks are often tackled in isolation, resulting in fragmented architectures and representations that hinder knowledge transfer and holistic scene modeling. To address these challenges, we propose UniMesh, a unified framework that jointly learns 3D generation and understanding within a single architecture. First, we introduce a novel Mesh Head that acts as a cross model interface, bridging diffusion based image generation with implicit shape decoders. Second, we develop Chain of Mesh (CoM), a geometric instantiation of iterative reasoning that enables user driven semantic mesh editing through a closed loop latent, prompting, and re generation cycle. Third, we incorporate a self reflection mechanism based on an Actor Evaluator Self reflection triad to diagnose and correct failures in high level tasks like 3D captioning. Experimental results demonstrate that UniMesh not only achieves competitive performance on standard benchmarks but also unlocks novel capabilities in iterative editing and mutual enhancement between generation and understanding. Code: this https URL. Website: this https URL.
[CV-121] From Adaptation to Generalization: Adaptive Visual Prompting for Medical Image Segmentation CVPR2026
【速读】:该论文旨在解决现有视觉提示(Visual Prompting)方法在医学领域中因采用单一固定提示(prompt)而难以应对域内和域间变异性的局限性,从而影响模型泛化能力的问题。其解决方案的关键在于提出一种自适应提示提取框架 APEX(Adaptive Prompt EXtraction),通过构建可学习的提示记忆库(prompt memory)存储多样且具有域判别性的提示表示,并利用傅里叶频谱提取的域特征进行查询,实现输入特定提示的动态检索;同时引入低频特征对比学习(Low-Frequency Feature Contrastive, LFC)机制以增强域特征的鲁棒性和判别力,确保同一域内样本聚集、不同域间样本分离,从而显著提升跨域泛化性能。
链接: https://arxiv.org/abs/2604.17455
作者: Evren Çetinkaya,Sangmin Lee,Jung Uk Kim,Hong Joo Lee,Nassir Navab
机构: Technical University of Munich (慕尼黑工业大学); Korea University (韩国科学技术院); Kyung Hee University (中央大学); Seoul National University of Science and Technology (首尔科学综合大学校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Findings
Abstract:Visual prompting has emerged as a powerful method for adapting pre-trained models to new domains without updating model parameters. However, existing prompting methods typically optimize a single prompt per domain and apply it uniformly to all inputs, limiting their ability to generalize under intra and inter-domain variability, which is especially critical in the medical field. To address this, we propose APEX, an Adaptive Prompt EXtraction framework that retrieves input-specific prompts from a learnable prompt memory. The memory stores diverse, domain-discriminative prompt representations and is queried via domain features extracted from the Fourier spectrum. To learn robust and discriminative domain features, we introduce a novel Low-Frequency Feature Contrastive (LFC) learning framework that clusters representations from the same domain while separating those from different domains. Extensive experiments on two medical segmentation tasks demonstrate that APEX significantly improves generalization across both seen and unseen domains. Furthermore, it complements any existing backbones and consistently enhances performance, confirming its effectiveness as a plug-and-play prompting solution in medical fields. The code is available at this https URL
[CV-122] HSG: Hyperbolic Scene Graph
【速读】:该论文旨在解决现有场景图(Scene Graph)表示方法在欧几里得空间中学习嵌入时,难以显式建模对象与地点之间的层次蕴含关系(hierarchical entailment relationships),从而限制了所学表示的结构一致性问题。解决方案的关键在于将场景图嵌入从欧几里得空间迁移至双曲空间(Hyperbolic Space),利用双曲几何中距离天然编码层次结构的特性,实现对场景中语义层级关系的更准确建模。实验表明,该方法在图级别指标上显著优于传统方法,如HSG在PP IoU和Graph IoU上分别达到33.17和33.51,较最优对比方法提升8.14,验证了双曲表示学习在场景图建模中的有效性。
链接: https://arxiv.org/abs/2604.17454
作者: Liyang Wang,Zeyu Zhang,Hao Tang
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Scene graph representations enable structured visual understanding by modeling objects and their relationships, and have been widely used for multiview and 3D scene reasoning. Existing methods such as MSG learn scene graph embeddings in Euclidean space using contrastive learning and attention based association. However, Euclidean geometry does not explicitly capture hierarchical entailment relationships between places and objects, limiting the structural consistency of learned representations. To address this, we propose Hyperbolic Scene Graph (HSG), which learns scene graph embeddings in hyperbolic space where hierarchical relationships are naturally encoded through geometric distance. Our results show that HSG improves hierarchical structure quality while maintaining strong retrieval performance. The largest gains are observed in graph level metrics: HSG achieves a PP IoU of 33.17 and the highest Graph IoU of 33.51, outperforming the best AoMSG variant (25.37) by 8.14, highlighting the effectiveness of hyperbolic representation learning for scene graph modeling. Code: this https URL.
[CV-123] SegTTA: Training-Free Test-Time Augmentation for Zero-Shot Medical Imaging Segmentation
【速读】:该论文旨在解决医学图像分割中因设备差异和操作者因素导致的图像质量不一致性问题,从而提升模型在不同临床场景下的泛化能力。其解决方案的关键在于提出SegTTA框架,该框架无需重新训练模型,通过融合四种图像增强策略(伽马校正、对比度增强、高斯模糊、高斯噪声)与多个MedSAM2检查点的加权投票机制,实现对医学图像分割性能的稳定提升。实验表明,大器官受益于强度增强,小病灶则依赖噪声增强,且投票阈值可调控精度与覆盖范围之间的权衡,从而适配不同临床任务需求。
链接: https://arxiv.org/abs/2604.17451
作者: Yihong Yao,Chunlei Li,Canxuan Gang,Wenzhi Hu,Zeyu Zhang,Hao Zhang,Xiaoyan Li
机构: AI Geeks; Qingdao Municipal Hospital; University of Chinese Academy of Sciences
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Increasingly advanced data augmentation techniques have greatly aided clinical medical research, increasing data diversity and improving model generalization capabilities. Although most current basic models exhibit strong generalization abilities, image quality varies due to differences in equipment and operators. To address these challenges, we present SegTTA, a framework that improves medical image segmentation without model retraining by combining four augmentations (Gamma correction, Contrast enhancement, Gaussian blur, Gaussian noise) with weighted voting across multiple MedSAM2 checkpoints. Experiments demonstrate consistent improvements across three diverse datasets: healthy uterus segmentation, uterine myoma detection, and multi class hepatic structure segmentation. Ablation studies reveal that large organs benefit from intensity augmentations while small lesions require noise augmentations. The voting threshold controls the coverage precision trade off, enabling task specific optimization for different clinical requirements. Ultimately, on a multiclass hepatic vessel dataset, compared to MedSAM2 baselines, our method achieves an increase of 1.6 in mIoU and 1.9 in aIoU, along with a reduction of approximately 2.0 in HD95. Code will be available at this https URL.
[CV-124] HyKey: Hyperspectral Keypoint Detection and Matching in Minimally Invasive Surgery
【速读】:该论文旨在解决微创手术(Minimally Invasive Surgery, MIS)中基于RGB图像的三维重建问题,特别是传统RGB关键点检测与匹配方法在缺乏纹理和复杂光照条件下的性能瓶颈。其解决方案的关键在于引入快照式高光谱成像(Snapshot Hyperspectral Imaging, HSI),并提出HyKey模型——一个融合3D-2D卷积神经网络结构的高光谱关键点检测与描述模型,通过联合提取空间-光谱特征提升关键点匹配鲁棒性。该模型利用合成同源增强和极线几何约束,在机器人采集的双相机RGB-HSI腹腔镜数据集上训练,显著优于SuperPoint和ALIKE等RGB基准方法,在注册RGB帧上的平均匹配准确率达96.62%,姿态估计10°误差下的平均精度达67.18%。
链接: https://arxiv.org/abs/2604.17446
作者: Alexander Saikia,Chiara Di Vece,Zhehua Mao,Sierra Bonilla,Chloe He,Joao Ramalhinho,Tobias Czempiel,Sophia Bano,Danail Stoyanov
机构: University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 5 figures, IPCAI/IJCARS
Abstract:Purpose: 3D reconstruction in minimally invasive surgery (MIS) enables enhanced surgical guidance through improved visualisation, tool tracking, and augmented reality. However, traditional RGB-based keypoint detection and matching pipelines struggle with surgical challenges, such as poor texture and complex illumination. We investigate whether using snapshot hyperspectral imaging (HSI) can provide improved results on keypoint detection and matching surgical scenes. Methods: We developed HyKey, a HYperspectral KEYpoint detection and description model made up of a hybrid 3D-2D convolutional neural network that jointly extracts spatial-spectral features from HSI. The model was trained using synthetic homographic augmentation and epipolar geometry constraints on a robotically-acquired dual-camera RGB-HSI laparoscopic dataset of ex-vivo organs with calibrated camera poses. We benchmarked performance against established RGB-based methods, including SuperPoint and ALIKE. Results: Our HSI-based model outperformed RGB baselines on registered RGB frames, achieving 96.62% mean matching accuracy and 67.18% mean average accuracy at 10 degree on pose estimation, demonstrating consistent improvements across multiple evaluation metrics. Conclusion: Integrating spectral information from an HSI cube offers a promising approach for robust monocular 3D reconstruction in MIS, addressing limitations of texture-poor surgical environments through enhanced spectral-spatial feature discrimination. Our model and dataset are available at this https URL
[CV-125] Attention Is not Everything: Efficient Alternatives for Vision
【速读】:该论文旨在解决当前计算机视觉领域中Transformer模型主导地位下,非Transformer方法的研究现状与潜力尚未系统梳理的问题。其解决方案的关键在于构建一个全面的分类体系,将40篇代表性非Transformer方法归纳为基于卷积(convolution-based)、基于多层感知机(MLP-based)、状态空间(state-space-based)等类别,并从效率、可扩展性、可解释性和鲁棒性四个维度进行综合评估,从而揭示这些方法的优势、挑战与未来研究机遇。
链接: https://arxiv.org/abs/2604.17439
作者: Nur Mohammad Kazi,Ibteshum Khaled,Md. Luthful Hasan Galib,Ali Faruk Shihab,Md. Rakibul Islam
机构: Ahsanullah University of Science and Technology (阿山诺拉大学科学技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, manuscript under review
Abstract:Recently computer vision has seen advancements mainly thanks to Transformer-based models. However many non-Transformer methods are still doing well being a direct competition of Transformer-based models. This review tries to present a comprehensive taxonomy of such methods and organize these methods into categories like convolution-based models, MLP-based models, state-space-based and more. These methods are looked at in terms of how efficient they are, how well they scale, how easy they are to understand and how robust they are. A total of 40 papers were chosen for this study. The goal is to give a view of non-Transformer methods and find out what challenges and opportunities exist for future computer vision research.
[CV-126] DEM Refinement and Validation on the Lunar Surface Using Shape-from-Shading with Chandrayaan-2 OHRC Imagery
【速读】:该论文旨在解决月球数字高程模型(Digital Elevation Model, DEM)在亚米级分辨率下地形细节不足的问题,尤其针对传统立体匹配方法受限于基线长度导致的精度瓶颈。其解决方案的关键在于引入形状从明暗(Shape from Shading, SfS)框架,利用印度“月船2号”轨道器高分辨率相机(Orbiter High Resolution Camera, OHRC)获取的独立成像数据作为输入,将SfS不仅作为精修工具,更作为新地形数据的来源,从而突破立体匹配中基线约束的限制。通过三阶段平滑权重参数优化,在三个典型月面区域(包括Cyrillus陨石坑、Vikram着陆区及南极Mons Mouton山体)验证了该方法能显著提升表面坡度统计特征,揭示此前未分辨的微尺度撞击坑形态,同时识别出大视场倾角差异和遮挡覆盖不全等限制因素对增强效果的空间异质性影响。
链接: https://arxiv.org/abs/2604.17436
作者: Aaranay Aadi,Jai Gopal Singla,Nitant Dube
机构: Manipal University Jaipur (曼尼帕尔大学贾伊普尔分校); Space Applications Centre (SAC), Indian Space Research Organisation (ISRO) (印度空间研究组织空间应用中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 6 figures
Abstract:This study presents a Shape from Shading (SfS) framework to enhance sub-metre resolution lunar digital elevation models (DEMs) using imagery from the Orbiter High Resolution Camera (OHRC) aboard Chandrayaan-2. The framework applies SfS to an independent OHRC image of the same region, enabling SfS not just as a refinement tool, but as a source of new topographic data, unconstrained by stereo baseline limitations. The method is applied across three lunar sites, including the Cyrillus crater, the Vikram landing region, and the lunar south pole (Mons Mouton), with a systematic three-stage parameter sweep on the SfS smoothness weight. Results show measurable topographic enhancement, particularly in surface slope statistics, revealing fine-scale crater morphology previously unresolved. A limiting case is also characterized, where large pitch angle separation between the shading image and stereo pair reduces SfS sensitivity, and partial footprint coverage of the shading image is identified as a factor influencing spatially variable enhancement quality.
[CV-127] Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation
【速读】:该论文旨在解决当前视频生成模型评估中长期视频特性(如叙事丰富性和全局因果一致性)难以被现有短视频评价指标捕捉的问题。现有指标主要关注帧级视觉质量和局部时序平滑性,无法有效衡量长视频中的结构一致性与语义连贯性。其解决方案的关键在于提出一种将长视频特征从短视频评估中解耦的框架——Long-CODE(Long-Context as an Orthogonal Dimension for video Evaluation),通过设计基于镜头动态(shot dynamics)的新指标和一套专门用于测试长程属性破坏的基准测试方法,显著提升了与人类判断的相关性,从而建立了一个全面且无偏的视频生成模型评估范式。
链接: https://arxiv.org/abs/2604.17428
作者: Zhijiang Tang,Jiaxin Qi,Bing Zhao,Jianqiang Huang
机构: Hangzhou Institute for Advanced Study, UCAS (中国科学院大学杭州研究院); Computer Network Information Center, CAS (中国科学院计算机网络信息中心); Department of AI Infrastructure, Bilibili Inc. (哔哩哔哩公司人工智能基础设施部门)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:As video generation models achieve unprecedented capabilities, the demand for robust video evaluation metrics becomes increasingly critical. Traditional metrics are intrinsically tailored for short-video evaluation, predominantly assessing frame-level visual quality and localized temporal smoothness. However, as state-of-the-art video generation models scale to generate longer videos, these metrics fail to capture essential long-range characteristics, such as narrative richness and global causal consistency. Recognizing that short-term visual perception and long-context attributes are fundamentally orthogonal dimensions, we argue that long-video metrics should be disentangled from short-video assessments. In this paper, we focus on the rigorous justification and design of a dedicated framework for long-video evaluation. We first introduce a suite of long-video attribute corruption tests, exposing the critical limitations of existing hort-video metrics from their insensitivity to structural inconsistencies, such as shot-level perturbations and narrative shuffling. To bridge this gap, we design a novel long-video metric based on shot dynamics, which is highly sensitive to the long-range testing framework. Furthermore, we introduce Long-CODE (Long-Context as an Orthogonal Dimension for video Evaluation), a specialized dataset designed to benchmark long-video evaluation, with human annotations isolated specifically to genuine long-range characteristics. Extensive experiments show that our proposed metrics achieve state-of-the-art correlation with human judgments. Ultimately, our metric and benchmark seamlessly complement existing short-video standards, establishing a holistic and unbiased evaluation paradigm for video generation models.
[CV-128] Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长视频理解任务中因处理密集帧序列导致的计算成本过高问题。现有方法通常依赖单一视觉中心指标(如CLIP相似度)或静态融合启发式评分,存在“一刀切”缺陷:纯视觉任务中引入文本噪声,叙事驱动查询则因仅依赖视觉特征而失效。解决方案的关键在于提出Q-Gate框架——一个无需训练、可即插即用的动态模态路由机制。该机制将关键帧选择建模为由三个轻量级专家流组成的检索过程:视觉定位(Visual Grounding)捕捉局部细节、全局匹配(Global Matching)提取场景语义、上下文对齐(Contextual Alignment)响应字幕驱动的叙事线索;并通过查询调制门控机制(Query-Modulated Gating Mechanism),利用大语言模型(LLM)的上下文推理能力评估查询意图,动态分配各专家权重,智能激活相关模态并抑制无关模态,从而显著提升信噪比与推理鲁棒性。
链接: https://arxiv.org/abs/2604.17422
作者: Shaoguang Wang,Weiyu Guo,Ziyang Chen,Xuming Hu,Hui Xiong
机构: Thrust of Artificial Intelligence, HKUST (Guangzhou)(人工智能 thrust,香港科技大学(广州)); Department of CSE, HKUST(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 9 pages, 7 figures, 9 tables. Preprint
Abstract:Long video understanding remains a formidable challenge for Multimodal Large Language Models (MLLMs) due to the prohibitive computational cost of processing dense frame sequences. Prevailing solutions, which select a keyframe subset, typically rely on either a single visual-centric metric (e.g., CLIP similarity) or a static fusion of heuristic scores. This one-size-fits-all'' paradigm frequently fails: visual-only metrics are ineffective for plot-driven narrative queries, while indiscriminately incorporating textual scores introduces severe modal noise’’ for purely visual tasks. To break this bottleneck, we propose Q-Gate, a plug-and-play and training-free framework that treats keyframe selection as a dynamic modality routing problem. We decouple the retrieval process into three lightweight expert streams: Visual Grounding for local details, Global Matching for scene semantics, and Contextual Alignment for subtitle-driven narratives. Crucially, Q-Gate introduces a Query-Modulated Gating Mechanism that leverages the in-context reasoning of an LLM to assess the query’s intent and dynamically allocate attention weights across the experts. This mechanism intelligently activates necessary modalities while ``muting’’ irrelevant ones, thereby maximizing the signal-to-noise ratio. Extensive experiments on LongVideoBench and Video-MME across multiple MLLM backbones demonstrate that Q-Gate substantially outperforms state-of-the-art baselines. By effectively suppressing modality-specific noise, it provides a robust, highly interpretable solution for scalable video reasoning.
[CV-129] Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models
【速读】:该论文旨在解决预训练扩散模型或流模型在奖励驱动微调(reward-based fine-tuning)过程中,不同方法设计分散、缺乏统一理论框架的问题。现有方法虽源于软强化学习(Soft RL)、GFlowNets等不同视角,但作者提出统一框架——奖励得分匹配(Reward Score Matching, RSM),其核心在于将对齐过程建模为向奖励引导目标进行得分匹配(score matching),并指出各方法的主要差异仅体现在价值引导估计器的设计与时间步上的有效优化强度上。这一视角厘清了已有方案中的偏差-方差-计算权衡关系,并区分出核心优化组件与冗余辅助机制,从而指导开发更简洁高效的重设计方法,在可微分和黑盒奖励场景下均提升了对齐效果与计算效率。
链接: https://arxiv.org/abs/2604.17415
作者: Jeongjae Lee,Jinho Chang,Jeongsol Kim,Jong Chul Ye
机构: KAIST, Korea
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 42 pages, 15 figures
Abstract:Reward-based fine-tuning aims to steer a pretrained diffusion or flow-based generative model toward higher-reward samples while remaining close to the pretrained model. Although existing methods are motivated by different perspectives such as Soft RL, GFlowNets, etc., we show that many can be written under a common framework, which we call reward score matching (RSM). Under this view, alignment becomes score matching toward a reward-guided target, and the main differences across methods reduce to the construction of the value-guidance estimator and the effective optimization strength across timesteps. This unification clarifies the bias–variance–compute tradeoffs of existing designs and distinguishes core optimization components from auxiliary mechanisms that add complexity without clear benefit. Guided by this perspective, we develop simpler redesigns that improve alignment effectiveness and compute efficiency across representative settings with differentiable and black-box rewards. Overall, RSM turns a seemingly fragmented collection of reward-based fine-tuning methods into a smaller, more interpretable, and more actionable design space.
[CV-130] Speculative Decoding for Autoregressive Video Generation
【速读】:该论文旨在解决自回归视频扩散模型(autoregressive video diffusion)在推理加速方面的挑战,特别是如何将大语言模型中成熟的推测解码(speculative decoding)策略有效适配到基于块的视频生成任务中。由于视频块是连续的时空张量且缺乏逐标记分布,传统基于精确拒绝采样的推测解码难以直接应用。其解决方案的关键在于提出SDVG框架,通过引入一个图像质量路由器(image-quality router)替代传统的标记验证机制:使用1.3B规模的draft模型通过四步去噪生成候选块,再利用ImageReward结合最差帧聚合策略(worst-frame aggregation)对每个块进行评分——该策略通过取每帧奖励的最小值以捕捉单帧伪影,避免平均化掩盖问题;若得分高于阈值τ,则将其插入到14B目标模型的键值缓存(KV cache)中,否则由目标模型重新生成。此外,强制拒绝首块以锚定场景结构、并以单一超参数τ控制质量与速度的帕累托前沿,使系统无需训练即可实现高效稳定加速,实验表明在MovieGenVideoBench数据集上可达到最高2.09倍加速比同时保持95.7%的目标质量。
链接: https://arxiv.org/abs/2604.17397
作者: Yuezhou Hu,Jintao Zhang
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Autoregressive video diffusion is emerging as a promising paradigm for streaming video synthesis, with step distillation serving as the primary means of accelerating inference. Whether speculative decoding, the dominant acceleration strategy for large language models, can be effectively adapted to autoregressive video generation remains an open question, because video blocks are continuous spatiotemporal tensors with no token-level distribution for exact rejection sampling. We introduce SDVG, which brings speculative decoding to block-based autoregressive video diffusion by replacing token verification with an image-quality router. A 1.3B drafter proposes candidate blocks via four denoising steps; each block is VAE-decoded and scored by ImageReward using worst-frame aggregation–taking the minimum per-frame reward to catch single-frame artifacts that averaging would mask. Blocks scoring above a fixed threshold tau are accepted into the 14B target’s KV cache; the rest are regenerated by the target. Two additional design choices prove critical: the first block is always force-rejected to anchor scene composition, and tau serves as a single knob that traces a smooth quality-speed Pareto frontier. On 1003 MovieGenVideoBench prompts (832x480), SDVG retains 98.1% of target-only VisionReward quality (0.0773 vs. 0.0788) at a 1.59x speedup with tau=-0.7, and reaches 2.09x at 95.7% quality retention–while consistently outperforming draft-only generation by over +17%. The framework is training-free, requires no architectural changes, and can be seamlessly integrated into existing autoregressive video generation pipelines.
[CV-131] MESA: A Training-Free Multi-Exemplar Deep Framework for Restoring Ancient Inscription Textures
【速读】:该论文旨在解决古代铭文图像因碎片化、侵蚀或其他损伤导致的缺失或损坏区域难以识别与分析的问题。其核心解决方案是提出MESA(Multi-Exemplar, Style-Aware)方法,该方法利用同一碑刻、材质或相似字形结构的良好保存样本(exemplar inscriptions)来引导受损文本的重建;关键技术在于通过VGG19卷积特征提取Gram矩阵以捕捉样本的纹理、风格和笔画结构,并在每一神经网络层中选择均方位移(MSD)最小的示例进行匹配,同时结合光学字符识别(OCR)估计的字符宽度确定各层权重以匹配字母几何尺度,再借助训练掩码限制合成范围仅限于损坏区域,从而实现更精准、风格一致的图像恢复。
链接: https://arxiv.org/abs/2604.17390
作者: Vasileios Toulatzis,Ioannis Fudos
机构: University of Ioannina (伊奥annina大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:
Abstract:Ancient inscriptions frequently suffer missing or corrupted regions from fragmentation, erosion, or other damage, hindering reading, and analysis. We review prior image restoration methods and their applicability to inscription image recovery, then introduce MESA (Multi-Exemplar, Style-Aware) -an image-level restoration method that uses well-preserved exemplar inscriptions (from the same epigraphic monument, material, or similar letterforms) to guide reconstruction of damaged text. MESA encodes VGG19 convolutional features as Gram matrices to capture exemplar texture, style, and stroke structure; for each neural network layer it selects the exemplar minimizing Mean-Squared Displacement (MSD) to the damaged input. Layer-wise contribution weights are derived from Optical Character Recognition-estimated character widths in the exemplar set to bias filters toward scales matching letter geometry, and a training mask preserves intact regions so synthesis is restricted to damaged areas. We also summarize prior network architectures and exemplar and single-image synthesis, inpainting, and Generative Adversarial Network (GAN) approaches, highlighting limitations that MESA addresses. Comparative experiments demonstrate the advantages of MESA. Finally, we provide a practical roadmap for choosing restoration strategies given available exemplars and metadata.
[CV-132] Deep learning based Non-Rigid Volume-to-Surface Registration for Brain Shift compensation Using Point Cloud
【速读】:该论文旨在解决神经外科术中软组织形变导致的图像引导精度下降问题,特别是由于脑移位(brain shift)使得术中解剖结构与术前影像显著偏离,从而影响导航准确性与手术安全性。现有补偿方法多依赖术中MRI、CT或超声,但这些方式干扰手术流程且难以重复集成。本文提出一种基于深度学习的非刚性体到面配准框架,其关键在于通过多尺度点特征提取与分层变形解码器,在仅有稀疏术中皮质表面点云观测的情况下,无需显式点对应关系或体积型术中成像,即可估计密集位移场;核心创新在于将部分术中表面信息融合至完整的术前点云域中,实现隐式对应学习和有限视野下的精细形变恢复,从而支持自动、兼容手术流程的脑移位补偿。
链接: https://arxiv.org/abs/2604.17389
作者: Eashrat Jahan Muniya,Gernot Kronreif,Ander Biguri,Wolfgang Birkfellner,Sepideh Hatamikia
机构: ACMIT (Austrian Center for Medical Innovation and Technology); University of Cambridge (剑桥大学); Medical University of Vienna (维也纳医科大学); DP-University (DP大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Soft-tissue deformation remains a major limitation in image-guided neurosurgery, where intra-operative anatomy can deviate substantially from pre-operative imaging due to brain shift, compromising navigation accuracy and surgical safety. Existing compensation methods often rely on intra-operative MRI, CT, or ultrasound, which are disruptive and difficult to integrate repeatedly into the surgical workflow. In contrast, partial 3D cortical surfaces can be reconstructed as point clouds from stereoscopic microscopes or laser range scanners (LRS), capturing only a limited portion of the exposed cortex. This makes point cloud registration a practical alternative without interrupting surgery; however, such partial and noisy observations make deformation estimation highly challenging. In this study, we propose a deep learning-based framework for non-rigid volume-to-surface registration, enabling dense displacement field estimation from sparse intra-operative surface observations without explicit point correspondences or volumetric intra-operative imaging. The network leverages multi-scale point-based feature extraction and a hierarchical deformation decoder to capture both global and local deformations. The key contribution lies in integrating partial intra-operative surface information into the full pre-operative point cloud domain, enabling implicit correspondence learning and dense deformation recovery under limited visibility. Quantitative results demonstrate accurate recovery of fine-scale deformations, achieving an Endpoint Error (EPE) of 1.13 +/- 0.75 mm and RMSE of 1.33 +/- 0.81 mm under challenging partial-surface conditions. The proposed approach supports automatic, workflow-compatible brain-shift compensation from sparse surface observations.
[CV-133] SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在空间智能任务中表现脆弱的问题,尤其是其在需要一致空间状态识别的任务中难以维持几何结构的准确性。现有MLLMs主要依赖文本链式推理(chain-of-thought),而文本表示倾向于抽象掉关键的低层几何细节,导致空间状态更新不一致、推理轨迹不稳定。解决方案的关键在于提出一个统一的多模态生成框架SpatialImaginer,通过“分而治之”策略:利用文本链式推理进行高层语义规划,同时引入视觉想象(visual imagination)机制来实现对几何敏感的状态变换与一致性保持;此外,设计了一个难度感知的数据引擎和闭环验证机制,使模型能在需要稳定空间状态追踪时选择性调用视觉想象,从而显著提升复杂多步空间推理任务的鲁棒性和性能。
链接: https://arxiv.org/abs/2604.17385
作者: Yian Li,Yang Jiao,Bin Zhu,Tianwen Qian,Shaoxiang Chen,Jingjing Chen,Yu-Gang Jiang
机构: Fudan University (复旦大学); Singapore Management University (新加坡管理大学); East China Normal University (华东师范大学); MiniMax
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spatial intelligence, which refers to the ability to reason about geometric and physical structure from visual observations, remains a core challenge for multimodal large language models. Despite promising performance, recent multimodal large language models (MLLMs) often exhibit fragile reasoning traces in spatial intelligence tasks that involve consistent spatial state recognition. We argue that these failures stem from a mismatch between the spatial recognition mechanism and the text-only reasoning behavior of these MLLMs. Effective spatial reasoning requires low-level geometric structure to be faithfully preserved and updated throughout the reasoning process, whereas textual representations tend to abstract away precisely these critical details. To address this issue, we propose SpatialImaginer, a unified multimodal generation framework that integrates textual reasoning with visual imagination. Our framework adopts a divide-and-conquer strategy, using text chain-of-thought for high-level semantic planning and the visual imagination for geometry-sensitive state transformation and consistency preservation. To support this capability, we further introduce a difficulty-aware data engine with closed-loop verification to train the model to invoke visual imagination selectively when stable spatial state tracking is required. Extensive experiments on diverse spatial intelligence benchmarks show that SpatialImaginer achieves state-of-the-art performance and substantially improves robustness on complex multi-step spatial reasoning tasks.
[CV-134] owards Generalizable Deepfake Image Detection with Vision Transformers ICASSP2025
【速读】:该论文旨在解决当前深度伪造(deepfake)图像检测中因生成模型快速演进及现有方法泛化能力不足所带来的挑战。其解决方案的关键在于构建一个基于微调视觉Transformer(Vision Transformer)的集成模型,具体采用DINOv2、AIMv2以及OpenCLIP的ViT-L/14等先进架构,通过融合多模型特征提升检测鲁棒性与泛化性能。实验表明,该方法在DF-Wild数据集上实现了96.77%的AUC和9%的等错误率(Equal Error Rate, EER),显著优于当时最先进的Effort算法,在ICASSP 2025会议上作为IEEE SP Cup 2025的优胜方案被展示。
链接: https://arxiv.org/abs/2604.17376
作者: Kaliki V Srinanda,M Manvith Prabhu,Hemanth K Mogilipalem,Jayavarapu S Abhinai,Vaibhav Santhosh,Aryan Herur,Deepu Vijayasenan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 5 pages, 9 figures, SP Cup - ICASSP 2025
Abstract:In today’s day and age, we face a challenge in detecting deepfake images because of the fast evolution of modern generative models and the poor generalization capability of existing methods. In this paper, we use an ensemble of fine-tuned vision transformers like DINOv2, AIMv2 and OpenCLIP’s ViT-L/14 to create generalizable method to detect deepfakes. We use the DF-Wild dataset released as part of the IEEE SP Cup 2025, because it uses a challenging and diverse set of manipulations and generation techniques. We started our experiments with CNN classifiers trained on spatial features. Experimental results show that our ensemble outperforms individual models and strong CNN baselines, achieving an AUC of 96.77% and an Equal Error Rate (EER) of just 9% on the DF-Wild test set, beating the state-of-the-art deepfake detection algorithm Effort by 7.05% and 8% in AUC and EER respectively. This was the winning solution for SP Cup, presented at ICASSP 2025.
[CV-135] When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理包含屏幕叠加文本(on-screen text)的视频时存在的系统性幻觉问题,即当叠加文本与实际视觉内容矛盾时,VLMs 会优先采纳文本语义而非真实视觉信息,这种现象被作者定义为文本叠加诱导幻觉(Text Overlay-Induced Hallucination, TOIH)。解决方案的关键在于提出一种名为 Visual Text Hallucination Mitigation Mixture-of-Experts (VTHM-MoE) 的新型视觉-文本解耦框架,其核心机制是采用双编码器架构和四个针对时间、动作、物体与空间维度的专家模块,通过预训练识别跨模态语义差异,并结合自适应令牌路由策略实现动态专家分配,从而有效抵抗 TOIH 同时保持对未污染视频的性能。
链接: https://arxiv.org/abs/2604.17375
作者: Cui Yakun,Xingqun Qi,TianTian Geng,Yuyao Zhang,Sirui Han,Yike Guo
机构: The Hong Kong University of Science and Technology (香港科技大学); University of Birmingham (伯明翰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Vision-Language Models (VLMs) have substantially enhanced their ability across multimodal video understanding benchmarks spanning temporal, action, object, and spatial understanding. However, we identify a critical yet overlooked issue: when embedded on-screen text contradicts the visual scene, existing VLMs systematically hallucinate, prioritizing overlay textual semantics over the actual visual content. We define this phenomenon as Text Overlay-Induced Hallucination (TOIH). In this work, we propose VisualTextTrap, the first comprehensive benchmark, including large-scale human-validated samples with specifically designed evaluation metrics. In particular, we construct VisualTextTrap from widely-used public datasets using a scalable hybrid pipeline of VLMs assisted text generation and rigorous manual verification. The benchmark features 6,057 samples annotated across 88 fine-grained attributes within four dimensions, with hallucination intensity quantified on a five-level scale (L1–L5) that reflects the semantic contradiction between overlay text and visual reality. Moreover, we propose Visual Text Hallucination Mitigation Mixture-of-Experts (VTHM-MoE), a novel Vision-Text Disentanglement framework that employs a dual-encoder architecture. Concretely, four dimension-specialized expert modules spanning Temporal, Action, Object, and Spatial reasoning are first pre-trained to identify and leverage cross-modal discrepancies between textual semantics and actual video content. We develop an Adaptive Token Routing Strategy to enable dynamic expert allocation, conferring robust resistance to TOIH while preserving performance on uncontaminated videos. Extensive experiments conducted on our VisualTextTrap benchmark verify the effectiveness of VTHM-MoE, outperforming state-of-the-art counterparts with diverse video question answering tasks.
[CV-136] Robust Diabetic Retinopathy Grading Using Dual-Resolution Attention-Based Deep Learning with Ordinal Regression
【速读】:该论文旨在解决深度学习模型在糖尿病视网膜病变(Diabetic Retinopathy, DR)自动分级任务中因跨数据集成像条件差异而导致性能下降的问题。解决方案的关键在于提出了一种鲁棒的双分辨率深度学习框架,通过两个并行的EfficientNet主干网络在不同空间分辨率下提取互补的视网膜特征,并引入可学习的注意力机制实现多分辨率特征的自适应融合;同时采用基于累积链接模型(Cumulative Link Model, CORAL)的序数回归方法显式建模DR严重程度等级的有序特性,从而提升模型在不同数据分布下的泛化能力。此外,结合圆形裁剪、对比度增强和直方图匹配的预处理策略进一步缓解了域间差异,显著提升了跨数据集的分级准确性。
链接: https://arxiv.org/abs/2604.17341
作者: Afshan Hashmi
机构: Tuwaiq Academy, Tuwaiq Research and Development Centre (图威克学院,图威克研发中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Diabetic retinopathy (DR) is a leading cause of vision impairment worldwide, and automated grading systems play a crucial role in large-scale screening programs. However, deep learning models often exhibit degraded performance when deployed across datasets acquired under different imaging conditions. This study presents a robust dual-resolution deep learning framework for DR grading that integrates attention-based feature fusion with ordinal regression to improve cross-dataset generalization. The proposed method employs two parallel EfficientNet backbones operating at different spatial resolutions to capture complementary retinal features. A learnable attention mechanism adaptively fuses multi-resolution representations, while an ordinal regression formulation based on the cumulative link model (CORAL) explicitly accounts for the ordered nature of DR severity levels. To mitigate domain discrepancies between datasets, a preprocessing strategy combining circular cropping, contrast enhancement, and histogram matching is applied. The model was trained on the APTOS 2019 dataset and evaluated on both an internal validation split and an external Messidor-2 test set. Experimental results demonstrate strong grading performance, achieving a quadratic weighted kappa (QWK) of 0.88 on the APTOS validation set and 0.68 on the unseen Messidor-2 dataset, indicating improved robustness for cross-dataset DR grading applications.
[CV-137] R-FLoRA: Residual-Statistic-Gated Low-Rank Adaptation for Single-Image Face Morphing Attack Detection
【速读】:该论文旨在解决单张人脸图像中伪造面容攻击(face morphing attacks)的检测难题,尤其在缺乏可信参考图像且攻击生成方法多样化的场景下,如何有效识别局部伪造痕迹并保持语义一致性。解决方案的关键在于提出一种名为S-MAD(Single-Image Face Morphing Attack Detection)的新框架,其核心创新包括:利用高频率拉普拉斯残差统计量捕捉局部伪造特征,并结合冻结的大型视觉Transformer(vision transformer)骨干网络提取语义信息;引入残差门控低秩适配器(R-FLoRA)与特征级残差融合机制(Res-FiLM),在不破坏原始语义的前提下增强对微小伪造痕迹的敏感性;同时设计了一种新型残差对比对齐损失函数,以规范融合后的token空间,在未见过的伪造条件下提升判别能力。该方法在四个符合ICAO标准的数据集上验证,显著优于九种最新S-MAD算法,在准确率和跨域泛化性能方面表现优异,且模型参数极少、推理速度快,具备实际部署潜力。
链接: https://arxiv.org/abs/2604.17321
作者: Raghavendra Ramachandra
机构: Norwegian University of Science and Technology (NTNU)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Pre-Print; Accepted in IEEE Transactions on Information Forensics and Security (TIFS), 2026
Abstract:Face morphing attacks pose a substantial risk to the reliability of face recognition systems used in passport issuance, border control, and digital identity verification. Detecting morphing attacks from a single facial image remains challenging owing to the lack of a trusted reference and the diversity of attack generation methods. This paper presents a new Single-Image Face Morphing Attack Detection (S-MAD) framework that integrates high-frequency Laplacian residual statistics with representations from a frozen, foundation-scale vision transformer. The approach employs residual-statistic-gated low-rank adapters (R-FLoRA) and feature-wise residual fusion (Res-FiLM) to enhance sensitivity to local morphing artefacts while preserving the semantic context of the backbone. A novel residual-contrastive alignment loss further regularises the fused token space, improving discrimination under unseen morphing conditions. Comprehensive experiments on four ICAO-compliant datasets, encompassing seven morph generation techniques, demonstrate that the proposed method consistently surpasses nine recent state-of-the-art S-MAD algorithms in detection accuracy and cross-domain (or dataset) generalisation. With a frozen backbone and minimal trainable parameters, the model achieves real-time efficiency and interpretability, making it suitable for real-life scenarios in biometric verification systems.
[CV-138] owards Joint Quantization and Token Pruning of Vision-Language Models
【速读】:该论文旨在解决在极端低比特(low-bit)推理条件下,视觉-语言模型(Vision-Language Models, VLMs)部署时因长视觉标记前缀(visual-token prefix)的预填充阶段和自回归解码过程中不断增长的键值缓存(KV cache)导致的高计算开销问题。传统方法中,令牌剪枝(token pruning)与低比特量化(low-bit quantization)虽具互补性,但简单分阶段组合常因量化校准与剪枝执行之间的不匹配而表现不稳定。其解决方案的关键在于提出一个协同量化与剪枝框架——QUOTA(Quantization-Unified Offline Token Allocator),该框架将低比特校准信号转化为逐层的令牌分配调度,并以剪枝配方形式实现;通过在W4A4(4-bit权重、4-bit激活)运算符下结合激活幅值、注意力线索及显式低比特风险信号,对令牌重要性进行评估,从而实现预算约束下的稳定top-k选择,显著提升了在相同低比特设置下的鲁棒性和性能保留率(如仅保留30%视觉令牌即可达到95.65%平均保留率)。
链接: https://arxiv.org/abs/2604.17320
作者: Xinqing Li,Xin He,Xindong Zhang,Ming-Ming Cheng,Lei Zhang,Yun Liu
机构: Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deploying Vision-Language Models (VLMs) under aggressive low-bit inference remains challenging because inference cost is dominated by the long visual-token prefix during prefill and the growing KV cache during autoregressive decoding. Token pruning and low-bit quantization are complementary for reducing these costs, yet naive stage-wise combinations are often brittle due to a mismatch between quantization calibration and pruning execution. We present a collaborative quantization-and-pruning framework that unifies low-bit inference and deterministic visual-token pruning in a single deployable pipeline. The framework introduces the \textbfQuantization \textbfUnified \textbfOffline \textbfToken \textbfAllocator (\textbfQUOTA), which converts low-bit calibration signals into a layer-wise token allocation schedule and materializes it as a pruning recipe. Token importance is evaluated under deployed W4A4 operators with a quantized KV cache by combining activation magnitude, attention cues, and an explicit low-bit risk signal, enabling consistent budgeted top- k selection. Experiments on standard VLM benchmarks show improved robustness over stage-wise baselines under the same low-bit regime, achieving 95.65% average retention while retaining only 30% of visual tokens, compared with about 94.3% retention for representative stage-wise combinations. The code will be released.
[CV-139] When Background Matters: Breaking Medical Vision Language Models by Transferable Attack ACL
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在临床诊断中对对抗攻击的脆弱性问题,现有攻击方法或聚焦于次要目标(如模型窃取),或依赖自然图像迁移攻击导致可见扰动,易被临床医生察觉。其解决方案的关键在于提出MedFocusLeak——一种高度可迁移的黑盒多模态攻击方法,通过在非诊断区域注入协同扰动,并利用注意力干扰机制引导模型关注非病灶区域,从而诱导出看似合理但错误的诊断结果,同时保持扰动不可感知。实验表明,该方法在六种医学成像模态下均实现最优攻击效果,揭示了现代临床VLMs在推理能力上的显著缺陷。
链接: https://arxiv.org/abs/2604.17318
作者: Akash Ghosh,Subhadip Baidya,Sriparna Saha,Xiuying Chen
机构: Indian Institute of Technology Patna(印度理工学院巴特那分校); Indian Institute of Technology Kanpur(印度理工学院坎普尔分校); MBZUAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACL Main 2026
Abstract:Vision-Language Models (VLMs) are increasingly used in clinical diagnostics, yet their robustness to adversarial attacks remains largely unexplored, posing serious risks. Existing medical attacks focus on secondary objectives such as model stealing or adversarial fine-tuning, while transferable attacks from natural images introduce visible distortions that clinicians can easily detect. To address this, we propose MedFocusLeak, a highly transferable black-box multimodal attack that induces incorrect yet clinically plausible diagnoses while keeping perturbations imperceptible. The method injects coordinated perturbations into non-diagnostic background regions and employs an attention distraction mechanism to shift the model’s focus away from pathological areas. Extensive evaluations across six medical imaging modalities show that MedFocusLeak achieves state-of-the-art performance, generating misleading yet realistic diagnostic outputs across diverse VLMs. We further introduce a unified evaluation framework with novel metrics that jointly capture attack success and image fidelity, revealing a critical weakness in the reasoning capabilities of modern clinical VLMs.
[CV-140] Generalizable Face Forgery Detection via Separable Prompt Learning
【速读】:该论文旨在解决基于CLIP(Contrastive Language–Image Pretraining)模型进行人脸伪造检测时,现有方法过度依赖视觉编码器而忽视文本模态信息的问题。其解决方案的关键在于提出一种可分离提示学习策略(Separable Prompt Learning, SePL),通过两种类型的提示学习机制将图像中的伪造相关与无关信息解耦,其中伪造特异性提示用于增强检测能力;同时设计跨模态对齐策略和专用优化目标,使文本模态能够有效指导深度伪造检测任务。实验表明,该方法在跨数据集和跨方法评估中均表现出优异的泛化性能。
链接: https://arxiv.org/abs/2604.17307
作者: Enrui Yang,Yuezun Li
机构: Ocean University of China (中国海洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Detecting face forgeries using CLIP has recently emerged as a promising and increasingly popular research direction. Owing to its rich visual knowledge acquired through large-scale pretraining, most existing methods typically rely on the visual encoder of CLIP, while paying limited attention to the text modality. Given the instructive nature of the text modality, we posit that it can be leveraged to instruct Deepfake detection with meticulous design. Accordingly, we shift the focus from the visual modality to the text modality and propose a new Separable Prompt Learning strategy (SePL) that enables CLIP to serve as an effective face forgery detector. The core idea of SePL is to disentangle forgery-specific and forgery-irrelevant information in images via two types of prompt learning, with the former enhancing detection. To achieve this disentangle, we describe a cross-modality alignment strategy and a set of dedicated objectives. Extensive experiments demonstrate that, with this simple adaptation, our method achieves competitive and even superior performance compared to other methods under both cross-dataset and cross-method evaluation, highlighting its strong generalizability. The codes have been released at this https URL
[CV-141] he First Challenge on Mobile Real-World Image Super-Resolution at NTIRE 2026: Benchmark Results and Method Overview
【速读】:该论文针对移动设备上真实场景图像超分辨率(Real-World Image Super-Resolution, RISR)问题展开研究,旨在从由未知退化过程生成的低分辨率(Low-Resolution, LR)图像中恢复高分辨率(High-Resolution, HR)图像,且模型需满足移动端部署的效率要求。解决方案的关键在于设计高效且性能优异的网络架构,在保证图像质量的同时实现快速推理,其评估指标为图像质量评估(Image Quality Assessment, IQA)得分与加速比(Speedup Ratio)的加权组合。通过NTIRE 2026挑战赛,共吸引108名注册参与者,最终16支团队提交有效方案,推动了移动端RISR技术的发展并揭示了该领域的最新趋势。
链接: https://arxiv.org/abs/2604.17306
作者: Jiatong Li,Zheng Chen,Kai Liu,Jingkai Wang,Zihan Zhou,Xiaoyang Liu,Libo Zhu,Jue Gong,Radu Timofte,Yulun Zhang,Congyu Wang,Zihao Wang,Ke Wu,Xinzhe Zhu,Fengkai Zhang,Zhongbao Yang,Long Sun,Jiangxin Dong,Jinshan Pan,Jiachen Tu,Yaokun Shi,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Renyuan Situ,Yixin Yang,Zhaorun Zhou,Junyang Chen,Yuqi Li,Chuanguang Yang,Weilun Feng,Chuanyue Yan,Yuedong Tan,Yingli Tian,Zhenzhong Chen,Tongqi Guo,Ruhan Liu,Sangzi Shi,Huazhang Deng,Jie Yang,Wenzhuo Ma,Yuantong Zhang,Daiqin Yang,Tianrun Chen,Deyi Ji,Yuxiao Jiang,Qi Zhu,Lanyun Zhu,Yuwen Pan,Runze Tian,Mingyu Shi,Zhanfeng Feng,Yuanfei Bao,Jiaming Guo,Renjing Pei,Xin Di,Long Peng,Linfeng Jiang,Xueyang Fu,Yang Cao,Zhengjun Zha,Choulhyouc Lee,Shyang-En Weng,Yi-Cheng Liao,Jorge Tyrakowski,Yu-Syuan Xu,Wei-Chen Chiu,Ching-Chun Huang,Yoonjin Im,Jihye Park,Hyungju Chun,Hyunhee Park,MinKyu Park,Xiaoxuan Yu,Jianxing Zhang,Yuxuan Jiang,Chengxi Zeng,Tianhao Peng,Fan Zhang,David Bull,Watchara Ruangsang,Supavadee Aramvith,JiaHao Deng,Wei Zhou,Hongyu Huang,Shaohui Lin,Zihan Wang,Yilin Chen,Yunchen Li,Junbo Qiao,Wei Li,Jiao Xie,Gaoqi He,Wenxi Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NTIRE 2026 webpage: this https URL . Code: this https URL
Abstract:This paper provides a review of the NTIRE 2026 challenge on mobile real-world image super-resolution, highlighting the proposed solutions and the resulting outcomes. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through unknown degradations with a x4 scaling factor while ensuring the models remain executable on mobile devices. The objective is to develop effective and efficient network designs or solutions that achieve state-of-the-art real-world image super-resolution performance. The track of the challenge evaluates performance using a weighted combination of image quality assessment (IQA) score and speedup ratios. The competition attracted 108 registrants, with 16 teams achieving a valid score in the final ranking. This collaborative effort advances the performance of mobile real-world image super-resolution while offering an in-depth overview of the latest trends in the field.
[CV-142] Frequency-guided Multi-level Reasoning for Scene Graph Generation in Video ICASSP2026
【速读】:该论文旨在解决视频场景图生成(Video Scene Graph Generation)中长期尾分布关系建模不足的问题,即现有方法在识别低频关系时性能显著下降,导致整体推理鲁棒性受限。解决方案的关键在于提出频率引导的多级关系推理模型(FReMuRe),其核心创新包括:1)引入关系特定分支以缓解梯度冲突,实现更均衡且面向长尾类别的学习;2)设计频率感知的双分支谓词嵌入网络,分别建模高频与低频关系,并通过门控融合机制提升尾部类别召回率;3)提出两种可互换的关系分类头——贝叶斯头用于不确定性估计,高斯混合模型头则增强类内多样性,从而全面提升模型对长尾关系的识别能力与泛化性能。
链接: https://arxiv.org/abs/2604.17298
作者: Chenxing Li,Yiping Duan,Xiaoming Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5pages,3figures, 2tables, icassp 2026
Abstract:Video Scene Graph Generation aims to obtain structured semantic representations of objects and their relationships in videos for high-level understanding. However, existing methods still have limitations in handling long-tail distributions. This paper proposes the Frequency-guided Relational Multi-level Reasoning (FReMuRe) model, which enhances the modeling ability of long-tail relationships from a mechanism perspective. We introduce relation-specific branches to deal gradient conflicts, yielding more balanced and tail-aware learning. And we design a frequency-aware dual-branch predicate embedding network to model high-frequency and low-frequency relationships separately and improve the recall rate of tail classes through gated fusion. Meanwhile, we propose two types of interchangeable relation classification heads: Bayesian Head for uncertainty estimation and new Gaussian Mixture Model Head to enhance intra-class diversity. Experimental results show that FReMuRe significantly improves the recall rate of long-tail relationships and overall reasoning robustness on the Action Genome dataset.
[CV-143] Spectral Forensics of Diffusion Attention Graphs for Copy-Move Forgery Detection NEURIPS
【速读】:该论文旨在解决图像中复制-移动伪造(copy-move forgery)检测问题,即通过复制图像中的某一部分区域来隐藏或伪造内容,从而破坏视觉媒体的真实性。解决方案的关键在于提出一种无需训练的框架GraphSpecForge,其核心思想是利用预训练Stable Diffusion U-Net中的自注意力图(self-attention graph)的谱结构变化来识别伪造痕迹:复制操作会导致注意力图中近似子图的重复,进而引发归一化图拉普拉斯矩阵(normalized graph Laplacian)的谱分布发生可测量的再分配。作者基于扰动理论形式化了这一关联,并采用Wasserstein距离构建图像级异常检测器,将每张图像的拉普拉斯谱与真实参考分布进行比较,从而实现高精度、无监督的伪造检测。
链接: https://arxiv.org/abs/2604.17287
作者: H. M. Shadman Tabib,Tasriad Ahmed Tias,Nafis Tahmid
机构: Bangladesh University of Engineering and Technology (BUET)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint before NeurIPS main track submission
Abstract:Copy-move forgery, where a region within an image is duplicated to hide or fabricate content, remains a persistent threat to visual media integrity. We introduce GraphSpecForge, a training-free framework that detects copy-move forgery by analysing the spectral structure of attention graphs from a pretrained Stable Diffusion U-Net. Our central insight is that copy-move manipulation induces approximate subgraph duplication in the self-attention graph, leading to measurable spectral redistribution in the normalized graph Laplacian. We formalise this link with perturbation-based arguments and build an image-level anomaly detector using Wasserstein distances between per-image Laplacian spectra and an authentic reference distribution. We evaluate GraphSpecForge on four copy-move benchmarks without forgery-specific retraining. On RecodAI-LUC (5,128 images), our best configuration achieves AUROC = 0.606 (95% CI: 0.580-0.638; permutation p = 0.005), and the normalized Laplacian outperforms raw attention spectra by +0.057 AUROC. On MICC-F220, CoMoFoD, and COVERAGE, the same pipeline attains AUROCs of 0.752, 0.774, and 0.673, respectively; on CoMoFoD it also reaches AUPRC = 0.833, balanced accuracy = 0.712, MCC = 0.499, and TPR@1%FPR = 32.5%. Additional ablation and falsification experiments confirm the signal’s specificity and sensitivity to manipulation strength, while null-graph controls rule out trivial-statistic explanations.
[CV-144] Depth Adaptive Efficient Visual Autoregressive Modeling CVPR2026
【速读】:该论文旨在解决视觉自回归(Visual Autoregressive, VAR)模型在生成高分辨率图像时计算效率低下的问题,即对每个位置均采用固定计算深度,导致资源浪费。现有方法通过频域图进行硬剪枝(hard-pruning)加速推理,但受限于二值化剪枝策略,难以提升生成质量,即使频率估计更优亦然。解决方案的关键在于提出一种从“剪枝整个token”到“按token自适应分配计算深度”的范式转变:引入无需训练的DepthVAR框架,其核心是基于循环旋转调度的自适应深度调度器(adaptive depth scheduler),实现非静态、均衡的细化过程,并结合层优先掩码机制动态选择性应用Transformer块,最终通过融合不同深度的特征编码确保每个token的影响与其处理深度成比例,从而在显著加速(2.3×–3.1×)的同时保持高质量输出。
链接: https://arxiv.org/abs/2604.17286
作者: Chunliang Li,Tianze Cao,Sanyuan Zhao
机构: Beijing Institute of Technology (北京理工大学); Beijing Institute of Technology, Zhuhai (北京理工大学珠海校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 Findings
Abstract:Visual Autoregressive (VAR) modeling inefficiently applies a fixed computational depth to each position when generating high-resolution images. While existing methods accelerate inference by pruning tokens using frequency maps, their binary hard-pruning approach is fundamentally limited and fails to improve quality even with better frequency estimation. Observing that VAR models possess significant depth redundancy, we propose a paradigm shift from pruning entire tokens to adaptively allocating per-token computational depth. To this end, we introduce DepthVAR, a training-free framework that dynamically allocates computation. It integrates an adaptive depth scheduler, which assigns computational depth via a cyclic rotated schedule for balanced, non-static refinement, with a dynamic inference process that translates these depths into layer-major masks, selectively applies transformer blocks, and blends the resulting codes to ensure each token’s influence is proportional to its processing depth. Extensive experiments show that DepthVAR achieves 2.3 \times -3.1 \times acceleration with minimal quality loss, offering a competitive compute-performance trade-off compared to existing hard-pruning approaches. Code is available at this https URL
[CV-145] PestVL-Net: Enabling Multimodal Pest Learning via Fine-grained Vision-Language Interaction
【速读】:该论文旨在解决农业场景中害虫识别与管理的难题,特别是由于害虫种类繁多、形态特征复杂多样,现有技术难以有效建模其细粒度视觉特征和高层语义信息,从而限制了实际应用效果。解决方案的关键在于提出一种融合视觉-语言的协同框架PestVL-Net,其核心创新包括:(1)视觉路径采用基于循环加权键值(Recurrent Weighted Key Value, RWKV)架构,并引入显著性引导的自适应窗口划分机制,以精准捕捉害虫的细粒度视觉特征;(2)语言模块利用多模态大语言模型(Multimodal Large Language Models, MLLMs)先验知识,结合农业专家经验并通过多模态思维链(Chain-of-Thought, CoT)推理生成精确的害虫语义描述;最终通过深度融合视觉与文本表征实现细粒度多模态害虫学习,显著提升在真实农业环境中的识别性能。
链接: https://arxiv.org/abs/2604.17278
作者: Xueheng Li,Tao Hu,Ke Cao,Runsheng Qi,Huixin Zhang,Rui Li,Jie Zhang,Chengjun Xie
机构: Institute of Intelligent Machines, Hefei Institutes of Physical Science, Chinese Academy of Sciences (中国科学院合肥物质科学研究院智能机械研究所); University of Science and Technology of China (中国科学技术大学); Zhongke Hefei Institute of Technology Innovation Engineering (中科合肥技术工程院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures
Abstract:Effective pest recognition and management are crucial for sustainable agricultural development. However, collecting pest data in real scenarios is often challenging. Compared to other domains, pests exhibit a wide variety of species with complex and diverse morphological characteristics. Existing techniques struggle to effectively model the key visual and high-level semantic features of pests in a fine-grained manner. These limitations hinder the practical application of such methods in real agricultural scenarios. To address these critical challenges, we present a synergistic approach that integrates PestVL-Net, a novel vision-language framework, with two multi-species pest datasets to facilitate fine-grained pest learning. The visual pathway of PestVL-Net utilizes the Recurrent Weighted Key Value (RWKV) architecture, incorporating a saliency-guided adaptive window partitioning scheme to effectively model the fine-grained visual characteristics of pests. Concurrently, the linguistic component generates precise pest semantic descriptions by leveraging Multimodal Large Language Models (MLLMs) priors, critically informed by agricultural expert knowledge and structured via multimodal Chain-of-Thought (CoT) reasoning. The deep fusion of these complementary visual and textual representations enables fine-grained multimodal pest learning. Extensive experimental evaluations on multiple pest datasets validate the superior performance of PestVL-Net, highlighting its potential for effective real-world pest management.
[CV-146] Instinct vs. Reflection: Unifying Token and Verbalized Confidence in Multimodal Large Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在实际部署中缺乏可靠置信度估计的问题。现有方法主要针对纯文本大语言模型(Text-only Large Language Models, LLMs),常依赖计算成本较高的自一致性采样,难以直接适用于多模态场景。论文发现MLLMs存在显著的“直觉-反思错位”现象:模型在token级别隐式支持与其显式自我评估的置信度不一致。为解决此问题,作者提出一种单调置信度融合框架(monotone confidence fusion framework),通过整合双通道信号(token级支持与语义自信度)及跨通道一致性来估计预测正确性;随后引入保持顺序的均值对齐步骤(order-preserving mean alignment),校正全局偏差,在维持选择性预测中风险-覆盖率权衡的同时提升校准性能。实验表明,该方法在多个开源与闭源MLLMs上均能获得更可靠的置信度估计,并改善校准效果与失败预测能力。
链接: https://arxiv.org/abs/2604.17274
作者: Yunkai Dang,Yifan Jiang,Yizhu Jiang,Anqi Chen,Wenbin Li,Yang Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in various perception and reasoning tasks. Despite this success, ensuring their reliability in practical deployment necessitates robust confidence estimation. Prior works have predominantly focused on text-only LLMs, often relying on computationally expensive self-consistency sampling. In this paper, we extend this to multimodal settings and conduct a comprehensive evaluation of MLLMs’ response confidence estimation. Our analysis reveals a significant instinct-reflection misalignment: the model’s implicit token-level support frequently diverges from its verbal self-assessment confidence. To address this misalignment, we propose a monotone confidence fusion framework to merge dual-channel signals and cross-channel consistency to estimate correctness. Subsequently, an order-preserving mean alignment step is applied to correct global bias, which improves calibration while preserving the risk-coverage trade-off for selective prediction. Experiments on diverse open-source and closed-source MLLMs show that our method consistently yields more reliable confidence estimates and improves both calibration and failure prediction. Code will be available at this https URL.
[CV-147] Fractal Characterization of Low-Correlation Signals in AI-Generated Image Detection
【速读】:该论文旨在解决当前深度伪造(Deepfake)检测方法在开放世界场景下鲁棒性不足的问题。其解决方案的关键在于从信号层面出发,识别合成图像与真实照片之间的内在差异:研究发现低相关性信号(low-correlation signals)是区分AI生成图像与真实图像的显著特征;在此基础上,作者提出一种基于分形理论(fractal theory)量化此类信号的方法,通过分析低相关性信号的分形特性,有效捕捉生成过程中隐含的细微统计异常,从而实现更鲁棒且高效的深度伪造检测。该方法不仅适用于人脸图像,还可推广至所有AI生成图像的检测任务。
链接: https://arxiv.org/abs/2604.17268
作者: Wenwei Xie,Jie Yin,Lu Ma,Xuansong Zhang,Wenjing Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: this https URL
Abstract:AI-generated imagery has reached near-photorealistic fidelity, yet this technology poses significant threats to information security and societal trust. Existing deepfake detection methods often exhibit limited robustness in open-world scenarios. To address this limitation, this paper investigates intrinsic discrepancies between synthetic and authentic images from a signal-level perspective. Our analysis reveals that low-correlation signals serve as distinctive markers for differentiating AI-generated imagery from real photographs. Building on this insight, we introduce a novel method for quantifying these signals based on fractal theory. By analyzing the fractal characteristics of low-correlation signals, our method effectively captures the subtle statistical anomalies inherent to the synthesis process. Extensive experimental results demonstrate the method’s robustness and superior detection performance. This work emphasizes the need to shift research focus to a new signal-level direction for deepfake detection. Theoretically, this proposed approach is not limited to face image identification but can be applied to all AI-generated image detection tasks. This study provides a new research direction for deepfake detection.
[CV-148] RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation
【速读】:该论文旨在解决当前遥感多模态大语言模型(Remote Sensing Multimodal Large Language Model, RS MLLM)在真实场景下缺乏鲁棒性的问题,即模型在面对输入图像和文本中的噪声(如云雾遮挡、口语化或模糊的指令)时,其视觉-语义推理能力显著下降,导致部署时性能退化。解决方案的关键在于提出 RemoteShield,一种通过语义等价簇(semantic equivalence cluster)进行偏好学习(preference learning)的训练机制:每个干净样本与其对应的扰动变体组成一个簇,在同一簇内比较模型对干净与扰动输入的响应,引导模型偏好稳定输出而非受扰动影响的错误响应,从而增强跨条件一致性并提升对多模态扰动的鲁棒性。
链接: https://arxiv.org/abs/2604.17243
作者: Rui Min,Liang Yao,Shiyu Miao,Shengxiang Xu,Yuxuan Liu,Chuanyi Zhang,Shimin Di,Fan Liu
机构: Hohai University (河海大学); Nanjing University (南京大学); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A robust Multimodal Large Language Model (MLLM) for Earth Observation should maintain consistent interpretation and reasoning under realistic input variations. However, current Remote Sensing MLLMs fail to meet this requirement. Trained on carefully curated clean datasets, they learn brittle mappings that do not generalize to noisy conditions in operational Earth Observation. Consequently, their performance degrades when confronted with imperfect inputs in deployment. To quantify this vulnerability, we construct a realistic set of multimodal perturbations, including visual degradations such as cloud and fog cover, together with diverse human-centric textual variations ranging from colloquialisms to vague or omitted instructions. Empirical evaluations show that these perturbations significantly impair the visual-semantic reasoning capabilities of leading RS foundation models. To address this limitation, we introduce RemoteShield, a robust Remote Sensing MLLM trained to maintain consistent outputs across realistic input variations. During training, each clean sample is paired with its image-text perturbed variants to form a semantic equivalence cluster. Rather than directly fitting noisy samples, RemoteShield is optimized through preference learning over clean and perturbed conditions within the same cluster. By comparing model responses to clean and corrupted inputs, the model is encouraged to favor stable responses over perturbation-induced failures. This cross-condition alignment helps the model focus on underlying task semantics despite visual degradations and textual noise. Experiments on three Earth Observation tasks show that RemoteShield consistently delivers stronger robustness and cross-condition consistency than representative baselines under realistic multimodal perturbations.
[CV-149] Enhancing Zero-shot Personalized Image Aesthetics Assessment with Profile-aware Multimodal LLM
【速读】:该论文旨在解决个性化图像美学评估(Personalized Image Aesthetics Assessment, PIAA)中的零样本(zero-shot)问题,即在缺乏用户历史评分数据的情况下如何建模个体用户的审美偏好。现有方法依赖于用户的历史评分进行个性化建模,难以应对无历史数据的场景。其解决方案的关键在于引入用户画像(user profile)作为上下文信号,并提出一种基于画像的个性化范式——P-MLLM(Profile-aware Multimodal Large Language Model)。该模型通过在冻结的大语言模型(frozen LLM)中嵌入选择性融合模块(selective fusion modules),在画像条件推理过程中动态地、可控地将视觉信息融入模型的隐藏状态,从而实现对视觉内容的画像感知式整合,使模型能够在无历史评分的情况下仍保持良好的个性化预测性能。
链接: https://arxiv.org/abs/2604.17233
作者: Chun Wang,Chenfeng Wei,Chenyang Liu,Weihong Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Personalized image aesthetics assessment (PIAA) aims to predict an individual user’s subjective rating of an image, which requires modeling user-specific aesthetic preferences. Existing methods rely on historical user ratings for this modeling and therefore struggle when such data are unavailable. We address this zero-shot setting by using user profiles as contextual signals for personalization and adopting a profile-based personalization paradigm. We introduce P-MLLM, a profile-aware multimodal LLM that augments a frozen LLM with selective fusion modules for controlled visual integration. These modules selectively integrate visual information into the model’s evolving hidden states during profile-conditioned reasoning, allowing visual information to be incorporated in a profile-aware manner. Experiments on recent PIAA benchmarks show that P-MLLM achieves competitive zero-shot performance and remains effective even with coarse profile information, highlighting the potential of profile-based personalization for zero-shot PIAA.
[CV-150] Fringe Projection Based Vision Pipeline for Autonomous Hard Drive Disassembly
【速读】:该论文旨在解决废旧硬盘驱动器(Hard Disk Drives, HDDs)自动化拆解过程中存在的感知难题,特别是缺乏鲁棒的三维(3D)传感、场景理解能力不足以及紧固件定位精度低的问题。其核心解决方案是提出一种自主视觉感知流水线,通过结构光投影轮廓测量(Fringe Projection Profilometry, FPP)实现高精度3D传感,并在FPP失效时选择性触发深度补全模块;同时,将同一FPP相机-投影仪系统用于实例分割网络以实现像素级对齐的语义与几何信息融合,避免了RGB-D系统中常见的配准误差。该方案还优化了深度补全与实例分割网络的推理效率,在保证高精度(实例分割box mAP@50为0.960,mask mAP@50为0.957;深度补全RMSE为2.317 mm)的同时实现了低延迟(12.86 ms)和高吞吐量(77.7 FPS),并采用仿真到真实(sim-to-real)迁移学习增强数据多样性,从而为下游机器人拆解任务提供高质量的空间与语义感知输入。
链接: https://arxiv.org/abs/2604.17231
作者: Badrinath Balasubramaniam,Vignesh Suresh,Benjamin Metcalf,Beiwen Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 20 pages, 11 figures
Abstract:Unrecovered e-waste represents a significant economic loss. Hard disk drives (HDDs) comprise a valuable e-waste stream necessitating robotic disassembly. Automating the disassembly of HDDs requires holistic 3D sensing, scene understanding, and fastener localization, however current methods are fragmented, lack robust 3D sensing, and lack fastener localization. We propose an autonomous vision pipeline which performs 3D sensing using a Fringe Projection Profilometry (FPP) module, with selective triggering of a depth completion module where FPP fails, and integrates this module with a lightweight, real-time instance segmentation network for scene understanding and critical component localization. By utilizing the same FPP camera-projector system for both our depth sensing and component localization modules, our depth maps and derived 3D geometry are inherently pixel-wise aligned with the segmentation masks without registration, providing an advantage over RGB-D perception systems common in industrial sensing. We optimize both our trained depth completion and instance segmentation networks for deployment-oriented inference. The proposed system achieves a box mAP@50 of 0.960 and mask mAP@50 of 0.957 for instance segmentation, while the selected depth completion configuration with the Depth Anything V2 Base backbone achieves an RMSE of 2.317 mm and MAE of 1.836 mm; the Platter Facing learned inference stack achieved a combined latency of 12.86 ms and a throughput of 77.7 Frames Per Second (FPS) on the evaluation workstation. Finally, we adopt a sim-to-real transfer learning approach to augment our physical dataset. The proposed perception pipeline provides both high-fidelity semantic and spatial data which can be valuable for downstream robotic disassembly. The synthetic dataset developed for HDD instance segmentation will be made publicly available.
[CV-151] Region-Affinity Attention for Whole-Slide Breast Cancer Classification in Deep Ultraviolet Imaging
【速读】:该论文旨在解决深紫外荧光成像(Deep Ultraviolet, DUV)获取的全切片图像(Whole-Slide Images, WSIs)在乳腺癌分类任务中,现有基于patch的深度学习方法因破坏空间上下文信息且预处理开销大而难以临床应用的问题。其关键解决方案是提出一种新型区域亲和注意力机制(Region-Affinity Attention),该机制无需将WSI分割为patch即可直接处理整张切片,通过建模局部邻域距离构建完整的亲和矩阵,动态突出诊断相关区域,并引入对比损失增强特征判别性,从而在保留空间完整性的同时显著提升分类性能,在136例DUV-WSI数据集上达到92.67%准确率和95.97% AUC。
链接: https://arxiv.org/abs/2604.17222
作者: Nagur Shareef Shaik,Teja Krishna Cherukuri,Dong Hye Ye
机构: Georgia State University (佐治亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: Accepted at the IEEE Engineering in Medicine and Biology Society Annual International Conference (Proceedings of the 48th International Conference), 2026
Abstract:Breast cancer diagnosis demands rapid and precise tools, yet traditional histopathological methods often fall short in intra-operative settings. Deep Ultraviolet (DUV) fluorescence imaging emerges as a transformative approach, offering high-contrast, label-free visualization of whole-slide images (WSIs) with unprecedented detail, surpassing conventional hematoxylin and eosin (HE) staining in speed and resolution. However, existing deep learning methods for breast cancer classification, predominantly patch-based, fragment spatial context and incur significant preprocessing overhead, limiting their clinical utility. Moreover, standard attention mechanisms, such as Spatial, Squeeze-and-Excitation, Global Context and Guided Context Gating, fail to fully exploit the rich, multi-scale regional relationships inherent in DUV-WSI data, often prioritizing generic feature recalibration over diagnostic specificity. This study introduces a novel Region-Affinity Attention mechanism tailored for DUV-WSI breast cancer classification, processing entire slides without patching to preserve spatial integrity. By modeling local neighbor distances and constructing a full affinity matrix, our method dynamically highlights diagnostically relevant regions, augmented by a contrastive loss to enhance feature discriminability. Evaluated on a dataset of 136 DUV-WSI samples, our approach achieves an accuracy of 92.67 +/- 0.73% and an AUC of 95.97%, outperforming existing attention methods.
[CV-152] Cross-Modal Attention Analysis and Optimization in Vision-Language Models: A Study on Visual Reliability
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中存在的“文本捷径学习”(text shortcut learning)问题,即模型过度依赖文本描述而忽视视觉信息,导致跨模态理解能力受限。解决方案的关键在于提出一种对抗性评估框架,通过在保持图像不变的情况下引入语义冲突的文本扰动(如形状替换、颜色替换、位置替换和随机文本),量化模型对视觉证据的依赖程度;同时设计了一种基于LoRA的优化策略,整合硬负样本挖掘、标签平滑、分层学习率、余弦重启、课程学习和数据增强等技术,显著降低了模型在对抗场景下的准确率下降幅度(Drop从27.5%降至9.8%),并在保持高正常准确率(97%)的同时增强了视觉特征关注与跨模态对齐。
链接: https://arxiv.org/abs/2604.17217
作者: Lijie Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) achieve strong cross-modal performance, yet recent evidence suggests they over-rely on textual descriptions while under-utilizing visual evidence – a phenomenon termed ``text shortcut learning.‘’ We propose an adversarial evaluation framework that quantifies this cross-modal dependency by measuring accuracy degradation (Drop) when semantically conflicting text is paired with unchanged images. Four adversarial strategies – shape_swap, color_swap, position_swap, and random_text – are applied to a controlled geometric-shapes dataset ( n=1,000 ). We compare three configurations: Baseline CLIP (ViT-B/32), LoRA fine-tuning, and LoRA Optimized (integrating Hard Negative Mining, Label Smoothing, layer-wise learning rates, Cosine Restarts, curriculum learning, and data augmentation). The optimized model reduces average Drop from 27.5% to 9.8% (64.4% relative improvement, p0.001 ) while maintaining 97% normal accuracy. Attention visualization and embedding-space analysis confirm that the optimized model attends more to visual features and achieves tighter cross-modal alignment.
[CV-153] EmbodiedHead: Real-Time Listening and Speaking Avatar for Conversational Agents
【速读】:该论文旨在解决生成式 AI (Generative AI) 中语音驱动头像(speech-driven talking-head)在实时性、统一听-说行为以及高保真视觉质量方面的挑战。现有方法依赖双流音频处理,引入了对话方前瞻依赖,不适用于因果用户与大语言模型(LLM)的交互场景。本文提出 EmbodiedHead 框架,其核心创新在于采用单流接口结合显式的逐帧听-说状态条件控制与流式音频调度器(Streaming Audio Scheduler),有效抑制倾听时的虚假嘴部动作并实现自然的轮替对话;同时引入两阶段训练策略——系数空间预训练与图像域联合微调,显著缩小运动级监督与渲染质量之间的差距,从而在保持实时生成的同时提升视觉质量和动作一致性。
链接: https://arxiv.org/abs/2604.17211
作者: Yu Zhang,Kaiyuan Shen,Yang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages
Abstract:We present EmbodiedHead, a speech-driven talking-head framework that equips LLMs with real-time visual avatars for conversation. A practical embodied avatar must achieve real-time generation, unified listening-speaking behavior, and high rendered visual quality simultaneously. Our framework couples the first Rectified-Flow Diffusion Transformer (DiT) for this task with a differentiable renderer, enabling diverse, high-fidelity generation in as few as four sampling steps. Prior listening-speaking methods rely on dual-stream audio, introducing an interlocutor look-ahead dependency incompatible with causal user–LLM interaction. We instead adopt a single-stream interface with explicit per-frame listening-speaking state conditioning and a Streaming Audio Scheduler, suppressing spurious mouth motion during listening while enabling seamless turn-taking. A two-stage training scheme of coefficient-space pretraining and joint image-domain refinement further closes the gap between motion-level supervision and rendered quality. Extensive experiments demonstrate state-of-the-art visual quality and motion fidelity in both speaking and listening scenarios.
[CV-154] DREAM: Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion for Expert Precision Medical Report Generation
【速读】:该论文旨在解决在医疗影像领域(特别是视网膜图像)中,由于数据稀缺导致大型视觉语言模型(Large Vision-Language Models, LVLMs)容易过拟合且难以捕捉细微但关键病灶的问题。其解决方案的关键在于提出了一种名为DREAM(Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion)的新框架,该框架通过两阶段融合机制实现:首先由Abstractor模块将图像特征与眼科专家标注的临床关键词映射到共享空间以增强病理相关性;其次Adaptor模块采用可学习参数动态调整多模态信息权重,生成统一表示;同时引入对比对齐(Contrastive Alignment)模块确保融合表征与真实医学报告语义一致,从而在有限数据下实现高保真医学报告生成,并在DeepEyeNet和ROCO数据集上取得显著性能提升。
链接: https://arxiv.org/abs/2604.17209
作者: Nagur Shareef Shaik,Teja Krishna Cherukuri,Dong Hye Ye
机构: Georgia State University (佐治亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: Accepted at the IEEE Engineering in Medicine and Biology Society Annual International Conference (Proceedings of the 48th International Conference), 2026
Abstract:Automating medical reports for retinal images requires a sophisticated blend of visual pattern recognition and deep clinical knowledge. Current Large Vision-Language Models (LVLMs) often struggle in specialized medical fields where data is scarce, leading to models that overfit and miss subtle but critical pathologies. To address this, we introduce DREAM (Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion), a novel framework for high-fidelity medical report generation that excels even with limited data. DREAM employs a unique two-stage fusion mechanism that intelligently integrates visual data with clinical keywords curated by ophthalmologists. First, the Abstractor module maps image and keyword features into a shared space, enhancing visual data with pathology-relevant insights. Next, the Adaptor performs adaptive multi-modal fusion, dynamically weighting the importance of each modality using learnable parameters to create a unified representation. To ensure the model’s outputs are semantically grounded in clinical reality, a Contrastive Alignment module aligns these fused representations with ground-truth medical reports during training. By combining medical expertise with an efficient fusion strategy, DREAM sets a new state-of-the-art on the DeepEyeNet benchmark, achieving a BLEU-4 score of 0.241, and further demonstrates strong generalization to the ROCO dataset.
[CV-155] CDSA-Net:Collaborative Decoupling of Vascular Structure and Background for High-Fidelity Coronary Digital Subtraction Angiography
【速读】:该论文旨在解决冠状动脉数字减影血管造影(Digital Subtraction Angiography, DSA)成像中因生理运动导致的图像质量问题,尤其是现有深度学习方法存在的两个关键临床不可接受缺陷:持续存在的边界伪影和原始组织灰度保真度的丧失。解决方案的核心在于提出一种名为CDSA-Net的新框架,其关键创新包括:(i) 分层几何先验引导机制(Hierarchical Geometric Prior Guidance, HGPG),嵌入于冠状动脉结构提取网络(Coronary Structure Extraction Network, CSENet)中,通过集成几何先验(Integrated Geometric Prior, IGP)、门控空间调制(Gated Spatial Modulation, GSM)与中心线感知拓扑损失(Centerline-Aware Topology, CAT)监督,实现血管结构连续性的精准保持;(ii) 自适应噪声模块(Adaptive Noise Module, ANM),部署于冠状动脉背景恢复网络(Coronary Background Restoration Network, CBResNet)中,能够建模临床X射线噪声的随机特性,弥合域间差异,从而实现无缝背景强度估计并彻底消除边界伪影。最终通过从原始 angiogram 中减去恢复的背景获得高质量减影图像,在血管强度相关性和感知质量方面显著优于当前最优方法,并在形态学评估效率和血流动力学评价速度上分别提升25.6%和42.9%,同时保持与原始angiogram一致的诊断准确性。
链接: https://arxiv.org/abs/2604.17208
作者: Si Li,Chen-Kai Hu,Zhenhuan Lyu,Yuanqing He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Digital subtraction angiography (DSA) in coronary imaging is fundamentally challenged by physiological motion, forcing reliance on raw angiograms cluttered with anatomical noise. Existing deep learning methods often produced images with two critical clinically unacceptable flaws: persistent boundary artifacts and a loss of native tissue grayscale fidelity that undermined diagnostic confidence. We propose a novel framework termed as CDSA-Net that for the first time explicitly decouples and jointly optimizes vascular structure preservation and realistic background restoration. CDSA-Net introduces two core innovations: (i) A hierarchical geometric prior guidance (HGPG) mechanism, embedded in our coronary structure extraction network (CSENet). It synergistically combines integrated geometric prior (IGP) with gated spatial modulation (GSM) and centerline-aware topology (CAT) loss supervision, ensuring structural continuity. (ii) An adaptive noise module (ANM) within our coronary background restoration network (CBResNet). Unlike standard restoration, ANM uniquely models the stochastic nature of clinical X-ray noise, bridging the domain gap to enable seamless background intensity estimation and the complete elimination of boundary artifacts. The final subtraction is obtained by removing the restored background from the raw angiogram. Quantitatively, it significantly outperformed state-of-the-art methods in vascular intensity correlation and perceptual quality. A 25.6% improvement in morphology assessment efficiency and a 42.9% gain in hemodynamic evaluation speed set a new benchmark for utility in interventional cardiology, while maintaining diagnostic results consistent with raw angiograms. The project code is available at this https URL.
[CV-156] SciDraw-6K: A Multilingual Scientific Illustration Dataset Generated by Google Gemini CIDR
【速读】:该论文旨在解决科学插图(scientific illustration)生成中缺乏高质量、多语言标注数据集的问题,以支持多语言文本到图像生成研究、领域自适应扩散模型微调及科学可视化提示工程。解决方案的关键在于构建了SciDraw-6K数据集,包含6,291张由Google Gemini图像生成模型合成的科学插图,每张图像均配有11种语言(包括中文、日语、韩语、德语、法语、西班牙语等)的文本提示,覆盖生物医学、化学、材料、电子学、环境科学、人工智能系统、物理学等多个学科领域,并聚焦于示意图、机制图、目录图和概念海报等科学插图特有类型,从而为科学可视化任务提供专用、结构化且多样化的训练与评估资源。
链接: https://arxiv.org/abs/2604.17206
作者: Davie Chen
机构: University of Arts in Poznań (波兹南艺术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures. Dataset: this https URL . Code: this https URL
Abstract:We present SciDraw-6K, a curated dataset of 6,291 scientific illustrations synthesized by Google Gemini image-generation models, each paired with prompts in eleven languages (English, Simplified Chinese, Traditional Chinese, Japanese, Korean, German, French, Spanish, Brazilian Portuguese, Italian, and Russian). Images span eight broad scientific categories – biomedical, chemistry, materials, electronics, environment, AI systems, physics, and a long “other” tail – and are produced primarily by the gemini-2.5-flash-image and gemini-3-pro-image-preview model families. In contrast to general-purpose text-to-image corpora that dominate the literature, SciDraw-6K is purpose-built for the scientific illustration genre: schematic diagrams, mechanism figures, table-of-contents graphics, and conceptual posters. We describe the construction pipeline, report dataset statistics, and document its use as the substrate of this http URL, a public scientific drawing service. The dataset is released to support multilingual text-to-image research, domain-adapted diffusion fine-tuning, and prompt-engineering studies for scientific visualization. Dataset: this https URL Code: this https URL
[CV-157] DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior CVPR2026
【速读】:该论文旨在解决现有基于文本到图像扩散模型的分镜合成方法在长时序一致性、角色身份稳定性和叙事连贯性方面表现不足的问题,这些问题限制了视觉故事讲述的自然性和可控性。解决方案的关键在于提出DreamShot框架,其核心是利用视频扩散模型(video diffusion models)的强大时空先验,实现可控的多镜头生成;通过引入多参考角色条件模块与角色注意力一致性损失(Role-Attention Consistency Loss),显式约束参考图像与生成角色之间的注意力对齐,从而增强角色身份的一致性与叙事逻辑的完整性。
链接: https://arxiv.org/abs/2604.17195
作者: Junjia Huang,Binbin Yang,Pengxiang Yan,Jiyang Liu,Bin Xia,Zhao Wang,Yitong Wang,Liang Lin,Guanbin Li
机构: Sun Yat-sen University (中山大学); Peng Cheng Laboratory (鹏城实验室); ByteDance Intelligent Creation (字节跳动智能创作); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026 as a Highlight paper
Abstract:Storyboard synthesis plays a crucial role in visual storytelling, aiming to generate coherent shot sequences that visually narrate cinematic events with consistent characters, scenes, and transitions. However, existing approaches are mostly adapted from text-to-image diffusion models, which struggle to maintain long-range temporal coherence, consistent character identities, and narrative flow across multiple shots. In this paper, we introduce DreamShot, a video generative model based storyboard framework that fully exploits powerful video diffusion priors for controllable multi-shot synthesis. DreamShot supports both Text-to-Shot and Reference-to-Shot generation, as well as story continuation conditioned on previous frames, enabling flexible and context-aware storyboard generation. By leveraging the spatial-temporal consistency inherent in video generative models, DreamShot produces visually and semantically coherent sequences with improved narrative fidelity and character continuity. Furthermore, DreamShot incorporates a multi-reference role conditioning module that accepts multiple character reference images and enforces identity alignment via a Role-Attention Consistency Loss, explicitly constraining attention between reference and generated roles. Extensive experiments demonstrate that DreamShot achieves superior scene coherence, role consistency, and generation efficiency compared to state-of-the-art text-to-image storyboard models, establishing a new direction toward controllable video model-driven visual storytelling.
[CV-158] LookasideVLN: Direction-Aware Aerial Vision-and-Language Navigation CVPR2026
【速读】:该论文旨在解决当前空中视觉-语言导航(Aerial Vision-and-Language Navigation, Aerial VLN)中因指令理解浅层化和计算成本高而导致的导航精度不足问题,尤其针对现有方法过度依赖地标描述而忽视方向性线索(directional cues)这一关键空间上下文信息的局限。解决方案的关键在于提出LookasideVLN新范式,其核心创新是引入一个自我中心侧视图图(Egocentric Lookaside Graph, ELG),动态编码与指令相关的地标及其方向关系,并结合轻量级空间地标知识库(Spatial Landmark Knowledge Base, SLKB)实现高效记忆检索,以及一个侧视多模态大语言模型(Lookaside MLLM)导航代理,融合用户指令、视觉观测与ELG中的方向信息进行路径规划,从而在单层前瞻情况下即显著优于当前最优方法CityNavAgent,证明了利用方向线索是一种高效且强大的Aerial VLN策略。
链接: https://arxiv.org/abs/2604.17190
作者: Yuwei Ning,Ganlong Zhao,Yipeng Qin,Si Liu,Yang Liu,Liang Lin,Guanbin Li
机构: Sun Yat-sen University (中山大学); Peng Cheng Laboratory (鹏城实验室); The Chinese University of Hong Kong (香港中文大学); Centre for Perceptual and Interactive Intelligence (感知与交互智能中心); Cardiff University (卡迪夫大学); Beihang University (北京航空航天大学); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Aerial Vision-and-Language Navigation (Aerial VLN) enables unmanned aerial vehicles (UAVs) to follow natural language instructions and navigate complex urban environments. While recent advances have achieved progress through large-scale memory graphs and lookahead path planning, they remain limited by shallow instruction understanding and high computational cost. In particular, existing methods rely primarily on landmark descriptions, overlooking directional cues “a key source of spatial context in human navigation”. In this work, we propose LookasideVLN, a new paradigm that exploits directional cues in natural language to achieve both more accurate spatial reasoning and greater computational efficiency. LookasideVLN comprises three core components: (1) an Egocentric Lookaside Graph (ELG) that dynamically encodes instruction-relevant landmarks and their directional relationships, (2) a Spatial Landmark Knowledge Base (SLKB) that provides lightweight memory retrieval from prior navigation experiences, and (3) a Lookaside MLLM Navigation Agent that aligns multimodal information from user instructions, visual observations, and landmark-direction information from ELG for path planning. Extensive experiments show that LookasideVLN significantly outperforms the state-of-the-art CityNavAgent, even with a single-level lookahead, demonstrating that leveraging directional cues is a powerful yet efficient strategy for Aerial VLN.
[CV-159] PPEDCRF: Dynamic-CRF-Guided Selective Perturbation for Background-Based Location Privacy in Video Sequences
【速读】:该论文旨在解决视频帧中基于背景的地理位置隐私泄露问题,即在移除GPS元数据后,攻击者仍可通过匹配帧中的背景视觉特征与地理标记参考图像实现精准定位。为应对这一威胁,论文提出PPEDCRF(Calibrated Selective Perturbation Framework),其核心在于:利用动态条件随机场(Dynamic Conditional Random Field, DCRF)识别敏感位置区域;通过归一化控制惩罚(Normalized Control Penalty, NCP)自适应调节扰动强度;并基于差分隐私(Differential Privacy, DP)风格校准规则,在推断出的敏感区域内注入高斯噪声,从而实现对背景区域的局部扰动而非全局加噪。该方法在保持较高图像质量(PSNR达36.14 dB)的同时显著降低检索准确率(从0.667降至0.361±0.127),优于传统全局高斯噪声方案。
链接: https://arxiv.org/abs/2604.17163
作者: Bo Ma,Weiqi Yan,Jinsong Wu
机构: Resideo Technologies Inc.(Resideo 技术公司); Auckland University of Technology(奥克兰理工大学); Guilin University of Electronic Technology(桂林电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose PPEDCRF, a calibrated selective perturbation framework that protects \emphbackground-based location privacy in released video frames against gallery-based retrieval attackers. Even after GPS metadata are stripped, an adversary can geolocate a frame by matching its background visual cues to geo-tagged reference imagery; PPEDCRF mitigates this threat by estimating location-sensitive background regions with a dynamic conditional random field (DCRF), rescaling perturbation strength with a normalized control penalty (NCP), and injecting Gaussian noise only inside the inferred regions via a DP-style calibration rule. On a controlled paired-scene retrieval benchmark with eight attacker backbones and three noise seeds, PPEDCRF reduces ResNet18 Top-1 retrieval accuracy from 0.667 to 0.361\pm0.127 at \sigma_0=8 while preserving 36.14, dB PSNR – an \approx6, dB quality advantage over global Gaussian noise. Transfer across the eight-backbone seed-averaged benchmark is broadly supportive (23 of 24 backbone-gallery cells show negative \Delta ), while appendix-scale confirmation identifies MixVPR as a remaining adverse-transfer exception. Matched-operating-point analysis shows that PPEDCRF and global Gaussian noise converge in Top-1 privacy at equal utility, so the practical benefit is spatially concentrated perturbation that preserves higher visual quality at any given noise scale rather than stronger matched-utility privacy. Code: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.17163 [cs.CV] (or arXiv:2604.17163v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.17163 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-160] Instant Colorization of Gaussian Splats
【速读】:该论文旨在解决将2D图像信息(如颜色、神经特征或语义分割掩码)高效映射回已有的高斯 splatting 场景的问题,这一“反向”映射过程在场景重光照、风格化和3D语义分割等应用中具有重要意义,但面临视图依赖性着色和遮挡处理等挑战。其解决方案的关键在于利用正规方程(normal equation)求解每个高斯的可见性加权最小二乘问题,从而实现稳定且高效的参数更新,并可通过现有的可微分光栅化器(differentiable rasterizer)高效实现,相较基于梯度下降的基线方法,在多个任务上实现了高达一个数量级的速度提升。
链接: https://arxiv.org/abs/2604.17155
作者: Daniel Lieber,Alexander Mock,Nils Wandel
机构: University of Osnabrück(奥斯纳布吕克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Gaussian Splatting has recently become one of the most popular frameworks for photorealistic 3D scene reconstruction and rendering. While current rasterizers allow for efficient mappings of 3D Gaussian splats onto 2D camera views, this work focuses on mapping 2D image information (e.g. color, neural features or segmentation masks) efficiently back onto an existing scene of Gaussian splats. This ‘opposite’ direction enables applications ranging from scene relighting and stylization to 3D semantic segmentation, but also introduces challenges, such as view-dependent colorization and occlusion handling. Our approach tackles these challenges using the normal equation to solve a visibility-weighted least squares problem for every Gaussian and can be implemented efficiently with existing differentiable rasterizers. We demonstrate the effectiveness of our approach on scene relighting, feature enrichment and 3D semantic segmentation tasks, achieving up to an order of magnitude speedup compared to gradient descent-based baselines. Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) Cite as: arXiv:2604.17155 [cs.CV] (or arXiv:2604.17155v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.17155 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-161] ScenarioControl: Vision-Language Controllable Vectorized Latent Scenario Generation
【速读】:该论文旨在解决生成式 AI (Generative AI) 在自动驾驶场景生成中缺乏多模态控制能力的问题,即如何根据文本提示或输入图像精确、多样化且真实地合成包含动态交通参与者、道路结构和摄像机视角的三维驾驶场景滚动(scenario rollouts)。解决方案的关键在于提出 ScenarioControl,其核心创新是通过一个向量化潜在空间联合表示道路结构与动态代理,并引入交叉全局控制机制(cross-global control mechanism),该机制结合交叉注意力(cross-attention)与轻量级全局上下文分支,实现对道路布局和交通条件的细粒度控制,同时保证场景的真实性与时间一致性,从而支持从不同角色视角生成长时程的驾驶场景。
链接: https://arxiv.org/abs/2604.17147
作者: Lili Gao,Yanbo Xu,William Koch,Samuele Ruffino,Luke Rowe,Behdad Chalaki,Dmitriy Rivkin,Julian Ost,Roger Girgis,Mario Bijelic,Felix Heide
机构: Torc Robotics; Princeton University; Mila
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:We introduce ScenarioControl, the first vision-language control mechanism for learned driving scenario generation. Given a text prompt or an input image, Scenario-Control synthesizes diverse, realistic 3D scenario rollouts - including map, 3D boxes of reactive actors over time, pedestrians, driving infrastructure, and ego camera observations. The method generates scenes in a vectorized latent space that represents road structure and dynamic agents jointly. To connect multimodal control with sparse vectorized scene elements, we propose a cross-global control mechanism that integrates crossattention with a lightweight global-context branch, enabling fine-grained control over road layout and traffic conditions while preserving realism. The method produces temporally consistent scenario rollouts from the perspectives different actors in the scene, supporting long-horizon continuation of driving scenarios. To facilitate training and evaluation, we release a dataset with text annotations aligned to vectorized map structures. Extensive experiments validate that the control adherence and fidelity of ScenarioControl compare favorable to all tested methods across all experiments. Project webpage: this https URL
[CV-162] OptiMVMap: Offline Vectorized Map Construction via Optimal Multi-vehicle Perspectives
【速读】:该论文旨在解决多车辆协同构建高精度矢量地图(vectorized maps)时面临的三大挑战:计算成本高、视角冗余以及位姿误差和遮挡伪影带来的噪声问题。现有方法主要依赖单一车辆轨迹,存在视点不足的问题,而简单融合多车视角虽可补充空间信息,却易引入上述问题。其解决方案的关键在于将多车映射重构为“先选择后融合”的策略:首先通过最优车辆选择(Optimal Vehicle Selection, OVS)模块,基于不确定性引导机制筛选出能最大程度降低自车感知盲区不确定性的少数辅助车辆,从而缓解计算开销与冗余;随后利用跨车注意力(Cross-Vehicle Attention, CVA)实现鲁棒的位姿容忍对齐,并结合语义感知去噪滤波器(Semantic-aware Noise Filter, SNF)抑制伪影,最终在BEV层面完成高质量融合。该方法显著提升了地图完整性与拓扑准确性,同时大幅减少所需视点数量。
链接: https://arxiv.org/abs/2604.17135
作者: Zedong Dan,Zijie Wang,Wei Zhang,Xiangru Lin,Weiming Zhang,Xiao Tan,Jingdong Wang,Liang Lin,Guanbin Li
机构: Sun Yat-sen University (中山大学); Zhongguancun Academy (中关村学院); Baidu Inc. (百度公司); Shenzhen Loop Area Institute (深圳环区研究所); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Offline vectorized maps constitute critical infrastructure for high-precision autonomous driving and mapping services. Existing approaches rely predominantly on single ego-vehicle trajectories, which fundamentally suffer from viewpoint insufficiency: while memory-based methods extend observation time by aggregating ego-trajectory frames, they lack the spatial diversity needed to reveal occluded regions. Incorporating views from surrounding vehicles offers complementary perspectives, yet naive fusion introduces three key challenges: computational cost from large candidate pools, redundancy from near-collinear viewpoints, and noise from pose errors and occlusion artifacts. We present OptiMVMap, which reformulates multi-vehicle mapping as a select-then-fuse problem to address these challenges systematically. An Optimal Vehicle Selection (OVS) module strategically identifies a compact subset of helpers that maximally reduce ego-centric uncertainty in occluded regions, addressing computation and redundancy challenges. Cross-Vehicle Attention (CVA) and Semantic-aware Noise Filter (SNF) then perform pose-tolerant alignment and artifact suppression before BEV-level fusion, addressing the noise challenge. This targeted pipeline yields more complete and topologically faithful maps with substantially fewer views than indiscriminate aggregation. On nuScenes and Argoverse2, OptiMVMap improves MapTRv2 by +10.5 mAP and +9.3 mAP, respectively, and surpasses memory-augmented baselines MVMap and HRMapNet by +6.2 mAP and +3.8 mAP on nuScenes. These results demonstrate that uncertainty-guided selection of helper vehicles is essential for efficient and accurate multi-vehicle vectorized mapping. The code is released at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.17135 [cs.CV] (or arXiv:2604.17135v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.17135 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-163] Prompt Sensitivity in Vision-Language Grounding: How Small Changes in Wording Affect Object Detection
【速读】:该论文旨在解决视觉-语言模型在开放词汇目标定位(open-vocabulary object grounding)中因语义等价描述导致输出不一致的问题,即不同表述如“a person”与“a human”可能选择同一图像中的不同对象实例。其核心解决方案是通过一个受控的流水线(结合DETR生成候选框与CLIP进行语言条件筛选)对263张COCO val2017图像进行系统性评估,发现此类不稳定性具有结构性而非随机性,且文本嵌入距离仅能解释34%的定位差异(相关系数r = -0.58),从而表明问题根源在于argmax选择机制本身,而非单纯的语言表征差异。
链接: https://arxiv.org/abs/2604.17126
作者: Dawar Jyoti Deka,Amit Sethi,Syed Mohammad Ali
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 9 figures, 1 table. Accepted at ICCAI 2026 (The 12th International Conference on Computing and Artificial Intelligence), Okinawa, Japan, April 24-27, 2026
Abstract:Vision-language models enable open-vocabulary object grounding through natural language queries, under the implicit assumption that semantically equivalent descriptions yield consistent outputs. We examine this assumption using a controlled pipeline combining DETR for object proposals with CLIP for language-conditioned selection on 263 COCO val2017 images. We find that overlapping prompts such as “a person,” “a human,” and “a pedestrian” frequently select different instances, with mean instability of 2.11 distinct selections across six prompts. PCA analysis shows this variability is structured and directional, not random. Prompt ensembling does not improve quality and often shifts selections toward generic regions. We further show that text embedding proximity explains only 34% of grounding disagreement (r = -0.58), confirming that instability arises from the argmax selection mechanism rather than text-level distances alone.
[CV-164] Multimodal Fusion of Histopathology Images and Electronic Health Records for Early Breast Cancer Diagnosis
【速读】:该论文旨在解决乳腺癌(breast cancer)诊断中因单一模态数据利用不足而导致的预测性能受限问题,尤其是在病理图像与临床电子健康记录(EHR)信息未充分融合的情况下。其解决方案的关键在于构建一个系统性的多模态框架,将来自BreCaHAD数据集的组织切片图像特征(patch-level histopathology features)与MIMIC-IV中的结构化临床数据进行中间层融合(intermediate-fusion),通过concatenate方式整合来自ResNet-18(图像模态)和XGBoost(表格模态)的潜在表示(latent representations),从而在三分类病理切片识别任务中实现接近完美的性能(AUC 1.000),并在整体多模态模型中获得0.997的宏平均AUC,尤其在类别不平衡的有丝分裂(mitosis)判别上提升显著(AUC 0.994),同时借助Grad-CAM和SHAP可解释性分析验证了模型决策符合临床与病理标准。
链接: https://arxiv.org/abs/2604.17122
作者: Aditya Shribhagwan Khandelwal,Mohammad Samar Ansari,Asra Aslam
机构: University of Sheffield (谢菲尔德大学); University of Chester (切斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Breast cancer is a leading cause of cancer-related mortality worldwide, and timely accurate diagnosis is critical to improving survival outcomes. While convolutional neural networks (CNNs) have demonstrated strong performance on histopathology image classification, and machine learning models on structured electronic health records (EHR) have shown utility for clinical risk stratification, most existing work treats these modalities in isolation. This paper presents a systematic multimodal framework that integrates patch-level histopathology features from the BreCaHAD dataset with structured clinical data from MIMIC-IV. We train and evaluate unimodal image models (a simple CNN baseline and ResNet-18 with transfer learning), unimodal tabular models (XGBoost and a multilayer perceptron), and an intermediate-fusion model that concatenates latent representations from both modalities. ResNet-18 achieves near-perfect accuracy (1.000) and AUC (1.000) on three-class patch-level classification, while XGBoost achieves 98% accuracy on the EHR prediction task. The intermediate fusion model yields a macro-average AUC of 0.997, outperforming all unimodal baselines and delivering the largest improvements on the diagnostically critical but class-imbalanced mitosis category (AUC 0.994). Grad-CAM and SHAP interpretability analyses validate that model decisions align with established pathological and clinical criteria. Our results demonstrate that multimodal integration delivers meaningful improvements in both predictive performance and clinical transparency.
[CV-165] Inference-Time Temporal Probability Smoothing for Stable Video Segmentation with SAM2 under Weak Prompts
【速读】:该论文旨在解决基于SAM2的交互式视频分割模型在弱用户监督(如单帧稀疏点提示)下存在的时序不稳定性问题,包括边界闪烁、目标丢失和帧间对象范围不一致等现象,这些问题严重影响其在下游视频理解与控制任务中的可靠性。解决方案的关键在于提出一种推理阶段的时序概率平滑方法,该方法直接作用于每帧的分割概率图,通过光学流引导的运动对齐、基于分割熵的像素级不确定性估计以及前后向光流一致性约束,自适应地融合当前帧预测与运动对齐的历史估计,从而在不修改模型结构或重新训练的前提下,显著提升时序一致性,同时保持空间精度。
链接: https://arxiv.org/abs/2604.17115
作者: Dawar Jyoti Deka
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Interactive video segmentation models such as SAM2 have demonstrated strong generalization across diverse visual domains. However, under weak user supervision, for example, when sparse point prompts are provided on a single frame, their predictions often suffer from temporal instability, including flickering boundaries, object dropout, and inconsistent object extents across frames. These issues limit their reliability in downstream video understanding and control applications. In this paper, we propose an inference-time temporal probability smoothing method that improves the temporal stability of SAM2-based video segmentation without retraining or architectural modification. Our approach operates directly on per-frame segmentation probability maps and leverages optical-flow-based motion warping together with pixel-wise uncertainty estimates derived from segmentation entropy, and forward-backwards flow consistency. These signals are used to adaptively blend current-frame predictions with motion-aligned historical estimates, yielding temporally coherent segmentation outputs under weak prompts. We evaluate the proposed method on four diverse video sequences using a comprehensive set of frame-wise and temporal stability metrics, including motion-compensated IoU, boundary consistency, object persistence, and area volatility. Experimental results demonstrate consistent improvements in temporal stability over vanilla SAM2 inference while preserving spatial accuracy. The proposed framework is lightweight, model-agnostic, and well-suited for real-time, interactive video segmentation. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.17115 [cs.CV] (or arXiv:2604.17115v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.17115 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-166] From Clinical Intent to Clinical Model: An Autonomous Coding-Agent Framework for Clinician-driven AI Development
【速读】:该论文旨在解决临床人工智能(Clinical AI)开发中因医工协作模式导致的效率低下与需求错位问题,即临床医生需反复与AI专业团队沟通才能将需求转化为可执行模型,这一过程耗时且易因专业知识壁垒产生误解。其解决方案的关键在于引入自主编码代理(autonomous coding agents),使临床医生仅通过自然语言交互即可独立完成临床AI模型的开发,从而降低沟通成本并减少对专职AI开发人员的依赖。实验验证表明,该系统在多种临床任务中均能根据医生指令生成有效模型,并在去偏肺气胸分类任务中显著减少对胸腔引流管等混杂因素的依赖,证明了该方法在推动临床驱动型AI开发方面的可行性与潜力。
链接: https://arxiv.org/abs/2604.17110
作者: Zihao Zhao,Frederik Hauke,Juliana De Castilhos,Jakob Nikolas Kather,Sven Nebelung,Daniel Truhn
机构: University Hospital Aachen (亚琛大学医院); TU Dresden (德累斯顿工业大学); TUD Dresden University of Technology (德累斯顿工业大学); National Center for Tumor Diseases (NCT) (国家肿瘤疾病中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL
Abstract:Clinical AI development has traditionally followed a collaborative paradigm that depends on close interaction between clinicians and specialized AI teams. This paradigm imposes a practical challenge: clinicians must repeatedly communicate and refine their requirements with AI developers before those requirements can be translated into executable model development. This iterative process is time-consuming, and even after repeated discussion, misalignment may still exist because the two sides do not fully share each other’s expertise. However, autonomous coding agents may change this paradigm, raising the possibility that clinicians could develop clinical AI models independently through natural-language interaction alone. In this study, we present such an autonomous prototype for clinician-driven clinical AI development. We evaluated the system on five clinical tasks spanning dermoscopic lesion classification, melanoma-versus-nevus triage, wrist-fracture detection (including a weakly supervised variant with only 5% bounding-box annotations), and debiased pneumothorax classification on chest radiographs. Across these settings, the system consistently developed models from clinician requests and achieved promising performance. Notably, in a debiased pneumothorax classification task on chest radiographs, where chest drains can act as a major confounder, the system successfully mitigated shortcut learning and nearly halved the model’s reliance on chest drains. These findings provide proof of concept that autonomous coding agents may help shift clinical AI development toward a more clinician-driven paradigm, reducing the communication overhead and dependence on specialized AI developers. Although further validation and robustness assessment are needed, this study suggests a promising path toward making clinical AI development more accessible.
[CV-167] Hybrid Multi-Dimensional MRI Prostate Cancer Detection via Hadamard Network-Based Bias Correction and Residual Networks
【速读】:该论文旨在解决前列腺癌(Prostate Cancer, PCa)诊断中对高精度、自动化人工智能(Artificial Intelligence, AI)检测方法的迫切需求,尤其是在利用多维磁共振成像(Magnetic Resonance Imaging, MRI)获取组织成分定量信息时存在的图像强度不均匀性(intensity inhomogeneity, bias field)干扰问题。解决方案的关键在于提出了一种两阶段的AI框架——Hadamard-Bias Network plus ResNet18(HBR-Net-18):第一阶段采用基于Hadamard U-Net的算法校正六张参数化HM-MRI图中的偏置场,第二阶段通过ResNet-18模型对重叠的11×11像素补丁进行分类,并融合2D层内与3D相邻层间空间信息以增强空间一致性,从而显著提升敏感性和特异性,优于传统放射组学方法和基准卷积神经网络(CNN)模型。
链接: https://arxiv.org/abs/2604.17107
作者: Emadeldeen Hamdan,Gorkem Durak,Muhammed Enes Tasci,Abel Lorente Campos,Aritrick Chatterjee,Roger Engelmann,Gregory Karczma,Aytekin Oto,Ahmet Enis Cetin,Ulas Bagci
机构: University of Illinois Chicago(伊利诺伊大学芝加哥分校); Northwestern University(西北大学); University of Chicago(芝加哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This paper is accapted at the Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2026)
Abstract:Magnetic Resonance Imaging (MRI) is vital for prostate cancer (PCa) diagnosis. While advanced techniques such as Hybrid Multi-dimensional MRI (HM-MRI) have enhanced diagnostic capabilities, the significant need remains for robust, automated Artificial Intelligence (AI)-based detection methods. In this study, we combine quantitative HM-MRI of tissue composition with an AI-based neural network. We propose the Hadamard-Bias Network plus ResNet18 (HBR-Net-18), a two-stage AI framework for PCa detection. In the first stage, a Hadamard U-Net-based algorithm suppresses intensity inhomogeneities (bias fields) across six parametric HM-MRI maps generated via a Physics-Informed Autoencoder (PIA). In the second stage, a Residual Network (ResNet-18) performs patch-level classification. The framework utilizes overlapping 11-by-11 patches, incorporating both 2D intra-slice and 3D inter-slice (adjacent-slice) information to improve spatial consistency. Our experimental results demonstrate that HB-Net achieves balanced sensitivity and specificity, significantly outperforming conventional radiomics-based approaches and baseline CNN models, highlighting its potential for clinical deployment.
[CV-168] Marrying Text-to-Motion Generation with Skeleton-Based Action Recognition
【速读】:该论文旨在解决人类动作识别(Action Recognition)与动作生成(Motion Generation)长期分离研究的问题,其核心挑战在于二者缺乏统一建模框架,尤其在动作生成中对语义理解的依赖未被充分挖掘。解决方案的关键在于提出基于骨架坐标(Skeleton Coordinates)的自回归运动扩散模型 CoAMD(Coordinates-based Autoregressive Motion Diffusion),其中设计了多模态动作识别器(Multi-modal Action Recognizer, MAR),通过梯度引导为运动生成提供语义一致性约束,从而实现从粗到细的运动合成。该方法首次在统一框架下完成骨架动作识别、文本到动作生成、文本-动作检索及运动编辑四项任务,显著提升性能并验证了其通用性。
链接: https://arxiv.org/abs/2604.17090
作者: Jidong Kuang,Hongsong Wang,Jie Gui
机构: Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human action recognition and motion generation are two active research problems in human-centric computer vision, both aiming to align motion with textual semantics. However, most existing works study these two problems separately, without uncovering the links between them, namely that motion generation requires semantic comprehension. This work investigates unified action recognition and motion generation by leveraging skeleton coordinates for both motion understanding and generation. We propose Coordinates-based Autoregressive Motion Diffusion (CoAMD), which synthesizes motion in a coarse-to-fine manner. As a core component of CoAMD, we design a Multi-modal Action Recognizer (MAR) that provides gradient-based semantic guidance for motion generation. Furthermore, we establish a rigorous benchmark by evaluating baselines on absolute coordinates. Our model can be applied to four important tasks, including skeleton-based action recognition, text-to-motion generation, text-motion retrieval, and motion editing. Extensive experiments on 13 benchmarks across these tasks demonstrate that our approach achieves state-of-the-art performance, highlighting its effectiveness and versatility for human motion modeling. Code is available at this https URL.
[CV-169] EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling CVPR2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在高分辨率或跨图像场景下因视觉token数量庞大而导致的推理效率低下问题。解决方案的关键在于提出EvoComp框架,其核心创新是引入一个轻量级仅编码器结构的Transformer压缩器,通过联合考虑视觉与文本上下文来选择最具信息量且非冗余的视觉token;同时设计了一种基于进化的标签策略,在最小化MLLM输出损失的同时,借助词汇级token分组保障语义多样性,并结合GHM损失函数与余弦相似度正则项优化训练过程,从而实现高效压缩与高精度保持的平衡。
链接: https://arxiv.org/abs/2604.17087
作者: Jiafei Song,Fengwei Zhou,Jin Qu,Wenjin Jason Li,Tong Wu,Gengjian Xue,Zhikang Zhao,Daomin Wei,Yichao Lu,Bailin Na
机构: OPPO CTG
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by CVPR 2026
Abstract:Recent Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language understanding tasks, yet their inference efficiency is often hampered by the large number of visual tokens, particularly in high-resolution or multi-image scenarios. To address this issue, we propose EvoComp, a visual token compression framework that significantly reduces token count while preserving task accuracy. EvoComp introduces a lightweight encoder-only transformer-based compressor that selects the most informative and non-redundant visual tokens by jointly considering visual and textual contexts. A core challenge lies in providing effective supervision for training the compressor. To this end, we design an evolutionary labeling strategy that searches for token subsets minimizing the MLLM’s output loss, while enforcing semantic diversity through vocabulary-based token grouping. We further train the compressor using a tailored loss function combining the GHM loss to mitigate class and difficulty imbalance, and a cosine similarity regularization to encourage semantic separation between retained and discarded tokens. Extensive experiments across multiple vision-language benchmarks show that EvoComp outperforms existing methods based on attention or similarity heuristics. Notably, it retains 99.3% of the original accuracy under 3x token compression and delivers up to 1.6x speedup on mobile devices.
[CV-170] D-Prism: Differentiable Primitives for Structured Dynamic Modeling CVPR2026
【速读】:该论文旨在解决结构化动态物体(如多部件装配体或铰接机构)的几何与刚性运动联合建模难题,现有方法如可变形网格或3DGS(3D Gaussian Splatting)依赖非结构化表示,难以同时精确建模几何形状与关节运动;而基于基元(primitive-based)的方法虽在静态场景中表现优异,其动态建模潜力尚未被探索。解决方案的关键在于提出D-Prism框架,首次将可微分基元扩展至动态域:通过将3DGS绑定到基元表面以融合外观与几何优势,并引入变形网络控制基元运动以实现精准运动追踪;此外设计自适应控制策略动态调整基元数量,更贴合物体的真实空间占据。
链接: https://arxiv.org/abs/2604.17082
作者: Xingyuan Yu,Yijin Li,Chong Zeng,Yuhang Ming,Hujun Bao,Guofeng Zhang
机构: Zhejiang University (浙江大学); Stanford University; Hangzhou Dianzi University (杭州电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026. Project page: this https URL
Abstract:Capturing both geometry and rigid motion for structured dynamic objects, like multi-part assemblies or jointed mechanisms, remains a key challenge. Existing dynamic methods, such as deformable meshes or 3DGS, rely on unstructured representations and fail to jointly model suitable geometry and articulated motion. Primitive-based methods excel at structured static scenes, but their dynamic potential is still unexplored. We propose D-Prism, the first framework to achieve high-fidelity structured dynamic modeling by extending differentiable primitives to the dynamic domain. Specifically, we bind 3DGS to primitive surfaces, leveraging their respective strengths in appearance and geometry. We introduce a deformation network to control primitive motion, ensuring it accurately matches the object’s movement. Furthermore, we design a novel adaptive control strategy to dynamically adjust primitive counts, better matching objects’ true spatial footprint. Experiments confirm that our method excels at structured dynamic modeling, providing both structured geometry and precise motion tracking.
[CV-171] Comparison Drives Preference: Reference-Aware Modeling for AI-Generated Video Quality Assessment
【速读】:该论文旨在解决当前AI生成内容视频质量评估(AIGC-VQA)方法中忽视视频间潜在关联性的问题,即现有方法通常独立分析每段视频,未能利用不同视频之间的语义关系来提升评估准确性。其解决方案的关键在于提出一种参考感知的视频质量评估框架(RefVQA),将AIGC-VQA建模为一个参考感知的评价问题,通过构建以查询视频为中心的参考图(query-centered reference graph)组织语义相关样本,并基于图结构从参考节点到查询节点进行差异聚合,从而融合内在视频特征与跨视频比较信息,更贴合人类感知机制。实验表明,该方法在多个质量维度上优于现有最先进方法,且具备良好的跨数据集泛化能力。
链接: https://arxiv.org/abs/2604.17074
作者: Minghao Zou,Gen Liu,Guanghui Yue,Baoquan Zhao,Zhihua Wang,Paul L. Rosin,Hantao Liu,Wei Zhou
机构: Cardiff University (卡迪夫大学); Shandong University of Science and Technology (山东科技大学); Shenzhen University (深圳大学); Sun Yat-sen University (中山大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancement of generative models has led to a growing volume of AI-generated videos, making the automatic quality assessment of such videos increasingly important. Existing AI-generated content video quality assessment (AIGC-VQA) methods typically estimate visual quality by analyzing each video independently, ignoring potential relationships among videos. In this work, we revisit AIGC-VQA from an inter-video perspective and formulate it as a reference-aware evaluation problem. Through this formulation, quality assessment is guided not only by intrinsic video characteristics but also by comparisons with related videos, which is more consistent with human perception. To validate its effectiveness, we propose Reference-aware Video Quality Assessment (RefVQA), which utilizes a query-centered reference graph to organize semantically related samples and performs graph-guided difference aggregation from the reference nodes to the query node. Experiments on existing datasets demonstrate that our proposed RefVQA outperforms state-of-the-art methods across multiple quality dimensions, with strong generalization ability validated by cross-dataset evaluation. These results highlight the effectiveness of the proposed reference-based formulation and suggest its potential to advance AIGC-VQA.
[CV-172] NTIRE 2026 Rip Current Detection and Segmentation (RipDetSeg) Challenge Report CVPR2026
【速读】:该论文旨在解决近岸危险流——破浪流(rip current)在图像中的自动识别与分割问题,以提升海滩安全。破浪流因其视觉特征在不同海滩、视角和海况下变化显著,导致传统方法难以准确识别。解决方案的关键在于构建一个涵盖10余个国家、4种摄像头视角及多样化环境条件的大规模数据集,并基于RipVIS基准进行检测与分割双任务评估;参赛方法普遍采用预训练模型结合强数据增强和后处理设计,表明当前通用视觉模型的进步对破浪流理解具有显著促进作用,但针对其独特视觉结构的定制化方法仍有较大提升空间。
链接: https://arxiv.org/abs/2604.17070
作者: Andrei Dumitriu,Aakash Ralhan,Florin Miron,Florin Tatui,Radu Tudor Ionescu,Radu Timofte,Abdullah Naeem,Anav Katwal,Ayon Dey,Md Tamjidul Hoque,Asuka Shin,Hiroto Shirono,Kosuke Shigematsu,Gaurav Mahesh,Anjana Nanditha,Jiji CV,Akbarali Vakhitov,Sang-Chul Lee,Xinger Li,Chun’an Yu,Junhao Chen,Yang Yang,Gundluri Yuvateja Reddy,Harshitha Palaram,Gejalakshmi N,Jeevitha S,Jiachen Tu,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Yaokun Shi,Amitabh Tripathi,Modugumudi Mahesh,Santosh Kumar Vipparthi,Subrahmanyam Murala
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Challenge report paper from NTIRE Workshop at CVPR 2026
Abstract:This report presents the NTIRE 2026 Rip Current Detection and Segmentation (RipDetSeg) Challenge, which targets automatic rip current understanding in images. Rip currents are hazardous nearshore flows that cause many beach-related fatalities worldwide, yet remain difficult to identify because their visual appearance varies substantially across beaches, viewpoints, and sea states. To advance research on this safety-critical problem, the challenge builds on the RipVIS benchmark, evaluating both detection and segmentation. The dataset is diverse, sourced from more than 10 countries, with 4 camera orientations and diverse beach and sea conditions. This report describes the dataset, challenge protocol, evaluation methodology, final results, and summarizes the main insights from the submitted methods. The challenge attracted 159 registered participants and produced 9 valid test submissions across the two tasks. Final rankings are based on a composite score that combines F_1[50] , F_2[50] , F_1[40!:!95] , and F_2[40!:!95] . Most participant solutions relied on pretrained models, combined with strong augmentation and post-processing design. These results suggest that rip current understanding benefits strongly from the robust general-purpose vision models’ progress, while leaving ample room for future methods tailored to their unique visual structure.
[CV-173] BasketHAR: A Multimodal Dataset for Human Activity Recognition and Sport Analysis in Basketball Training Scenarios
【速读】:该论文旨在解决现有人体活动识别(Human Activity Recognition, HAR)数据集多局限于基础动作(如行走、站立等),难以满足专业场景(如篮球训练分析)需求的问题。其解决方案的关键在于构建了一个名为BasketHAR的新型多模态HAR数据集,专为篮球训练设计,包含惯性测量单元(inertial measurement units, IMUs)采集的加速度、角速度、磁场强度、心率和皮肤温度等生理与运动信号,并同步记录视频数据,从而实现对专业级篮球动作的全面表征。此外,研究还提供了一种基线多模态对齐方法用于性能评估,验证了该数据集在复杂活动识别任务中的适用性与潜力。
链接: https://arxiv.org/abs/2604.17065
作者: Xian Gao,Haoyue Zhang,Zongyun Zhang,Jiacheng Ruan,Ting Liu,Yuzhuo Fu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 7 figures
Abstract:Human Activity Recognition (HAR) involves the automatic identification of user activities and has gained significant research interest due to its broad applicability. Most HAR systems rely on supervised learning, which necessitates large, diverse, and well-annotated datasets. However, existing datasets predominantly focus on basic activities such as walking, standing, and stair navigation, limiting their utility in specialized contexts like sports performance analysis. To address this gap, we present BasketHAR, a novel multimodal HAR dataset tailored for basketball training, encompassing a diverse set of professional-level actions. BasketHAR includes comprehensive motion data from inertial measurement units (accelerometers and gyroscopes), angular velocity, magnetic field, heart rate, skin temperature, and synchronized video recordings. We also provide a baseline multimodal alignment method to benchmark performance. Experimental results underscore the dataset’s complexity and suitability for advanced HAR tasks. Furthermore, we highlight its potential applications in the analysis of basketball training sessions and in the generation of specialized performance reports, representing a valuable resource for future research in HAR and sports analytics. The dataset are publicly accessible at this https URL licensed under Apache License 2.0.
[CV-174] Motion-Guided Semantic Alignment with Negative Prompts for Zero-Shot Video Action Recognition ICASSP2026
【速读】:该论文旨在解决零样本动作识别(zero-shot action recognition)中因“已见类别”与“未见类别”之间语义鸿沟(semantic gap)导致的性能瓶颈问题。其解决方案的关键在于:提出一种基于CLIP(Contrastive Language–Image Pretraining)的改进框架,通过引入解耦嵌入(disentangled embeddings)和语义引导交互机制,实现更精准的动作表征。具体而言,运动分离模块(Motion Separation Module, MSM)将运动敏感特征与全局静态特征解耦,而运动聚合块(Motion Aggregation Block, MAB)则利用门控交叉注意力机制细化运动表示,避免冗余信息重新耦合;同时,通过投影嵌入与正向文本提示对齐,并借助负向提示显式建模“非类别”语义,增强视频特征与文本表示之间的语义一致性,从而显著提升模型在未见类别上的泛化能力。
链接: https://arxiv.org/abs/2604.17062
作者: Yiming Wang,Frederick W. B. Li,Jingyun Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures, accepted by ICASSP 2026
Abstract:Zero-shot action recognition is challenging due to the semantic gap between seen and unseen classes. We present a novel framework that enhances CLIP with disentangled embeddings and semantic-guided interaction. A Motion Separation Module (MSM) separates motion-sensitive and global-static features, while a Motion Aggregation Block (MAB) employs gated cross-attention to refine motion representation without re-coupling redundant information. To facilitate generalization to unseen categories, we enforce semantic alignment between video features and textual representations by aligning projected embeddings with positive textual prompts, while leveraging negative prompts to explicitly model “non-class” semantics. Experiments on standard benchmarks demonstrate that our method consistently outperforms prior CLIP-based approaches, achieving robust zero-shot action recognition across both coarse and fine-grained datasets.
[CV-175] mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval WACV2026
【速读】:该论文旨在解决现有方法在处理可缩放矢量图形(SVG)时普遍存在的问题:即多数方法将SVG栅格化并丢弃其符号结构信息,同时当前强大的文本嵌入技术难以自然扩展到视觉或结构化模态。为实现对结构感知的多模态检索,作者提出一种无需训练、受指令引导的多模态嵌入框架,利用多模态大语言模型(MLLM)将文本、位图图像和SVG代码映射到对齐的嵌入空间。该方案的关键在于两个核心组件:(1) 多模态显式单字限制(mEOL),通过指令引导MLLM将任意多模态输入压缩为一个词 token,其隐藏状态作为紧凑语义嵌入;(2) 语义SVG重写模块,基于渲染图像进行视觉推理,为SVG元素分配有意义的标识符并简化嵌套结构,从而揭示原始代码中隐藏的几何与关系线索。此方法无需学习投影头或对比训练,仅靠提示控制即可实现结构敏感的多模态对齐。
链接: https://arxiv.org/abs/2604.17054
作者: Kyeong Seon Kim,Baek Seong-Eun,Lee Jung-Mok,Tae-Hyun Oh
机构: KAIST; POSTECH
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Round 1 early acceptance to WACV 2026, Project page: this https URL
Abstract:Scalable Vector Graphics (SVGs) function both as visual images and as structured code that encode rich geometric and layout information, yet most methods rasterize them and discard this symbolic organization. At the same time, recent sentence embedding methods produce strong text representations but do not naturally extend to visual or structured modalities. We propose a training-free, instruction-guided multimodal embedding framework that uses a Multimodal Large Language Model (MLLM) to map text, raster images, and SVG code into an aligned embedding space. We control the direction of embeddings through modality-specific instructions and structural SVG cues, eliminating the need for learned projection heads or contrastive training. Our method has two key components: (1) Multimodal Explicit One-word Limitation (mEOL), which instructs the MLLM to summarize any multimodal input into a single token whose hidden state serves as a compact semantic embedding. (2) A semantic SVG rewriting module that assigns meaningful identifiers and simplifies nested SVG elements through visual reasoning over the rendered image, exposing geometric and relational cues hidden in raw code. Using a repurposed VGBench, we build the first text-to-SVG retrieval benchmark and show that our training-free embeddings outperform encoder-based and training-based multimodal baselines. These results highlight prompt-level control as an effective alternative to parameter-level training for structure-aware multimodal retrieval. Project page: this https URL
[CV-176] OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning CVPR2026
【速读】:该论文旨在解决流式视频推理(streaming video reasoning)中长期记忆管理的难题,即在历史数据无限增长但有效证据稀缺的场景下,如何高效地定位和利用关键信息,避免因盲目扩大记忆容量导致冗余干扰或过度压缩造成重要信号丢失。其解决方案的关键在于提出OASIS框架,通过结构化、按需检索的方式组织流式历史为分层事件,并采用“短上下文推理优先 + 不确定性触发语义驱动检索”的两阶段机制,使检索基于高层意图而非嵌入相似性,从而显著提升检索准确性与低噪声特性;该方法无需训练、可插拔,适用于不同流式多模态大模型(streaming MLLM)骨干网络,在控制token消耗和请求延迟的同时实现长时程准确性和组合推理能力的显著提升。
链接: https://arxiv.org/abs/2604.17052
作者: Zhijia Liang,Jiaming Li,Weikai Chen,Yanhao Zhang,Haonan Lu,Guanbin Li
机构: Sun Yat-sen University (中山大学); OPPO AI Center (OPPO人工智能研究中心); Shenzhen Loop Area Institute (深圳 loop 区域研究院); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Streaming video reasoning requires models to operate in a setting where history grows without bound while meaningful evidence remains scarce. In such a landscape, relevant signal is like an oasis-small, critical, and easily lost in a desert of redundancy. Enlarging memory only widens the desert; aggressive compression dries up the oasis. The real difficulty lies in discovering where to look, not how much to remember. We therefore introduce OASIS, a novel framework for streaming video reasoning that tackles this challenge through structured, on-demand retrieval. It organizes streaming history into hierarchical events and performs reasoning as controlled refinement-short-context inference first, followed by semantically grounded retrieval only when uncertainty arises. As the retrieval is driven by high-level intent rather than embedding similarity, the retrieved memory is substantially more accurate and less noisy. Additionally, the mechanism is plug-and-play, training-free, and readily attaches to different streaming MLLM backbones. Experiments across multiple benchmarks and backbones show that OASIS achieves strong gains in long-horizon accuracy and compositional reasoning with bounded token cost and low request delay. Code is available at this https URL.
[CV-177] A Real-Time Bike-Pedestrian Safety System with Wide-Angle Perception and Evaluation Testbed for Urban Intersections
【速读】:该论文旨在解决城市交叉路口中骑行者与行人碰撞事故频发的问题,特别是针对未配备专用设备的路权使用者缺乏实时预警系统的现状。解决方案的关键在于构建一个基于单个边缘计算设备(搭载广角鱼眼相机)的碰撞预警系统,其核心创新包括:1)提出一种针对超广角鱼眼镜头的标定流程,通过透视重映射和直接束调整克服角点检测失败与优化器发散问题;2)结合鱼眼感知的目标检测与基于预计算查找表的闭式地面投影方法,实现高精度的物体定位;3)引入设计阶段的合规性仿真测试,涵盖24种脚本化危险场景、随机尺寸感知的检测失败模型及延迟扫描,验证一阶运动学预测器在真实摄像头延迟下仍能保持平均预警时间超过分心行人反应时间;4)将决策层形式化为可分离、可审计的测试平台,包含明确的部署门禁、可 contestability 机制和残余风险登记表,从而保障系统安全性和可解释性。实测结果显示,在鱼眼定位误差下的合规测试中,该方案达到93.3%灵敏度和92.3%特异性,平均预警时间为3.3秒。
链接: https://arxiv.org/abs/2604.17046
作者: Mehmet Kerem Turkcan
机构: Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Collisions between cyclists and pedestrians at urban intersections remain a persistent source of injuries, yet few systems attempt real-time warnings to unequipped road users using commodity hardware. We present a prototype collision warning system that runs on a single edge device with a wide-angle fisheye camera, producing audible and visual alerts at 30,fps. The system makes four contributions. First, we develop a calibration pipeline for ultra-wide fisheye lenses that overcomes corner-detection failure and optimizer divergence through perspective remapping and direct bundle adjustment. Second, we combine fisheye-aware object detection with a closed-form ground-plane projection via a precomputed lookup table. Third, we introduce a design-time conformance simulation with 24 scripted hazard scenarios, stochastic size-aware detection failures, and a latency sweep showing that a first-order kinematic predictor maintains the mean warning budget above the distracted-pedestrian reaction time across realistic camera latencies. Fourth, we formalize the decision layer as a separable, auditable testbench with explicit deployment gates, contestability mechanisms, and a residual risk register. Under conformance testing with fisheye localization error, the selected pipeline configuration achieves 93.3% sensitivity and 92.3% specificity, with a mean warning budget of 3.3,s. The system design was informed by community-aided design workshops. Code and replication scripts are available at this https URL.
[CV-178] SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models CVPR2026
【速读】:该论文旨在解决大规模视觉语言模型(Large Vision-Language Models, LVLMs)在公开可用背景下面临的未经授权复用和知识产权侵权问题。现有所有权验证方法通常依赖语义异常查询或分布外响应作为指纹,但这些方法易被攻击者检测并移除。为应对这一挑战,作者提出一种非侵入式框架SIF(Semantically In-Distribution Fingerprints),其核心创新在于引入语义对齐的指纹蒸馏(Semantic-Aligned Fingerprint Distillation, SAFD)与鲁棒指纹优化(Robust-Fingerprint Optimization, RFO)。SAFD将文本水印信号迁移至视觉模态,生成语义一致且带有指纹的响应;RFO通过模拟最坏情况下的表示扰动,增强指纹对微调、量化等模型修改的鲁棒性,从而实现隐蔽性强、抗干扰能力高的版权保护方案。
链接: https://arxiv.org/abs/2604.17041
作者: Yifei Zhao,Qian Lou,Mengxin Zheng
机构: University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026
Abstract:The public accessibility of large vision-language models (LVLMs) raises serious concerns about unauthorized model reuse and intellectual property infringement. Existing ownership verification methods often rely on semantically abnormal queries or out-of-distribution responses as fingerprints, which can be easily detected and removed by adversaries. We expose this vulnerability through a Semantic Divergence Attack (SDA), which identifies and filters fingerprint queries by measuring semantic divergence between a suspect model and a reference model, showing that existing fingerprints are not semantic-preserving and are therefore easy to detect and bypass. To address these limitations, we propose SIF (Semantically In-Distribution Fingerprints), a non-intrusive ownership verification framework that requires no parameter modification. SIF introduces Semantic-Aligned Fingerprint Distillation (SAFD), which transfers text watermarking signals into the visual modality to produce semantically coherent yet fingerprinted responses. In addition, Robust-Fingerprint Optimization (RFO) enhances robustness by simulating worst-case representation perturbations, making the fingerprints resilient to model modifications such as fine-tuning and quantization. Extensive experiments on LLaVA-1.5 and Qwen2.5-VL demonstrate that SIF achieves strong stealthiness and robustness, providing a practical solution for LVLM copyright protection. Code is available at this https URL
[CV-179] Conditional Evidence Reconstruction and Decomposition for Interpretable Multimodal Diagnosis
【速读】:该论文旨在解决多模态诊断模型在实际应用中因模态缺失而导致的性能下降与可解释性不足的问题。具体而言,现有方法依赖于群体层面或静态先验,难以捕捉个体特异性的跨模态依赖关系,且缺乏对决策依据的清晰解析。其解决方案的关键在于提出条件证据重构与分解(Conditional Evidence Reconstruction and Decomposition, CERD)框架:首先基于每个受试者的已观测模态条件重建缺失模态表示,随后通过logit级归因将诊断证据分解为共享的跨模态一致性证据与模态特异性线索,从而实现鲁棒且结构化的可解释诊断支持。
链接: https://arxiv.org/abs/2604.17030
作者: Shaowen Wan,Yanjun Lv,Lu Zhang,Dajiang Zhu,Bharat Biswal,Tianming Liu,Xiaobo Li,Lin Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Neurobiological and neurodegenerative diseases are inherently multifactorial, arising from coupled influences spanning genetic susceptibility, brain alterations, and environmental and behavioral factors. Multimodal modeling has therefore been increasingly adopted for disease diagnosis by integrating complementary evidence across data sources. However, in both large-scale cohorts and real-world clinical workflows, modality coverage is often incomplete, making many multimodal models brittle when one or more modalities are unavailable. Existing approaches to incomplete multimodal diagnosis typically rely on group-wise or static priors, which may fail to capture subject-specific cross-modal dependencies; moreover, many models provide limited interpretability into which evidence sources drive the final decision. To address these limitations, we propose Conditional Evidence Reconstruction and Decomposition (CERD), a framework for interpretable multimodal diagnosis with incomplete modalities. CERD first reconstructs missing modality representations conditioned on each subject’s observed inputs, then decomposes diagnostic evidence into shared cross-modal corroboration and modality-specific cues via logit-level attribution. Experiments on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) demonstrate that CERD outperforms competitive baselines under incomplete-modality settings while producing structured and clinically aligned evidence attributions for trustworthy decision support.
[CV-180] IMA-MoE: An Interpretable Modality-Aware Mixture-of-Experts Framework for Characterizing the Neurobiological Signatures of Binge Eating Disorder
【速读】:该论文旨在解决当前暴食症(Binge Eating Disorder, BED)诊断仍主要依赖症状标准而缺乏生物机制支撑的问题,从而限制了早期识别和基于生物学的干预策略发展。其解决方案的关键在于提出一种可解释的模态感知混合专家模型(Interpretable Modality-Aware Mixture-of-Experts, IMA-MoE),该架构能够统一建模多模态数据(包括神经影像、行为、激素及人口统计学指标),通过将每类测量编码为独立token来灵活捕捉跨模态依赖关系并保留模态特异性特征,并引入token重要性机制以量化各变量对预测结果的贡献,从而实现更准确且具备生物学意义的BED表征与个性化干预基础。
链接: https://arxiv.org/abs/2604.17028
作者: Lin Zhao,Qiaohui Gao,Elizabeth Martin,Kurt P. Schulz,Tom Hildebrandt,Robyn Sysko,Tianming Liu,Xiaobo Li
机构: New Jersey Institute of Technology (新泽西理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Binge eating disorder (BED) is the most prevalent eating disorder. However, current diagnostic frameworks remain largely grounded in symptom-based criteria rather than underlying biological mechanisms, thereby limiting early detection and the development of biologically-informed interventions. Emerging studies have begun to investigate the neurobiological signatures of BED, yet their findings are often difficult to generalize due to the reliance on hypothesis-driven parametric models, single-modality analyses, and limited data diversity. Therefore, there is a critical need for advanced data-driven frameworks capable of modeling multimodal data to uncover generalizable and biologically meaningful signatures of BED. In this study, we propose the Interpretable Modality-Aware Mixture-of-Experts (IMA-MoE), a novel architecture designed to integrate heterogeneous neuroimaging, behavioral, hormonal, and demographic measures within a unified predictive framework. By encoding each measure as a distinct token, IMA-MoE enables flexible modeling of cross-modal dependencies while preserving modality-specific characteristics. We further introduce a token-importance mechanism to enhance interpretability by quantifying the contribution of each measure to model predictions. Evaluated on the large-scale Adolescent Brain Cognitive Development (ABCD) dataset, IMA-MoE demonstrates superior performance in differentiating BED from healthy controls compared with baseline methods, while revealing sex-specific predictive patterns, with hormonal measures contributing more prominently to prediction in females. Collectively, these findings highlight the promise of interpretable, data-driven multimodal modeling in advancing biologically-informed characterization of BED and facilitating more precise and personalized interventions in neuropsychiatric disorders.
[CV-181] CAM3DNet: Comprehensively mining the multi-scale features for 3D Object Detection with Multi-View Cameras
【速读】:该论文旨在解决基于多视角图像的查询式3D目标检测方法在高效利用动态多尺度信息方面的挑战,尤其是对象特征与查询几何关系学习不足,以及直接探索多尺度时空特征导致计算开销过大的问题。解决方案的关键在于提出一种新颖的稀疏查询框架CAM3DNet,其核心创新包括三个模块:复合查询(Composite Query, CQ)通过多尺度投影策略将2D查询映射至3D空间;自适应自注意力(Adaptive Self-Attention, ASA)模块学习时空多尺度查询间的交互关系;多尺度混合采样(Multi-Scale Hybrid Sampling, MSHS)模块结合可变形注意力机制,融合多尺度查询、金字塔特征图和2D相机先验知识进行高效采样。该架构以FPN为编码器,YOLOX与DepthNet生成CQ,并通过重复使用ASA和MSHS作为解码器提取检测特征,显著提升了检测性能并降低了冗余计算。
链接: https://arxiv.org/abs/2604.17024
作者: Mingxi Pang,Dingheng Wang,Zekun Li,Zhenping Sun,Bo Wang,Zhihang Wang,Zhao-Xu Yang
机构: National University of Defense Technology (国防科技大学); Northwest Institute of Mechanical Electrical Engineering (西北机电工程研究所); Xi’an Jiaotong University (西安交通大学); Xian Jiaotong University (西安交通大学); Xi’an Jiaotong University (西安交通大学); XJTU (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Query-based 3D object detection methods using multi-view images often struggle to efficiently leverage dynamic multi-scale information, e.g., the relationship between the object features and the geometric of the queries are not sufficiently learned, directly exploring the multi-scale spatiotemporal features will pay too many costs. To address these challenges, we propose CAM3DNet, a novel sparse query-based framework which combines three new modules, composite query (CQ), adaptive self-attention (ASA), and multi-scale hybrid sampling (MSHS). First, the core idea in the CQ module is a multi-scale projection strategy to transform 2D queries into 3D space. Second, the ASA module learns the interactions between the spatiotemporal multi-scale queries. Third, the MSHS module uses the deformable attention mechanism to sample multi-scale object information by considering multi-scales queries, pyramid feature maps, and 2D-camera prior knowledge. The entire model employs a backbone network and a feature pyramid network (FPN) as the encoder, then introduces a YOLOX and a DepthNet as a ROI_Head to produce CQ, and repeatedly utilizes ASA and MSHS as the decoder to gain detection features. Extensive experiments on the nuScenes, Waymo, and Argoverse benchmark datasets demonstrate the effectiveness of our CAM3DNet, and most existing camera-based 3D object detection methods are outperformed. Besides, we make comprehensive ablation studies to check the individual effect of CQ, ASA, and MSHS, as well as their cost of space and computation complexity.
[CV-182] LIVE: Leverag ing Image Manipulation Priors for Instruction-based Video Editing
【速读】:该论文旨在解决视频编辑数据集在规模、质量和任务多样性方面受限于高标注成本的问题,尤其针对依赖视频生成模型或人工标注时难以获取高质量大规模数据的瓶颈。其解决方案的关键在于提出LIVE框架,通过联合训练策略融合大规模高质量图像编辑数据与视频数据,并引入帧级token噪声策略以缓解静态图像与动态视频之间的域差异——该策略将特定帧的潜在表示视为推理token,利用预训练视频生成模型实现合理的时序变换;同时结合公开数据集清洗和自动化数据流水线,采用两阶段训练策略逐步提升视频编辑能力,从而显著增强模型在复杂视频编辑任务中的表现。
链接: https://arxiv.org/abs/2604.17021
作者: Weicheng Wang,Zhicheng Zhang,Zhongqi Zhang,Juncheng Zhou,Yongjie Zhu,Wenyu Qin,Meng Wang,Pengfei Wan,Jufeng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video editing aims to modify input videos according to user intent. Recently, end-to-end training methods have garnered widespread attention, constructing paired video editing data through video generation or editing models. However, compared to image editing, the high annotation costs of video data severely constrain the scale, quality, and task diversity of video editing datasets when relying on video generative models or manual annotation. To bridge this gap, we propose LIVE, a joint training framework that leverages large-scale, high-quality image editing data alongside video datasets to bolster editing capabilities. To mitigate the domain discrepancy between static images and dynamic videos, we introduce a frame-wise token noise strategy, which treats the latents of specific frames as reasoning tokens, leveraging large pretrained video generative models to create plausible temporal transformations. Moreover, through cleaning public datasets and constructing an automated data pipeline, we adopt a two-stage training strategy to anneal video editing capabilities. Furthermore, we curate a comprehensive evaluation benchmark encompassing over 60 challenging tasks that are prevalent in image editing but scarce in existing video datasets. Extensive comparative and ablation experiments demonstrate that our method achieves state-of-the-art performance. The source code will be publicly available.
[CV-183] owards Universal Skeleton-Based Action Recognition
【速读】:该论文旨在解决异构骨架数据下的开放词汇动作识别问题,即在人类与机器人交互场景中,如何有效处理来自不同来源和结构的骨架数据(如人体与类人机器人),并实现对开放词汇表中动作类别的准确识别。其解决方案的关键在于提出一种基于Transformer的模型,包含三个核心组件:统一骨架表示、用于骨架的运动编码器以及多粒度运动-文本对齐机制。其中,运动编码器将多模态骨架嵌入输入双流Transformer架构以学习时空动作表征,并通过多粒度对比学习(全局实例对齐、流特异性对齐和细粒度对齐)将其映射到语义空间并与文本嵌入对齐,从而提升模型在异构骨架数据上的泛化能力和开放词汇识别性能。
链接: https://arxiv.org/abs/2604.17013
作者: Jidong Kuang,Hongsong Wang,Jie Gui
机构: Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the development of robotics, skeleton-based action recognition has become increasingly important, as human-robot interaction requires understanding the actions of humans and humanoid robots. Due to different sources of human skeletons and structures of humanoid robots, skeleton data naturally exhibit heterogeneity. However, previous works overlook the data heterogeneity of skeletons and solely construct models using homogeneous skeletons. Moreover, open-vocabulary action recognition is also essential for real-world applications. To this end, this work studies the challenging problem of heterogeneous skeleton-based action recognition with open vocabularies. We construct a large-scale Heterogeneous Open-Vocabulary (HOV) Skeleton dataset by integrating and refining multiple representative large-scale skeleton-based action datasets. To address universal skeleton-based action recognition, we propose a Transformer-based model that comprises three key components: unified skeleton representation, motion encoder for skeletons, and multi-grained motion-text alignment. The motion encoder feeds multi-modal skeleton embeddings into a two-stream Transformer-based encoder to learn spatio-temporal action representations, which are then mapped to a semantic space to align with text embeddings. Multi-grained motion-text alignment incorporates contrastive learning at three levels: global instance alignment, stream-specific alignment, and fine-grained alignment. Extensive experiments on popular benchmarks with heterogeneous skeleton data demonstrate both the effectiveness and the generalization ability of the proposed method. Code is available at this https URL.
[CV-184] MobileAgeNet: Lightweight Facial Age Estimation for Mobile Deployment
【速读】:该论文旨在解决移动设备上人脸年龄估计(facial age estimation)模型在预测精度、推理延迟和模型体积之间的平衡问题。其关键解决方案是提出一种轻量级的回归框架 MobileAgeNet,该框架基于预训练的 MobileNetV3-Large 主干网络并结合紧凑的回归头,在 UTKFace 数据集上实现 4.65 年的平均绝对误差(MAE),同时保持 14.4 毫秒的平均推理延迟。此外,作者采用有界年龄回归与两阶段微调策略以提升训练稳定性和泛化能力,并构建从 PyTorch 训练到 ONNX 导出再到 TensorFlow Lite 转换的完整部署流水线,确保在实际移动端环境下预测行为无显著退化,从而为移动端人脸年龄估计提供了一个可复现、高效且实用的基准方案。
链接: https://arxiv.org/abs/2604.17007
作者: Arun Kumar,Aswathy Baiju,Radu Timofte,Dmitry Ignatov
机构: University of Würzburg (维尔茨堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 Pages including references, 3 figures
Abstract:Mobile deployment of facial age estimation requires models that balance predictive accuracy with low latency and compact size. In this work, we present MobileAgeNet, a lightweight age-regression framework that achieves an MAE of 4.65 years on the UTKFace held-out test set while maintaining efficient on-device inference with an average latency of 14.4 ms measured using the AI Benchmark application. The model is built on a pretrained MobileNetV3-Large backbone combined with a compact regression head, enabling real-time prediction on mobile devices. The training and evaluation pipeline is integrated into the NN LEMUR Dataset framework, supporting reproducible experimentation, structured hyperparameter optimization, and consistent evaluation. We employ bounded age regression together with a two-stage fine-tuning strategy to improve training stability and generalization. Experimental results show that MobileAgeNet achieves competitive accuracy with 3.23M parameters, and that the deployment pipeline from PyTorch training through ONNX export to TensorFlow Lite conversion - preserves predictive behavior without measurable degradation under practical on-device conditions. Overall, this work provides a practical, deployment-ready baseline for mobile-oriented facial age estimation.
[CV-185] MuDance: Contrastive Alignment-Based Textual Control for Music-Driven Dance Generation
【速读】:该论文旨在解决现有音乐驱动舞蹈生成方法在语义可控性方面的不足,即难以通过自然语言描述有效引导特定动作。其核心问题是缺乏大规模的音乐-文本-动作三元组标注数据集,导致无法对文本条件下的舞蹈生成进行有效的监督学习。解决方案的关键在于提出TeMuDance框架,采用以运动为中心的桥接范式(motion-centred bridging paradigm),利用运动作为共享语义锚点,在统一嵌入空间中对不相关的音乐-舞蹈和文本-运动数据集进行对齐,从而实现跨模态检索缺失模态并用于端到端训练;同时引入轻量级文本控制分支,在冻结的音乐到舞蹈扩散主干网络基础上,保留节奏保真度的同时实现细粒度语义引导,并通过双流微调策略与置信度过滤机制抑制检索监督信号中的噪声。
链接: https://arxiv.org/abs/2604.17005
作者: Xinran Liu,Diptesh Kanojia,Wenwu Wang,Zhenhua Feng
机构: University of Surrey (萨里大学); Jiangnan University (江南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:
Abstract:Existing music-driven dance generation approaches have achieved strong realism and effective audio-motion alignment. However, they generally lack semantic controllability, making it difficult to guide specific movements through natural language descriptions. This limitation primarily stems from the absence of large-scale datasets that jointly align music, text, and motion for supervised learning of text-conditioned control. To address this challenge, we propose TeMuDance, a framework that enables text-based control for music-conditioned dance generation without requiring any manually annotated music-text-motion triplet dataset. TeMuDance introduces a motion-centred bridging paradigm that leverages motion as a shared semantic anchor to align disjoint music-dance and text-motion datasets within a unified embedding space, enabling cross-modal retrieval of missing modalities for end-to-end training. A lightweight text control branch is then trained on top of a frozen music-to-dance diffusion backbone, preserving rhythmic fidelity while enabling fine-grained semantic guidance. To further suppress noise inherent in the retrieved supervision, we design a dual-stream fine-tuning strategy with confidence-based filtering. We also propose a novel task-aligned metric that quantifies whether textual prompts induce the intended kinematic attributes under music conditioning. Extensive experiments demonstrate that TeMuDance achieves competitive dance quality while substantially improving text-conditioned control over existing methods.
[CV-186] Inductive Convolution Nuclear Norm Minimization for Tensor Completion with Arbitrary Sampling
【速读】:该论文旨在解决张量补全(Tensor Completion, TCAS)中因任意采样导致的恢复难题,特别是针对传统卷积核范数最小化(Convolution Nuclear Norm Minimization, CNNM)方法在优化过程中需多次执行奇异值分解(Singular Value Decomposition, SVD),从而造成计算开销大且难以并行化的问题。其解决方案的关键在于从卷积特征向量的角度重构CNNM的优化目标,并引入预学习的、跨张量共享的卷积特征向量作为先验知识,提出诱导式卷积核范数最小化(Inductive Convolution Nuclear Norm Minimization, ICNNM)方法,该方法可跳过SVD步骤,显著降低计算时间,同时借助额外先验信息提升恢复性能。
链接: https://arxiv.org/abs/2604.17001
作者: Wei Li,Yuyang Li,Kaile Du,Yi Yu,Guangcan Liu
机构: Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11
Abstract:The recently established Convolution Nuclear Norm Minimization (CNNM) addresses the problem of \textittensor completion with arbitrary sampling (TCAS), which involves restoring a tensor from a subset of its entries sampled in an arbitrary manner. Despite its promising performance, the optimization procedure of CNNM needs performing Singular Value Decomposition (SVD) multiple times, which is computationally expensive and hard to parallelize. To address the issue, we reformulate the optimization objective of CNNM from the perspective of convolution eigenvectors. By introducing pre-learned convolution eigenvectors which are shared among different tensors, we propose a novel method called Inductive Convolution Nuclear Norm Minimization (ICNNM), which bypasses the SVD step so as to decrease significantly the computational time. In addition, due to the extra prior knowledge encoded in the pre-learned convolution eigenvectors, ICNNM also outperforms CNNM in terms of recovery performance. Extensive experiments on video completion, prediction and frame interpolation verify the superiority of ICNNM over CNNM and several other competing methods.
[CV-187] Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification
【速读】:该论文旨在解决当前具身智能(Embodied AI)在真实世界部署中,视觉-语言导航(Vision-and-Language Navigation, VLN)任务从单纯可达性向社会合规性演进时所面临的“目标驱动陷阱”问题——即现有代理过度关注物理路径可行性(“我能去吗?”),而忽视语义规则约束(“我可以去吗?”),导致对细微监管限制的忽略。解决方案的关键在于提出Rule-VLN基准和Semantic Navigation Rectification Module(SNRM):Rule-VLN是首个大规模城市级规则合规导航基准,包含29k节点、8k受限节点及177类监管规则;SNRM则是一个通用的零样本模块,通过粗粒度到细粒度的视觉语言模型(VLM)感知框架与认知不确定性(epistemic)心理地图结合,实现动态绕行规划,显著提升代理的安全意识与导航性能,实验表明其可使碰撞率(CVR)降低19.26%,总完成率(TC)提升5.97%。
链接: https://arxiv.org/abs/2604.16993
作者: Jiawen Wen,Penglei Sun,Wenjie Zhang,Suixuan Qiu,Weisheng Xu,Xiaofei Yang,Xiaowen Chu
机构: Hong Kong University of Science and Technology (Guangzhou); Beijing Normal University; Guangzhou University
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:As embodied AI transitions to real-world deployment, the success of the Vision-and-Language Navigation (VLN) task tends to evolve from mere reachability to social compliance. However, current agents suffer from a “goal-driven trap”, prioritizing physical geometry (“can I go?”) over semantic rules (“may I go?”), frequently overlooking subtle regulatory constraints. To bridge this gap, we establish Rule-VLN, the first large-scale urban benchmark for rule-compliant navigation. Spanning a massive 29k-node environment, it injects 177 diverse regulatory categories into 8k constrained nodes across four curriculum levels, challenging agents with fine-grained visual and behavioral constraints. We further propose the Semantic Navigation Rectification Module (SNRM), a universal, zero-shot module designed to equip pre-trained agents with safety awareness. SNRM integrates a coarse-to-fine visual perception VLM framework with an epistemic mental map for dynamic detour planning. Experiments demonstrate that while Rule-VLN challenges state-of-the-art models, SNRM significantly restores navigation capabilities, reducing CVR by 19.26% and boosting TC by 5.97%.
[CV-188] DVAR: Adversarial Multi-Agent Debate for Video Authenticity Detection
【速读】:该论文旨在解决视频生成技术快速演进背景下,传统媒体鉴伪方法在面对未见过的生成架构时泛化能力不足的问题。其解决方案的关键在于提出一种无需训练的框架DVAR(Debate-based Video Authenticity Reasoning),将视频真实性检测重构为多智能体辩论式的结构化推理过程:通过生成假设代理(Generative Hypothesis Agent)与自然机制代理(Natural Mechanism Agent)之间的迭代交叉质询,使双方在异常证据面前捍卫各自解释,从而推动逻辑收敛;同时引入最小描述长度(Minimum Description Length, MDL)框架应用奥卡姆剃刀原则量化每条推理路径的“解释成本”,并融合动态知识库GenVideoKB提供生成边界与失效模式的高层推理启发式信息,实现对视频真实性的可解释、高泛化能力评估。
链接: https://arxiv.org/abs/2604.16987
作者: Hongyuan Qi,Feifei Shao,Ming Li,Hehe Fan,Jun Xiao
机构: Zhejiang University (浙江大学); Guangming Lab (光明实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages
Abstract:The rapid evolution of video generation technologies poses a significant challenge to media forensics, as conventional detection methods often fail to generalize beyond their training distributions. To address this, we propose DVAR (Debate-based Video Authenticity Reasoning), a training-free framework that reformulates video detection as a structured multi-agent forensic reasoning process. Moving beyond the paradigm of pattern matching, DVAR orchestrates a competition between a Generative Hypothesis Agent and a Natural Mechanism Agent. Through iterative rounds of cross-examination, these agents defend their respective explanations against abnormal evidence, driving a logical convergence where the truth emerges from rigorous stress-testing. To adjudicate these conflicting claims, we apply Occam’s Razor through the Minimum Description Length (MDL) framework, defining an Explanatory Cost to quantify the “logical burden” of each reasoning path. Furthermore, we integrate GenVideoKB, a dynamic knowledge repository that provides high-level reasoning heuristics on generative boundaries and failure modes. Extensive experiments demonstrate that DVAR achieves competitive performance against supervised state-of-the-art methods while exhibiting superior generalization to unseen generative architectures. By transforming detection into a transparent debate, DVAR provides explicit, interpretable reasoning traces for robust video authenticity assessment.
[CV-189] Adverse-to-the-eXtreme Panoptic Segmentation: URVIS 2026 Study and Benchmark
【速读】:该论文旨在解决在恶劣至极端天气条件下实现鲁棒的多模态全景分割(panoptic segmentation)问题,这是自动驾驶和环境感知系统中的一项关键挑战。解决方案的关键在于构建并使用MUSES数据集,该数据集是一个多传感器基准,涵盖RGB相机、激光雷达(LiDAR)、雷达和事件相机(event camera)等模态数据,从而支持跨天气条件的公平评估;同时引入加权全景质量(Weighted Panoptic Quality, wPQ)作为官方评价指标,确保在不同天气场景下模型性能的可比性与公正性。
链接: https://arxiv.org/abs/2604.16984
作者: Yiting Wang,Nolwenn Peyratout,Tim Brodermann,Jiahui Wang,Yusi Cao,Michele Cazzola,Elie Tarassov,Takuya Kobayashi,Abderrahim Kasmi,Guillaume Allibert,Cédric Demonceaux,Valentina Donzella,Kurt Debattista,Radu Timofte,Zongwei Wu,Christos Sakaridis
机构: ETH Zurich (苏黎世联邦理工学院); University of Geneva (日内瓦大学); EPFL (洛桑联邦理工学院); University of Oxford (牛津大学); University of Trento (特伦托大学); University of Bologna (博洛尼亚大学); University of Genoa (热那亚大学); University of Edinburgh (爱丁堡大学); University of Southern California (南加州大学); National University of Singapore (新加坡国立大学); University of Tokyo (东京大学); KAIST (韩国科学技术院); CNRS (法国国家科学研究中心); TU Darmstadt (达姆施塔特工业大学); Université de Lille (里尔大学); ETH Zurich (苏黎世联邦理工学院); University of Cambridge (剑桥大学); Technical University of Munich (慕尼黑工业大学); University of Stuttgart (斯图加特大学); University of Liège (列日大学); University of Rome Tor Vergata (罗马托尔韦加塔大学); University of Pisa (比萨大学); University of Nantes (南特大学); University of Bordeaux (波尔多大学); University of Toulouse (图卢兹大学); University of Strasbourg (斯特拉斯堡大学); University of Lyon (里昂大学); University of Grenoble Alpes (格勒诺布尔-阿尔卑斯大学); University of Paris-Saclay (巴黎-萨克雷大学); University of Paris (巴黎大学); University of Aix-Marseille (马赛大学); University of Montpellier (蒙彼利埃大学); University of Nancy (南锡大学); University of Lorraine (洛林大学); University of Reims Champagne-Ardenne (香槟-阿登大学); University of Rouen Normandy (鲁昂诺曼底大学); University of Caen Normandy (卡昂诺曼底大学); University of Le Havre (勒阿弗尔大学); University of Angers (昂热大学); University of Rennes (雷恩大学); University of Nîmes (尼姆大学); University of Perpignan (佩皮尼昂大学); University of Toulon (土伦大学); University of Avignon (阿维尼翁大学); University of Valenciennes (瓦朗谢讷大学); University of Versailles Saint-Quentin-en-Yvelines (凡尔赛圣 Quentin en Yvelines大学); University of Orléans (奥尔良大学); University of Limoges (利摩日大学); University of Poitiers (普瓦捷大学); University of La Rochelle (拉罗谢尔大学); University of Angers (昂热大学); University of Bretagne Sud (布列塔尼南大学); University of Bretagne Occidentale (布列塔尼西部大学); University of Brittany (布列塔尼大学); University of Corsica (科西嘉大学); University of Alsace (阿尔萨斯大学); University of Franche-Comté (勃艮第-弗朗什-孔泰大学); University of Picardie Jules Verne (皮卡第朱尔斯·凡尔纳大学); University of Haute-Normandie (上诺曼底大学); University of Normandy (诺曼底大学); University of Upper Normandy (上诺曼底大学); University of Lower Normandy (下诺曼底大学); University of Brittany (布列塔尼大学); University of Brittany South (布列塔尼南大学); University of Brittany West (布列塔尼西部大学); University of Brittany North (布列塔尼北部大学); University of Brittany Central (布列塔尼中部大学); University of Brittany East (布列塔尼东部大学); University of Brittany North-East (布列塔尼东北部大学); University of Brittany South-West (布列塔尼西南部大学); University of Brittany North-West (布列塔尼西北部大学); University of Brittany East-Central (布列塔尼东中部大学); University of Brittany West-Central (布列塔尼西中部大学); University of Brittany South-Central (布列塔尼南中部大学); University of Brittany North-Central (布列塔尼北中部大学); University of Brittany East-West (布列塔尼东西部大学); University of Brittany South-East (布列塔尼南东部大学); University of Brittany North-East (布列塔尼东北部大学); University of Brittany West-East (布列塔尼西东部大学); University of Brittany North-West (布列塔尼西北部大学); University of Brittany South-West (布列塔尼西南部大学); University of Brittany East-North (布列塔尼东北部大学); University of Brittany West-South (布列塔尼西南部大学); University of Brittany North-South (布列塔尼南北大学); University of Brittany East-West-North (布列塔尼东西部北部大学); University of Brittany West-East-South (布列塔尼西东部南部大学); University of Brittany North-East-South (布列塔尼东北部南部大学); University of Brittany South-East-North (布列塔尼南东部北部大学); University of Brittany East-West-Central (布列塔尼东西部中部大学); University of Brittany West-East-Central (布列塔尼西东部中部大学); University of Brittany North-East-Central (布列塔尼东北部中部大学); University of Brittany South-East-Central (布列塔尼南东部中部大学); University of Brittany East-West-North-South (布列塔尼东西部南北大学); University of Brittany West-East-North-South (布列塔尼西东部南北大学); University of Brittany North-East-North-South (布列塔尼东北部南北大学); University of Brittany South-East-North-South (布列塔尼南东部南北大学); University of Brittany East-West-South-North (布列塔尼东西部南部北部大学); University of Brittany West-East-South-North (布列塔尼西东部南部北部大学); University of Brittany North-East-South-North (布列塔尼东北部南部北部大学); University of Brittany South-East-North-South (布列塔尼南东部南北大学); University of Brittany East-West-North-South-Central (布列塔尼东西部南北中部大学); University of Brittany West-East-North-South-Central (布列塔尼西东部南北中部大学); University of Brittany North-East-North-South-Central (布列塔尼东北部南北中部大学); University of Brittany South-East-North-South-Central (布列塔尼南东部南北中部大学); University of Brittany East-West-South-North-Central (布列塔尼东西部南部北部中部大学); University of Brittany West-East-South-North-Central (布列塔尼西东部南部北部中部大学); University of Brittany North-East-South-North-Central (布列塔尼东北部南部北部中部大学); University of Brittany South-East-North-South-Central (布列塔尼南东部南北中部大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents the report of the URVIS 2026 challenge on adverse-to-extreme panoptic segmentation. As the first challenge of its kind, it attracted 17 registered participants and 47 submissions, with 4 teams reaching the final phase. The challenge is based on the MUSES dataset, a multi-sensor benchmark for panoptic segmentation in adverse-to-extreme weather, including RGB frame camera, LiDAR, radar, and event camera data. Weighted Panoptic Quality (wPQ) is designed and adopted as the official ranking metric for fair evaluation across weather conditions. In this report, we summarise the challenge setting and benchmark results, analyse the performance of the submitted methods, and discuss current progress and remaining challenges for robust multimodal panoptic segmentation. Link: this https URL
[CV-190] UGD: An Unsupervised Geometric Distance for Evaluating Real-world Noisy Point Cloud Denoising
【速读】:该论文旨在解决真实场景中点云去噪(Point Cloud Denoising)方法缺乏有效定量评估指标的问题。传统评价指标依赖于监督方式,需同时获取去噪后的点云与对应的真值干净点云来计算几何距离,但在实际应用中真值往往不可得。其解决方案的关键在于提出一种仅基于噪声点云即可计算的无监督几何距离(Unsupervised Geometric Distance, UGD),通过从一组干净点云中学习一个局部补丁级的先验模型(即纯真高斯混合模型,Pristine Gaussian Mixture Model, GMM),将其作为参考基准,进而量化去噪后点云在补丁空间中的几何变化程度。该方法利用自监督训练框架对补丁特征提取网络进行优化,包含成对质量排序、失真分类和失真分布预测三个任务,从而实现无需真值即可对点云去噪效果进行可靠评估。
链接: https://arxiv.org/abs/2604.16976
作者: Zhiyong Su,Jincan Wu,Yonghui Liu,Zheng Li,Weiqing Li
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: to be published in IEEE Transactions on Visualization and Computer Graphics
Abstract:Point cloud denoising is a fundamental and crucial challenge in real-world point cloud applications. Existing quantitative evaluation metrics for point cloud denoising methods are implemented in a supervised manner, which requires both the denoised point cloud and the corresponding ground-truth clean point cloud to compute a representative geometric distance. This requirement is highly problematic in real-world scenarios, where ground-truth clean point clouds are often unavailable. In this paper, we propose a simple yet effective unsupervised geometric distance (UGD) for real-world noisy point cloud denoising, calculated solely from noisy point clouds. The core idea of UGD is to learn a patch-wise prior model from a set of clean point clouds and then employ this prior model as the ground-truth to quantify the degradation by measuring the geometric variations of the denoised point cloud. To this end, we first learn a pristine Gaussian Mixture Model (GMM) with extracted patch-wise quality-aware features from a set of pristine clean point clouds by a patch-wise feature extraction network, which serves as the ground-truth for the quantitative evaluation. Then, the UGD is defined as the weighted sum of distances between each patch of the denoised point cloud and the learned pristine GMM model in the patch space. To train the employed patch-wise feature extraction network, we propose a self-supervised training framework through multi-task learning, which includes pair-wise quality ranking, distortion classification, and distortion distribution prediction. Quantitative experiments with synthetic noise confirm that the proposed UGD achieves comparable performance to supervised full-reference metrics. Moreover, experimental results on real-world data demonstrate that the proposed UGD enables unsupervised evaluation of point cloud denoising methods based exclusively on noisy point clouds.
[CV-191] Hyperspectral Unmixing Hierarchies
【速读】:该论文旨在解决高光谱图像解混(hyperspectral unmixing)中的三大挑战:光谱变异(spectral variability)导致的解混性能下降、端元(endmember)数量难以准确确定,以及随着端元数量增加导致的端元清晰度降低。其解决方案的关键在于提出一种基于层次结构的解混方法——二进制线性解混触觉层次结构(Binary Linear Unmixing Tactile Hierarchies, BLUTHs),通过在深度非负矩阵分解(Deep Nonnegative Matrix Factorization)中引入层次丰度和约束,实现对不同光谱对比度的端元分层提取。BLUTHs 的拓扑结构可通过稀疏调制(sparsity modulation)自适应调整以匹配具体场景,从而有效缓解光谱变异影响并提升丰度估计精度,在实验室场景中显著优于现有最优算法,同时在遥感与海洋颜色解混任务中保持竞争力。
链接: https://arxiv.org/abs/2604.16969
作者: Joseph L. Garrett,P. S. Vishnu,Pauliina Salmi,Daniela Lupu,Nitesh Kumar Singh,Ion Necoara,Tor Arne Johansen
机构: Norwegian University of Science and Technology (挪威科技大学); University Politehnica Bucharest (布加勒斯特理工大学); UPES (印度UPES大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Main text and supplemental
Abstract:Unmixing reveals the spatial distribution and spectral details of different constituents, called endmembers, in a hyperspectral image. Because unmixing has limited ground truth requirements, can accommodate mixed pixels, and is closely tied to light propagation, it is a uniquely powerful tool for analyzing hyperspectral images. However, spectral variability inhibits unmixing performance, the proper way to determine the number of endmembers is ambiguous, and the clarity of the endmembers degrades as more are included. Hierarchical structure is a possible solution to all three problems. Here, hierarchical unmixing is defined by imposing a hierarchical abundance sum constraint on Deep Nonnegative Matrix Factorization. Binary Linear Unmixing Tactile Hierarchies (BLUTHs) solve the hierarchical unmixing problem with a simple network architecture. Sparsity modulation unmixing growth tailors the topology of a BLUTH to each scene. The structure imposed by BLUTHs allows endmembers with varying levels of spectral contrast to be revealed, mitigating the challenge of spectral variability. The performance of BLUTHs exceeds state-of-the-art unmixing algorithms on laboratory scenes, particularly with regard to abundance estimation, while their performance remains competitive on remote sensing scenes. In addition, ocean color unmixing by BLUTHs is demonstrated on hyperspectral scenes from the HYPSO and PACE satellites. Comments: Main text and supplemental Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2604.16969 [cs.CV] (or arXiv:2604.16969v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.16969 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-192] Hyperbolic Enhanced Representation Learning for Incomplete Multi-view Clustering
【速读】:该论文旨在解决不完整多视图聚类(Incomplete Multi-View Clustering, IMVC)中因视图缺失导致的表征学习困难与鲁棒性不足问题,尤其针对传统基于欧氏空间的方法在处理具有内在层次结构的真实数据时存在的几何失配问题,即语义模糊现象——表征会向空间上邻近但语义不同的邻居漂移。其解决方案的关键在于提出一种超球面增强表示学习框架(HERL),该框架在庞加莱球(Poincaré ball)中构建结构感知的潜在空间:通过双约束超球面对比机制优化角距离损失以保持语义一致性、距离损失以强化层次紧凑性;并引入超球面原型头(hyperbolic prototype head)通过对齐跨视图的层次感知原型分布来修正全局结构漂移,从而解耦细粒度语义关联、锐化聚类边界,并施加几何约束以修正数据恢复过程。
链接: https://arxiv.org/abs/2604.16959
作者: Tianyi Chen,Haobo Wang,Kai Tang,Gengyu Lyu,Tianlei Hu,Gang Chen,Hong Ma,Meixiang Xiang
机构: Zhejiang University (浙江大学); Beijing Jiaotong University (北京交通大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Incomplete Multi-View Clustering (IMVC) faces the challenge of learning discriminative representations from fragmentary observations while maintaining robustness against missing views. However, prevalent Euclidean-based methods suffer from a geometric mismatch when modeling real-world data with intrinsic hierarchies, leading to semantic blurring where representations drift towards spatially proximal but semantically distinct neighbors. To bridge this gap, we propose HERL, a Hyperbolic Enhanced Representation Learning framework for IMVC. Operating within the Poincaré ball, HERL constructs a structure-aware latent space to enhance representation learning. Specifically, we design a dual-constraint hyperbolic contrastive mechanism optimizing: an angular-based loss to preserve semantic identity via directional alignment, and a distance-based loss to enforce hierarchical compactness. Furthermore, a hyperbolic prototype head is introduced to rectify global structural drift by aligning cross-view hierarchy-aware prototype distributions. Consequently, HERL disentangles fine-grained semantic correlations to sharpen cluster boundaries and imposes geometric constraints to rectify the data recovery process. Extensive experimental results demonstrate that HERL consistently outperforms state-of-the-art approaches.
[CV-193] Self-Reasoning Agent ic Framework for Narrative Product Grid-Collage Generation
【速读】:该论文旨在解决现有图像生成方法在产品摄影中缺乏结构化叙事规划与跨面板协调能力的问题,导致视觉叙事薄弱和画面不连贯。其核心解决方案是提出一种自推理代理框架(self-reasoning agentic framework),首先构建产品叙事框架(Product Narrative Framework),显式表达产品的身份、使用场景及环境,并将其转化为共享视觉风格的互补网格;随后通过约束感知提示(constraint-aware prompts)驱动生成模型联合合成统一的多格拼贴图像,而非独立生成各面板;同时引入内容有效性与摄影质量的双重评估机制,失败时进行归因分析并实施针对性优化,实现迭代式自我修正与持续改进。
链接: https://arxiv.org/abs/2604.16958
作者: Minyan Luo,Yuxin Zhang,Yifei Li,Xincan Wang,Fuzhang Wu,Tong-Yee Lee,Oliver Deussen,Weiming Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Narrative-driven product photography has become a prevalent paradigm in modern marketing, as coherent visual storytelling helps convey product value and establishes emotional engagement with consumers. However, existing image generation methods do not support structured narrative planning or cross-panel coordination, often resulting in weak storytelling and visual incoherence. In practice, narrative product photography is commonly presented as multi-grid collages, where multiple views or scenes jointly communicate a product narrative. To ensure visual consistency across grids and aesthetic harmony of the overall composition, we generate the collage as a single unified image rather than composing independently synthesized panels. We propose a self-reasoning agentic framework for narrative product grid collage generation. Given a product packshot and its name, the system first constructs a Product Narrative Framework that explicitly represents the product’s identity, usage context, and situational environment, and translates it into complementary grids governed by a shared visual style. Constraint-aware prompts are then compiled and fed to a generation model that synthesizes the collage jointly. The generated output is evaluated on both content validity and photography quality, with explicit gates determining whether to proceed or refine. When evaluation fails, the system performs failure attribution and applies targeted refinement, enabling progressive improvement through iterative self-reflection. Experiments demonstrate that our framework consistently improves aesthetic quality, narrative richness, and visual coherence, compared to direct prompting baselines.
[CV-194] raining-inference input alignment outweighs framework choice in longitudinal retinal image prediction
【速读】:该论文旨在解决基于纵向眼底成像对渐进性黄斑疾病未来视网膜形态进行定量预测的问题,以支持临床决策,替代当前依赖定性比较或标量进展评分的手段。其关键解决方案在于:通过系统对比五种共享同一架构与训练数据集的条件生成配置(包括标准条件扩散、推理对齐的随机训练和确定性回归),发现将训练与推理输入分布对齐可显著提升性能(delta-SSIM +0.082,SSIM +0.086,p < 0.001),而具体框架选择对主要指标无显著影响;进一步任务熵与后验集中度分析表明,跨访视变化中可预测成分远小于时间不变的采集变异性,因此随机采样难以发挥优势。基于此机制理解,作者提出TRU(Temporal Retinal U-Net),一种具有连续时间差分条件输入和多尺度历史聚合的确定性直接回归模型,在三个不同成像平台共28,902只眼的数据上验证了其优越性,且在可用历史长度增加时性能优势持续增强。
链接: https://arxiv.org/abs/2604.16955
作者: Liyin Chen,Nazlee Zebardast,Mengyu Wang,Tobias Elze,Jason I. Comander
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Quantitative prediction of future retinal appearance from longitudinal imaging would support clinical decisions in progressive macular disease that currently rely on qualitative comparison or scalar progression scores. Recent methods have moved toward increasing generative complexity, but whether this complexity is necessary for slowly progressing retinal disease is unclear. We tested this through a controlled comparison of five conditioning configurations sharing one architecture and training dataset, spanning standard conditional diffusion, inference-aligned stochastic training, and deterministic regression. In our evaluation, aligning the training and inference input distributions produced large gains (delta-SSIM +0.082, SSIM +0.086, both p 0.001), while the choice among aligned frameworks did not significantly affect any primary metric. Task-entropy and posterior-concentration analyses, replicated on two fundus autofluorescence (FAF) platforms, provided a mechanistic account: the predictable component of inter-visit change is small relative to time-invariant acquisition variability, leaving stochastic sampling with little width to exploit. Guided by these findings, we developed TRU (Temporal Retinal U-Net), a deterministic direct-regression model with continuous time-delta conditioning and multi-scale history aggregation. We evaluated TRU on 28,902 eyes across three imaging platforms: a mixed-disease Optos FAF cohort (9,942 eyes), zero-shot transfer to Stargardt macular dystrophy on Optos (288 eyes) and Heidelberg Spectralis (125 eyes), and a boundary evaluation on Cirrus en-face fundus images from a glaucoma cohort (18,547 eyes). TRU matched or exceeded delta-SSIM, SSIM, and PSNR in every FAF cohort against three state-of-the-art benchmarks, and its advantage grew monotonically with available history length.
[CV-195] SM-Pose: Topology-Aware Learning with Semantic Mamba for Category-Level Object Pose Estimation
【速读】:该论文旨在解决类别级物体位姿估计(category-level object pose estimation)中对未见实例泛化能力不足的问题,现有方法因依赖简单的特征提取与聚合机制,难以捕捉类别共享的拓扑结构并进行语义关键点建模。解决方案的关键在于提出TSM-Pose框架:其一,设计拓扑提取器(Topology Extractor)以捕获点云的全局拓扑表示,并将其融合进局部几何特征中,从而构建鲁棒的类别级结构表征;其二,引入基于Mamba的全局语义聚合器(Mamba-based Global Semantic Aggregator),通过注入语义先验增强关键点表达能力,并利用多个TwinMamba模块建模长程依赖关系,实现更有效的全局特征聚合。
链接: https://arxiv.org/abs/2604.16954
作者: Jinshuo Liu,Bingtao Ma,Junlin Su,Guanyuan Pan,Beining Wu,Cheng Yang,Jiaxuan Lu,Chenggang Yan,Shuai Wang
机构: Hangzhou Dianzi University (杭州电子科技大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Category-level object pose estimation is fundamental for embodied intelligence, yet achieving robust generalization to unseen instances remains challenging. However, existing methods mainly rely on simple feature extraction and aggregation, which struggle to capture category-shared topological structures and conduct semantic keypoint modeling, limiting their generalization. To address these, we propose a \textbfTopology-Aware Learning with \textbfSemantic \textbfMamba for Category-Level \textbfPose Estimation framework (TSM-Pose). Specifically, we introduce a Topology Extractor to capture the global topological representation of the point cloud, which is integrated into local geometry features and enables robust category-level structural representation. Simultaneously, we propose a Mamba-based Global Semantic Aggregator that injects semantics priors into keypoints to enhance their expressiveness and leverages multiple TwinMamba blocks to model long-range dependencies for more effective global feature aggregation. Extensive experiments on three benchmark datasets (REAL275, CAMERA25, and HouseCat6D) demonstrate that TSM-Pose outperforms existing state-of-the-art methods.
[CV-196] Better with Less: Tackling Heterogeneous Multi-Modal Image Joint Pretraining via Conditioned and Degraded Masked Autoencoder
【速读】:该论文旨在解决多模态视觉中跨异构模态(尤其是高分辨率光学图像与合成孔径雷达SAR图像)表征学习的难题,核心挑战在于“异质性-分辨率悖论”:随着空间分辨率提升,雷达复杂的几何结构与光学纹理之间的物理差异显著放大,导致传统基于刚性对齐的方法在高分辨率场景下引发特征抑制或污染,进而造成表征退化和负迁移。解决方案的关键在于提出一种“更少对齐、更好协同”的新范式——通过三项核心技术实现:1)光学锚定知识蒸馏(OKD),隐式地将SAR斑点噪声映射至纯语义流形;2)条件对比学习(CCL),利用梯度缓冲机制对齐共享共识的同时保留差异化的物理特征;3)跨模态退化重建(CDR),主动剥离非同源光谱伪特征,以捕捉真正的结构不变性。该方法在100万样本预训练下展现出卓越的数据效率,并在多种单模态与双模态下游任务中达到新的SOTA性能。
链接: https://arxiv.org/abs/2604.16952
作者: Bowen Peng,Yongxiang Liu,Jie Zhou,Xiaodong Chen,Tianpeng Liu,Xiaogang Yu,Li Liu
机构: National University of Defense Technology (国防科技大学); Beijing Institute of Remote Sensing Information (北京遥感信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Learning robust representations across extremely heterogeneous modalities remains a fundamental challenge in multi-modal vision. As a critical and profound instantiation of this challenge, high-resolution (HR) joint optical and synthetic aperture radar (SAR) pretraining seeks modality synergy to mutually enhance single-source representations; its potential is severely hindered by the Heterogeneity-Resolution Paradox: finer spatial scales drastically amplify the physical divergence between complex radar geometries and non-homologous optical textures. Consequently, migrating medium-resolution-oriented rigid alignment paradigms to HR scenarios triggers either severe feature suppression to force equivalence, or feature contamination driven by extreme epistemic uncertainty. Both extremes inevitably culminate in profound representation degradation and negative transfer. To overcome this bottleneck, we propose CoDe-MAE, pioneering a \textitbetter synergy with less alignment philosophy. First, Optical-anchored Knowledge Distillation (OKD) implicitly regularizes SAR’s speckle noise by mapping it into a pure semantic manifold. Building on this, Conditioned Contrastive Learning (CCL) utilizes a gradient buffering mechanism to align shared consensus while safely preserving divergent physical signatures. Concurrently, Cross-Modal Degraded Reconstruction (CDR) deliberately strips non-homologous spectral pseudo-features, truncating the inherently ill-posed mapping to capture true structural invariants. Extensive analyses validate our theoretical claims. Pretrained on 1M samples, CoDe-MAE demonstrates remarkable data efficiency, successfully preventing representation degradation and establishing new state-of-the-art performance across diverse single- and bi-modal downstream tasks, substantially outperforming foundation models scaled on vastly larger datasets.
[CV-197] Adaptive receptive field-based spatial-frequency feature reconstruction network for few-shot fine-grained image classification
【速读】:该论文旨在解决少样本细粒度图像分类(Few-Shot Fine-Grained Image Classification, FSFGIC)中特征重建技术面临的挑战,即如何为不同类别的输入图像自适应地选择感受野(receptive field)大小,以提取更有效的空间与频域特征描述符。其解决方案的关键在于提出一种基于自适应感受野的空间-频率特征重建网络(Adaptive Receptive Field-based Spatial-Frequency Feature Reconstruction Network, ARF-SFR-Net),该网络能够动态调整感受野尺寸以获取空间和频域特征,并实现高效融合用于特征重建与分类任务,同时可无缝嵌入到典型的元训练机制中进行端到端训练。
链接: https://arxiv.org/abs/2604.16936
作者: Linyue Zhang,Wenyi Zeng,Zicheng Pan,Yongsheng Gao,Changming Sun,Jun Hu,Lixian Liu,Weichuan Zhang,Tuo Wang
机构: School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi’an, Shaanxi Province, China; School of Electrical Engineering and Computer Science, the University of Queensland, QLD, Australia; Institute for Integrated and Intelligent Systems, Griffith University, QLD, Australia; CSIRO Data61, PO Box 76, Epping, NSW 1710, Australia; Chengdu University of Technology, Chengdu, Sichuan Province, China; Xidian University, Xi’an, Shaanxi Province, China; The First Affiliated Hospital of Xi’an Jiaotong University, Xi’an, Shaanxi Province, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Feature reconstruction techniques are widely applied for few-shot fine-grained image classification (FSFGIC). Our research indicates that one of the main challenges facing existing feature-based FSFGIC methods is how to choose the size of the receptive field to extract feature descriptors (including spatial and frequency feature descriptors) from different category input images, thereby better performing the FSFGIC tasks. To address this, an adaptive receptive field-based spatial-frequency feature reconstruction network (ARF-SFR-Net) is proposed. The designed ARF-SFR-Net has the capability to adaptively determine receptive field sizes for obtaining spatial and frequency features, and effectively fuse them for reconstruction and FSFGIC tasks. The designed ARF-SFR-Net can be easily embedded into a given episodic training mechanism for end-to-end training from scratch. Extensive experiments on multiple FSFGIC benchmarks demonstrate the effectiveness and superiority of the proposed ARF-SFR-Net over state-of-the-art approaches. The code is available at: this https URL.
[CV-198] CoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question Answering
【速读】:该论文旨在解决视觉问答(Visual Question Answering, VQA)任务中基于混合专家(Mixture-of-Experts, MoE)模型因路由机制不稳定或过度稳定而导致的专家选择不一致或灵活性不足的问题。解决方案的关键在于提出概念引导路由框架(Concept-Guided Routing framework, CoGR-MoE),该框架在训练阶段利用答案选项的语义信息指导专家选择,并通过选项特征重新加权所选专家,生成每个候选选项的判别性表示;进一步地,这些选项级表示用于选项比较并借助对比学习进行优化,从而提升模型在多类VQA任务中的性能与鲁棒性。
链接: https://arxiv.org/abs/2604.16930
作者: Xiyin Zeng,Yi Lu,Hao Wang
机构: Hong Kong University of Science and Technology (Guangzhou)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Visual Question Answering (VQA) requires models to identify the correct answer options based on both visual and textual evidence. Recent Mixture-of-Experts (MoE) methods improve option reasoning by grouping similar concepts or routing based on examples. However, unstable routing can lead to inconsistent expert selection in the same question type, while overly stable routing may reduce flexibility. To address this, we propose Concept-Guided Routing framework (CoGR-MoE), which incorporates semantics of the answer options to guide expert selection in the training phase. Next, option features are used to reweight the selected experts, producing discriminative representations for each candidate option. These option-level representations are further used for option comparison and optimized via contrastive learning. The experimental results indicate that CoGR-MoE delivers strong performance across multiple VQA tasks, demonstrating the effectiveness of our approach.
[CV-199] Rethinking Cross-Dose PET Denoising: Mitigating Averag ing Effects via Residual Noise Learning
【速读】:该论文旨在解决低剂量正电子发射断层成像(Low-dose PET, LDPET)中模型跨剂量泛化能力差的问题。传统方法如“一刀切”式模型(one-size-for-all)因在不同噪声水平下学习平均表示,导致性能下降;而单独训练各剂量的U-Net模型则缺乏灵活性。其核心解决方案是提出一种统一的残差噪声学习框架(unified residual noise learning framework),该框架直接从低剂量PET图像中估计噪声分布,而非预测全剂量图像,从而避免了对异质噪声分布的平均化建模,有效缓解了性能退化问题,并显著提升了跨剂量条件下的去噪效果与泛化能力。
链接: https://arxiv.org/abs/2604.16925
作者: Yichao Liu,Zongru Shao,Yueyang Teng,Junwen Guo
机构: IWR, Heidelberg University (海德堡大学); Silicon Austria Labs; College of Medicine and Biological Information Engineering, Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Northeastern University (东北大学); Department of Epidemiology Global Health, Umeå University (于默奥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cross-dose denoising for low-dose positron emission tomography (LDPET) has been proposed to address the limited generalization of models trained at a single noise level. In practice, neural networks trained on a specific dose level often fail to generalize to other dose conditions due to variations in noise magnitude and statistical properties. Conventional “one-size-for-all” models attempt to handle this variability but tend to learn averaged representations across noise levels, resulting in degraded performance. In this work, we analyze this limitation and show that standard training formulations implicitly optimize an expectation over heterogeneous noise distributions. To this end, we propose a unified residual noise learning framework that estimates noise directly from low-dose PET images rather than predicting full-dose images. Experiments on large-scale multi-dose PET datasets from two medical centers demonstrate that the proposed method outperforms the “one-size-for-all” model, individual dose-specific U-Net models, and dose-conditioned approaches, achieving improved denoising performance. These results indicate that residual noise learning effectively mitigates the averaging effect and enhances generalization for cross-dose PET denoising.
[CV-200] Noise-Adaptive Diffusion Sampling for Inverse Problems Without Task-Specific Tuning ICLR2026
【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)在求解逆问题(Inverse Problems, IPs)时面临的两大挑战:一是优化方法虽利用DM作为强大先验可快速求解,但易陷入局部极小值且对噪声过拟合;二是贝叶斯方法中强制测量一致性与去噪过程结合时会导致流形不可行性问题。其解决方案的关键在于提出一种基于噪声空间的哈密顿蒙特卡洛采样方法(Noise-space Hamiltonian Monte Carlo, N-HMC),将推理完全置于初始噪声空间中,通过将反向扩散视为从初始噪声到干净图像的确定性映射,确保提案始终位于学习到的数据流形上,从而实现对解空间的全面探索并规避局部最优。进一步地,作者还提出了噪声自适应变体(Noise-adaptive N-HMC, NA-NHMC),有效应对未知噪声类型和强度的逆问题场景,在多个线性和非线性逆问题上均展现出优越重建质量与鲁棒性。
链接: https://arxiv.org/abs/2604.16919
作者: Yingzhi Xia,Setthakorn Tanomkiattikun,Liangli Zhen,Zaiwang Gu
机构: Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore; Institute for Infocomm Research, Agency for Science, Technology and Research, Singapore; Johns Hopkins University
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026
Abstract:Diffusion models (DMs) have recently shown remarkable performance on inverse problems (IPs). Optimization-based methods can fast solve IPs using DMs as powerful regularizers, but they are susceptible to local minima and noise overfitting. Although DMs can provide strong priors for Bayesian approaches, enforcing measurement consistency during the denoising process leads to manifold infeasibility issues. We propose Noise-space Hamiltonian Monte Carlo (N-HMC), a posterior sampling method that treats reverse diffusion as a deterministic mapping from initial noise to clean images. N-HMC enables comprehensive exploration of the solution space, avoiding local optima. By moving inference entirely into the initial-noise space, N-HMC keeps proposals on the learned data manifold. We provide a comprehensive theoretical analysis of our approach and extend the framework to a noise-adaptive variant (NA-NHMC) that effectively handles IPs with unknown noise type and level. Extensive experiments across four linear and three nonlinear inverse problems demonstrate that NA-NHMC achieves superior reconstruction quality with robust performance across different hyperparameters and initializations, significantly outperforming recent state-of-the-art methods. The code is available at this https URL.
[CV-201] KIRA: Knowledge-Intensive Image Retrieval and Reasoning Architecture for Specialized Visual Domains
【速读】:该论文旨在解决视觉增强生成(Visual Retrieval Augmented Generation, Visual RAG)在专业领域中面临的五大核心挑战:跨模态鸿沟问题(图像查询与文本知识库之间的语义不匹配)、构建语义一致的视觉知识库、多跳推理能力不足、生成答案缺乏视觉证据支撑,以及对罕见视觉概念的适应性差。其解决方案的关键在于提出KIRA(Knowledge Intensive Image Retrieval and Reasoning Architecture),一个五阶段统一框架,通过以下创新实现突破:(1) 基于DINO区域检测的分层语义切片用于多粒度知识库构建;(2) 少样本适配的域自适应对比编码器提升罕见视觉概念识别能力;(3) 双路径跨模态检索结合Chain-of-Thought查询扩展增强检索精度;(4) Chain-of-Retrieval机制支持时序和多视角下的多跳视觉推理;(5) 证据条件生成结合事后幻觉验证确保答案忠实于视觉证据。该方法在医学X光、电路图、卫星影像和组织病理学四个专业领域均取得显著性能提升,平均检索精度达0.97,接地得分1.0,领域正确性0.707,并揭示各模块的作用边界与精度多样性权衡关系。
链接: https://arxiv.org/abs/2604.16915
作者: Parthaw Goswami,Jaynto Goswami Deep
机构: University of Missouri (密苏里大学); SAP Prague (SAP布拉格)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Retrieval augmented generation (RAG) has transformed text based question answering, yet its extension to visual domains remains hindered by fundamental challenges: bridging the modality gap between image queries and text heavy knowledge bases, constructing semantically meaningful visual knowledge bases, performing multihop reasoning over retrieved images, and verifying that generated answers are faithfully grounded in visual evidence. We present KIRA (Knowledge Intensive Image Retrieval and Reasoning Architecture), a unified five stage framework that addresses ten core problems in visual RAG for specialized domains. KIRA introduces: (1) hierarchical semantic chunking with DINO based region detection for multi granularity knowledge base construction, (2) domain adaptive contrastive encoders with fewshot adaptation for rare visual concepts, (3) dualpath crossmodal retrieval with chainOfThought query expansion, (4) chainOfRetrieval for multihop visual reasoning with temporal and multiview support, and (5) evidence conditioned grounded generation with posthoc hallucination verification. We also propose DOMAINVQAR, a benchmark suite that evaluates visual RAG along three axes (retrieval precision, reasoning faithfulness, and domain correctness) going beyond standard recall metrics. Experiments across four specialized domains (medical Xray, circuit diagrams, satellite imagery, and histopathology) with a progressive six variant ablation demonstrate that KIRA achieves 0.97 retrieval precision, 1.0 grounding scores, and 0.707 domain correctness averaged across domains, while the ablation reveals actionable insights about when each component helps and when components introduce precision diversity tradeoffs that must be managed. Code will be released upon acceptance.
[CV-202] Unified Ultrasound Intelligence Toward an End-to-End Agent ic System
【速读】:该论文旨在解决临床超声分析中模型在异质器官、视角和设备间泛化能力弱,且难以支持可解释的工作流程级分析的问题。现有方法通常依赖于任务特定的适应策略,而联合学习易受跨任务干扰导致不稳定,难以实现端到端的临床工作流输出。解决方案的关键在于提出USTri三阶段智能流水线:第一阶段训练一个通用的USGen模型以学习鲁棒的跨设备与协议迁移先验;第二阶段冻结USGen并微调数据集特异性头部(dataset-specific heads)以提升任务对齐性能同时保留共享超声知识;第三阶段引入USAgent,通过模拟临床医生工作流程协调多个USpec专家进行多步骤推理并生成结构化报告,从而实现高效、可解释的全流程分析。
链接: https://arxiv.org/abs/2604.16914
作者: Chen Ma,Yunshu Li,Junhu Fu,Shuyu Liang,Yuanyuan Wang,Yi Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by ISBI2026. 5 pages, 2 figures
Abstract:Clinical ultrasound analysis demands models that generalize across heterogeneous organs, views, and devices, while supporting interpretable workflow-level analysis. Existing methods often rely on task-wise adaptation, and joint learning may be unstable due to cross-task interference, making it hard to deliver workflow-level outputs in practice. To address these challenges, we present USTri, a tri-stage ultrasound intelligence pipeline for unified multi-organ, multi-task analysis. Stage I trains a universal generalist USGen on different domains to learn broad, transferable priors that are robust to device and protocol variability. To better handle domain shifts and reach task-aligned performance while preserving ultrasound shared knowledge, Stage II builds USpec by keeping USGen frozen and finetuning dataset-specific heads. Stage III introduces USAgent, which mimics clinician workflows by orchestrating USpec specialists for multi-step inference and deterministic structured reports. On the FMC_UIA validation set, our model achieves the best overall performance across 4 task types and 27 datasets, outperforming state-of-the-art methods. Moreover, qualitative results show that USAgent produces clinically structured reports with high accuracy and interpretability. Our study suggests a scalable path to ultrasound intelligence that generalizes across heterogeneous ultrasound tasks and supports consistent end-to-end clinical workflows.
[CV-203] LAGS: Low-Altitude Gaussian Splatting with Groupwise Heterogeneous Graph Learning
【速读】:该论文旨在解决低空高斯点绘(Low-altitude Gaussian splatting, LAGS)场景重建中资源分配效率低的问题,尤其针对现有方案未考虑不同视角带来的图像多样性而导致通信资源利用不充分的缺陷。解决方案的关键在于提出一种分组异构图神经网络(Groupwise Heterogeneous Graph Neural Network, GW-HGNN),其核心思想是将LAGS的重建损失与通信约束映射为图学习中的代价函数,并通过双层消息传递机制显式建模不同图像组对重建过程的非均匀贡献,从而自动平衡数据保真度与传输成本。
链接: https://arxiv.org/abs/2604.16910
作者: Yikun Wang,Yujie Wan,Wei Zuo,Shuai Wang,Yik-Chung Wu,Chengzhong Xu,Huseyin Arslan
机构: The University of Hong Kong (香港大学); Chinese Academy of Sciences (中国科学院); Southern University of Science and Technology (南方科技大学); University of Macau (澳门大学); Istanbul Medipol University (伊斯坦布尔Medipol大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 5 pages, 8 figures
Abstract:Low-altitude Gaussian splatting (LAGS) facilitates 3D scene reconstruction by aggregating aerial images from distributed drones. However, as LAGS prioritizes maximizing reconstruction quality over communication throughput, existing low-altitude resource allocation schemes become inefficient. This inefficiency stems from their failure to account for image diversity introduced by varying viewpoints. To fill this gap, we propose a groupwise heterogeneous graph neural network (GW-HGNN) for LAGS resource allocation. GW-HGNN explicitly models the non-uniform contribution of different image groups to the reconstruction process, thus automatically balancing data fidelity and transmission cost. The key insight of GW-HGNN is to transform LAGS losses and communication constraints into graph learning costs for dual-level message passing. Experiments on real-world LAGS datasets demonstrate that GW-HGNN significantly outperforms state-of-the-art benchmarks across key rendering metrics, including PSNR, SSIM, and LPIPS. Furthermore, GW-HGNN reduces computational latency by approximately 100x compared to the widely-used MOSEK solver, achieving millisecond-level inference suitable for real-time deployment.
[CV-204] Physics-Informed Tracking (PIT)
【速读】:该论文旨在解决视频中单粒子轨迹追踪的准确性与物理一致性问题,尤其在缺乏标注数据的情况下实现高精度跟踪。其核心挑战在于如何在复杂背景噪声中稳定定位粒子,并确保预测轨迹符合已知物理动力学规律。解决方案的关键是提出物理信息引导的追踪框架(Physics-Informed Tracking, PIT),其中嵌入可微分的物理模块约束多帧地标点(landmark)形成满足物理规律的轨迹;同时设计了物理信息地标损失函数(Physics-Informed Landmark Loss, PILL),通过对比预测轨迹与地标点来强制物理一致性,无需标签即可训练;进一步引入监督变体PILLS,利用仿真提供的真值位置、速度和碰撞信息实现端到端反向传播优化。此外,采用带分裂瓶颈结构的自动编码器分离追踪相关特征(热力图地标)与背景噪声,从而提升鲁棒性和精度。
链接: https://arxiv.org/abs/2604.16895
作者: Emil Hovad,Allan Peter Engsig-Karup
机构: Technical University of Denmark (丹麦技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 3 figures, 11 tables
Abstract:We propose Physics-Informed Tracking (PIT), a video-based framework for tracking a single particle from video, where a neural network autoencoder localizes a particle as a heatmap peak (landmark) and a differentiable physics module embedded in the autoencoder constrains several landmarks over time (a trajectory) to satisfy known dynamics. The novel Physics-Informed Landmark Loss (PILL) compares this predicted trajectory back against the landmarks, enforcing physical consistency without labels. Its supervised variant (PILLS) instead compares the prediction against ground-truth position, velocity, and bounce from simulation, enabling end-to-end backpropagation. To support supervised and unsupervised learning, we use an autoencoder with a split bottleneck that separates A) tracking-related structure via landmark heatmaps from B) background noise and subsequent image reconstruction. We evaluate a replicated 26 factorial design (n = 4 replicates, 64 configurations), showing that PILLS consistently achieves sub-pixel tracking accuracy for the bilinear and physics-refined decoder outputs under both clean and noisy conditions.
[CV-205] EasyVideoR1: Easier RL for Video Understanding
【速读】:该论文旨在解决将强化学习从可验证奖励(Reinforcement Learning from Verifiable Rewards, RLVR)扩展至视频理解任务时面临的三大核心挑战:一是视频任务类型多样导致的奖励设计复杂性;二是高维视觉输入重复解码与预处理带来的计算开销;三是超参数敏感性引发的评估不可复现问题。其解决方案的关键在于提出一个名为EasyVideoR1的完整且高效的强化学习框架,通过五大创新实现突破:(1) 基于离线预处理与张量缓存的全流程视频RL训练流水线,显著减少冗余解码,提升1.47倍吞吐量;(2) 覆盖11类视频与图像任务的统一、模块化奖励系统;(3) 混合离线-在线数据训练范式,融合高质量轨迹与策略探索以增强复杂任务学习;(4) 图像-视频联合训练机制,支持独立配置像素预算实现模态互促;(5) 异步多基准评估框架,覆盖22个主流视频理解基准并实现与官方结果高度一致的复现精度。
链接: https://arxiv.org/abs/2604.16893
作者: Chuanyu Qin,Chenxu Yang,Qingyi Si,Naibin Gu,Dingyu Yao,Zheng Lin,Peng Fu,Nan Duan,Jiaqi Wang
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); JD.com (京东)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present \textbfEasyVideoR1, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 \times throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.
[CV-206] CrossFlowDG: Bridging the Modality Gap with Cross-modal Flow Matching for Domain Generalization CVPR
【速读】:该论文旨在解决领域泛化(Domain Generalization, DG)中因图像风格差异导致模型过拟合于域特定外观特征而非类别语义的问题。现有基于文本表示的多模态方法虽能提供稳定的域不变锚点,但依赖余弦相似度的对比对齐仍存在模态间隙(modality gap),即图像与文本嵌入在几何空间上分离,尽管语义上对应。其解决方案的关键在于提出CrossFlowDG框架,通过无噪声的跨模态流匹配(cross-modal flow matching)在联合欧氏潜在空间中学习连续变换,显式地将受域偏倚影响的图像嵌入迁移至正确类别的域不变文本嵌入,从而弥合模态间隙并提升泛化性能。
链接: https://arxiv.org/abs/2604.16892
作者: Antonios Kritikos,Nikolaos Spanos,Athanasios Voulodimos
机构: National Technical University of Athens (国家技术大学雅典)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPRW 2026 (DG-EBF Workshop)
Abstract:Domain generalization (DG) aims to maintain performance under domain shift, which in computer vision appears primarily as stylistic variations that cause models to overfit to domain-specific appearance cues rather than class semantics. To overcome this, recent methods use textual representations as stable, domain-invariant anchors. However, multimodal approaches that rely on cosine similarity-based contrastive alignment leave a modality gap where image and text embeddings remain geometrically separated despite semantic correspondence. We propose CrossFlowDG, a novel DG framework that addresses this residual gap using noise-free, cross-modal flow matching. By learning a continuous transformation in the joint Euclidean latent space, our framework explicitly transports domain-biased image embeddings toward domain-invariant text embeddings of the correct class. Using the efficient VMamba image encoder and CLIP’s text encoder, CrossFlowDG is tested against four common DG benchmarks, and achieves competitive performance on several benchmarks and state-of-the-art on TerraIncognita. Code is available at: this https URL
[CV-207] Bias-constrained multimodal intelligence for equitable and reliable clinical AI
【速读】:该论文旨在解决医疗影像与临床文本融合中普遍存在的偏倚问题,如疾病流行率失衡、解剖区域分布偏斜、成像协议异质性及人群 demographics 差异等,这些问题严重影响了视觉-语言(Vision-Language)系统在真实临床场景中的公平性和可靠性。其解决方案的关键在于提出 BiasCareVL 框架,该框架将偏倚控制直接嵌入模型设计阶段,而非作为事后修正手段;通过引入自适应不确定性建模机制,并结合可选的人机协同精调(human-in-the-loop refinement),有效抑制主导数据模式的影响,促进在分布不平衡下的公平推理能力。该方法在涵盖15种成像模态的344万样本上训练,支持多种临床任务(如视觉问答、疾病分类、分割和报告生成),并在8个公开基准测试中显著优于20种前沿方法,尤其在多类皮肤病变诊断和小肿瘤分割等挑战性任务中表现突出。
链接: https://arxiv.org/abs/2604.16884
作者: Cheng Li,Weijian Huang,Jiarun Liu,Hao Yang,Qi Yang,Song Wu,Ye Li,Hairong Zheng,Shanshan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The integration of medical imaging and clinical text has enabled the emergence of generalist artificial intelligence (AI) systems for healthcare. However, pervasive biases, such as imbalanced disease prevalence, skewed anatomical region distributions, heterogeneous imaging protocols, and demographic disparities, pose significant challenges to the fairness and reliability of vision-language systems in real-world clinical settings. Here we present BiasCareVL, a bias-aware multimodal learning framework that introduces bias control directly into model design, rather than treating it as a post hoc correction. BiasCareVL incorporates adaptive uncertainty modeling with optional human-in-the-loop refinement to regulate the influence of dominant data patterns and to promote equitable reasoning under distributional imbalance. Trained on 3.44 million samples spanning over 15 imaging modalities, the framework supports diverse clinical tasks, including visual question answering, disease classification, segmentation, and report generation within a unified representation space. Across eight public benchmarks covering dermatology, oncology, radiology, and pathology, BiasCareVL consistently outperforms 20 state-of-the-art methods, with pronounced gains in clinically challenging scenarios, including over 10% accuracy improvement in multi-class skin lesion diagnosis and more than 20% Dice improvement in small tumor segmentation. Furthermore, BiasCareVL achieves diagnostic performance exceeding human accuracy with substantially reduced time requirements when evaluated with board-certified radiologists. By open-sourcing BiasCareVL, we aim to promote a transparent, reproducible, and equitable future for AI in healthcare, paving the way for general-purpose, trustworthy, and clinically reliable AI systems.
[CV-208] Adaptive Forensic Feature Refinement via Intrinsic Importance Perception
【速读】:该论文旨在解决合成图像检测(SID)中跨分布泛化能力不足的问题,即模型在面对未知生成源时的性能下降。现有基于视觉基础模型(VFM)的方法在适配策略上较为粗粒度,通常直接使用最终层表示或简单融合多层特征,缺乏对可迁移伪造线索最优表征层次的显式建模;同时,直接微调VFM虽能增强任务适应性,但可能破坏支持开放集泛化的跨模态预训练结构。为缓解这一任务特异性与结构保持之间的矛盾,论文提出I2P框架,其核心在于将VFM适配重构为联合优化问题:一方面自适应识别最利于SID的判别性表征层,另一方面在低敏感度参数子空间内约束任务驱动的参数更新,从而在提升任务特异性的同时最大限度保留预训练表示的可迁移结构。
链接: https://arxiv.org/abs/2604.16879
作者: Jiazhen Yang,Junjun Zheng,Kejia Chen,Xiangheng Kong,Jie Lei,Zunlei Feng,Bingde Hu,Yang Gao
机构: Zhejiang University(浙江大学); Zhejiang University of Technology(浙江工业大学); Alibaba-inc(阿里巴巴)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid development of generative models and multimodal content editing technologies, the key challenge faced by synthetic image detection (SID) lies in cross-distribution generalization to unknown generation sources. In recent years, visual foundation models (VFM), which acquire rich visual priors through large scale image-text alignment pretraining, have become a promising technical route for improving the generalization ability of SID. However, existing VFM-based methods remain relatively coarse-grained in their adaptation strategies. They typically either directly use the final layer representations of VFM or simply fuse multi layer features, lacking explicit modeling of the optimal representational hierarchy for transferable forgery cues. Meanwhile, although directly fine-tuning VFM can enhance task adaptation, it may also damage the cross-modal pretrained structure that supports open-set generalization. To address this task specific tension, we reformulate VFM adaptation for SID as a joint optimization problem: it is necessary both to identify the critical representational layer that is more suitable for carrying forgery discriminative information and to constrain the disturbance caused by task knowledge injection to the pretrained structure. Based on this, we propose I2P, an SID framework centered on intrinsic importance perception. I2P first adaptively identifies the critical layer representations that are most discriminative for SID, and then constrains task-driven parameter updates within a low sensitivity parameter subspace, thereby improving task specificity while preserving the transferable structure of pretrained representations as much as possible.
[CV-209] CCAR: Intrinsic Robustness as an Emergent Geometric Property
【速读】:该论文旨在解决标准监督学习在优化预测准确性时忽视特征空间内部几何结构的问题,导致学习到的表示存在纠缠(entangled)和脆弱性(brittle)的缺陷。解决方案的关键在于提出类条件激活正则化(Class-Conditional Activation Regularization, CCAR),通过引入软归纳偏置(soft inductive bias)强制特征空间呈现块对角结构,使不同类别的特征能量被限制在正交子空间中,从而构建内在的几何骨架,实现对噪声和对抗扰动的自然过滤。理论分析进一步表明,该结构约束等价于最大化Fisher判别比(Fisher Discriminant Ratio),建立了几何解耦与算法稳定性的形式关联。
链接: https://arxiv.org/abs/2604.16861
作者: Akash Samanta,Manish Pratap Singh,Debasis Chaudhuri
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Standard supervised learning optimizes for predictive accuracy but remains agnostic to the internal geometry of learned features, often yielding representations that are entangled and brittle. We propose Class-Conditional Activation Regularization (CCAR) to explicitly engineer the feature space, imposing a block-diagonal structure via a soft inductive bias. By shaping the latent representation to confine class energy to orthogonal subspaces, we create an intrinsic geometric scaffold that naturally filters noise and adversarial perturbations. We provide theoretical analysis linking this structural constraint to the maximization of the Fisher Discriminant Ratio, establishing a formal connection between geometric disentanglement and algorithmic stability. Empirically, this approach demonstrates that robustness is an emergent property of a well-engineered feature space, significantly outperforming baselines on label noise and input corruption benchmarks.
[CV-210] Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement
【速读】:该论文旨在解决当前基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的图像质量评估(Image Quality Assessment, IQA)方法在指导生成模型和图像修复时存在的局限性,即这些方法通常采用单一视角、纯语言输出的范式,缺乏对质量退化位置与原因的显式定位与可解释推理,导致其反馈不够可靠,难以支持闭环优化。解决方案的关键在于提出Q-DeepSight框架,该框架通过模拟人类“边看边思考”的认知过程,引入交错式多模态思维链(interleaved Multimodal Chain-of-Thought, iMCoT),结合工具增强的证据获取机制(如裁剪放大),实现对图像中质量下降区域及其成因的精准诊断;同时设计感知课程奖励(Perceptual Curriculum Reward, PCR)与证据梯度过滤(Evidence Gradient Filtering, EGF)两项技术,以缓解强化学习中的奖励稀疏性和提升视觉引导推理的信用分配精度,从而显著提升IQA模型的可解释性与实用性,并成功应用于无需训练的感知生成框架(Perceptual-in-Generation, PiG)中,实现从评估到迭代优化的闭环控制。
链接: https://arxiv.org/abs/2604.16858
作者: Xudong Li,Jiaxi Tan,Ziyin Zhou,Yan Zhong,Zihao Huang,Jingyuan Zheng,Yan Zhang,Xiawu Zheng,Rongrong Ji
机构: Xiamen University (厦门大学); Peking University (北京大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image Quality Assessment (IQA) models are increasingly deployed as perceptual critics to guide generative models and image restoration. This role demands not only accurate scores but also actionable, localized feedback. However, current MLLM-based methods adopt a single-look, language-only paradigm, which departs from human evidence-seeking judgment and yields weakly grounded rationales, limiting their reliability for in-the-loop refinement. We propose Q-DeepSight, a think-with-image framework that emulates this human-like process. It performs interleaved Multimodal Chain-of-Thought (iMCoT) with tool-augmented evidence acquisition (e.g., crop-and-zoom) to explicitly determine where quality degrades and why. To train these long iMCoT trajectories via reinforcement learning, we introduce two techniques: Perceptual Curriculum Reward (PCR) to mitigate reward sparsity and Evidence Gradient Filtering (EGF) to improve credit assignment for visually-grounded reasoning. Q-DeepSight achieves state-of-the-art performance across diverse benchmarks, including natural, restored, and AI-generated content. Furthermore, we demonstrate its practical value with Perceptual-in-Generation (PiG), a training-free framework where Q-DeepSight’s diagnoses guide iterative image enhancement, effectively closing the loop between assessment and refinement.
[CV-211] When W4A4 Breaks Camouflaged Object Detection: Token-Group Dual-Constraint Activation Quantization
【速读】:该论文旨在解决基于Transformer的伪装目标检测(Camouflaged Object Detection, COD)模型在低比特量化(如W4A4)下性能显著下降的问题。研究发现,COD任务存在一种特定于任务的“悬崖效应”:背景令牌(background tokens)呈重尾分布,主导共享激活范围,导致量化步长增大,使本应保留的弱但结构化的边界线索被压缩至零桶(zero bin),从而破坏检测精度。解决方案的关键在于识别并突破这一“令牌局部瓶颈”——通过引入COD-TDQ方法,采用两个耦合步骤实现:1)直接求和令牌分组(Direct-Sum Token-Group, DSTG)为每个令牌组分配独立缩放因子,抑制跨令牌范围主导;2)双约束范围投影(Dual-Constraint Range Projection, DCRP)对每个令牌组的裁剪范围进行优化,同时控制步长与分布比(step-to-dispersion ratio)和零桶质量(zero-bin mass)。该方法在四个COD基准和两种基线模型上均显著优于现有无微调量化方法,平均Sα分数提升超过0.12。
链接: https://arxiv.org/abs/2604.16855
作者: Tianqi Li,Wenyu Fang,Xin He,Xue Geng,Xu Cheng,Yun Liu
机构: Nankai University (南开大学); Tianjin University of Technology (天津理工大学); Institute for Infocomm Research, A*STAR (新加坡资讯通信研究院); Academy for Advanced Interdisciplinary Studies, Nankai University (南开大学先进交叉学科研究院); Nankai International Advanced Research Institute, Shenzhen Futian (南开大学深圳福田国际研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Camouflaged object detection (COD) segments objects that intentionally blend with the background, so predictions depend on subtle texture and boundary cues. COD is often needed under tight on-device memory and latency budgets, making low-bit inference highly desirable. However, COD is unusually hard to quantify aggressively. We study post-training W4A4 quantization of Transformer-based COD and find a task-specific cliff: heavy-tailed background tokens dominate a shared activation range, inflating the step size and pushing weak-but-structured boundary cues into the zero bin. This exposes a token-local bottleneck – remove cross-token range domination and bound the zero-bin mass under 4-bit activations. To address this, we introduce COD-TDQ, a COD-aware Token-group Dual-constraint activation Quantization method. COD-TDQ addresses this token-local bottleneck with two coupled steps: Direct-Sum Token-Group (DSTG) assigns token-group scales to suppress cross-token range domination, and Dual-Constraint Range Projection (DCRP) projects each token-group clip range to keep the step-to-dispersion ratio and the zero-bin mass bounded. Across four COD benchmarks and two baseline models (CFRN and ESCNet), COD-TDQ consistently achieves an S\alphascore more than 0.12 higher than that of the state-of-the-art quantization method without retraining. The code will be released.
[CV-212] CATP: Confidence-Aware Token Pruning for Camouflaged Object Detection
【速读】:该论文旨在解决基于Transformer的伪装目标检测(Camouflaged Object Detection, COD)模型在实际部署中因计算开销过大而导致效率低下的问题。其解决方案的关键在于提出一种分层的置信度感知令牌剪枝框架(Hierarchical Confidence-Aware Token Pruning, CATP),通过层级化识别并剔除背景和目标内部中易于区分的令牌,将计算资源集中于关键边界区域;同时引入双路径特征补偿机制,从被剪枝的令牌中聚合上下文信息以弥补信息损失,从而在显著降低计算复杂度的同时保持高检测精度。
链接: https://arxiv.org/abs/2604.16854
作者: Yuhan Gao,Shuhao Kang,Xin He,Bing Li,Xu Cheng,Yun Liu
机构: Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Camouflaged Object Detection (COD) aims to segment targets that share extreme textural and structural similarities with their complex environments. Leveraging their capacity for long-range dependency modeling, Transformer-based detectors have become the mainstream approach and achieve state-of-the-art (SoTA) accuracy, yet their substantial computational overhead severely limits practical deployment. To address this, we propose a hierarchical Confidence-Aware Token Pruning framework (CATP) tailored for COD. Our approach hierarchically identifies and discards easily distinguishable tokens from both background and object interiors, focusing computations on critical boundary tokens. To compensate for information loss from pruning, we introduce a dual-path feature compensation mechanism that aggregates contextual knowledge from pruned tokens into enriched features. Extensive experiments on multiple COD benchmarks demonstrate that our method significantly reduces computational complexity while maintaining high accuracy, offering a promising research direction for the efficient deployment of COD models in real-world scenarios. The code will be released.
[CV-213] Applications of deep generative models to DNA reaction kinetics and to cryogenic electron microscopy
【速读】:该论文旨在解决两个关键生物学问题:一是DNA反应动力学模拟结果的可解释性不足,二是冷冻电镜(cryo-EM)密度图解读与蛋白质结构建模中的精度和效率问题。针对第一个问题,作者提出ViDa框架,其核心是将生物物理先验知识融入变分自编码器(Variational Autoencoders, VAEs)和几何散射变换(Geometric Scattering Transforms)中,生成符合生物物理规律的低维嵌入空间,从而可视化DNA杂交及引物介导的链置换反应,并识别反应路径;针对第二个问题,解决方案的关键在于引入生成式模型——Struc2mapGAN通过生成对抗网络合成高保真实验类cryo-EM密度图,而CryoSAMU则利用结构感知的多模态U-Net架构,结合蛋白质语言模型提取的结构嵌入与密度特征,通过交叉注意力机制提升中分辨率cryo-EM图的质量,实现更精确的蛋白质结构建模。
链接: https://arxiv.org/abs/2604.16851
作者: Chenwei Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
备注: PhD Thesis
Abstract:This dissertation explores how deep generative models can advance the analysis of challenging biological problems by integrating domain knowledge with deep learning. It focuses on two areas: DNA reaction kinetics and cryogenic electron microscopy (cryo-EM). In the first part, we present ViDa, a biophysics-informed framework leveraging variational autoencoders (VAEs) and geometric scattering transforms to generate biophysically-plausible embeddings of DNA reaction kinetics simulations. These embeddings are reduced to a two-dimensional space to visualize DNA hybridization and toehold-mediated strand displacement reactions. ViDa preserves structure and clusters trajectory ensembles into reaction pathways, making simulation results more interpretable and revealing new mechanistic insights. In the second part, we address key challenges in cryo-EM density map interpretation and protein structure modeling. We provide a comprehensive review and benchmarking of deep learning methods for atomic model building, with improved evaluation metrics and practical guidance. We then present Struc2mapGAN, a generative adversarial network that synthesizes high-fidelity experimental-like cryo-EM density maps from protein structures. Finally, we present CryoSAMU, a structure-aware multimodal U-Net that enhances intermediate-resolution cryo-EM maps by integrating density features with structural embeddings from protein language models via cross-attention. Overall, these contributions demonstrate the potential of deep generative models to interpret DNA reaction mechanisms and advance cryo-EM density map analysis and protein structure modeling.
[CV-214] owerDataset: A Heterogeneous Benchmark for Transmission Corridor Segmentation with a Global-Local Fusion Framework
【速读】:该论文旨在解决输电走廊点云细粒度语义分割中存在的现实数据稀缺性问题,以及在长距离、异构场景中建模全局结构与局部几何细节的挑战。现有公开数据集通常仅提供粗粒度类别或短片段场景,忽视了长程结构依赖、严重长尾分布及关键部件间的细微差异,导致当前方法难以在真实巡检条件下评估其性能。解决方案的关键在于提出TowerDataset这一异构基准数据集(包含661个真实场景、约24.66亿个点),并设计一种全局-局部融合框架:通过无裁剪训练和原型对比学习捕捉全场景拓扑关系,同时利用块级局部分支保留精细几何结构,并通过几何验证机制融合与优化预测结果,从而有效整合互补的全局与局部信息,提升对稀有和易混淆组件的识别能力。
链接: https://arxiv.org/abs/2604.16848
作者: Xu Cui,Xinyan Liu,Chen Yang,Zhaobo Qi,Beichen Zang,Weigang Zhang,Antoni B. Chan
机构: Harbin Institute of Technology (Weihai)(哈尔滨工业大学(威海) ); University of the Chinese Academy of Sciences(中国科学院大学); City University of Hong Kong(香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-grained semantic segmentation of transmission-corridor point clouds is fundamental for intelligent power-line inspection. However, current progress is limited by realistic data scarcity and the difficulty of modeling global corridor structure and local geometric details in long, heterogeneous scenes. Existing public datasets usually provide only a few coarse categories or short cropped scenes which overlook long-range structural dependencies, severe long-tail distributions, and subtle distinctions among safety-critical components. As a result, current methods are difficult to evaluate under realistic inspection settings, and their ability to preserve and integrate complementary global and local cues remains unclear. To address the above challenges, we introduce TowerDataset, a heterogeneous benchmark for transmission-corridor segmentation. TowerDataset contains 661 real-world scenes and about 2.466 billion points. It preserves long corridor extents, defines a fine-grained 22-class taxonomy, and provides standardized splits and evaluation protocols. In addition, we present a global-local fusion framework which preserves and fuses whole-scene and local-detail information. A whole-scene branch with NoCrop training and prototypical contrastive learning captures long-range topology and contextual dependencies. A block-wise local branch retains fine geometric structures. Both predictions are then fused and refined by geometric validation. This design allows the model to exploit both global relationships and local shape details when recognizing rare and confusing components. Experiments on TowerDataset and two public benchmarks demonstrate the challenge of the proposed benchmark and the robustness of our framework in real, complex, and heterogeneous transmission-corridor scenes. The dataset will be released soon at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.16848 [cs.CV] (or arXiv:2604.16848v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.16848 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-215] When Earth Foundation Models Meet Diffusion: An Application to Land Surface Temperature Super-Resolution
【速读】:该论文旨在解决地表温度(Land Surface Temperature, LST)超分辨率重建问题,尤其是在极端空间降尺度(32×)条件下,如何从低分辨率热红外观测中恢复高分辨率细节结构。其核心挑战在于粗分辨率观测严重欠定细尺度信息,导致传统方法难以实现高质量重建。解决方案的关键在于提出Earth Foundation Model-guided Diffusion (EFDiff) 框架,利用Prithvi-EO-2.0地球基础模型(Earth Foundation Model, EFM)将高分辨率多光谱反射率编码为地理空间嵌入(geospatial embeddings),并通过交叉注意力机制(cross-attention)将其注入去噪网络,从而引导从高度退化的观测中进行精细尺度重建。实验表明,该方法在全局多样数据集上显著优于基线模型,且交叉注意力条件化比通道拼接更有效,具备良好的泛化能力。
链接: https://arxiv.org/abs/2604.16841
作者: Yiheng Chen,Zihui Ma,Peishi Jiang,Yilong Dai,Qikai Hu,Xinyue Ye,Lingyao Li,Rita Sousa,Runlong Yu
机构: University of Alabama (阿拉巴马大学); Emory University (埃默里大学); University of Michigan (密歇根大学); University of South Florida (南佛罗里达大学); New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Land surface temperature (LST) super-resolution is important for environmental monitoring. However, it remains challenging as coarse thermal observations severely underdetermine fine-scale structure. In this paper, we propose Earth Foundation Model-guided Diffusion (EFDiff), a novel framework for super-resolution under extreme spatial degradation. EFDiff uses the Prithvi-EO-2.0 Earth foundation model to encode high-resolution multispectral reflectance into geospatial embeddings, which are injected into the denoising network via cross-attention to guide fine-scale reconstruction from highly degraded observations. We study two variants, EFDiff- \epsilon and EFDiff- x_0 , which offer complementary trade-offs between perceptual realism and pixel-level fidelity. We evaluate EFDiff under an extreme 32\times scale gap using a globally diverse benchmark comprising 242,416 co-registered Landsat thermal-reflectance patches. Results show that EFDiff consistently outperforms baseline methods and that cross-attention conditioning by EFM is more effective than HLS channel concatenation. Although we present EFDiff in the context of LST super-resolution, the framework is broadly applicable to remote sensing problems in which pretrained geospatial representations can guide generative reconstruction.
[CV-216] Lorentz Framework for Semantic Segmentation
【速读】:该论文旨在解决传统语义分割方法在建模层次结构时的效率与不确定性量化不足的问题,尤其是基于庞加莱球模型(Poincaré ball model)的超参数优化不稳定、计算复杂度高且难以与现有欧几里得架构兼容的局限性。其解决方案的关键在于提出一种适用于超曲面 Lorentz 模型的通用语义分割框架,该框架无需依赖黎曼优化器即可实现稳定高效的优化,并通过结合文本嵌入与视觉线索引导像素级表示在 Lorentz 空间中的层次化建模,从而自然地提供不确定度估计、置信图、边界细化、基于文本的检索及零样本泛化能力。此外,作者还引入了 Lorentz 锥嵌入中的新型不确定性和置信度指标,并通过梯度分析揭示了 Lorentz 优化的内在机制,实验验证了该方法在多个主流数据集上的有效性与普适性。
链接: https://arxiv.org/abs/2604.16836
作者: Zahid Hasan,Masud Ahmed,Nirmalya Roy
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Semantic segmentation in hyperbolic space enables compact modeling of hierarchical structure while providing inherent uncertainty quantification. Prior approaches predominantly rely on the Poincaré ball model, which suffers from numerical instability, optimization, and computational challenges. We propose a novel, tractable, architecture-agnostic semantic segmentation framework (pixel-wise and mask classification) in the hyperbolic Lorentz model. We employ text embeddings with semantic and visual cues to guide hierarchical pixel-level representations in Lorentz space. This enables stable and efficient optimization without requiring a Riemannian optimizer, and easily integrates with existing Euclidean architectures. Beyond segmentation, our approach yields free uncertainty estimation, confidence map, boundary delineation, hierarchical and text-based retrieval, and zero-shot performance, reaching generalized flatter minima. We introduce a novel uncertainty and confidence indicator in Lorentz cone embeddings. Further, we provide analytical and empirical insights into Lorentz optimization via gradient analysis. Extensive experiments on ADE20K, COCO-Stuff-164k, Pascal-VOC, and Cityscapes, utilizing state-of-the-art per-pixel classification models (DeepLabV3 and SegFormer) and mask classification models (mask2former and maskformer), validate the effectiveness and generality of our approach. Our results demonstrate the potential of hyperbolic Lorentz embeddings for robust and uncertainty-aware semantic segmentation. Code is available at this https URL.
[CV-217] Hierarchical Vision Transformer Enhanced by Graph Convolutional Network for Image Classification
【速读】:该论文旨在解决Vision Transformer (ViT) 在图像分类任务中面临的三个关键问题:(1) 图像块(patch)尺寸选择对预测精度影响显著,如何合理选取或融合不同尺度的patch;(2) 传统一维位置嵌入(1D position embeddings)难以准确捕捉patch在二维空间中的结构信息;(3) 图卷积网络(GCN)虽能建模局部连接关系,但缺乏全局图结构建模能力,而ViT可捕获全局关系却无法有效建模局部结构。解决方案的关键在于提出一种分层增强型视觉Transformer(GCN-HViT),其核心创新包括:设计分层ViT架构以在多层级上同时建模局部与全局patch交互关系,并引入GCN作为局部特征提取器,将每个patch的局部表示作为二维位置嵌入(2D position embeddings)输入至ViT,从而显式建模patch间的局部结构信息并提升整体表征能力。
链接: https://arxiv.org/abs/2604.16823
作者: Haibin Jiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision Transformer (ViT) has brought new breakthroughs to the field of image classification by introducing the self-attention mechanism and Graph Convolutional Networks(GCN) have been proposed and successfully applied in data representation and analysis. However, there are key challenges which limit their further development: (1) The patch size selected by ViT is crucial for accurate predictions, which raises a natural question: How to select the size of patches properly or how to comprehensively combine small patches and larger patches; (2) While the spatial structure information is important in vision tasks, the 1D position embeddings fails to capture the spatial structure information of patches more accurately; (3) The GCN can capture the local connectivity relationships between image nodes, but it lacks the ability to capture global graph structural information. On the contrary, the self-attention mechanism of ViT can draw the global relation on image patches, but it is unable to model the local structure of image. To overcome such limitations, we propose the Hierarchical Vision Transformer Enhanced by Graph Convolutional Network (GCN-HViT) for image classification. Specifically, the Hierarchical ViT we designed can model patch-wise information interactions on a global scale within each level and model hierarchical relationships between small patches and large patches across multiple levels. In addition, the proposed GCN method functions as a local feature extractor to obtain the local representation of each image patch which serves as a 2D position embedding of each patch in the 2D space. Meanwhile, it models patch-wise information interactions on a local scale within each level. Extensive experiments on 3 real-world datasets demonstrate that GCN-HViT achieves state-of-the-art performance.
[CV-218] Modeling Biomechanical Constraint Violations for Language-Agnostic Lip-Sync Deepfake Detection
【速读】:该论文旨在解决当前语音同步深度伪造检测方法在跨语言场景下泛化能力不足的问题,其根本原因在于现有方法依赖像素级伪影或音频-视觉对应关系,而这些特征编码的是数据相关的模式而非普适的物理规律。解决方案的关键在于识别并利用一个更基础的生理学原理:生成式模型未强制执行真实口腔运动的生物力学约束,导致伪造视频中唇部时间动态方差显著升高,即所谓的“时间唇部抖动”(temporal lip jitter)——这一信号在不同说话者语言、种族和录制条件下均具有一致性。作者通过 BioLip 框架实现了该原理,该框架基于 MediaPipe 提取的 64 个口周关键点坐标进行轻量级分析,从而有效区分真实与伪造唇部运动。
链接: https://arxiv.org/abs/2604.16808
作者: Hao Chen,Junnan Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures. Keywords: deepfake detection, lip-sync forgery, biomechanical constraints, temporal kinematics, cross-lingual generalization, privacy-preserving detection, geometric features
Abstract:Current lip-sync deepfake detectors rely on pixel-level artifacts or audio-visual correspondence, failing to generalize across languages because these cues encode data-dependent patterns rather than universal physical laws. We identify a more fundamental principle: generative models do not enforce the biomechanical constraints of authentic orofacial articulation, producing measurably elevated temporal lip variance – a signal we term temporal lip jitter – that is empirically consistent across the speaker’s language, ethnicity, and recording conditions. We instantiate this principle through BioLip, a lightweight framework operating on 64 perioral landmark coordinates extracted by MediaPipe.
[CV-219] Channel Attention-Guided Cross-Modal Knowledge Distillation for Referring Image Segmentation
【速读】:该论文旨在解决参考图像分割(Referring Image Segmentation, RIS)任务中,现有基于大规模视觉-语言编码模型的方法因参数量过大而难以在计算资源受限场景下部署的问题。其解决方案的关键在于提出一种通道注意力引导的跨模态知识蒸馏方法,通过迁移教师网络中视觉与语言之间的高阶细粒度关联以及各通道所表征语义组件间的关联,使学生网络在不引入额外推理参数的前提下,既学习到教师的知识,又保留部分独立学习能力,从而缓解学习偏差的传递并显著提升学生模型性能。
链接: https://arxiv.org/abs/2604.16806
作者: Chen Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures
Abstract:Referring image segmentation (RIS) requires accurate segmentation of target regions in images according to language descriptions, which is a cross-modal task integrating vision and language. Existing RIS methods typically employ large-scale vision and language encoding models to improve performance, but their enormous parameter size severely restricts deployment in scenarios with limited computing resources. To solve this problem, this paper proposes a channel attention-guided cross-modal knowledge distillation method, which transfers the high-order fine-grained correlations between vision and language learned by the teacher network, as well as the correlations between semantic components represented by each channel, to the student network. Compared with the traditional pixel-wise relational distillation, this method not only enables the student to learn the knowledge of the teacher, but also retains part of its independent learning ability, alleviating the transfer of learning bias. Experimental results on two public datasets show that the proposed distillation method does not introduce additional parameters during inference and can achieve significant performance improvement for the student model.
[CV-220] Frequency-Decomposed INR for NIR-Assisted Low-Light RGB Image Denoising
【速读】:该论文旨在解决低光照条件下可见光图像中严重噪声和高频结构退化的问题。其解决方案的关键在于提出了一种基于频域解耦隐式神经表示(Frequency Decoupled Implicit Neural Representation, FDINR)的近红外(NIR)辅助低光图像恢复方法。该方法利用RGB与NIR跨模态频率相关性的统计先验——即低频RGB信号更可靠,而高频NIR信号具有更高相关性——通过多尺度小波变换显式分离图像频率成分,并构建双分支隐式神经表示框架;在此基础上设计了跨模态差异化频率监督机制:由低光RGB引导低频亮度与颜色重建,利用高信噪比(high-SNR)NIR信号约束高频纹理细节生成,从而实现频域互补优势;同时引入基于不确定性的自适应加权损失函数,自动平衡不同频率任务的贡献,有效缓解传统方法因空间域刚性融合导致的颜色失真与伪影问题。
链接: https://arxiv.org/abs/2604.16800
作者: Ligen Shi,Zengyu Pang,Chang Liu,Shuchen Sun,Jun Qiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures
Abstract:Addressing the issues of severe noise and high frequency structural degradation in visible images under low-light conditions, this paper proposes a Near Infrared (NIR) aided low light image restoration method based on Frequency Decoupled Implicit Neural Representation (FDINR). Based on the statistical prior of RGB-NIR cross-modal frequency correlations, specifically that low-frequency RGB signals are more reliable, whereas high frequency NIR signals exhibit higher correlation, we explicitly decompose images into distinct frequency components via multi-scale wavelet transforms and construct a dual-branch implicit neural representation framework. Within this framework, we design a cross modal differentiated frequency supervision mechanism, leveraging low light RGB to guide the reconstruction of low frequency luminance and color, and utilizing high-SNR NIR signals to constrain the generation of high frequency texture details, thereby achieving complementary advantages in the frequency domain. Furthermore, an uncertainty-based adaptive weighting loss function is introduced to automatically balance the contributions of different frequency tasks, solving the problems of color distortion and artifacts caused by rigid fusion in the spatial domain common in traditional methods. Experimental results demonstrate that FD-INR not only effectively restores image luminance consistency and structural details but also, benefitting from its implicit continuous representation, outperforms existing methods in arbitrary-resolution reconstruction tasks, significantly enhancing the reliability of low light perception.
[CV-221] Generative Semantic Communication via Alternating Dual-Domain Posterior Sampling
【速读】:该论文旨在解决生成式语义通信(Generative Semantic Communication, SemCom)中因采用最大后验估计(Maximum a Posteriori, MAP)导致的数据分布无法保持、从而限制感知质量的问题。现有基于扩散模型的接收端方法在单域引导下存在局限性:潜空间引导对信道噪声敏感,图像空间引导则继承解码器偏差;而直接联合使用两域会导致伪后验过自信。论文的关键解决方案是将语义解码建模为贝叶斯逆问题,并证明后验采样可最优地保持数据分布;在此基础上提出交替双域后验采样(Alternating Dual-Domain Posterior Sampling, ADDPS),通过在采样过程中交替施加潜空间与图像空间的一致性约束,分解联合后验采样为更易处理的子问题,避免梯度冲突并保留双域优势,显著提升感知质量。
链接: https://arxiv.org/abs/2604.16796
作者: Shunpu Tang,Qianqian Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Signal Processing (eess.SP)
备注:
Abstract:Generative semantic communication (SemCom) harnesses pretrained generative priors to improve the perceptual quality of wireless image transmission. Existing generative SemCom receivers, however, rely on maximum a posteriori (MAP) estimation, which fundamentally cannot preserve the data distribution and thus limits achievable perceptual quality. Moreover, current diffusion-based approaches using single-domain guidance face significant limitations: latent-domain guidance is sensitive to channel noise, while image-domain guidance inherits decoder bias. Simply combining both domains simultaneously yields an overconfident pseudo-posterior. In this paper, we formulate semantic decoding as a Bayesian inverse problem and prove that posterior sampling achieves optimal perceptual quality by preserving the data distribution. Building on this insight, we propose alternating dual-domain posterior sampling (ADDPS), a diffusion-based SemCom receiver that alternately enforces latent-domain and image-domain consistency during the sampling process. This alternating strategy decomposes joint posterior sampling into simpler subproblems, avoiding gradient conflicts while retaining the complementary strengths of both domains. Experiments on FFHQ demonstrate that the proposed ADDPS achieves superior perceptual quality compared with existing methods.
[CV-222] Improving Radio Interferometry Imaging by Explicitly Modeling Cross-Domain Consistency in Reconstruction
【速读】:该论文旨在解决射电天文成像中因观测数据稀疏性导致的图像伪影问题,现有方法通常局限于单一域(脏图或可见度)的重建,忽视了可见度域与图像域之间的互补信息和跨域一致性建模,从而限制了成像质量的提升。其解决方案的关键在于提出CDCRec方法,通过设计一种分层多任务、多阶段的框架,显式建模两个域间的交叉一致性,并引入自监督的互补建模策略,以增强跨域相关性的提取能力,从而在受限源域数据下更有效地恢复密集信息,显著提升干涉测量数据的重建性能。
链接: https://arxiv.org/abs/2604.16794
作者: Kai Cheng,Ruoqi Wang,Qiong Luo
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Radio astronomy plays a crucial role in understanding the universe, particularly within the realm of non-thermal astrophysics. Images of celestial objects are derived from the signals (called visibility) measured by radio telescopes. Such imaging results, called dirty images, contain artifacts due to factors such as sparsity and therefore require reconstruction to improve imaging quality. Existing methods typically restrict reconstruction to a unimodal domain, either to the dirty image after imaging or to the sparse visibility prior to imaging. Focusing solely on each unimodal reconstruction results in the loss of complementary in-context information in either the visibility or image domain, leading to an incomplete modeling of mutual dependency and consistency. To address these challenges, we propose CDCRec, a multimodal radio interferometric data reconstruction method that explicitly models cross-domain consistency. We design a hierarchical multi-task and multi-stage framework to enhance the exploration of interplays between domains during reconstruction. Our experimental results demonstrate that CDCRec improves imaging performance through enhanced cross-domain correlation extraction. In particular, our self-supervised complementary modeling strategy is better than current methods at interferometric domain translations that rely heavily on recovering dense information from constrained source-domain data.
[CV-223] Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational Games
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度物体识别任务中表现不足,而CLIP类模型虽擅长细粒度识别但缺乏对通用物体类别覆盖的问题。解决方案的关键在于提出HyMOR(Hybrid Multi-granularity open-ended Object Recognition)框架,其核心是融合MLLM与CLIP模型:MLLM负责开放域、粗粒度的物体识别,CLIP则专注于特定领域(如动植物)的细粒度识别,从而实现跨语义粒度的精准物体理解,显著提升了多模态内容生成和交互式学习场景中的感知准确性。
链接: https://arxiv.org/abs/2604.16785
作者: Hanling Yi,Feng Lin,Mao Luo,Yifan Yang,Xiaotian Yu,Rong Xiao
机构: Intellifusion Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have enabled open-ended object recognition, yet they struggle with fine-grained tasks. In contrast, CLIP-style models excel at fine-grained recognition but lack broad coverage of general object categories. To bridge this gap, we propose \textbfHyMOR, a \textbfHybrid \textbfMulti-granularity open-ended \textbfObject \textbfRecognition framework that integrates an MLLM with a CLIP model. In HyMOR, the MLLM performs open-ended and coarse-grained object recognition, while the CLIP model specializes in fine-grained identification of domain-specific objects such as animals and plants. This hybrid design enables accurate object understanding across multiple semantic granularities, serving as a robust perceptual foundation for downstream multi-modal content generation and interactive gameplay. To support evaluation in content-rich and educational scenarios, we introduce TBO (TextBook Objects), a dataset containing 20,942 images annotated with 8,816 object categories extracted from textbooks. Extensive experiments demonstrate that HyMOR narrows the fine-grained recognition gap with CLIP to 0.2% while improving general object recognition by 2.5% over a baseline MLLM, measured by average Sentence-BERT (SBert) similarity. Overall, HyMOR achieves a 23.2% improvement in average SBert across all evaluated datasets, highlighting its effectiveness in enabling accurate perception for multi-modal game content generation and interactive learning applications.
[CV-224] EdgeVTP: Exploration of Latency-efficient Trajectory Prediction for Edge-based Embedded Vision Applications
【速读】:该论文旨在解决高速公路场景下车辆轨迹预测(Vehicle Trajectory Prediction, VTP)在路侧边缘设备部署时面临的实时性与准确性权衡问题,尤其关注端到端延迟的确定性和可预测性。解决方案的关键在于提出EdgeVTP框架,其核心创新包括:1)采用交互感知的图建模结合轻量级Transformer骨干网络,实现高效特征提取;2)设计单次曲线解码器(one-shot curve decoder),将未来轨迹表示为以最新观测位置为锚点的紧凑曲线参数,而非逐点自回归预测,显著降低解码开销并生成平滑轨迹;3)通过引入具有硬邻居上限的局部图(locality graph)显式限制交互复杂度,确保在高密度交通场景下运行时延可预测。该方案在三个高速公路基准数据集和两个Jetson类边缘平台上验证了其在保证最低实测端到端延迟的同时,达到或接近当前最优(SotA)的预测精度。
链接: https://arxiv.org/abs/2604.16783
作者: Seungjin Kim,Reza Jafarpourmarzouni,Christopher Neff,Hamed Tabkhi,Vinit Katariya
机构: University of Wyoming (怀俄明大学); University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校); North Carolina AT State University (北卡罗来纳农业技术州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vehicle trajectory prediction is central to highway perception, but deployment on roadside edge devices necessitates bounded, deterministic end-to-end latency. We present EdgeVTP, an embedded-first trajectory predictor that combines interaction-aware graph modeling with a lightweight transformer backbone and a one-shot curve decoder. By predicting future motion as compact curve parameters (anchored at the last observed position) rather than horizon-scaled autoregressive waypoints, EdgeVTP reduces decoding overhead while producing smooth trajectories. To keep runtime predictable in crowded scenes, we explicitly bound interaction complexity via a locality graph with a hard neighbor cap. Across three highway benchmarks and two Jetson-class platforms, EdgeVTP achieves the lowest measured end-to-end latency under a protocol that includes graph construction and post-processing, while attaining state-of-the-art (SotA) prediction accuracy on two of the three datasets and competitive error on other benchmarks. Our code is available at this https URL.
[CV-225] FairNVT: Improving Fairness via Noise Injection in Vision Transformers ICLR2026
【速读】:该论文旨在解决预训练Transformer编码器在下游任务中存在敏感属性偏见的问题,即模型在表示学习和预测层面均可能引入不公平性,从而影响公平决策。解决方案的关键在于提出一种轻量级去偏框架FairNVT,其核心机制包括:通过轻量适配器(adapter)学习任务相关与敏感属性嵌入;对敏感嵌入施加校准高斯噪声后与任务表示融合;结合正交约束与公平性正则化,协同降低嵌入中敏感属性泄露并促进下游预测的公平性。该方法在不牺牲任务准确率的前提下,显著提升了代表性公平性和预测公平性指标。
链接: https://arxiv.org/abs/2604.16780
作者: Qiaoyue Tang,Sepidehsadat Hosseini,Mengyao Zhai,Thibaut Durand,Greg Mori
机构: University of British Columbia (不列颠哥伦比亚大学); RBC Borealis (加拿大皇家银行博瑞斯); Simon Fraser University (西蒙菲莎大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICLR 2026 Algorithmic Fairness Across Alignment Procedures and Agentic Systems (AFAA) Workshop
Abstract:This paper presents FairNVT, a lightweight debiasing framework for pretrained transformer-based encoders that improves both representation and prediction level fairness while preserving task accuracy. Unlike many existing debiasing approaches that address these notions separately, we argue they are inherently connected: suppressing sensitive information at the representation level can facilitate fairer predictions. Our approach learns task-relevant and sensitive embeddings via lightweight adapters, applies calibrated Gaussian noise to the sensitive embedding, and fuses it with the task representation. Together with orthogonality constraints and fairness regularization, these components jointly reduce sensitive-attribute leakage in the learned embeddings and encourage fairer downstream predictions. The framework is compatible with a wide range of pretrained transformer encoders. Across three datasets spanning vision and language, FairNVT reduces sensitive-attribute attacker accuracy, improves demographic-parity and equalized-odds metrics, and maintains high task performance.
[CV-226] Frozen Vision Transformers for Dense Prediction on Small Datasets: A Case Study in Arrow Localization
【速读】:该论文旨在解决在小样本条件下对室内射箭靶面箭孔进行自动检测、定位与评分的问题,其核心挑战在于如何在仅有48张标注图像(共5,084个箭孔)的情况下实现高精度的密集预测。解决方案的关键在于:首先通过基于颜色的正则化阶段将透视畸变图像映射到标准化坐标系;其次利用冻结的自监督视觉Transformer(DINOv3 ViT-L/16)结合AnyUp引导特征上采样,从32×32的patch token中恢复亚毫米级空间精度;最后采用轻量级CenterNet风格检测头预测箭心热图。整个系统仅训练3.8M参数(占总参数308M的约1.2%),并在交叉验证中达到F1分数0.893±0.011和定位误差1.41±0.06 mm,优于以往需要大量标注数据的全监督方法,表明冻结基础模型结合最小任务适配是小数据场景下密集预测的有效范式。
链接: https://arxiv.org/abs/2604.16758
作者: Maxwell Shepherd
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We present a system for automated detection, localization, and scoring of arrow punctures on 40,cm indoor archery target faces, trained on only 48 annotated photographs (5,084 punctures). Our pipeline combines three components: a color-based canonical rectification stage that maps perspective-distorted photographs into a standardized coordinate system where pixel distances correspond to known physical measurements; a frozen self-supervised vision transformer (DINOv3 ViT-L/16) paired with AnyUp guided feature upsampling to recover sub-millimeter spatial precision from 32 \times 32 patch tokens; and lightweight CenterNet-style detection heads for arrow-center heatmap prediction. Only 3.8,M of 308,M total parameters are trainable. Across three cross-validation folds, we achieve a mean F1 score of 0.893 \pm 0.011 and a mean localization error of 1.41 \pm 0.06 ,mm, comparable to or better than prior fully-supervised approaches that require substantially more training data. An ablation study shows that the CenterNet offset regression head, typically essential for sub-pixel refinement, provides negligible detection improvement while degrading localization in our setting. This suggests that guided feature upsampling already resolves the spatial precision lost through patch tokenization. On downstream archery metrics, the system recovers per-image average arrow scores with a median error of 1.8% and group centroid positions to within a median of 4.00,mm. These results demonstrate that frozen foundation models with minimal task-specific adaptation offer a practical paradigm for dense prediction in small-data regimes.
[CV-227] riTS: Time Series Forecasting from a Multimodal Perspective CVPR2026
【速读】:该论文旨在解决长期时间序列预测(Long-term Time Series Forecasting, LTSF)中因真实信号包含高度纠缠的时序动态而导致的传统一维(1D)建模方法难以充分捕捉复杂依赖关系的问题。其核心解决方案是提出TriTS框架,通过跨模态解耦策略将1D时间序列投影至正交的时间、频率与二维视觉(2D-vision)空间,从而打破单一维度表示瓶颈;关键创新在于引入周期感知重排(Period-Aware Reshaping)和视觉Mamba(Visual Mamba),在保持线性计算复杂度的同时建模跨周期依赖关系,并结合多分辨率小波混合(Multi-Resolution Wavelet Mixing, MR-WM)模块对非平稳信号进行趋势与噪声成分的显式分离,实现精细的时间-频率定位,最终通过动态融合三者互补表征提升模型适应性和预测性能。
链接: https://arxiv.org/abs/2604.16748
作者: Xiang Ao
机构: Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures. Accepted by the A2A-MML Workshop in conjunction with CVPR 2026
Abstract:Time series forecasting plays a pivotal role in critical sectors such as finance, energy, transportation, and meteorology. However, Long-term Time Series Forecasting (LTSF) remains a significant challenge because real-world signals contain highly entangled temporal dynamics that are difficult to fully capture from a purely 1D perspective. To break this representation bottleneck, we propose TriTS, a novel cross-modal disentanglement framework that projects 1D time series into orthogonal time, frequency, and 2D-vision this http URL seamlessly bridge the 1D-to-2D modality gap without the prohibitive O(N^2) computational overhead of Vision Transformers (ViTs), we introduce a Period-Aware Reshaping strategy and incorporate Visual Mamba (Vim). This approach efficiently models cross-period dependencies as global visual textures while maintaining linear computational complexity. Complementing this, we design a Multi-Resolution Wavelet Mixing (MR-WM) module for the frequency modality, which explicitly decouples non-stationary signals into trend and noise components to achieve fine-grained time-frequency localization. Finally, a streaming linear branch is retained in the time domain to anchor numerical stability. By dynamically fusing these three complementary representations, TriTS effectively adapts to diverse data contexts. Extensive experiments across multiple benchmark datasets demonstrate that TriTS achieves state-of-the-art (SOTA) performance, fundamentally outperforming existing vision-based forecasters by drastically reducing both parameter count and inference latency.
[CV-228] Incoherent Deformation Not Capacity: Diagnosing and Mitigating Overfitting in Dynamic Gaussian Splatting
【速读】:该论文旨在解决动态3D高斯溅射(Dynamic 3D Gaussian Splatting, Dynamic 3DGS)在单目视频重建中训练视图与测试视图之间性能差距显著的问题,即过拟合现象。实验表明,在D-NeRF基准上平均训练-测试PSNR差距达6.18 dB,个别场景甚至高达11 dB。研究发现:过拟合主要由两个因素驱动——高斯点数量的过度增长(split操作导致)和变形场不一致性(deformation coherence缺失)。关键解决方案是引入弹性能量正则化(Elastic Energy Regularization, EER),通过约束每个高斯点的局部平滑性来提升变形场的一致性,从而有效降低PSNR差距(减少40.8%),并配合GAD(损失率感知的密度阈值)和PTDrop(抖动加权的高斯丢弃)进一步优化,最终实现57%的gap缩减。该方法不仅适用于原始架构,也在不同变形结构(Deformable-3DGS)和真实单目视频(HyperNeRF)中验证了泛化能力,说明过拟合本质源于变形不一致而非单纯参数量增加。
链接: https://arxiv.org/abs/2604.16747
作者: Ahmad Droby
机构: Independent Researcher
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, 2 tables
Abstract:Dynamic 3D Gaussian Splatting methods achieve strong training-view PSNR on monocular video but generalize poorly: on the D-NeRF benchmark we measure an average train-test PSNR gap of 6.18 dB, rising to 11 dB on individual scenes. We report two findings that together account for most of that gap. Finding 1 (the role of splitting). A systematic ablation of the Adaptive Density Control pipeline (split, clone, prune, frequency, threshold, schedule) shows that splitting is responsible for over 80% of the gap: disabling split collapses the cloud from 44K to 3K Gaussians and the gap from 6.18 dB to 1.15 dB. Across all threshold-varying ablations, gap is log-linear in count (r = 0.995, bootstrap 95% CI [0.99, 1.00]), which suggests a capacity-based explanation. Finding 2 (the role of deformation coherence). We show that the capacity explanation is incomplete. A local-smoothness penalty on the per-Gaussian deformation field – Elastic Energy Regularization (EER) – reduces the gap by 40.8% while growing the cloud by 85%. Measuring per-Gaussian strain directly on trained checkpoints, EER reduces mean strain by 99.72% (median 99.80%) across all 8 scenes; on 8/8 scenes the median Gaussian under EER is less strained than the 1st-percentile (best-behaved) Gaussian under baseline. Alongside EER, we evaluate two further regularizers: GAD, a loss-rate-aware densification threshold, and PTDrop, a jitter-weighted Gaussian dropout. GAD+EER reduces the gap by 48%; adding PTDrop and a soft growth cap reaches 57%. We confirm that coherence generalizes to (a) a different deformation architecture (Deformable-3DGS, +40.6% gap reduction at re-tuned lambda), and (b) real monocular video (4 HyperNeRF scenes, reducing the mean PSNR gap by 14.9% at the same lambda as D-NeRF, with near-zero quality cost). The overfitting in dynamic 3DGS is driven by incoherent deformation, not parameter count. Comments: 10 pages, 6 figures, 2 tables Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.16747 [cs.CV] (or arXiv:2604.16747v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.16747 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-229] Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals
【速读】:该论文旨在解决训练-free的视觉Transformer(Vision Transformer, ViT)中token压缩方法在高压缩比下出现的“悬崖式”性能崩溃问题(cliff-like collapse)。现有方法如ToMe、ToFu、PiToMe和MCTF虽采用不同的评分机制,但在高压缩率时均表现出相似的性能骤降现象。作者通过构建诊断框架,识别出两个关键成因:其一为层间压缩固有的信号无关误差放大器,导致帕累托曲线呈凸形且临界压缩率 $ r_\text{crit} \propto 1/L $;其二为对成对相似性信号(pairwise similarity signals)的依赖,其排序一致性从浅层到深层由 ρs=0.88 降至 0.27,而此类信号本身具有高不稳定性(O(Np2) 联合扰动),相较单值信号(unary signals)更易受干扰(仅 O(Np) 扰动,符合中心极限定理)。基于此诊断,提出CATIS方案,其核心设计原则为:利用单值信号提升触发阈值以减少误删,引入分层筛选机制(triaje)抑制增益过载,从而在ViT-Large模型上实现63%计算量减少的同时保持96.9%的原始准确率(ImageNet-1K),显著优于所有基线方法(仅43–65%)。
链接: https://arxiv.org/abs/2604.16745
作者: Yang Shanglin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Training-free token reduction methods for Vision Transformers (ToMe, ToFu, PiToMe, and MCTF) employ different scoring mechanisms, yet they share a closely matched cliff-like collapse at high compression. This paper explains \emphwhy. We develop a diagnostic framework with two tools, ranking consistency \rho_s and off-diagonal correlation \rho_\textoff , that decomposes the collapse into (1)a signal-agnostic error amplifier inherent to layer-wise reduction, predicting convex Pareto curves and r_\textcrit \propto 1/L ; and (2)shared reliance on \emphpairwise similarity signals whose ranking consistency degrades from \rho_s=0.88 to 0.27 in deep layers. Pairwise rankings are inherently unstable ( O(N_p^2) joint perturbations) while unary signals enjoy greater stability ( O(N_p) perturbations, CLT). From three design principles derived from this diagnosis, we construct CATIS as a constructive validation: unary signals raise the trigger threshold, triage suppresses the gain. On ViT-Large at 63% FLOPs reduction, CATIS retains 96.9% of vanilla accuracy (81.0%) on ImageNet-1K where all baselines collapse to 43–65%.
[CV-230] Automated Palynological Analysis System: Integrating Deep Metric Learning and U2-Net Detection in Hinfty bright field microscopy
【速读】:该论文旨在解决传统蜜粉学(melissopalynology)分析过程耗时且主观性强的问题,通常每份样品需耗时4–6小时。其解决方案的关键在于构建一个自动化、高通量的显微成像系统,集成H∞鲁棒机械控制与先进的深度学习流水线,实现对智利比奥比奥地区花粉颗粒的精确计数、分类及形态学分析。系统采用U^2-Net进行显著目标检测,并基于DINOv2视觉Transformer骨干网络通过深度度量学习(Deep Metric Learning)实现分类;同时引入梯度加权注意力机制(Gradient-Weighted Attention),提供人类可解释的纹理和诊断特征标注,最终达到95.8%的分类召回率,并实现相比人工专家分析6倍的处理速度提升。
链接: https://arxiv.org/abs/2604.16743
作者: J. Staforelli-Vivanco,R. Jofré,B. Muñoz,V. Salamanca,P. Coelho,I. Sanhueza,L. Viafora,C. Toro,J. Troncoso,M. Rondanelli-Reyes,I. Lamas
机构: Universidad de Concepción (康塞普西翁大学); Universidad San Sebastián (圣塞巴斯蒂安大学); Universidad Andres Bello (安德烈斯贝洛大学); Universidad Arturo Prat (阿图罗普拉特大学); University of Concepción (康塞普西翁大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注: 14 pages, 16 figures
Abstract:Traditional melissopalynology is a time-consuming and subjective process, often taking 4-6 hours per sample. We present an automated, high-throughput microscopy system that integrates H\infty robust mechanical control with advanced deep learning pipelines for the precise counting, classification, and morphological analysis of pollen grains from Bio Bio region in south central territory in Chile. Our system employs U^2 -Net for salient object detection and a DINOv2 Vision Transformer backbone trained via Deep Metric Learning for classification. By integrating Gradient-Weighted Attention, the model provides human-interpretable texture and diagnostic feature annotations. The system achieves a 95.8 % classification recall and a 6x processing speedup compared to manual expert analysis.
[CV-231] Reducing Peak Memory Usage for Modern Multimodal Large Language Model Pipelines ACL2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中因存储大量视觉标记(vision tokens)于键值缓存(Key-Value Cache, KV Cache)而导致的内存消耗瓶颈问题。现有方法通常仅在所有输入处理完毕后才进行缓存压缩,导致预填充阶段(prefill stage)峰值内存占用过高。其解决方案的关键在于发现MLLM中存在内在结构规律和表征冗余,并提出一种顺序输入压缩机制(sequential input-compression mechanism),在预填充阶段即对KV缓存进行结构感知的压缩,从而在保持生成性能几乎不变的前提下显著降低峰值内存使用,实现更高效、实用的多模态推理。
链接: https://arxiv.org/abs/2604.16734
作者: Junwan Kim,Hyunkyung Bae
机构: New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ACL 2026
Abstract:Multimodal large language models (MLLMs) have recently demonstrated strong capabilities in understanding and generating responses from diverse visual inputs, including high-resolution images and long video sequences. As these models scale to richer visual representations, inference increasingly relies on storing large numbers of vision tokens in the key-value (KV) cache, making memory consumption a central bottleneck. Existing methods address this issue by identifying redundancy in vision tokens and compressing the cache, but such compression is typically applied only after all inputs are processed, resulting in high peak memory usage during the prefill stage. In this work, we show that MLLMs exhibit inherent structural regularities and representational redundancy that can be exploited to control memory growth throughout inference. Based on this insight, we propose a sequential input-compression mechanism that enforces a fixed memory budget by performing structure-aware key-value cache compression during the prefill process. This approach substantially reduces peak memory usage while maintaining generative performance with only minimal degradation, enabling more practical and memory-efficient multimodal inference.
[CV-232] Active World-Model with 4D-informed Retrieval for Exploration and Awareness ICLR2026
【速读】:该论文旨在解决在大型动态环境中实现物理感知(Physical Awareness)的决策难题,特别是面对部分可观测性(Partial Observability)时,如何有效提升感知决策的质量。传统强化学习(Reinforcement Learning, RL)在全可观测场景中表现优异,但在部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP)中仍面临现实探索成本高、仿真到现实(Sim-to-Real)迁移存在未观测视角等挑战。解决方案的关键在于提出AW4RE(Active World-model with 4D-informed Retrieval for Exploration),这是一种以感知为中心的生成式世界模型,通过结合4D信息引导的证据检索(4D-informed Evidence Retrieval)、动作条件几何支撑(Action-conditioned Geometric Support)与时间一致性(Temporal Coherence),以及条件生成补全(Conditional Generative Completion),实现对查询传感动作下观测过程的建模,从而提供一个传感器原生的替代环境用于感知查询探索,显著提升了在极端视角变化、时间间隙和稀疏几何支持下的预测准确性与一致性。
链接: https://arxiv.org/abs/2604.16733
作者: Elaheh Vaezpour,Amirhosein Javadi,Tara Javidi
机构: KavAI; UCSD
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures, submitted to ICLR 2026 2nd Workshop on World Models
Abstract:Physical awareness, especially in a large and dynamic environment, is shaped by sensing decisions that determine observability across space, time, and scale, while observations impact the quality of sensing decisions. This loopy information structure makes physical awareness a fundamentally challenging decision problem with partial observations. While in the past decade we have witnessed the unprecedented success of reinforcement learning (RL) in problems with full observability, decision problems with partial observation, such as POMDPs, remain largely open: real-world explorations are excessively costly, while sim-to-real pipeline suffer from unobserved viewpoints. We introduce AW4RE (Active World-model with 4D-informed Retrieval for Exploration), an awareness-centric generative world model that provides a sensor-native surrogate environment for exploring sensing queries. Conditioned on a queried sensing action, AW4RE estimates the action-conditioned observation process. This is done by combining 4D-informed evidence retrieval, action-conditioned geometric support with temporal coherence, and conditional generative completion. Experiments demonstrate that AW4RE produces more grounded and consistent predictions than geometry-aware generative baselines under extreme viewpoint shifts, temporal gaps, and sparse geometric support.
[CV-233] Agent ic Large Language Models for Training-Free Neuro-Radiological Image Analysis
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在医学影像分析中缺乏原生三维空间推理能力的问题,尤其是在处理CT或MRI等体积医学图像时难以直接进行精准的结构识别与定量分析。其解决方案的关键在于提出一种无需训练的代理型人工智能(Agentic AI)框架,通过调用外部专用工具(如图像预处理、病灶分割和体积测量工具)来实现端到端的自动化脑部MRI分析流程,从而绕过对模型本身进行3D感知能力重构的需求。该方法利用多智能体协作机制模拟放射科专家分工,显著提升了复杂多步骤任务(如纵向疗效评估)的执行准确性与鲁棒性。
链接: https://arxiv.org/abs/2604.16729
作者: Ayhan Can Erdur,Daniel Scholz,Jiazhen Pan,Benedikt Wiestler,Daniel Rueckert,Jan C. Peeken
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:State-of-the-art large language models (LLMs) show high performance in general visual question answering. However, a fundamental limitation remains: current architectures lack the native 3D spatial reasoning required for direct analysis of volumetric medical imaging, such as CT or MRI. Emerging agentic AI offers a new solution, eliminating the need for intrinsic 3D processing by enabling LLMs to orchestrate and leverage specialized external tools. Yet, the feasibility of such agentic frameworks in complex, multi-step radiological workflows remains underexplored. In this work, we present a training-free agentic pipeline for automated brain MRI analysis. Validating our methodology on several LLMs (GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5) with off-the-shelf domain-specific tools, our system autonomously executes complex end-to-end workflows, including preprocessing (skull stripping, registration), pathology segmentation (glioma, meningioma, metastases), and volumetric analysis. We evaluate our framework across increasingly complex radiological tasks, from single-scan segmentation and volumetric reporting to longitudinal response assessment requiring multi-timepoint comparisons. We analyze the impact of architectural design by comparing single-agent models against multi-agent “domain-expert” collaborations. Finally, to support rigorous evaluation of future agentic systems, we introduce and release a benchmark dataset of image-prompt-answer tuples derived from public BraTS data. Our results demonstrate that agentic AI can solve highly neuro-radiological image analysis tasks through tool use without the need for training or fine-tuning.
[CV-234] DocV2: Leverag ing Self-Supervision and Open-Set Detection for Improving Pattern Spotting in Historical Documents
【速读】:该论文旨在解决历史文献中图形模式识别与文档检索效率低下的问题,当前最优方法在小尺寸非正方形查询上的精度仅为0.427,且处理时间过长(如DocExplore数据集需达7秒)。其解决方案的关键在于提出一种新型模型,采用改进的编码器iDoc(基于自监督训练)和开放集检测器,显著提升搜索速度(加速10倍)并实现小非正方形查询的新SOTA精度,该性能提升得益于非极大值抑制(non-maximum suppression, NMS)策略有效减少误检。
链接: https://arxiv.org/abs/2604.16726
作者: Jose M. Saavedra,Crhistopher Stears,Marcelo Pizarro,Cristóbal Loyola,Luis Aros
机构: University of Chile (智利大学); Pontifical Catholic University of Chile (智利天主教大学); University of Santiago, Chile (智利圣地亚哥大学); Universidad Técnica Federico Santa María (智利联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Considering the imminent massification of digital books, it has become critical to facilitate searching collections through graphical patterns. Current strategies for document retrieval and pattern spotting in historical documents still need to be improved. State-of-the-art strategies achieve an overall precision of 0.494 for pattern spotting, where the precision for small non-square queries reaches 0.427. In addition, the processing time is excessive, requiring up to 7 seconds for searching in the DocExplore dataset due to a dense-based strategy used by SOTA models. Therefore, we propose a new model based on a better encoder (iDoc), trained under a self-supervised strategy, and an open-set detector to accelerate searching. Our model achieves competitive results with state-of-the-art pattern spotting and document retrieval, improving speed by 10x. Furthermore, our model reaches a new SOTA performance on the small non-square queries, achieving a new precision of this http URL from the previous version, this leverages non-maximum suppression to reduce false positives.
[CV-235] LOD-Net: Locality-Aware 3D Object Detection Using Multi-Scale Transformer Network
【速读】:该论文旨在解决点云数据中3D目标检测因稀疏性和缺乏全局结构而导致的性能瓶颈问题。其解决方案的关键在于将多尺度注意力(Multi-Scale Attention, MSA)机制嵌入到3DETR架构中,通过引入上采样操作生成高分辨率特征图,从而增强网络对局部几何细节和全局上下文信息的捕捉能力,尤其提升了小尺寸及语义相关物体的检测精度。
链接: https://arxiv.org/abs/2604.16696
作者: Mustaqeem Khan,Aidana Nurakhmetova,Wail Gueaieb,Abdulmotaleb El Saddik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:3D object detection in point cloud data remains a challenging task due to the sparsity and lack of global structure inherent in the input. In this work, we propose a novel Multi-Scale Attention (MSA) mechanism integrated into the 3DETR architecture to better capture both local geometry and global context. Our method introduces an upsampling operation that generates high-resolution feature maps, enabling the network to better detect smaller and semantically related objects. Experiments conducted on the ScanNetv2 dataset demonstrate that our 3DETR + MSA model improves detection performance, achieving a gain of almost 1% in mAP@25 and 4.78% in mAP@50 over the baseline. While applying MSA to the 3DETR-m variant shows limited improvement, our analysis reveals the importance of adapting the upsampling strategy for lightweight models. These results highlight the effectiveness of combining hierarchical feature extraction with attention mechanisms in enhancing 3D scene understanding.
[CV-236] Rewind-IL: Online Failure Detection and State Respawning for Imitation Learning
【速读】:该论文旨在解决生成式 AI (Generative AI) 领域中基于示范的机器人操作策略在长时程任务中因执行偏差导致的部署失败问题,尤其是当策略偏离演示流形(demonstration manifold)后无法自我恢复的问题。现有运行时监控方法要么依赖失败数据、在正常特征漂移下误触发,要么仅检测失败而缺乏恢复机制。其解决方案的关键在于提出了一种无需训练的在线保护框架 Rewind-IL:该框架结合基于时间跨块差异估计(Temporal Inter-chunk Discrepancy Estimate, TIDE)的零样本失败检测器(经分割 conformal prediction 校准)与状态重置机制,通过语义验证的安全中间状态(recovery checkpoint)实现策略执行回溯与重启,从而提升模仿学习策略在真实世界和仿真环境中的可靠性。
链接: https://arxiv.org/abs/2604.16683
作者: Gehan Zheng,Sanjay Seenivasan,Matthew Johnson-Roberson,Weiming Zhi
机构: Vanderbilt University (范德堡大学); University of Waterloo (滑铁卢大学); The University of Sydney (悉尼大学); Australian Centre for Robotics (澳大利亚机器人中心)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 8 figures, 6 tables. Project page at this https URL
Abstract:Imitation learning has enabled robots to acquire complex visuomotor manipulation skills from demonstrations, but deployment failures remain a major obstacle, especially for long-horizon action-chunked policies. Once execution drifts off the demonstration manifold, these policies often continue producing locally plausible actions without recovering from the failure. Existing runtime monitors either require failure data, over-trigger under benign feature drift, or stop at failure detection without providing a recovery mechanism. We present Rewind-IL, a training-free online safeguard framework for generative action-chunked imitation policies. Rewind-IL combines a zero-shot failure detector based on Temporal Inter-chunk Discrepancy Estimate (TIDE), calibrated with split conformal prediction, with a state-respawning mechanism that returns the robot to a semantically verified safe intermediate state. Offline, a vision-language model identifies recovery checkpoints in demonstrations, and the frozen policy encoder is used to construct a compact checkpoint feature database. Online, Rewind-IL monitors self-consistency in overlapping action chunks, tracks similarity to the checkpoint library, and, upon failure, rewinds execution to the latest verified safe state before restarting inference from a clean policy state. Experiments on real-world and simulated long-horizon manipulation tasks, including transfer to flow-matching action-chunked policies, demonstrate that policy-internal consistency coupled with semantically grounded respawning offers a practical route to improved reliability in imitation learning. Supplemental materials are available at this https URL
[CV-237] C-GenReg: Training-Free 3D Point Cloud Registration by Multi-View-Consistent Geometry-to-Image Generation with Probabilistic Modalities Fusion CVPR2026
【速读】:该论文旨在解决当前基于学习的3D点云配准方法在跨传感器模态、采样差异和环境变化下泛化能力不足的问题。解决方案的关键在于提出一种无需训练(training-free)的框架C-GenReg,其核心创新是通过世界基础模型(World Foundation Model)将几何点云转换为多视图一致的RGB表示,并借助预训练的注册导向视觉基础模型(Vision Foundation Model, VFM)在图像域中提取密集对应关系,再将其映射回3D空间;同时引入“匹配-融合”(Match-then-Fuse)的概率冷融合机制,联合生成RGB分支与原始几何分支的独立对应后验分布,从而保留各模态的归纳偏置并提供校准置信度,实现零样本(zero-shot)且即插即用的鲁棒配准性能。
链接: https://arxiv.org/abs/2604.16680
作者: Yuval Haitman,Amit Efraim,Joseph M. Francos
机构: Ben-Gurion University (本-古里安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:We introduce C-GenReg, a training-free framework for 3D point cloud registration that leverages the complementary strengths of world-scale generative priors and registration-oriented Vision Foundation Models (VFMs). Current learning-based 3D point cloud registration methods struggle to generalize across sensing modalities, sampling differences, and environments. Hence, C-GenReg augments the geometric point cloud registration branch by transferring the matching problem into an auxiliary image domain, where VFMs excel, using a World Foundation Model to synthesize multi-view-consistent RGB representations from the input geometry. This generative transfer, preserves spatial coherence across source and target views without any fine-tuning. From these generated views, a VFM pretrained for finding dense correspondences extracts matches. The resulting pixel correspondences are lifted back to 3D via the original depth maps. To further enhance robustness, we introduce a “Match-then-Fuse” probabilistic cold-fusion scheme that combines two independent correspondence posteriors, that of the generated-RGB branch with that of the raw geometric branch. This principled fusion preserves each modality inductive bias and provides calibrated confidence without any additional learning. C-GenReg is zero-shot and plug-and-play: all modules are pretrained and operate without fine-tuning. Extensive experiments on indoor (3DMatch, ScanNet) and outdoor (Waymo) benchmarks demonstrate strong zero-shot performance and superior cross-domain generalization. For the first time, we demonstrate a generative registration framework that operates successfully on real outdoor LiDAR data, where no imagery data is available.
[CV-238] Appearance-free Action Recognition: Zero-shot Generalization in Humans and a Two-Pathway Model
【速读】:该论文旨在解决人类和视觉模型在零样本(zero-shot)条件下对真实世界动作视频进行外观无关(appearance-free)变换时的泛化能力问题,即在缺乏静态形状线索的情况下,是否仍能有效识别动作。其解决方案的关键在于提出一种双通路3D卷积神经网络(3D CNN)架构,融合RGB(形式)流与光流(运动)流,并引入受格式塔共命运分组(Gestalt common-fate grouping)启发的相干门控机制(coherence-gating mechanism),从而增强模型对运动信息的利用能力。实验表明,该模型在两种外观无关视频数据集上均表现出良好泛化性能,且运动通路对零样本泛化至关重要,而形式通路提升自然视频上的识别准确率,显著缩小了模型与人类表现之间的差距。
链接: https://arxiv.org/abs/2604.16675
作者: Prerana Kumar(1 and 2),Martin A. Giese(1) ((1) Hertie Institute, University of Tuebingen, (2) IMPRS-IS)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Action recognition is a fundamental ability for social species. Yet, its underlying computations are not well understood. Classical psychophysical studies using simplified stimuli have shown that humans can perceive body motion even under degradation of relevant shape cues. Recent work using real-world action videos and their appearance-free counterparts (that preserve motion but lack static shape cues) included explicit training of humans and models on the appearance-free videos. Whether humans and vision models generalize in a zero-shot manner to appearance-free transformations of real-world action videos is not yet known. To measure this generalization in humans, we conducted a laboratory-based psychophysics experiment. 22 participants were trained to recognize five action categories using naturalistic videos (UCF5 dataset), and tested zero-shot on two types of appearance-free transformations: (i) dense-noise motion videos from an existing dataset (AFD5) and (ii) random-dot appearance-free videos. We find that participants recognize actions in both types of appearance-free videos well above chance, albeit with reduced accuracy compared to naturalistic videos. To model this behavior, we developed a two-pathway 3D CNN-based model combining an RGB (form) stream and an optical flow (motion) stream, including a coherence-gating mechanism inspired by Gestalt common-fate grouping. Our model generalizes to both appearance-free datasets and outperforms contemporary video classification models, narrowing the gap to human performance. We find that the motion pathway is critical for generalization to appearance-free videos, while the form pathway improves performance on naturalistic videos. Our findings highlight the importance of motion-based representations for generalization to appearance-free videos, and support the use of multi-stream architectures to model video-based action recognition.
[CV-239] A Benchmark Study of Segmentation Models and Adaptation Strategies for Landslide Detection from Satellite Imagery
【速读】:该论文旨在解决高分辨率卫星影像中滑坡检测的准确性与模型效率问题,尤其关注现代分割架构(如卷积神经网络CNN、基于Transformer的模型及大预训练基础模型)以及微调策略在该任务中的相对有效性。其解决方案的关键在于:首先,通过统一的训练和评估协议系统性地比较不同模型架构在GDCLD数据集上的表现,发现基于Transformer的模型具有较强的分割性能;其次,引入参数高效微调方法(如LoRA和AdaLoRA),在保持与全量微调相当精度的前提下,将可训练参数减少高达95%,显著提升了模型部署效率。
链接: https://arxiv.org/abs/2604.16663
作者: Md Kowsher,Weiwei Zhan,Chen Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Landslide detection from high resolution satellite imagery is a critical task for disaster response and risk assessment, yet the relative effectiveness of modern segmentation architectures and finetuning strategies for this problem remains insufficiently understood. In this work, we present a systematic benchmarking study of convolutional neural networks, transformer based segmentation models, and large pre-trained foundation models for landslide detection. Using the Globally Distributed Coseismic Landslide Dataset (GDCLD) dataset, we evaluate representative CNN- and transformer-based segmentation models alongside large pretrained foundation models under consistent training and evaluation protocols. In addition, we compare full fine-tuning with parameter-efficient fine-tuning methods, including LoRA and AdaLoRA, to assess their performance efficiency tradeoffs. Experimental results show that transformer-based models achieve strong segmentation performance, while parameter efficient finetuning reduces trainable parameters by up to 95% with comparable accuracy to full finetuning. We further analyze generalization under distribution shift by comparing validation and held-out test performance.
[CV-240] ri-Modal Fusion Transformers for UAV-based Object Detection
【速读】:该论文旨在解决无人机(UAV)目标检测在复杂环境下的鲁棒性问题,特别是光照变化、运动模糊和场景动态等因素导致RGB图像特征退化的问题。为提升检测性能,作者提出了一种三模态融合框架,整合RGB、热成像(Long-Wave Infrared, LWIR)与事件相机(Event Camera)数据,利用双流分层视觉Transformer进行多模态特征提取,并设计了两种关键模块:**模态感知门控交换(Modality-Aware Gated Exchange, MAGE)用于跨传感器通道与空间维度的门控融合,以及双向标记交换(Bidirectional Token Exchange, BiTE)**模块实现token级双向注意力机制并结合深度可分离卷积优化特征表示。该方案通过一个包含10,489帧同步标注数据的新型无人机数据集进行系统验证,实验表明三模态融合显著优于双模态基线,且融合深度对性能影响显著,轻量化的CSSA变体可在极低计算开销下恢复大部分增益,从而首次建立了面向无人机场景的三模态目标检测基准与可扩展骨干网络架构。
链接: https://arxiv.org/abs/2604.16630
作者: Craig Iaboni,Pramod Abichandani
机构: New Jersey Institute of Technology (新泽西理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures
Abstract:Reliable UAV object detection requires robustness to illumination changes, motion blur, and scene dynamics that suppress RGB cues. Thermal long-wave infrared (LWIR) sensing preserves contrast in low light, and event cameras retain microsecond-level temporal edges, but integrating all three modalities in a unified detector has not been systematically studied. We present a tri-modal framework that processes RGB, thermal, and event data with a dual-stream hierarchical vision transformer. At selected encoder depths, a Modality-Aware Gated Exchange (MAGE) applies inter-sensor channel and spatial gating, and a Bidirectional Token Exchange (BiTE) module performs bidirectional token-level attention with depthwise-pointwise refinement, producing resolution-preserving fused maps for a standard feature pyramid and two-stage detector. We introduce a 10,489-frame UAV dataset with synchronized and pre-aligned RGB-thermal-event streams and 24,223 annotated vehicles across day and night flights. Through 61 controlled ablations, we evaluate fusion placement, mechanism (baseline MAGE+BiTE, CSSA, GAFF), modality subsets, and backbone capacity. Tri-modal fusion improves over all dual-modal baselines, with fusion depth having a significant effect and a lightweight CSSA variant recovering most of the benefit at minimal cost. This work provides the first systematic benchmark and modular backbone for tri-modal UAV-based object detection. Comments: 10 pages, 4 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.16630 [cs.CV] (or arXiv:2604.16630v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.16630 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-241] Amortized Inverse Kinematics via Graph Attention for Real-Time Human Avatar Animation
【速读】:该论文旨在解决从稀疏的3D关节位置中恢复完整骨骼朝向(即关节旋转)的问题,这在动画、机器人学和生物力学中是逆运动学(Inverse Kinematics, IK)的核心任务。由于仅凭关节位置无法唯一确定绕骨轴的扭转(twist),传统IK求解器依赖迭代优化,存在计算慢且对噪声敏感的问题。解决方案的关键在于提出IK-GAT——一种轻量级图注意力网络(Graph Attention Network),通过在骨骼父子结构图上进行消息传递来利用运动学约束;其创新性地采用以静息姿态骨帧为锚点的骨对齐世界坐标系表示法,显式建模扭转轴,并将旋转参数化为连续6D形式,结合SO(3)流形上的测地线损失与可选的前向运动学一致性正则项进行训练,从而实现单次前向传播即可输出可直接驱动绑定角色的局部旋转,具备高效率(CPU上>650 FPS)与鲁棒性。
链接: https://arxiv.org/abs/2604.16629
作者: Muhammad Saif Ullah Khan,Chen-Yu Wang,Tim Prokosch,Michael Lorenz,Bertram Taetz,Didier Stricker
机构: German Research Center for Artificial Intelligence (DFKI); RPTU Kaiserslautern-Landau; International University of Applied Sciences
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Inverse kinematics (IK) is a core operation in animation, robotics, and biomechanics: given Cartesian constraints, recover joint rotations under a known kinematic tree. In many real-time human avatar pipelines, the available signal per frame is a sparse set of tracked 3D joint positions, whereas animation systems require joint orientations to drive skinning. Recovering full orientations from positions is underconstrained, most notably because twist about bone axes is ambiguous, and classical IK solvers typically rely on iterative optimization that can be slow and sensitive to noisy inputs. We introduce IK-GAT, a lightweight graph-attention network that reconstructs full-body joint orientations from 3D joint positions in a single forward pass. The model performs message passing over the skeletal parent-child graph to exploit kinematic structure during rotation inference. To simplify learning, IK-GAT predicts rotations in a bone-aligned world-frame representation anchored to rest-pose bone frames. This parameterization makes the twist axis explicit and is exactly invertible to standard parent-relative local rotations given the kinematic tree and rest pose. The network uses a continuous 6D rotation representation and is trained with a geodesic loss on SO(3) together with an optional forward-kinematics consistency regularizer. IK-GAT produces animation-ready local rotations that can directly drive a rigged avatar or be converted to pose parameters of SMPL-like body models for real-time and online applications. With 374K parameters and over 650 FPS on CPU, IK-GAT outperforms VPoser-based per-frame iterative optimization without warm-start at significantly lower cost, and is robust to initial pose and input noise
[CV-242] AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
【速读】:该论文旨在解决多模态推理模型在音频-视觉(audio-visual)场景中因高质量标注数据稀缺而导致的训练困难问题。其核心挑战在于如何有效利用单模态教师模型生成高质量的跨模态推理轨迹,从而提升目标模型在复杂多模态任务中的表现。解决方案的关键在于提出AVRT框架:首先通过分别针对视觉和音频的专用推理模型生成独立的单模态推理轨迹,再使用大语言模型(LLM)作为融合器将二者合并为多模态推理轨迹;随后采用监督微调(SFT)冷启动策略对目标模型进行初步适配,再进入强化学习阶段进行大规模数据训练,从而实现跨模态能力迁移并显著提升性能。
链接: https://arxiv.org/abs/2604.16617
作者: Edson Araujo,Saurabhchand Bhati,M. Jehanzeb Mirza,Brian Kingsbury,Samuel Thomas,Rogerio Feris,James R. Glass,Hilde Kuehne
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注:
Abstract:Recent advances in reasoning models have shown remarkable progress in text-based domains, but transferring those capabilities to multimodal settings, e.g., to allow reasoning over audio-visual data, still remains a challenge, in part because of the limited availability of high-quality reasoning data in targeted multimodal combinations. To address this problem, we introduce AVRT, a novel framework that generates high-quality audio-visual reasoning traces from single-modality teacher models. We generate independent vision- and audio-reasoning traces via models specialized to reason over their respective modalities and merge the resulting traces with an LLM merger model. The resulting multimodal traces are used in a supervised fine-tuning (SFT) cold start to adapt the target model to audio-visual reasoning traces first, before training it in a second reinforcement learning stage on larger-scale data. Evaluated on seven audio-visual and audio benchmarks, our 3B and 7B parameter models achieve state-of-the-art results among models of comparable size including OmniBench and DailyOmni for audio-visual and MMAR for audio-only reasoning, showing that cross-modal training also transfers to single-modality tasks and establishing a new training pipeline for multimodal reasoning models.
[CV-243] IncepDeHazeGAN: Novel Satellite Image Dehazing ACCV2024
【速读】:该论文旨在解决单图像去雾(single-image dehazing)问题,即从受雾霾影响的遥感图像中恢复清晰、高质量的图像。其解决方案的关键在于提出了一种名为IncepDeHazeGAN的新型生成对抗网络(Generative Adversarial Network, GAN),该网络结合了Inception模块与多层特征融合机制:Inception模块实现多尺度特征提取,而多层特征融合设计则通过在不同卷积层间多次融合特征,高效复用和增强特征表示能力,从而提升去雾性能。实验表明,该方法在多个数据集上达到了当前最优效果。
链接: https://arxiv.org/abs/2604.16609
作者: Tejeswar Pokuri,Shivarth Rai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CV4DC Workshop, ACCV 2024
Abstract:Dehazing is a technique in computer vision for enhancing the visual quality of images captured in cloudy or foggy conditions. Dehazing helps to recover clear, high-quality images from haze-affected remote sensing data. In this study, we introduce IncepDeHazeGAN, a novel Generative Adversarial Network (GAN) involving Inception block and multi-layer feature fusion for the task of single-image dehazing. Utilizing the Inception block allows for multi-scale feature extraction. On the other hand, the multi-layer feature fusion design achieves efficient reuse of features as the features extracted at different convolution layers are fused several times. Grad-CAM XAI technique has been applied to our network, highlighting the regions focused on by the network for dehazing and its adaptation to different haze conditions. Experiments demonstrate that our network achieves state-of-the-art results in several datasets.
[CV-244] Human Cognition in Machines: A Unified Perspective of World Models
【速读】:该论文旨在解决当前世界模型(World Models)研究中缺乏统一认知架构理论(Cognitive Architecture Theory, CAT)框架的问题,以准确评估各类模型是否真正具备类人认知能力。其解决方案的关键在于提出一个概念上统一的框架,全面整合CAT所定义的所有认知功能(包括记忆、感知、语言、推理、想象、动机和元认知),并据此识别出当前研究中的关键空白——尤其是内在动机(intrinsic motivation)和元认知(meta-cognition)的严重不足。作者进一步基于主动推理(active inference)和全局工作空间理论(global workspace theory)提出具体改进方向,并引入“知识型世界模型”(Epistemic World Models)这一新类别,用于支持科学发现任务的结构化知识操作,从而为未来研究提供更清晰的路径指引。
链接: https://arxiv.org/abs/2604.16592
作者: Timothy Rupprecht,Pu Zhao,Amir Taherin,Arash Akbari,Arman Akbari,Yumei He,Sean Duffy,Juyi Lin,Yixiao Chen,Rahul Chowdhury,Enfu Nan,Yixin Shen,Yifan Cao,Haochen Zeng,Weiwei Chen,Geng Yuan,Jennifer Dy,Sarah Ostadabbas,Silvia Zhang,David Kaeli,Edmund Yeh,Yanzhi Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注:
Abstract:This comprehensive report distinguishes prior works by the cognitive functions they innovate. Many works claim an almost “human-like” cognitive capability in their world models. To evaluate these claims requires a proper grounding in first principles in Cognitive Architecture Theory (CAT). We present a conceptual unified framework for world models that fully incorporates all the cognitive functions associated with CAT (i.e. memory, perception, language, reasoning, imagining, motivation, and meta-cognition) and identify gaps in the research as a guide for future states of the art. In particular, we find that motivation (especially intrinsic motivation) and meta-cognition remain drastically under-researched, and we propose concrete directions informed by active inference and global workspace theory to address them. We further introduce Epistemic World Models, a new category encompassing agent frameworks for scientific discovery that operate over structured knowledge. Our taxonomy, applied across video, embodied, and epistemic world models, suggests research directions where prior taxonomies have not.
[CV-245] MambaKick: Early Penalty Direction Prediction from HAR Embeddings
【速读】:该论文旨在解决足球比赛中罚球点球(penalty kick)时守门员在极短时间内预测射门方向的问题,以提升其反应效率。解决方案的关键在于提出了一种基于学习的框架 MambaKick,它利用预训练的人体动作识别(Human Action Recognition, HAR)模型提取的时空嵌入特征,并结合轻量级的状态空间模型(Mamba)进行高效的时间序列聚合,从而实现对踢球者意图的低延迟预测。该方法不依赖显式的运动学重建或手工设计的生物力学特征,而是复用可迁移的时空表示,并引入简单上下文元数据(如场地侧和惯用脚)作为补充线索,有效降低了真实场景视频中的歧义性。实验表明,该方法在三分类和二分类任务中分别达到最高53.1%和64.5%的准确率,验证了结合预训练HAR表示与高效状态空间建模在实际体育视频中意图预测的有效性。
链接: https://arxiv.org/abs/2604.16588
作者: Henry O. Velesaca,David Freire-Obregon,Abel Reyes-Angulo,Steven Araujo,Angel Sappa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Penalty kicks in soccer are decided under extreme time constraints, where goalkeepers benefit from anticipating shot direction from the kickers motion before or around ball contact. In this paper, MambaKick is presented as a learning-based framework for penalty direction prediction that leverages pretrained human action recognition (HAR) embeddings extracted from contact-centered short video segments and combines them with a lightweight temporal predictor. Rather than relying on explicit kinematic reconstruction or handcrafted biomechanical features, the approach reuses transferable spatiotemporal representations and utilizes selective state-spare models (Mamba) for efficient sequence aggregation. Simple contextual metadata (e.g., field side and footedness) are also considered as complementary cues that may reduce ambiguity in real-world footage. Across a range of HAR backbones, MambaKick consistently improves or matches strong embedding baselines, achieving up to 53.1% accuracy for three classes and 64.5% for two classes under the proposed methodology. Overall, the results indicate that combining pretrained HAR representations with efficient state-space temporal modeling is a practical direction for low-latency intention prediction in real-world sports video. The code will be available at GitHub: this https URL
[CV-246] Real-Time Visual Attribution Streaming in Thinking Model
【速读】:该论文旨在解决多模态思维模型(multimodal thinking models)在生成代码或解答图像中的数学问题时,如何实现对视觉证据的实时、可信归因(visual attribution)问题。现有方法中,忠实的因果分析需依赖昂贵的重复反向传播或扰动操作,而注意力图(attention maps)虽能快速提供归因信息,却缺乏因果有效性。解决方案的关键在于提出一种“摊销化”(amortized)方法,通过从注意力特征中学习直接估计语义区域的因果效应,从而在保持与穷举因果方法相当的忠实性的同时,实现视觉归因的流式输出(visual attribution streaming),使用户能在模型推理过程中即时观察到支撑依据,而非事后回溯。
链接: https://arxiv.org/abs/2604.16587
作者: Seil Kang,Woojung Han,Junhyeok Kim,Jinyeong Kim,Youngeun Kim,Seong Jae Hwang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We present an amortized framework for real-time visual attribution streaming in multimodal thinking models. When these models generate code from a screenshot or solve math problems from images, their long reasoning traces should be grounded in visual evidence. However, verifying this reliance is challenging: faithful causal methods require costly repeated backward passes or perturbations, while raw attention maps offer instant access, they lack causal validity. To resolve this, we introduce an amortized approach that learns to estimate the causal effects of semantic regions directly from the rich signals encoded in attention features. Across five diverse benchmarks and four thinking models, our approach achieves faithfulness comparable to exhaustive causal methods while enabling visual attribution streaming, where users observe grounding evidence as the model reasons, not after. Our results demonstrate that real-time, faithful attribution in multimodal thinking models is achievable through lightweight learning, not brute-force computation.
[CV-247] Camo-M3FD: A New Benchmark Dataset for Cross-Spectral Camouflaged Pedestrian Detection
【速读】:该论文旨在解决伪装行人检测(Camouflaged Pedestrian Detection)在安全关键场景下的识别难题,尤其针对行人与背景高度相似时的检测失效问题。现有研究多聚焦于生物体伪装检测,缺乏对人类目标在复杂环境中因视觉融合导致难以识别的系统性评估。解决方案的关键在于构建首个跨谱(可见光-热红外)伪装行人检测基准数据集 Camo-M3FD,其核心创新包括:基于量化指标筛选高前景-背景相似度的图像对、提供像素级标注掩膜,并建立标准化评估框架。实验表明,热红外信号虽能提供可靠定位线索,但仅靠单一模态仍无法准确还原结构细节,因此多模态融合策略是提升检测鲁棒性的必要手段。
链接: https://arxiv.org/abs/2604.16582
作者: Henry O. Velesaca,Andrea Mero,Guillermo A. Castillo,Angel D. Sappa
机构: ESPOL Polytechnic University (ESPOL理工学院); University of Granada (格拉纳达大学); Universitat Autònoma de Barcelona (巴塞罗那自治大学); Universität della Svizzera Italiana (意大利瑞士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Pedestrian detection is fundamental to autonomous driving, robotics, and surveillance. Despite progress in deep learning, reliable identification remains challenging due to occlusions, cluttered backgrounds, and degraded visibility. While multispectral detection-combining visible and thermal sensors-mitigates poor visibility, the challenge of camouflaged pedestrians remains largely unexplored. Existing Camouflaged Object Detection (COD) benchmarks focus on biological species, leaving a gap in safety-critical human detection where targets blend into their surroundings. To address this, we introduce Camo-M3FD (derived from the M3FD dataset), a novel benchmark for cross-spectral camouflaged pedestrian detection, consisting of registered visible-thermal image pairs. The dataset is curated using quantitative metrics to ensure high foreground-background similarity. We provide high-quality pixel-level masks and establish a standardized evaluation framework using state-of-the-art COD models. Our results demonstrate that while thermal signals provide indispensable localization cues, multispectral fusion is essential for refining structural details. Camo-M3FD serves as a foundational resource for developing robust and safety-critical detection systems. The dataset is available on GitHub: this https URL
[CV-248] Multilevel neural networks with dual-stage feature fusion for human activity recognition
【速读】:该论文旨在解决人类活动识别(Human Activity Recognition, HAR)中模型性能受限于单一网络结构局限性的问题,通过构建多层级融合架构以充分利用不同神经网络(如卷积神经网络 CNN、长短期记忆网络 LSTM 及其混合结构)的互补优势。解决方案的关键在于提出一种两层级网络框架,并引入双阶段特征融合机制:第一阶段为中间融合(intermediate fusion),将第一层与第二层网络的特征进行整合;第二阶段为晚期融合(late fusion),对第一层输出结果进行聚合。实验表明,同时采用中间融合与晚期融合的配置显著优于仅依赖晚期融合的模型,从而验证了该多级融合策略在提升 HAR 准确性方面的有效性。
链接: https://arxiv.org/abs/2604.16577
作者: Abeer FathAllah Brery,Ascensión Gallardo-Antolín,Israel Gonzalez-Carrasco,Mahmoud Fakhry
机构: Universidad Carlos III de Madrid (卡洛斯三世大学); Universidad Francisco de Vitoria (弗朗西斯科·德·维多利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Human activity recognition (HAR) refers to the process of identifying human actions and activities using data collected from sensors. Neural networks, such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, convolutional LSTM, and their hybrid combinations, have demonstrated exceptional performance in various research domains. Developing a multilevel individual or hybrid model for HAR involves strategically integrating multiple networks to capitalize on their complementary strengths. The structural arrangement of these components is a critical factor influencing the overall performance. This study explores a novel framework of a two-level network architecture with dual-stage feature fusion: late fusion, which combines the outputs from the first network level, and intermediate fusion, which integrates the features from both the first and second levels. We evaluated 15 different network architectures of CNNs, LSTMs, and convolutional LSTMs, incorporating late fusion with and without intermediate fusion, to identify the optimal configuration. Experimental evaluation on two public benchmark datasets demonstrates that architectures incorporating both late and intermediate fusion achieve higher accuracy than those relying on late fusion alone. Moreover, the optimal configuration outperforms baseline models, thereby validating its effectiveness for HAR.
[CV-249] Classification of systolic murmurs in heart sounds using multiresolution complex Gabor dictionary and vision transformer
【速读】:该论文旨在解决心音中收缩期杂音(systolic murmurs)自动分类的难题,以提高心脏疾病诊断的准确性与效率。其核心问题是:如何从具有高度变异性(如强度、频率和质地差异)的心音信号中提取稳定且具判别力的时频特征,并实现高精度分类。解决方案的关键在于两方面:一是采用基于多分辨率复数Gabor基函数(multiresolution complex Gabor basis functions, GBFs)的冗余字典,结合复杂正交匹配追踪(complex orthogonal matching pursuit)进行特征提取,通过共享字典学习机制使同一记录中多个杂音段落对应相同的基函数,从而生成一致的时频特征矩阵;二是设计一种融合视觉Transformer(vision transformer)架构的分类模型,利用卷积神经网络对不同分辨率的特征矩阵进行patch tokenization后,整合所有嵌入token并输入多头注意力机制与残差连接结构,实现多尺度特征的有效融合与判别。这一方法显著提升了杂音分类准确率,在CirCor DigiScope数据集上达到95.96%的分类精度。
链接: https://arxiv.org/abs/2604.16563
作者: Mahmoud Fakhry,Abeer FathAllah Brery
机构: CEIEC, Universidad Francisco de Vitoria (CEIEC,弗朗西斯科·德·维多利亚大学); Departamento de Informática, Universidad Carlos III de Madrid (计算机系,卡洛斯三世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Systolic murmurs are extra heart sounds that occur during the contraction phase of the cardiac cycle, often indicating heart abnormalities caused by turbulent blood flow. Their intensity, pitch, and quality vary, requiring precise identification for the accurate diagnosis of cardiac disorders. This study presents an automatic classification system for systolic murmurs using a feature extraction module, followed by a classification model. The feature extraction module employs complex orthogonal matching pursuit to project single or multiple murmur segments onto a redundant dictionary composed of multiresolution complex Gabor basis functions (GBFs). The resulting projection weights are split and reshaped into variable-resolution time–frequency feature matrices. Processing multiple segments of a single recording using a shared dictionary mitigates murmur variability. This is achieved by learning the weights for each segment while enforcing that they correspond to the same set of basis functions in the dictionary, promoting consistent time–frequency feature matrices. The classification model is built based on a vision transformer to process multiple input matrices of different resolutions by passing each through a convolutional neural network for patch tokenization. All embedding tokens are then concatenated to form a matrix and forwarded to an encoder layer that includes multihead attention, residual connections, and a convolutional network with a kernel size of one. This integration of multiresolution feature extraction with transformer-based feature classification enhances the accuracy and reliability of heart murmur identification. An experimental analysis of four types of systolic murmurs from the CirCor DigiScope dataset demonstrates the effectiveness of the system, achieving a classification accuracy of 95.96% .
[CV-250] See Through the Noise: Improving Domain Generalization in Gaze Estimation CVPR2026
【速读】:该论文旨在解决标签噪声(label noise)对视觉注视估计(gaze estimation)模型泛化性能的负面影响问题。由于真实场景中精确标注注视方向存在困难,训练数据常包含噪声标签,而现有方法往往忽视这一因素,导致跨域迁移能力下降。其解决方案的关键在于提出一种名为 See-Through-Noise (SeeTN) 的新框架,通过构建基于原型的语义嵌入空间(semantic embedding space),保持注视特征与连续标签之间的拓扑结构一致性;进而利用特征-标签亲和性一致性度量识别噪声样本,并在语义流形上引入亲和力正则化项,将清洁样本中的注视相关信息迁移到噪声样本中,从而提升模型对标签噪声的鲁棒性并增强域不变的注视关系建模能力。
链接: https://arxiv.org/abs/2604.16562
作者: Yanming Peng,Shijing Wang,Yaping Huang,Yi Tian
机构: Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University (北京交通大学交通数据挖掘与具身智能重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026
Abstract:Generalizable gaze estimation methods have garnered increasing attention due to their critical importance in real-world applications and have achieved significant progress. However, they often overlook the effect of label noise, arising from the inherent difficulty of acquiring precise gaze annotations, on model generalization performance. In this paper, we are the first to comprehensively investigate the negative effects of label noise on generalization in gaze estimation. Further, we propose a novel solution, called See-Through-Noise (SeeTN) framework, which improves generalization from a novel perspective of mitigating label noise. Specifically, we propose to construct a semantic embedding space via a prototype-based transformation to preserve a consistent topological structure between gaze features and continuous labels. We then measure feature-label affinity consistency to distinguish noisy from clean samples, and introduce a novel affinity regularization in the semantic manifold to transfer gaze-related information from clean to noisy samples. Our proposed SeeTN promotes semantic structure alignment and enforces domain-invariant gaze relationships, thereby enhancing robustness against label noise. Extensive experiments demonstrate that our SeeTN effectively mitigates the adverse impact of source-domain noise, leading to superior cross-domain generalization without compromising the source-domain accuracy, and highlight the importance of explicitly handling noise in generalized gaze estimation.
[CV-251] LLM as a Tool Not an Agent : Code-Mined Tree Transformations for Neural Architecture Search
【速读】:该论文旨在解决传统神经架构搜索(Neural Architecture Search, NAS)方法依赖人工设计搜索空间导致探索受限,以及基于大语言模型(Large Language Models, LLMs)的代理式方法因生成复杂且有效架构能力不足、易受训练数据偏见影响而难以实现稳定且开放式的模型进化问题。其解决方案的关键在于提出 LLMasTool——一种分层树状结构的 NAS 框架,通过自动从任意源代码中提取可复用模块并将完整架构表示为层次化树结构,从而利用可靠的树变换进行演化;同时结合贝叶斯建模引导的粗粒度规划与 LLM 对剩余自由度的精细化决策,既保障了架构的可执行性与多样性,又突破了 LLM 固有模式偏见,实现了更高效、稳定的开放探索。
链接: https://arxiv.org/abs/2604.16555
作者: Masakazu Yoshimura,Zitang Sun,Yuiko Sakuma,Junji Otsuka,Atsushi Irie,Takeshi Ohashi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 72 pages
Abstract:Neural Architecture Search (NAS) aims to automatically discover high-performing deep neural network (DNN) architectures. However, conventional algorithm-driven NAS relies on carefully hand-crafted search spaces to ensure executability, which restricts open-ended exploration. Recent coding-based agentic approaches using large language models (LLMs) reduce manual design, but current LLMs struggle to reliably generate complex, valid architectures, and their proposals are often biased toward a narrow set of patterns observed in their training data. To bridge reliable algorithmic search with powerful LLM assistance, we propose LLMasTool, a hierarchical tree-based NAS framework for stable and open-ended model evolution. Our method automatically extracts reusable modules from arbitrary source code and represents full architectures as hierarchical trees, enabling evolution through reliable tree transformations rather than code generation. At each evolution step, coarse-level planning is governed by a diversity-guided algorithm that leverages Bayesian modeling to improve exploration efficiency, while the LLM resolves the remaining degrees of freedom to ensure a meaningful evolutionary trajectory and an executable generated architecture. With this formulation, instead of fully agentic LLM approaches, our method explores diverse directions beyond the inherent biases in the LLM. Our method improves over existing NAS methods by 0.69, 1.83, and 2.68 points on CIFAR-10, CIFAR-100, and ImageNet16-120, demonstrating its effectiveness.
[CV-252] PA-TCNet: Pathology-Aware Temporal Calibration with Physiology-Guided Target Refinement for Cross-Subject Motor Imagery EEG Decoding in Stroke Patients
【速读】:该论文旨在解决卒中患者跨被试运动想象(Motor Imagery, MI)脑-机接口(Brain-Computer Interface, BCI)解码中的泛化难题,该问题主要由病灶相关的异常时间动态特性及显著的个体间异质性引发,现有自适应方法易受病理慢波活动和不稳定的目标域伪标签干扰。解决方案的关键在于提出PA-TCNet框架,其核心创新为两个协同模块:一是病理感知的节奏状态Mamba(Pathology-aware Rhythmic State Mamba, PRSM)模块,通过将EEG时空特征分解为缓慢变化的节奏背景与快速瞬态扰动,并将融合的病理背景注入选择性状态传播过程,以更有效地捕捉异常时间动态;二是生理引导的目标校准(Physiology-Guided Target Calibration, PGTC)模块,利用源域感觉运动区感兴趣区域模板施加生理一致性约束并动态优化目标域伪标签,从而提升自适应可靠性。该方法在两个独立卒中EEG数据集上实现了优于现有最先进基线的跨被试解码性能,验证了联合建模病理时间动态与生理约束伪监督的有效性。
链接: https://arxiv.org/abs/2604.16554
作者: Xiangkai Wang,Yun Zhao,Dongyi He,Qingling Xia,Gen Li,Nizhuan Wang,Ningxiao Peng,Bin Jiang
机构: Chongqing University of Technology (重庆理工大学); Chongqing Polytechnic University of Electronic Technology (重庆电子工程职业学院); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Stroke patient cross-subject electroencephalography (EEG) decoding of motor imagery (MI) brain-computer interface (BCI) is essential for motor rehabilitation, yet lesion-related abnormal temporal dynamics and pronounced inter-patient heterogeneity often undermine generalization. Existing adaptation methods are easily misled by pathological slow-wave activity and unstable target-domain pseudo-labels. To address this challenge, we propose PA-TCNet, a pathology-aware temporal calibration framework with physiology-guided target refinement for stroke motor imagery decoding. PA-TCNet integrates two coordinated components. The Pathology-aware Rhythmic State Mamba (PRSM) module decomposes EEG spatiotemporal features into slowly varying rhythmic context and fast transient perturbations, injecting the fused pathological context into selective state propagation to more effectively capture abnormal temporal dynamics. The Physiology-Guided Target Calibration (PGTC) module constructs source-domain sensorimotor region-of-interest templates, imposing physiological consistency constraints and dynamically refining target-domain pseudo-labels, thereby improving adaptation reliability. Leave-one-subject-out experiments on two independent stroke EEG datasets, XW-Stroke and 2019-Stroke, yielded mean accuracies of 66.56% and 72.75%, respectively, outperforming state-of-the-art baselines. These results indicate that jointly modeling pathological temporal dynamics and physiology-constrained pseudo-supervision can provide more robust cross-subject initialization for personalized post-stroke MI-BCI rehabilitation. The implemented code is available at this https URL.
[CV-253] Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
【速读】:该论文旨在解决当前文本到场景生成方法中普遍存在的两个关键问题:一是现有方法通常仅生成场景布局或物体之一,难以同时实现高质量的布局与物体生成;二是生成结果常与输入文本描述在形状、外观及空间排列等复杂语义上存在不一致。为应对上述挑战,作者提出了一种新的顺序式文本到场景生成范式,其核心创新在于设计了一个统一的3D自回归扩散模型(3D-ARD+),该模型通过两阶段生成机制实现高保真场景构建:第一阶段基于自回归方式生成粗粒度场景空间中的3D潜在表示,条件为已知文本指令和已合成的3D场景;第二阶段则聚焦于细粒度对象空间,生成可解码为精确几何与外观的3D潜在表示。此方案有效提升了生成场景对复杂文本指令的空间与语义一致性。
链接: https://arxiv.org/abs/2604.16552
作者: Zhenggang Tang,Yuehao Wang,Yuchen Fan,Jun-Kun Chen,Yu-Ying Yeh,Kihyuk Sohn,Zhangyang Wang,Qixing Huang,Alexander Schwing,Rakesh Ranjan,Dilin Wang,Zhicheng Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent text-to-scene generation approaches largely reduced the manual efforts required to create 3D scenes. However, their focus is either to generate a scene layout or to generate objects, and few generate both. The generated scene layout is often simple even with LLM’s help. Moreover, the generated scene is often inconsistent with the text input that contains non-trivial descriptions of the shape, appearance, and spatial arrangement of the objects. We present a new paradigm of sequential text-to-scene generation and propose a novel generative model for interactive scene creation. At the core is a 3D Autoregressive Diffusion model 3D-ARD+, which unifies the autoregressive generation over a multimodal token sequence and diffusion generation of next-object 3D latents. To generate the next object, the model uses one autoregressive step to generate the coarse-grained 3D latents in the scene space, conditioned on both the current seen text instructions and already synthesized 3D scene. It then uses a second step to generate the 3D latents in the smaller object space, which can be decoded into fine-grained object geometry and appearance. We curate a large dataset of 230K indoor scenes with paired text instructions for training. We evaluate 7B 3D-ARD+, on challenging scenes, and showcase the model can generate and place objects following non-trivial spatial layout and semantics prescribed by the text instructions.
[CV-254] A B-Spline Function Based 3D Point Cloud Unwrapping Scheme for 3D Fingerprint Recognition and Identification
【速读】:该论文旨在解决三维(3D)指纹识别中因拓扑高度变化导致的脊线与谷线提取困难,以及由于注册(registration)不一致限制采集过程的问题。其关键解决方案是采用B样条曲线拟合对3D点云指纹进行展开(unwrapping),以缓解高度差异并降低注册依赖性;随后将展开后的点云映射为灰度图像,从而兼容传统二维(2D)指纹识别方法,实现高效且高精度的识别性能。
链接: https://arxiv.org/abs/2604.16546
作者: Mohammad Mogharen Askarin,Jiankun Hu,Min Wang,Xuefei Yin,Xiuping Jia
机构: UNSW Canberra (澳大利亚国防大学坎培拉分校); University of Canberra (澳大利亚首都大学); Griffith University (格里菲斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Three-dimensional (3D) fingerprint recognition and identification offer several advantages over traditional two-dimensional (2D) recognition systems. The contactless nature of 3D fingerprints enhances hygiene and security, reducing the risk of contamination and spoofing. In addition to surface ridge and valley patterns, 3D fingerprints capture depth, curvature, and shape information, enabling the development of more precise and robust authentication systems. Despite recent advancements, significant challenges remain. The topological height of fingerprint pixels complicates the extraction of ridge and valley patterns. Furthermore, registration issues limit the acquisition process, requiring consistent direction and orientation across all samples. To address these challenges, this paper introduces a method that unwraps 3D fingerprints, represented as 3D point clouds, using B-spline curve fitting to mitigate height variation and reduce registration limitations. The unwrapped point cloud is then converted into a grayscale image by mapping the relative heights of the points. This grayscale image is subsequently used for recognition through conventional 2D fingerprint identification methods. The proposed approach demonstrated superior performance in 3D fingerprint recognition, achieving Equal Error Rates (EERs) of 0.2072%, 0.26%, and 0.22% across three experiments, outperforming existing methods. Additionally, the method surpassed 3D fingerprint flattening technique in both recognition and identification during cross-session experiments, achieving an EER of 1.50% when fingerprints with varying registrations were included.
[CV-255] BOOKAGENT : Orchestrating Safety-Aware Visual Narratives via Multi-Agent Cognitive Calibration ACL2026
【速读】:该论文旨在解决生成式AI(Generative AI)在创作图文并茂的故事书时面临的两大核心问题:一是现有方法多采用分阶段处理策略,导致跨模态对齐(multi-modal grounding)不充分;二是缺乏针对儿童内容的安全约束机制,尤其在叙事规划与序列级多模态验证中未体现特定安全要求。其解决方案的关键在于提出BookAgent框架,通过多智能体协作实现从用户草稿到完整故事书的端到端合成,包含联合规划、脚本生成、图像绘制及全局不一致性修复四个环节;同时引入动态页面级对齐校准和基于“验证-修正”机制的时序一致性校验(character identity and storytelling logic),从而确保叙事连贯性、视觉一致性与儿童安全合规性。
链接: https://arxiv.org/abs/2604.16541
作者: Bo Gao,Chang Liu,Yuyang Miao,Siyuan Ma,Ser-Nam Lim
机构: Carnegie Mellon University (卡内基梅隆大学); University of Science and Technology of China (中国科学技术大学); Imperial College London (帝国理工学院); Nanyang Technological University (南洋理工大学); University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, Accepted by ACL 2026
Abstract:Recent advancements in Large Generative Models (LGMs) have revolutionized multi-modal generation. However, generating illustrated storybooks remains an open challenge, where prior works mainly decompose this task into separate stages, and thus, holistic multi-modal grounding remains limited. Besides, while safety alignment is studied for text- or image-only generation, existing works rarely integrate child-specific safety constraints into narrative planning and sequence-level multi-modal verification. To address these limitations, we propose BookAgent, a safety-aware multi-agent collaboration framework designed for high-quality, safety-aware visual narratives. Different from prior story visualization models that assume a fixed storyline sequence, BookAgent targets end-to-end storybook synthesis from a user draft by jointly planning, scripting, illustrating, and globally repairing inconsistencies. To ensure precise multi-modal grounding, BookAgent dynamically calibrates page-level alignment between textual scripts and visual layouts. Furthermore, BookAgent calibrates holistic consistency from the temporal dimension, by verifying-then-rectifying global inconsistencies in character identity and storytelling logic. Extensive experiments demonstrate that BookAgent significantly outperforms current methods in narrative coherence, visual consistency, and safety compliance, offering a robust paradigm for reliable agents in complex multi-modal creation. The implementation will be publicly released at this https URL.
[CV-256] PoInit-of-View: Poisoning Initialization of Views Transfers Across Multiple 3D Reconstruction Systems CVPR2026
【速读】:该论文旨在解决3D重建系统中输入视图的投毒攻击问题,特别是针对现有方法仅对整个重建流程进行梯度反向传播而忽视特定模块潜在漏洞的局限性。其核心问题是:结构从运动(Structure-from-Motion, SfM)初始化阶段作为众多主流3D重建系统的核心几何模块,可能成为实现跨系统迁移性投毒攻击的关键薄弱点。解决方案的关键在于提出PoInit-of-View方法,通过优化对抗扰动以在对应3D点的投影上引入跨视图梯度不一致性,从而破坏关键点检测与特征匹配,进而干扰SfM中的位姿估计和三角测量,最终导致渲染质量显著下降。理论分析进一步将跨视图不一致性与对应关系坍缩(correspondence collapse)相联系,实验表明该方法在黑盒迁移场景下(如3DGS到NeRF)相比单视图基线在PSNR上提升25.1%,SSIM提升16.5%。
链接: https://arxiv.org/abs/2604.16540
作者: Weijie Wang,Songlong Xing,Zhengyu Zhao,Nicu Sebe,Bruno Lepri
机构: University of Trento, Italy; Fondazione Bruno Kessler, Italy; Xi’an Jiaotong University, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2026
Abstract:Poisoning input views of 3D reconstruction systems has been recently studied. However, we identify that existing studies simply backpropagate adversarial gradients through the 3D reconstruction pipeline as a whole, without uncovering the new vulnerability rooted in specific modules of the 3D reconstruction pipeline. In this paper, we argue that the structure-from-motion (SfM) initialization, as the geometric core of many widely used reconstruction systems, can be targeted to achieve transferable poisoning effects across diverse 3D reconstruction systems. To this end, we propose PoInit-of-View, which optimizes adversarial perturbations to intentionally introduce cross-view gradient inconsistencies at projections of corresponding 3D points. These inconsistencies disrupt keypoint detection and feature matching, thereby corrupting pose estimation and triangulation within SfM, eventually resulting in low-quality rendered views. We also provide a theoretical analysis that connects cross-view inconsistency to correspondence collapse. Experimental results demonstrate the effectiveness of our PoInit-of-View on diverse 3D reconstruction systems and datasets, surpassing the single-view baseline by 25.1% in PSNR and 16.5% in SSIM in black-box transfer settings, such as 3DGS to NeRF.
[CV-257] Beyond Attack Success Rate: A Multi-Metric Evaluation of Adversarial Transferability in Medical Imaging Models
【速读】:该论文旨在解决当前医疗人工智能(AI)系统在面对对抗性扰动时,仅依赖攻击成功率(Attack Success Rate, ASR)这一单一指标评估鲁棒性的局限性问题。ASR作为二值化指标,无法反映扰动强度、图像感知质量及跨架构迁移能力等关键因素,从而导致对模型真实抗攻击能力的误判。解决方案的关键在于构建一个多维评估框架,通过系统性地测量七种攻击方法在四个医学图像数据集上对七种模型(包括CNN与Vision Transformer)的攻击效果,同时量化ASR、峰值信噪比(PSNR)、结构相似性指数(SSIM)和L₂扰动幅度四项指标,揭示了感知质量与扰动强度之间强相关而与ASR弱相关的普遍规律,从而证明仅用ASR无法全面刻画医疗AI系统的对抗风险,必须引入多指标协同评估体系以实现更可靠的鲁棒性分析。
链接: https://arxiv.org/abs/2604.16532
作者: Emily Curl,Kofi Ampomah,Md Erfan,Sayanton Dibbo
机构: The University of Alabama (阿拉巴马大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages
Abstract:While deep learning systems are becoming increasingly prevalent in medical image analysis, their vulnerabilities to adversarial perturbations raise serious concerns for clinical deployment. These vulnerability evaluations largely rely on Attack Success Rate (ASR), a binary metric that indicates solely whether an attack is successful. However, the ASR metric does not account for other factors, such as perturbation strength, perceptual image quality, and cross-architecture attack transferability, and therefore, the interpretation is incomplete. This gap requires consideration, as complex, large-scale deep learning systems, including Vision Transformers (ViTs), are increasingly challenging the dominance of Convolutional Neural Networks (CNNs). These architectures learn differently, and it is unclear whether a single metric, e.g., ASR, can effectively capture adversarial behavior. To address this, we perform a systematic empirical study on four medical image datasets: PathMNIST, DermaMNIST, RetinaMNIST, and CheXpert. We evaluate seven models (VGG-16, ResNet-50, DenseNet-121, Inception-v3, DeiT, Swin Transformer, and ViT-B/16) against seven attack methods at five perturbation budgets, measuring ASR, Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and L_2 perturbation magnitude. Our findings show a consistent pattern: perceptual and distortion metrics are strongly associated with one another and exhibit minimal correlation with ASR. This applies to both CNNs and ViTs. The results demonstrate that ASR alone is an inadequate indicator of adversarial robustness and transferability. Consequently, we argue that a thorough assessment of adversarial risk in medical AI necessitates multi-metric frameworks that encompass not only the attack efficacy but also its methodology and associated overheads.
[CV-258] Expert-Annotated Embryo Image Dataset with Natural Language Descriptions for Evidence-Based Patient Communication in IVF
【速读】:该论文旨在解决辅助生殖技术中胚胎选择环节的局限性问题,特别是当前基于人工智能(AI)的解决方案在临床应用中面临的适应性差、依赖时间 lapse 培养箱以及缺乏可解释性等挑战。其关键解决方案是构建一个由专家标注的胚胎图像与自然语言描述相结合的数据集,其中包含胚胎细胞周期、发育阶段及形态特征等结构化信息;该数据集可用于微调先进的视觉-语言基础模型,使其具备高精度的胚胎描述能力,并进一步通过自动提取文献中的科学证据来支持基于循证医学的决策过程,从而实现透明、可解释且以患者为中心的自动化胚胎评估体系。
链接: https://arxiv.org/abs/2604.16528
作者: Nicklas Neu,Thomas Ebner,Jasmin Primus,Bernhard Schenkenfelder,Raphael Zefferer,Mathias Brunbauer,Florian Kromp
机构: Software Competence Center Hagenberg (软件能力中心哈根贝格); Kepler Universitätsklinikum (开普勒大学医院); Wunschkind Klinik Dr. Brunbauer (布伦鲍尔博士愿望儿童诊所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures, in submission to Nature Scienfitic Data
Abstract:Embryo selection is one of multiple crucial steps in in-vitro fertilization, commonly based on morphological assessment by clinical embryologists. Although artificial intelligence methods have demonstrated their potential to support embryo selection by automated embryo ranking or grading methods, the overall impact of AI-based solutions is still limited. This is mainly due to the required adaptation of automated solutions to custom clinical data, reliance on time lapse incubators and a lack of interpretability to understand AI reasoning. The modern, informed patient is questioning expert decisions, particularly if the treatment is not successful. Thus, evidence-based decision justification in tasks like embryo selection would support transparent decision making and respectful patient communication. To support this aim, we hereby present an expert-annotated dataset consisting of embryo images and corresponding morphological description using natural language. The description contains relevant information on embryonic cell cycle, developmental stage and morphological features. This dataset enables the finetuning of modern foundational vision-language models to learn and improve over time with high accuracy. Predicted embryo descriptions can then be leveraged to automatically extract scientific evidence from literature, facilitating well-informed, evidence-based decision-making and transparent communication with patients. Our proposed dataset supports research in language-based, interpretable, and transparent automated embryo assessment and has the potential to enhance the decision-making process and improve patient outcomes significantly over time.
[CV-259] Privacy-Preserving Semantic Segmentation without Key Management
【速读】:该论文旨在解决隐私保护下的语义分割问题,即在不泄露客户端图像数据的前提下实现模型的训练与推理。其核心挑战在于如何在加密状态下保持模型性能,避免因加密导致的精度下降。解决方案的关键在于提出一种基于独立密钥的隐私保护语义分割方法:模型创建者和每个客户端使用本地生成的密钥对图像进行加密,训练和推理均在加密图像上执行;同时,通过在训练阶段引入图像加密机制(而不仅限于测试阶段),有效缓解了因加密带来的性能退化问题,从而在保证隐私的同时维持模型准确性。实验在Cityscapes数据集上使用基于视觉Transformer(Vision Transformer)的SETR模型验证了该方法的有效性。
链接: https://arxiv.org/abs/2604.16523
作者: Mare Hirose,Shoko Imaizumi,Hitoshi Kiya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 2 pages, 3 figures, 2 tables, Accepted to ICCE-TW 2026
Abstract:This paper proposes a novel privacy-preserving semantic segmentation method that can use independent keys for each client and image. In the proposed method, the model creator and each client encrypt images using locally generated keys, and model training and inference are conducted on the encrypted images. To mitigate performance degradation, an image encryption method is applied to model training in addition to the generation of test images. In experiments, the effectiveness of the proposed method is confirmed on the Cityscapes dataset under the use of a vision transformer-based model, called SETR.
[CV-260] Fast Online 3D Multi-Camera Multi-Object Tracking and Pose Estimation
【速读】:该论文旨在解决多摄像头环境下3D多目标跟踪与姿态估计的实时性与鲁棒性问题,尤其在缺乏昂贵3D标注数据或复杂深度学习模型的情况下实现高效计算。其解决方案的关键在于提出一种基于贝叶斯最优多目标跟踪滤波器的快速在线算法,仅依赖2D边界框和姿态检测结果,无需3D训练数据或高计算成本的模型,从而在保持高精度的同时显著提升运行效率,并具备在摄像头间歇性断开或重新连接时仍能稳定工作的能力。
链接: https://arxiv.org/abs/2604.16522
作者: Linh Van Ma,Tran Thien Dat Nguyen,Moongu Jeon
机构: GIST(韩国科学技术院); Curtin University(柯廷大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper proposes a fast and online method for jointly performing 3D multi-object tracking and pose estimation using multiple monocular cameras. Our algorithm requires only 2D bounding box and pose detections, eliminating the need for costly 3D training data or computationally expensive deep learning models. Our solution is an efficient implementation of a Bayes-optimal multi-object tracking filter, enhancing computational efficiency while maintaining accuracy. We demonstrate that our algorithm is significantly faster than state-of-the-art methods without compromising accuracy, using only publicly available pre-trained 2D detection models. We also illustrate the robust performance of our algorithm in scenarios where multiple cameras are intermittently disconnected or reconnected during operation.
[CV-261] Operationalizing Fairness in Text-to-Image Models: A Survey of Bias Fairness Audits and Mitigation Strategies ICLR2026
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型中因社会刻板印象而导致的公平性问题,其核心挑战在于当前研究对“偏见”(bias)与“公平性”(fairness)等概念缺乏清晰区分和可操作定义。解决方案的关键在于提出一个系统性的分类框架,将现有研究按偏见类型与公平性理念进行归类,并通过区分“目标公平性”(target fairness,即理想输出标准)与“阈值公平性”(threshold fairness,即具备可执行决策规则的标准)来厘清理论与实践之间的差距;同时,论文进一步提出一种以目标为导向的公平性操作化新框架,推动从描述性指标向严谨的目标驱动测试演进,从而提升生成式AI开发的问责性与可解释性。
链接: https://arxiv.org/abs/2604.16516
作者: Megan Smith,Venkatesh Thirugnana Sambandham,Florian Richter,Laura Crompton,Matthias Uhl,Torsten Schön
机构: AImotion Bavaria, Technische Hochschule Ingolstadt, Ingolstadt, Germany; School of Transformation and Sustainability, Catholic University of Eichstätt-Ingolstadt, Eichstätt, Germany; Chair of Economic and Social Ethics, University of Hohenheim, Stuttgart, Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: ICLR 2026 Algorithmic Fairness Across Alignment Procedures and Agentic Systems (AFAA) Workshop, reviews can be found at: this https URL
Abstract:Text-to-Image (T2I) generation models have been widely adopted across various industries, yet are criticized for frequently exhibiting societal stereotypes. While a growing body of research has emerged to evaluate and mitigate these biases, the field at present contends with conceptual ambiguity, for example terms like “bias” and “fairness” are not always clearly distinguished and often lack clear operational definitions. This paper provides a comprehensive systematic review of T2I fairness literature, organizing existing work into a taxonomy of bias types and fairness notions. We critically assess the gap between “target fairness” (normative ideals in T2I outputs) and “threshold fairness” (normative standards with actionable decision rules). Furthermore, we survey the landscape of mitigation strategies, ranging from prompt engineering to diffusion process manipulation. We conclude by proposing a new framework for operationalizing fairness that moves beyond descriptive metrics towards rigorous, target-based testing, offering an approach for more accountable generative AI development.
[CV-262] Penny Wise Pixel Foolish: Bypassing Price Constraints in Multimodal Agents via Visual Adversarial Perturbations
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在基于截图的价格约束场景中因视觉主导幻觉(Visual Dominance Hallucination, VDH)导致的决策失准问题,即微小且不可察觉的视觉扰动可覆盖文本价格信息,引发非理性交易行为。解决方案的关键在于提出PriceBlind框架,其核心创新是通过CLIP类编码器中的模态差距,设计语义解耦损失(Semantic-Decoupling Loss),使图像嵌入对齐低成本、高价值锚点,同时保持像素级保真度,从而实现隐蔽的白盒对抗攻击,在E-ShopBench上达到约80%的攻击成功率(ASR)。
链接: https://arxiv.org/abs/2604.16515
作者: Jiachen Qian,Zhaolu Kang
机构: City University of Hong Kong (香港城市大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 15 pages, 4 figures, 13 tables
Abstract:The rapid proliferation of Multimodal Large Language Models (MLLMs) has enabled mobile agents to execute high-stakes financial transactions, but their adversarial robustness remains underexplored. We identify Visual Dominance Hallucination (VDH), where imperceptible visual cues can override textual price evidence in screenshot-based, price-constrained settings and lead agents to irrational decisions. We propose PriceBlind, a stealthy white-box adversarial attack framework for controlled screenshot-based evaluation. PriceBlind exploits the modality gap in CLIP-based encoders via a Semantic-Decoupling Loss that aligns the image embedding with low-cost, value-associated anchors while preserving pixel-level fidelity. On E-ShopBench, PriceBlind achieves around 80% ASR in white-box evaluation; under a simplified single-turn coordinate-selection protocol, Ensemble-DI-FGSM transfers with roughly 35-41% ASR across GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet. We also show that robust encoders and Verify-then-Act defenses reduce ASR substantially, though with some clean-accuracy trade-off.
[CV-263] BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
【速读】:该论文旨在解决自回归视觉语言模型(Autoregressive Vision-Language Models, VLMs)在推理阶段因逐标记解码导致的效率瓶颈问题,同时克服将预训练自回归VLM直接转换为大块扩散视觉语言模型(Diffusion VLM, dVLM)时出现的性能显著下降难题。其解决方案的关键在于提出BARD框架,该框架通过两阶段策略实现高效且高质量的迁移:首先采用渐进式监督块合并(progressive supervised block merging),逐步增大解码块大小以提升并行性;其次引入分阶段内部蒸馏(stage-wise intra-dVLM distillation),从固定小块扩散锚模型中蒸馏知识,恢复因大块解码而损失的性能。此外,混合噪声调度器和内存友好的训练机制进一步提升了模型鲁棒性和长序列训练效率,实验证明该方法可在极少量数据下实现强多模态能力迁移,并在同等规模开源dVLM中达到最新SOTA性能,同时获得最高达3倍的解码吞吐量加速。
链接: https://arxiv.org/abs/2604.16514
作者: Baoyou Chen,Hanchen Xia,Peng Tu,Haojun Shi,Shan Mu,Weihao Yuan,Siyu Zhu
机构: Shanghai Academy of AI for Science(上海人工智能科学研究院); Fudan University (复旦大学); Nanjing University (南京大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Autoregressive vision-language models (VLMs) deliver strong multimodal capability, but their token-by-token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directly converting a pretrained autoregressive VLM into a large-block diffusion VLM (dVLM) often leads to substantial quality degradation. In this work, we present BARD, a simple and effective bridging framework that converts a pretrained autoregressive VLM into a same-architecture, decoding-efficient dVLM. Our approach combines progressive supervised block merging, which gradually enlarges the decoding block size, with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor to recover performance lost at larger blocks. We further incorporate a mixed noise scheduler to improve robustness and token revision during denoising, and memory-friendly training to enable efficient training on long multimodal sequences. A key empirical finding is that direct autoregressive-to-diffusion distillation is poorly aligned and can even hurt performance, whereas distillation within the diffusion regime is consistently effective. Experimental results show that, with \leq 4.4M data, BARD-VL transfers strong multimodal capability from Qwen3-VL to a large-block dVLM. Remarkably, BARD-VL establishes a new SOTA among comparable-scale open dVLMs on our evaluation suite at both 4B and 8B scales. At the same time, BARD-VL achieves up to \textbf3 \times decoding throughput speedup compared to the source model.
[CV-264] SynthPID: PID digitization from Topology-Preserving Synthetic Data
【速读】:该论文旨在解决过程工业中管道仪表流程图(Piping and Instrumentation Diagrams, PIDs)自动化数字化为结构化工艺图时面临的严重数据稀缺问题,即公开基准数据集仅包含12张标注图像,且现有合成数据生成方法因符号随机分布导致生成的工艺图与真实场景差异大,致使模型在边缘检测任务上的准确率仅为约33%。解决方案的关键在于提出SynthPID合成数据集,其管道拓扑结构直接从真实PID中提取种子,确保生成图像具有真实的物理和逻辑结构;同时采用基于patch的Relationformer模型进行高分辨率图纸处理,使得仅用SynthPID训练即可在PID2Graph OPEN100测试集上达到63.8±3.1%的边mAP,逼近使用真实数据训练的性能上限(相差仅8个百分点),验证了高质量合成数据对模型性能提升的核心作用。
链接: https://arxiv.org/abs/2604.16513
作者: Suraj Prasad,Pinak Mahapatra
机构: Indian Institute Of Technology Bombay (印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Automating the digitization of Piping and Instrumentation Diagrams (PIDs) into structured process graphs would unlock significant value in plant operations, yet progress is bottlenecked by a fundamental data problem: engineering drawings are proprietary, and the entire community shares a single public benchmark of just 12 annotated images. Prior attempts at synthetic augmentation have fallen short because template-based generators scatter symbols at random, producing graphs that bear little resemblance to real process plants and, accordingly, yield only approximately 33% edge detection accuracy under synth-only training. We argue the failure is structural rather than visual and address it by introducing SynthPID, a corpus of 665 synthetic PIDs whose pipe topology is seeded directly from real drawings. Paired with a patch-based Relationformer adapted for high-resolution diagrams, a model trained on SynthPID alone achieves 63.8 +/- 3.1% edge mAP on PID2Graph OPEN100 without seeing a single real PID during training, closing within 8 pp of the real-data oracle. These gains hold up under a controlled comparison against the template-based regime, confirming that generation quality drives performance rather than model choice. A scaling study reveals that gains flatten beyond roughly 400 synthetic images, pointing to seed diversity as the binding constraint.
[CV-265] Medial Axis Aware Learning of Signed Distance Functions
【速读】:该论文旨在解决从给定点云中计算高精度全局带符号距离函数(Signed Distance Function, SDF)的问题,尤其关注在复杂几何结构下保持SDF的全局一致性与局部精度。解决方案的关键在于提出一种新的变分方法,通过显式建模SDF梯度的跳跃集(即中轴线,medial axis),并采用高阶变分泛函强制梯度方向远离该不连续集时呈现线性增长特性;同时以等距方程(eikonal equation)和SDF零水平集作为约束条件,并利用Ambrosio-Tortorelli型相场近似来实现数值可计算性,其中相场函数隐式描述中轴线。该方法结合神经网络对SDF与相场函数的逼近,在无向点云表示的表面上实现了近场与全局范围内的高精度重建。
链接: https://arxiv.org/abs/2604.16512
作者: Samuel Weidemaier,Christoph Norden-Smoch,Martin Rumpf
机构: Institute for Numerical Simulation, University of Bonn(波恩大学数值模拟研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG); Graphics (cs.GR); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注:
Abstract:We propose a novel variational method to compute a highly accurate global signed distance function (SDF) to a given point cloud. To this end, the jump set of the gradient of the SDF, which coincides with the medial axis of the surface, is explicitly taken into account through a higher-order variational formulation that enforces linear growth along the gradient direction away from this discontinuity set. The eikonal equation and the zero-level set of the SDF are enforced as constraints. To make this variational problem computationally tractable, a phase field approximation of Ambrosio-Tortorelli type is employed. The associated phase field function implicitly describes the medial axis. The method is implemented for surfaces represented by unoriented point clouds using neural network approximations of both the SDF and the phase field. Experiments demonstrate the method’s accuracy both in the near field and globally. Quantitative and qualitative comparisons with other approaches show the advantages of the proposed method.
[CV-266] Predicting Blastocyst Formation in IVF: Integrating DINOv2 and Attention-Based LSTM on Time-Lapse Embryo Images
【速读】:该论文旨在解决体外受精(IVF)过程中胚胎移植前最优胚胎选择的难题,尤其针对因依赖人工分析大量时间 lapse 成像数据而导致的效率低和准确性差的问题。其核心挑战在于如何仅基于有限的每日图像预测胚胎是否能发育为囊胚(blastocyst),尤其是在许多诊所缺乏完整时间 lapse 系统、视频数据不全的情况下。解决方案的关键在于提出一种新型混合模型,结合基于 Transformer 的视觉模型 DINOv2 与改进的长短期记忆(LSTM)网络(引入多头注意力机制),其中 DINOv2 从胚胎图像中提取高维特征,LSTM 则利用这些特征对胚胎发育时序动态进行建模并输出最终预测结果。该方法在704个真实胚胎视频数据集上达到96.4%的准确率,且对缺失帧具有鲁棒性,显著优于现有方法。
链接: https://arxiv.org/abs/2604.16505
作者: Zahra Asghari Varzaneh,Niclas Wölner-Hanssen,Reza Khoshkangini,Thomas Ebner,Magnus Johnsson
机构: Malmö University (马尔默大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The selection of the optimal embryo for transfer is a critical yet challenging step in in vitro fertilization (IVF), primarily due to its reliance on the manual inspection of extensive time-lapse imaging data. A key obstacle in this process is predicting blastocyst formation from the limited number of daily images available. Many clinics also lack complete time-lapse systems, so full videos are often unavailable. In this study, we aimed to predict which embryos will develop into blastocysts using limited daily images from time-lapse recordings. We propose a novel hybrid model that combines DINOv2, a transformer-based vision model, with an enhanced long short-term memory (LSTM) network featuring a multi-head attention layer. DINOv2 extracts meaningful features from embryo images, and the LSTM model then uses these features to analyze embryo development over time and generate final predictions. We tested our model on a real dataset of 704 embryo videos. The model achieved 96.4% accuracy, surpassing existing methods. It also performs well with missing frames, making it valuable for many IVF laboratories with limited imaging systems. Our approach can assist embryologists in selecting better embryos more efficiently and with greater confidence.
[CV-267] From Handwriting to Structured Data: Benchmarking AI Digitisation of Handwritten Forms
【速读】:该论文旨在解决结构化手写文档(如医疗表单)人工数字化效率低、成本高的问题。解决方案的关键在于评估17种前沿多模态大语言模型(Multimodal Large Language Models, MLLMs)在包含日期、印刷文本、手写内容及高变异性挑战的真实世界场景下的性能表现,发现最新版本的Google和OpenAI模型可达到约85%准确率与近90%加权F1分数;其中,GPT 5.4在噪声日期提取中表现最优,Claude Sonnet 4.6在格式化字段上平均性能最佳,Gemini 3.1则整体表现最强,尤其在自由文本错误率(WER = 0.50,CER = 0.31)和离散分类指标上领先;进一步研究表明,提示优化(prompt optimisation)能显著提升宏观精度、召回率和F1分数(>60%),但对加权指标改善有限(仅~2–5%),表明任务特定微调与提示工程对实现复杂手写流程自动化具有关键意义。
链接: https://arxiv.org/abs/2604.16504
作者: Nicholas Pather,Joshua Fouché,Sitwala Mundia,Karl-Günter Technau,Thokozile Malaba,Alex Welte,Ushma Mehta,Bruce A. Bassett
机构: CSAM, University of the Witwatersrand (CSAM, 茨瓦内理工大学); Grai Labs (Grai 实验室); Faculty of Health Sciences, University of the Witwatersrand (健康科学学院,茨瓦内理工大学); Empilweni Services and Research Unit, Department of Paediatrics and Child Health, University of the Witwatersrand, Johannesburg (埃米利维尼服务与研究中心,儿科与儿童健康系,茨瓦内理工大学,约翰内斯堡); Division of Epidemiology and Biostatistics, School of Public Health, Faculty of Health Sciences, University of Cape Town, South Africa (流行病学与生物统计学系,公共卫生学院,健康科学学院,开普敦大学,南非); Discipline of Public Health, School of Medicine, University of KwaZulu-Natal (公共卫生学科,医学院,夸祖鲁-纳塔尔大学); Division of Clinical Pharmacology, Department of Medicine, University of Cape Town (临床药理学系,医学系,开普敦大学); WITS MIND Institute and CSAM, University of the Witwatersrand, University of Cape Town, Grai Labs, Cape Town, South Africa
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 Pages, 5 Figures
Abstract:Manual digitisation of structured handwritten documents is slow and costly. We benchmark 17 leading frontier multi-modal large language models and open-source models against a very challenging real-world medical form that mixes dates; structured, printed text; hand-written responses and significant variability challenges. None of the smaller or older models perform well but the latest Google and OpenAI models reach accuracies around 85% with weighted F1 scores \simeq 90% across the discrete or predefined fields despite the very challenging nature of the responses. Clear task specific strengths emerge: GPT 5.4 excels in noisy date extraction as well as reliability with the lowest hallucination rate ( 6% ). Claude Sonnet 4.6 had the best average performance across formatted fields (dates and numerical values), while Gemini 3.1 delivered the best overall performance, with the lowest free text error rates (WER = 0.50 and CER = 0.31 ) and the strongest results across discrete classification metrics. We further show that prompt optimisation dramatically improves macro precision, recall and F1 by over 60% , but has little impact on weighted metrics (only \sim2-5% improvement). These results provide evidence that the rapid improvements of multimodal large language models offer a compelling pathway toward fully automated digitisation of complex handwritten workflows that is particularly relevant in low- and middle-income countries.
[CV-268] Motif-Video 2B: Technical Report
【速读】:该论文旨在解决视频生成模型在资源受限条件下(如少于1000万视频片段和少于10万H200 GPU小时)仍能实现高质量文本到视频生成的问题。传统方法依赖大规模数据和参数量来提升性能,但本文指出模型架构的设计方式比单纯增加规模更为关键。其解决方案的核心在于通过架构专业化来解耦视频生成中的多个复杂任务:即分离提示对齐(prompt alignment)、时序一致性(temporal consistency)和细节恢复(fine-detail recovery)的处理路径,而非仅依靠模型规模扩展。具体而言,Motif-Video 2B采用三段式骨干网络结构,并引入共享交叉注意力机制(Shared Cross-Attention)以增强长序列控制能力,同时结合动态令牌路由与早期特征对齐策略优化训练效率。实验表明,该设计在VBench上达到83.76%得分,优于参数量大7倍的Wan2.1 14B模型,验证了架构优化与高效训练配方协同作用可显著缩小甚至超越大型模型的质量差距。
链接: https://arxiv.org/abs/2604.16503
作者: Junghwan Lim,Wai Ting Cheung,Minsu Ha,Beomgyu Kim,Taewhan Kim,Haesol Lee,Dongpin Oh,Jeesoo Lee,Taehyun Kim,Minjae Kim,Sungmin Lee,Hyeyeon Cho,Dahye Choi,Jaeheui Her,Jaeyeon Huh,Hanbin Jung,Changjin Kang,Dongseok Kim,Jangwoong Kim,Youngrok Kim,Hyukjin Kweon,Hongjoo Lee,Jeongdoo Lee,Junhyeok Lee,Eunhwan Park,Yeongjae Park,Bokki Ryu,Dongjoo Weon
机构: Motif Technologies
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. In this work, we ask whether strong text-to-video quality is possible at a much smaller budget: fewer than 10M clips and less than 100,000 H200 GPU hours. Our core claim is that part of the answer lies in how model capacity is organized, not only in how much of it is used. In video generation, prompt alignment, temporal consistency, and fine-detail recovery can interfere with one another when they are handled through the same pathway. Motif-Video 2B addresses this by separating these roles architecturally, rather than relying on scale alone. The model combines two key ideas. First, Shared Cross-Attention strengthens text control when video token sequences become long. Second, a three-part backbone separates early fusion, joint representation learning, and detail refinement. To make this design effective under a limited compute budget, we pair it with an efficient training recipe based on dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder. Our analysis shows that later blocks develop clearer cross-frame attention structure than standard single-stream baselines. On VBench, Motif-Video~2B reaches 83.76%, surpassing Wan2.1 14B while using 7 \times fewer parameters and substantially less training data. These results suggest that careful architectural specialization, combined with an efficiency-oriented training recipe, can narrow or exceed the quality gap typically associated with much larger video models.
[CV-269] opology-Aware Layer Pruning for Large Vision-Language Models ACL2026
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在资源受限场景下部署时面临的计算和内存开销过大的问题。现有层剪枝方法通常依赖局部相似性度量或静态代理信号,无法捕捉模型深度上表示的全局动态演化特性,常导致关键过渡层被误删。其解决方案的关键在于提出一种拓扑感知的层剪枝框架:将每一层隐藏状态表示为点云,并利用单纯复形(simplicial complexes)建模其演化过程;通过Zigzag持久同调(zigzag persistent homology)量化层间拓扑一致性,从而实现保留重要表征跃迁的自适应剪枝。
链接: https://arxiv.org/abs/2604.16502
作者: Pengcheng Zheng,Chaoning Zhang,Ya Wen,Wang Liu,Qigan Sun,Jiarong Mo,Jiaquan Zhang,Jewon Lee,Tae-Ho Kim,Kuien Liu,Tianyu Li,Caiyan Qin,Yang Yang
机构: University of Electronic Science and Technology of China(电子科技大学); Kyung Hee University(中央大学); Nota Inc.(Nota公司); Institute of Software Chinese Academy of Sciences(中国科学院软件研究所); Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACL 2026 (Main Conference)
Abstract:Large Language Models (LLMs) have demonstrated strong capabilities in natural language understanding and reasoning, while recent extensions that incorporate visual inputs enable them to process multimodal information. Despite these advances, Large Vision-Language Models (LVLMs) incur substantial computational and memory costs, hindering deployment in resource-constrained scenarios. Existing layer pruning methods typically rely on local similarity metrics or static proxy signals, failing to capture the global and dynamic evolution of representations across model depth, which often leads to the removal of transition-critical layers. To address this limitation, we propose a topology-aware layer pruning framework for LVLMs. Specifically, we represent layer wise hidden states as point clouds and models their evolution using \textitsimplicial complexes. By leveraging \textitzigzag persistent homology, we quantify inter-layer topological consistency and enable adaptive pruning that preserves critical representational transitions. Extensive experiments on diverse multimodal benchmarks demonstrate that the proposed framework consistently outperforms existing pruning methods across a wide range of sparsity ratios. Our code is available at this https URL.
[CV-270] Semantically Stable Image Composition Analysisvia Saliency and Gradient Vector Flow Fusion ICPR2026
【速读】:该论文旨在解决摄影构图(photographic composition)的可靠计算评估问题,其核心挑战在于如何提取对空间布局具有判别力但又对语义内容鲁棒的特征。解决方案的关键在于提出一种基于视觉注意力流动假设的低层表示方法——VFCNet,该模型通过融合显著性与边缘信息构建梯度向量场(gradient vector flow, GVF),并采用双流GVF表示、注意力机制融合以及DINOv3骨干网络提取多尺度流特征,从而有效建模构图中的几何结构与视觉注意分布关系。实验表明,该方法在PICD基准上达到当前最优性能(CDA-1: 0.683, CDA-2: 0.629),且简单使用自监督DINOv3特征即可超越复杂设计的构图专用模型。
链接: https://arxiv.org/abs/2604.16500
作者: Armin Dadras,Robert Sablatnig,Franziska Proksa,Markus Seidl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICPR 2026
Abstract:The reliable computational assessment of photographic composition requires features that are discriminative of spatial layout yet robust to semantic content. This paper proposes a low-level representation grounded in the assumption that composition can be understood as the flow of visual attention across geometric structure. We introduce VFCNet, which fuses saliency and edge information into a gradient vector flow (GVF) field. The model computes dual-stream GVF representations, integrates them via attention, and extracts multi-scale flow features with a DINOv3 backbone. VFCNet achieves state-of-the-art performance on the PICD benchmark (CDA-1: 0.683, CDA-2: 0.629), improving by 33.1% and 36.1% over the previous best method. We also show that a simple classifier on self-supervised DINOv3 features substantially outperforms more sophisticated, composition-specialized models. Code is available at this https URL
[CV-271] HQA-VLAttack: Towards High Quality Adversarial Attack on Vision-Language Pre-Trained Models
【速读】:该论文旨在解决视觉-语言预训练模型上的黑盒对抗攻击问题,即在仅能访问模型输出结果的情况下,同时对文本和图像进行扰动以生成高质量的对抗样本。现有方法要么依赖复杂的迭代交叉搜索策略导致查询开销大,要么仅关注降低正样本对的相似度而忽略负样本对的隐式削弱,从而影响攻击效果。其解决方案的关键在于提出一种名为HQA-VLAttack的简单但高效的框架,包含文本和图像两个攻击阶段:文本扰动利用对抗词向量(counter-fitting word vector)构建语义一致的替代词集;图像扰动则先通过层重要性引导策略初始化对抗样本,再引入对比学习优化图像扰动,使正样本对相似度下降、负样本对相似度上升,从而提升模型检索负例的概率,显著提高攻击成功率。
链接: https://arxiv.org/abs/2604.16499
作者: Han Liu,Jiaqi Li,Zhi Xu,Xiaotong Zhang,Xiaoming Xu,Fenglong Ma,Yuanman Li,Hong Yu
机构: Dalian University of Technology (大连理工大学); The Pennsylvania State University (宾夕法尼亚州立大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Black-box adversarial attack on vision-language pre-trained models is a practical and challenging task, as text and image perturbations need to be considered simultaneously, and only the predicted results are accessible. Research on this problem is in its infancy, and only a handful of methods are available. Nevertheless, existing methods either rely on a complex iterative cross-search strategy, which inevitably consumes numerous queries, or only consider reducing the similarity of positive image-text pairs but ignore that of negative ones, which will also be implicitly diminished, thus inevitably affecting the attack performance. To alleviate the above issues, we propose a simple yet effective framework to generate high-quality adversarial examples on vision-language pre-trained models, named HQA-VLAttack, which consists of text and image attack stages. For text perturbation generation, it leverages the counter-fitting word vector to generate the substitute word set, thus guaranteeing the semantic consistency between the substitute word and the original word. For image perturbation generation, it first initializes the image adversarial example via the layer-importance guided strategy, and then utilizes contrastive learning to optimize the image adversarial perturbation, which ensures that the similarity of positive image-text pairs is decreased while that of negative image-text pairs is increased. In this way, the optimized adversarial images and texts are more likely to retrieve negative examples, thereby enhancing the attack success rate. Experimental results on three benchmark datasets demonstrate that HQA-VLAttack significantly outperforms strong baselines in terms of attack success rate.
[CV-272] LayerCache: Exploiting Layer-wise Velocity Heterogeneity for Efficient Flow Matching Inference
【速读】:该论文旨在解决生成式 AI (Generative AI) 中基于 Transformer 的 Flow Matching 模型在图像生成任务中推理成本过高的问题。现有方法因将整个 Transformer 视为单一单元进行缓存决策,未能利用不同层组间速度动态的异质性(即浅层稳定、深层变化剧烈),导致缓存效率低下。解决方案的关键在于提出 LayerCache——一种分层感知的缓存框架,通过将 Transformer 分为多个层组并独立决定每组在每个去噪步骤中的缓存策略,结合基于层组稳定性的自适应 JVP span(Jacobian-Vector Product span)选择机制,在保证估计精度的同时最大化计算节省,并以贪心预算分配算法优化三维度调度问题(时间步、层组、JVP span),最终在质量-速度帕累托前沿上显著优于现有方法。
链接: https://arxiv.org/abs/2604.16492
作者: Guandong Li
机构: iFLYTEK(科大讯飞)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Flow Matching models achieve state-of-the-art image generation quality but incur substantial inference cost due to iterative denoising through large Transformer networks. We observe that different layer groups within a Transformer exhibit markedly heterogeneous velocity dynamics: shallow layers are highly stable and amenable to aggressive caching, while deep layers undergo large velocity changes that demand full computation. Existing caching methods, however, treat the entire Transformer as a monolithic unit, applying a single caching decision per timestep and thus failing to exploit this heterogeneity. Based on this finding, we propose LayerCache, a layer-aware caching framework that partitions the Transformer into layer groups and makes independent, per-group caching decisions at each denoising step. LayerCache introduces an adaptive JVP span K selection mechanism that leverages per-group stability measurements to balance estimation accuracy and computational savings. We formulate a three-dimensional scheduling problem over timesteps, layer groups, and JVP span, and solve it with a greedy budget allocation algorithm. On Qwen-Image (1024x1024, 50 steps), LayerCache achieves PSNR 37.46 dB (+5.38 dB over MeanCache), SSIM 0.9834, and LPIPS 0.0178 (a 70% reduction over MeanCache) at 1.37x speedup, dominating all prior caching methods on the quality-speed Pareto frontier.
[CV-273] A Lightweight Transformer for Pain Recognition from Brain Activity
【速读】:该论文旨在解决疼痛自动化评估中多模态信号融合与计算效率之间的矛盾问题,尤其是在功能近红外光谱(functional near-infrared spectroscopy, fNIRS)数据处理中如何有效建模不同时间尺度和空间特征的同时保持轻量化。其解决方案的关键在于提出一种基于统一标记化机制的轻量级Transformer架构,通过token-mixing策略将异构的fNIRS输入(如原始波形和功率谱密度)映射到共享潜在表示空间,从而在不引入模态特异性适配或增加模型复杂度的前提下,联合建模空间、时间和时频特征,并借助结构化的分割方案控制局部聚合与全局交互的粒度,最终实现高精度且适合GPU/CPU实时推理的疼痛识别性能。
链接: https://arxiv.org/abs/2604.16491
作者: Stefanos Gkikas,Christian Arzate Cruz,Yu Fang,Lu Cao,Muhammad Umar Khan,Thomas Kassiotis,Giorgos Giannakakis,Raul Fernandez Rojas,Randy Gomez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Pain is a multifaceted and widespread phenomenon with substantial clinical and societal burden, making reliable automated assessment a critical objective. This paper presents a lightweight transformer architecture that fuses multiple fNIRS representations through a unified tokenization mechanism, enabling joint modeling of complementary signal views without requiring modality-specific adaptations or increasing architectural complexity. The proposed token-mixing strategy preserves spatial, temporal, and time-frequency characteristics by projecting heterogeneous inputs onto a shared latent representation, using a structured segmentation scheme to control the granularity of local aggregation and global interaction. The model is evaluated on the AI4Pain dataset using stacked raw waveform and power spectral density representations of fNIRS inputs. Experimental results demonstrate competitive pain recognition performance while remaining computationally compact, making the approach suitable for real-time inference on both GPU and CPU hardware.
[CV-274] An Uncertainty-Aware Loss Function Incorporating Fuzzy Logic: Application to MRI Brain Image Segmentation
【速读】:该论文旨在解决脑部磁共振成像(MRI)图像中组织分割的不确定性问题,尤其是在像素级分类时因边界模糊或噪声导致的标签不明确性。其解决方案的关键在于提出了一种融合模糊逻辑(fuzzy logic)的新颖损失函数,该函数整合了经典的类别交叉熵(categorical cross-entropy, CCE)与基于模糊熵的度量,从而在训练过程中显式建模和处理像素分类中的不确定性,提升模型对复杂医学图像的分割精度与预测可靠性。
链接: https://arxiv.org/abs/2604.16490
作者: Hanuman Verma,Akshansh Gupta,Pranabesh Maji,Saurav Mandal,Vijay Kumar Pandey
机构: Bareilly College (巴瑞利学院); MJP Rohilkhand University (MJP罗希尔坎德大学); CSIR-Central Electronics Engineering Research Institute (印度科学与工业研究委员会-中央电子工程研究所); CICMR, Regional Medical Research Center (区域医学研究中心); Department of Statistics (统计系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 09 pages, 07 Figures
Abstract:Accurate brain image segmentation, particularly for distinguishing various tissues from magnetic resonance imaging (MRI) images, plays a pivotal role in finding the neurological dis ease and medical image computing. In deep learning approaches, loss functions are very crucial for optimizing the model. In this study, we introduce a novel loss function integrating fuzzy logic to deals uncertainty issues in brain image segmentation into various tissues. It integrates the well-known categorical cross-entropy (CCE) loss function and fuzzy entropy based on fuzzy logic. By employing fuzzy logic, this loss function accounts for the inherent uncertainties in pixel classifications. The proposed loss function has been evaluated on two publicly available benchmark datasets, IBSR and OASIS, using two widely recognised architectures, U-Net and U-Net++. Experimental results demonstrate that the trained model with proposed loss function provided better results in comparison to the CCE optimisation function in terms of various performance metrics. Additionally, it effectively enhances segmentation performance while handling meaningful uncer tainty during training. The findings suggest that this approach not only improves segmentation outcomes but also contributes to the reliability of model predictions.
[CV-275] Geometry-Aware CLIP Retrieval via Local Cross-Modal Alignment and Steering
【速读】:该论文旨在解决CLIP(Contrastive Language–Image Pretraining)在跨模态检索中因局部几何不一致导致的检索失败问题,例如邻近类别(如五边形与六边形)排序错误,从而产生模糊且缺乏控制的结果集。传统方法主要通过点级相关性优化或微调来缓解此问题,而本文提出将检索任务重新定义为邻域对齐问题,其关键在于:(1) 利用匈牙利匹配进行邻域级重排序,以奖励结构一致性;(2) 引入查询条件下的局部引导机制,通过对比邻域方向重塑检索结果的局部结构。这两项技术均作用于局部邻域,分别实现对齐增强与结构控制,显著提升属性绑定和组合检索任务的性能,且无需重新训练即可在推理阶段提升检索质量与可控性。
链接: https://arxiv.org/abs/2604.16487
作者: Nirmalendu Prakash,Narmeen Fatimah Oozeer,Xin Su,Phillip Howard,Shaan Shah,Zoe Wanying He,Shuang Wu,Shivam Raval,Roy Ka-Wei Lee,Meenakshi Khosla,Amir Abdullah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:CLIP retrieval is typically framed as a pointwise similarity problem in a shared embedding space. While CLIP achieves strong global cross-modal alignment, many retrieval failures arise from local geometric inconsistencies: nearby items are incorrectly ordered, leading to systematic confusions (e.g., pentagon vs. hexagon) and produces diffuse, weakly controlled result sets. Prior work largely optimizes for point wise relevance or finetuning to mitigate these problems. We instead view retrieval as a problem of neighborhood alignment. Our work introduces (1) neighborhood-level re-ranking via Hungarian matching, which rewards structural consistency; (2) query-conditioned local steering, where directions derived from contrastive neighborhoods around the query reshape retrieval. We show that these techniques improve retrieval performance on attribute-binding and compositional retrieval tasks. Together, these methods operate on local neighborhoods but serve different roles: re-ranking rewards alignment whereas local steering controls neighborhood structure. This shows that retrieval quality and controllability depend critically on local structure, which can be exploited at inference time without retraining.
[CV-276] Aletheia: Physics-Conditioned Localized Artifact Attention (PhyLAA-X) for End-to-End Generalizable and Robust Deepfake Video Detection
【速读】:该论文旨在解决当前深度伪造检测模型在跨生成器迁移(cross-generator shifts)、高压缩和对抗攻击下性能显著下降的问题。其核心挑战在于现有方法未能将语义伪影学习与物理不变性(physical invariants)有效耦合,如光流不连续性、镜面反射不一致性和心率调制的反射信号(rPPG)。解决方案的关键是提出PhyLAA-X,一种物理条件化的局部伪影注意力机制扩展,通过端到端可微分的方式注入三个物理衍生特征体积——光流旋度(optical-flow curl)、镜面反射偏度(specular-reflectance skewness)和空间上采样的rPPG功率谱——直接嵌入LAA-X注意力计算中,利用交叉注意力门控和共振一致性损失强制网络学习语义不一致与物理违反共现的篡改边界,这些区域对生成模型而言难以稳定复制。
链接: https://arxiv.org/abs/2604.16486
作者: Devendra Ghori
机构: Aletheia Project ( Aletheia 项目)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code: this https URL (MIT license). Dataset notes: see Data and Code Availability section
Abstract:State-of-the-art deepfake detectors achieve near-perfect in-domain accuracy yet degrade under cross-generator shifts, heavy compression, and adversarial perturbations. The core limitation remains the decoupling of semantic artifact learning from physical invariants: optical-flow discontinuities, specular-reflection inconsistencies, and cardiac-modulated reflectance (rPPG) are treated either as post-hoc features or ignored. We introduce PhyLAA-X, a novel physics-conditioned extension of Localized Artifact Attention (LAA-X). PhyLAA-X injects three end-to-end differentiable physics-derived feature volumes - optical-flow curl, specular-reflectance skewness, and spatially-upsampled rPPG power spectra - directly into the LAA-X attention computation via cross-attention gating and a resonance consistency loss. This forces the network to learn manipulation boundaries where semantic inconsistencies and physical violations co-occur - regions inherently harder for generative models to replicate consistently. PhyLAA-X is embedded across an efficient spatiotemporal ensemble (EfficientNet-B4+BiLSTM, ResNeXt-101+Transformer, Xception+causal Conv1D) with uncertainty-aware adaptive weighting. On FaceForensics++ (c23), Aletheia reaches 97.2% accuracy / 0.992 AUC-ROC; on Celeb-DF v2, 94.9% / 0.981; on DFDC, 90.8% / 0.966 - outperforming the strongest published baseline (LAA-Net [1]) by 4.1-7.3% in cross-generator settings and maintaining 79.4% accuracy under epsilon = 0.02 PGD-10 attacks. Single-backbone ablations confirm PhyLAA-X alone delivers a 4.2% cross-dataset AUC gain. The full production system is open-sourced at this https URL (v1.2, April 2026) with pretrained weights, the adversarial corpus (referred to as ADC-2026 in this work), and complete reproducibility artifacts. Comments: Code: this https URL (MIT license). Dataset notes: see Data and Code Availability section Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2604.16486 [cs.CV] (or arXiv:2604.16486v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.16486 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-277] Saccade Attention Networks: Using Transfer Learning of Attention to Reduce Network Sizes
【速读】:该论文旨在解决Transformer网络在处理长序列时因注意力矩阵的二次复杂度而导致的计算效率低下问题。其解决方案的关键在于引入一种称为“Saccade Attention Network”的新型架构,该网络通过学习从大型预训练模型中识别出关键特征区域(类比人类视觉中的saccades),从而对输入图像进行预处理,仅保留被关注的关键特征作为后续处理的输入序列。这种方法显著减少了输入序列长度,使计算量降低约80%,同时保持与原始模型相当的性能表现。
链接: https://arxiv.org/abs/2604.16485
作者: Marc Estafanous(1 and 2) ((1) Johns Hopkins University, (2) Neurobaby Corporation)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 5 figures, 2 tables
Abstract:One of the limitations of transformer networks is the sequence length due to the quadratic nature of the attention matrix. Classical self attention uses the entire sequence length, however, the actual attention being used is sparse. Humans use a form of sparse attention when analyzing an image or scene called saccades. Focusing on key features greatly reduces computation time. By using a network (Saccade Attention Network) to learn where to attend from a large pre-trained model, we can use it to pre-process images and greatly reduce network size by reducing the input sequence length to just the key features being attended to. Our results indicate that you can reduce calculations by close to 80% and produce similar results.
[CV-278] DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
【速读】:该论文旨在解决生成式世界-动作模型(World-Action Models)在机器人操作任务中部署时面临的三大瓶颈问题:冗余的像素级重建、内存随时间呈线性增长(O(T))以及推理延迟高。其核心解决方案是提出因果潜在世界模型(Causal Latent World Model, CLWM),通过使用DINOv3特征作为生成目标,将交互语义从视觉噪声中解耦,从而提升域泛化能力;引入双状态测试时训练(Dual-State Test-Time Training, TTT)记忆机制以实现严格O(1)内存占用,支持长程任务;并设计推测异步推理(Speculative Asynchronous Inference, SAI)策略,在物理执行过程中掩蔽部分扩散去噪过程,降低约50%的阻塞延迟;此外,通过EmbodiChain在线框架注入物理驱动轨迹流,建立效率定律,显著提升策略鲁棒性与可扩展性。
链接: https://arxiv.org/abs/2604.16484
作者: Yueci Deng,Guiliang Liu,Kui Jia
机构: DexForce AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deploying generative World-Action Models for manipulation is severely bottlenecked by redundant pixel-level reconstruction, \mathcalO(T) memory scaling, and sequential inference latency. We introduce the Causal Latent World Model (CLWM), which employs DINOv3 features as generative targets to disentangle interaction semantics from visual noise, yielding highly robust domain generalization. To overcome memory scaling, CLWM features a Dual-State Test-Time Training (TTT) Memory that guarantees a strict \mathcalO(1) footprint for long-horizon tasks. To overcome deployment latency, we propose Speculative Asynchronous Inference (SAI) to mask partial diffusion denoising behind physical execution, cutting blocking latency by about 50% . To scale robust policies, we present EmbodiChain, an online framework that establishes the Efficiency Law by injecting an infinite flow of physics-grounded trajectories during training. Extensive experiments validate that CLWM achieves state-of-the-art performance in complex dual-arm simulation and unprecedented zero-shot sim-to-real transfer on physical robots, outperforming baselines explicitly finetuned on real-world data.
[CV-279] Dynamic Eraser for Guided Concept Erasure in Diffusion Models
【速读】:该论文旨在解决文本到图像(Text-To-Image, T2I)扩散模型中概念擦除(concept erasure)的安全性问题,即如何在不引入语义漂移或表示崩溃的前提下,有效且可控地移除特定敏感概念。现有方法在推理阶段存在两大局限:特征修正类方法易导致过度修正,而基于token的干预则难以实现语义粒度控制与上下文一致性。为此,作者提出轻量级、无需训练的动态语义引导(Dynamic Semantic Steering, DSS)框架,其核心创新在于:1)敏感语义边界建模(Sensitive Semantic Boundary Modeling, SSBM),自动识别安全的语义锚点;2)敏感语义引导(Sensitive Semantic Guidance, SSG),利用交叉注意力特征进行精准检测,并通过一个良好定义的目标函数推导出闭式解来执行修正,从而在最大程度抑制敏感内容的同时保留良性语义。该方案实现了平均91.0%的概念擦除率,显著优于当前最优方法(提升至85.9%),且对输出保真度影响最小。
链接: https://arxiv.org/abs/2604.16483
作者: Qinghui Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 26 pages,21 figures
Abstract:Concept erasure in Text-To-Image (T2I) diffusion models is vital for safe content generation, but existing inference-time methods face significant limitations. Feature-correction approaches often cause uncontrolled over-correction, while token-level interventions struggle with semantic granularity and context. Moreover, both types of methods are prone to severe semantic drift or even complete representation collapse. To address these challenges, we present Dynamic Semantic Steering (DSS), a lightweight, training-free framework for interpretable and controllable concept erasure. DSS introduces: 1) Sensitive Semantic Boundary Modeling (SSBM) to automate the discovery of safe semantic anchors, and 2) Sensitive Semantic Guidance (SSG), which leverages cross-attention features for precise detection and performs correction via a closed-form solution derived from a well-posed objective. This ensures optimal suppression of sensitive content while preserving benign semantics. DSS achieves an average erasure rate of 91.0%, significantly outperforming SOTA methods (from 18.6% to 85.9%) with minimal impact on output fidelity.
[CV-280] A Survey of Spatial Memory Representations for Efficient Robot Navigation CVPR2026
【速读】:该论文旨在解决视觉导航机器人在大规模环境中运行时空间记忆(spatial memory)无限制增长导致计算资源耗尽的问题,尤其针对嵌入式平台(如8–16GB共享内存、30W功耗限制)无法通过增加硬件来缓解内存压力的场景。其核心解决方案是提出一个标准化评估协议,包含内存增长速率、查询延迟、内存完备性曲线和吞吐量退化等指标,并引入关键指标 α = Mₚₑₐₖ / Mₘₐₚ(峰值运行内存与保存地图大小之比),量化了实际部署成本与文献中报告的地图尺寸之间的差距。研究表明,内存架构设计而非方法范式(如Occupancy Grid或Neural Implicit Representations)才是决定部署可行性的关键因素,且不同方法在各自最优区间内表现各异,从而推动基于 α 值的预算分配算法,使开发者可在实现前预判目标硬件上的可行性。
链接: https://arxiv.org/abs/2604.16482
作者: Ma. Madecheen S. Pangaliman,Steven S. Sison,Erwin P. Quilloy,Rowel Atienza
机构: University of the Philippines Diliman (菲律宾大学迪里曼分校); University of Santo Tomas (圣托马斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at the Women in Computer Vision (WiCV) Workshop at CVPR 2026
Abstract:As vision-based robots navigate larger environments, their spatial memory grows without bound, eventually exhausting computational resources, particularly on embedded platforms (8-16GB shared memory, 30W) where adding hardware is not an option. This survey examines the spatial memory efficiency problem across 88 references spanning 52 systems (1989-2025), from occupancy grids to neural implicit representations. We introduce the \alpha = M_\textpeak / M_\textmap , the ratio of peak runtime memory (the total RAM or GPU memory consumed during operation) to saved map size (the persistent checkpoint written to disk), exposing the gap between published map sizes and actual deployment cost. Independent profiling on an NVIDIA A100 GPU reveals that \alpha spans two orders of magnitude within neural methods alone, ranging from 2.3 (Point-SLAM) to 215 (NICE-SLAM, whose 47,MB map requires 10GB at runtime), showing that memory architecture, not paradigm label, determines deployment feasibility. We propose a standardized evaluation protocol comprising memory growth rate, query latency, memory-completeness curves, and throughput degradation, none of which current benchmarks capture. Through a Pareto frontier analysis with explicit benchmark separation, we show that no single paradigm dominates within its evaluation regime: 3DGS methods achieve the best absolute accuracy at 90-254,MB map size on Replica, while scene graphs provide semantic abstraction at predictable cost. We provide the first independently measured \alpha reference values and an \alpha -aware budgeting algorithm enabling practitioners to assess deployment feasibility on target hardware prior to implementation.
[CV-281] Erasing Thousands of Concepts: Towards Scalable and Practical Concept Erasure for Text-to-Image Diffusion Models
【速读】:该论文旨在解决大规模文本到图像(Text-to-Image, T2I)扩散模型中存在的安全风险问题,即模型可能生成受版权保护或其他不 Desired 内容,而现有概念擦除(Concept Erasure)方法在可扩展性、精确性和鲁棒性之间难以平衡,仅能处理数百个概念。其解决方案的关键在于提出一种名为“擦除数千个概念”(Erasing Thousands of Concepts, ETC)的可扩展框架:首先通过学生t分布混合模型(Student’s t-distribution Mixture Model, tMM)建模低秩概念分布,并利用仿射最优传输实现对目标概念的精准擦除,同时通过锚定目标概念分布边界而不依赖预定义锚点来保留其他概念;其次引入基于专家混合(Mixture-of-Experts, MoE)结构的MoEraser模块,在注入噪声并微调以恢复嵌入的基础上,实现对目标嵌入的移除和锚点嵌入的保留,从而增强对白盒攻击(如模块移除)的鲁棒性。该方案在跨域、多扩散模型上验证了对超2000个概念的高精度擦除能力,显著提升了概念擦除任务的规模与实用性。
链接: https://arxiv.org/abs/2604.16481
作者: Hoigi Seo,Byung Hyun Lee,Jaehyun Cho,Sungjin Lim,Se Young Chun
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large-scale text-to-image (T2I) diffusion models deliver remarkable visual fidelity but pose safety risks due to their capacity to reproduce undesirable content, such as copyrighted ones. Concept erasure has emerged as a mitigation strategy, yet existing approaches struggle to balance scalability, precision, and robustness, which restricts their applicability to erasing only a few hundred concepts. To address these limitations, we present Erasing Thousands of Concepts (ETC), a scalable framework capable of erasing thousands of concepts while preserving generation quality. Our method first models low-rank concept distributions via a Student’s t-distribution Mixture Model (tMM). It enables pin-point erasure of target concepts via affine optimal transport while preserving others by anchoring the boundaries of target concept distributions without pre-defined anchor concepts. We then train a Mixture-of-Experts (MoE)-based module, termed MoEraser, which removes target embeddings while preserving the anchor embeddings. By injecting noise into the text embedding projector and fine-tuning MoEraser for recovery, our framework achieves robustness to white-box attack such as module removal. Extensive experiments on over 2,000 concepts across heterogeneous domains and diffusion models demerate state-of-the-art scalability and precision in large-scale concept erasure.
[CV-282] Positioning radiata pine branches requiring pruning by drone stereo vision
【速读】:该论文旨在解决林业中自动化修剪作业的挑战,具体聚焦于利用无人机搭载的双目视觉系统实现辐射松(Radiata Pine)枝条的检测与定位。其解决方案的关键在于构建一个两阶段的处理流程:首先通过YOLOv8、YOLOv9和Mask R-CNN等模型在自建71对立体图像数据集上进行枝条分割,以精确提取目标区域;其次采用多种深度估计方法(包括传统SGBM与WLS滤波及多款基于深度学习的算法如PSMNet、ACVNet等)生成稠密视差图,并结合中心点三角测量算法与中位数绝对偏差(MAD)异常值剔除策略,计算枝条相对于无人机的距离。实验表明,深度学习方法生成的视差图在1–2米距离下具有更高的几何一致性,验证了低成本双目视觉方案在林木自动定位中的可行性。
链接: https://arxiv.org/abs/2604.16480
作者: Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a stereo-vision-based system mounted on a drone for detecting and localising radiata pine branches to support autonomous pruning. The proposed pipeline comprises two stages: branch segmentation and depth estimation. For segmentation, YOLOv8, YOLOv9, and Mask R-CNN variants are compared on a custom dataset of 71 stereo image pairs captured with a ZED Mini camera. For depth estimation, both a traditional method (SGBM with WLS filtering) and deep-learning-based methods (PSMNet, ACVNet, GWCNet, MobileStereoNet, RAFT-Stereo, and NeRF-Supervised Deep Stereo) are evaluated. A centroid-based triangulation algorithm with MAD outlier rejection is proposed to compute branch distance from the segmentation mask and disparity map. Qualitative evaluation at distances of 1-2 m indicates that the deep learning-based disparity maps produce more coherent depth estimates than SGBM, demonstrating the feasibility of low-cost stereo vision for automated branch positioning in forestry.
[CV-283] Latent-Compressed Variational Autoencoder for Video Diffusion Models CVPR2026
【速读】:该论文旨在解决视频变分自编码器(Video Variational Autoencoders, Video VAEs)在潜在扩散模型(Latent Diffusion Models, LDMs)中因潜空间通道数过多而导致的收敛困难与生成性能下降问题,尽管此时重建质量仍保持较高水平。解决方案的关键在于提出一种潜压缩方法,通过移除视频潜在表示中的高频成分来实现压缩,而非直接减少通道数量,从而避免了重建保真度的损失,同时在相同压缩比下显著提升了视频重建质量。
链接: https://arxiv.org/abs/2604.16479
作者: Jiarui Guan,Wenshuai Zhao,Zhengtao Zou,Juho Kannala,Arno Solin
机构: Aalto University (阿尔托大学); ELLIS Institute Finland (芬兰ELLIS研究所); University of Oulu (奥卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026 findings
Abstract:Video variational autoencoders (VAEs) used in latent diffusion models typically require a sufficiently large number of latent channels to ensure high-quality video reconstruction. However, recent studies have revealed that an excessive number of latent channels can impede the convergence of latent diffusion models and deteriorate their generative performance, even when reconstruction quality remains high. We propose a latent compression method that removes high-frequency components in video latent representations rather than directly reducing the number of channels, which often compromises reconstruction fidelity. Experimental results demonstrate that the proposed method achieves superior video reconstruction quality compared to strong baselines while maintaining the same overall compression ratio.
[CV-284] From Inheritance to Saturation: Disentangling the Evolution of Visual Redundancy for Architecture-Aware MLLM Inference Acceleration ACL2026
【速读】:该论文旨在解决高分辨率多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理阶段因视觉标记(visual tokens)爆炸式增长而导致的计算成本过高问题。现有加速策略如标记剪枝或层稀疏化存在严重的“骨干依赖性”(backbone dependency),即在Vicuna或Mistral等架构上表现良好,但在Qwen等架构上会导致性能显著下降。解决方案的关键在于通过截断矩阵熵(truncated matrix entropy)揭示了一个通用的三阶段推理生命周期,将视觉冗余解耦为普遍存在的内在视觉冗余(Intrinsic Visual Redundancy, IVR)和与架构相关的二次饱和冗余(Secondary Saturation Redundancy, SSR)。基于此洞察,作者提出HalfV框架:首先采用统一剪枝策略消除IVR,再根据具体表现自适应处理SSR,从而在多种骨干网络上实现更优的效率-性能权衡。
链接: https://arxiv.org/abs/2604.16462
作者: Jiaqi Shi,Yuechan Li,Xulong Zhang,Xiaoyang Qu,Jianzong Wang
机构: University of Science and Technology of China (中国科学技术大学); Wuhan University (武汉大学); Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China (平安科技(深圳)有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 14 figures, plus appendix, accepted at ACL 2026
Abstract:High-resolution Multimodal Large Language Models (MLLMs) face prohibitive computational costs during inference due to the explosion of visual tokens. Existing acceleration strategies, such as token pruning or layer sparsity, suffer from severe “backbone dependency”, performing well on Vicuna or Mistral architectures (e.g., LLaVA) but causing significant performance degradation when transferred to architectures like Qwen. To address this, we leverage truncated matrix entropy to uncover a universal three-stage inference lifecycle, decoupling visual redundancy into universal Intrinsic Visual Redundancy (IVR) and architecture-dependent Secondary Saturation Redundancy (SSR). Guided by this insight, we propose HalfV, a framework that first mitigates IVR via a unified pruning strategy and then adaptively handles SSR based on its specific manifestation. Experiments demonstrate that HalfV achieves superior efficiency-performance trade-offs across diverse backbones. Notably, on Qwen25-VL, it retains 96.8% performance at a 4.1 \times FLOPs speedup, significantly outperforming state-of-the-art baselines. Our code is available at this https URL.
[CV-285] A High-Accuracy Optical Music Recognition Method Based on Bottleneck Residual Convolutions
【速读】:该论文旨在解决光学音乐识别(Optical Music Recognition, OMR)中从印刷或手写乐谱图像到可编辑符号表示的自动转换问题,其核心挑战在于准确捕捉音乐符号的细粒度特征与全局谱线结构,并建模符号间的时序依赖关系。解决方案的关键在于提出一种端到端的OMR框架,结合ResNet-v2风格的残差瓶颈卷积模块与多尺度空洞卷积以提取兼具局部细节和全局结构的特征表示,再通过双向门控循环单元(Bidirectional Gated Recurrent Unit, BiGRU)网络建模符号序列的时序依赖性;模型采用连接时序分类(Connectionist Temporal Classification, CTC)损失函数进行训练,无需显式对齐标注即可实现高效、高精度的符号识别。
链接: https://arxiv.org/abs/2604.16446
作者: Junwen Ma,Huhu Xue,Xingyuan Zhao,and Weicheng Fu
机构: Tianshui Normal University (天水师范学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 2 figs, and 13 tables
Abstract:Optical Music Recognition (OMR) aims to convert printed or handwritten music score images into editable symbolic representations. This paper presents an end-to-end OMR framework that combines residual bottleneck convolutions with bidirectional gated recurrent unit (BiGRU)-based sequence modeling. A convolutional neural network with ResNet-v2-style residual bottleneck blocks and multi-scale dilated convolutions is used to extract features that encode both fine-grained symbol details and global staff-line structures. The extracted feature sequences are then fed into a BiGRU network to model temporal dependencies among musical symbols. The model is trained using the Connectionist Temporal Classification loss, enabling end-to-end prediction without explicit alignment annotations. Experimental results on the Camera-PrIMuS and PrIMuS datasets demonstrate the effectiveness of the proposed framework. On Camera-PrIMuS, the proposed method achieves a sequence error rate (SeER) of 7.52% and a symbol error rate (SyER) of 0.45% , with pitch, type, and note accuracies of 99.33% , 99.60% , and 99.28% , respectively. The average training time is 1.74~s per epoch, demonstrating high computational efficiency while maintaining strong recognition performance. On PrIMuS, the method achieves a SeER of 8.11% and a SyER of 0.49% , with pitch, type, and note accuracies of 99.27% , 99.58% , and 99.21% , respectively. A fine-grained error analysis further confirms the effectiveness of the proposed model.
[CV-286] (Sparse) Attention to the Details: Preserving Spectral Fidelity in ML-based Weather Forecasting Models
【速读】:该论文旨在解决基于机器学习(ML)的天气预测中两个主要的谱降质问题:一是确定性训练对抗集合均值导致的谱偏差,二是压缩编码引发的信息瓶颈效应。其解决方案的关键在于提出Mosaic模型,该模型通过学习到的功能扰动生成集合成员,并采用块稀疏注意力(block-sparse attention)机制在原始分辨率网格上运行,该机制通过共享空间相邻查询的键和值,在线性计算成本下捕捉长程依赖关系,从而实现高保真、校准良好的集合预报。
链接: https://arxiv.org/abs/2604.16429
作者: Maksim Zhdanov,Ana Lucic,Max Welling,Jan-Willem van de Meent
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:
Abstract:We introduce Mosaic, a probabilistic weather forecasting model that addresses two principal sources of spectral degradation in ML-based weather prediction: (1) deterministic training against ensemble means and (2) compressive encoding creating an information bottleneck. Mosaic generates ensemble members through learned functional perturbations and operates on native-resolution grids via block-sparse attention, a hardware-aligned mechanism that captures long-range dependencies at linear cost by sharing keys and values across spatially adjacent queries. At 1.5 ° resolution with 214M parameters, Mosaic matches or outperforms models trained on 6 times finer data on headline upper-air variables and achieves state-of-the-art results among 1.5 ° models, producing well-calibrated ensembles whose individual members exhibit near-perfect spectral alignment across all resolved frequencies. A 24-member, 10-day forecast takes under 12 seconds on a single H100 GPU.
[CV-287] ICAT: Incident-Case-Grounded Adaptive Testing for Physical-Risk Prediction in Embodied World Models
【速读】:该论文旨在解决视频生成式世界模型(video-generative world models)在模拟物理风险和严重后果时存在的可靠性不足问题,即这些模型常忽视或弱化危险提示信息与极端后果,从而导致规划和训练过程中产生不安全偏好。解决方案的关键在于提出ICAT(Incident-Constrained Action Testing),其通过构建结构化的风险记忆库,并基于真实事故报告和安全手册进行检索与组合,以因果链和严重程度标签约束风险案例的生成,从而提升模型对危险机制、触发条件及后果严重性的准确建模能力。
链接: https://arxiv.org/abs/2604.16405
作者: Zhenglin Lai,Sirui Huang,Yuteng Li,Changxin Huang,Jianqiang Li,Bingzhe Wu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Video-generative world models are increasingly used as neural simulators for embodied planning and policy learning, yet their ability to predict physical risk and severe consequences is rarely this http URL find that these models often downplay or omit key danger cues and severe outcomes for hazardous actions, which can induce unsafe preferences during planning and training on imagined rollouts. We propose ICAT, which grounds testing in real incident reports and safety manuals by building structured risk memories and retrieving/composing them to constrain the generation of risk cases with causal chains and severity labels. Experiments on an ICAT-based benchmark show that mainstream world models frequently miss mechanisms and triggering conditions and miscalibrate severity, falling short of the reliability required for safety-critical embodied deployment.
[CV-288] Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining ICLR2026
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在构建通用机器人时面临的两大核心问题:一是2D图像预测与3D动作预测之间的错位(misalignment),二是视觉与动作耦合训练方式限制了模型从大规模无动作标注的网络视频数据中学习的能力。解决方案的关键在于提出一种名为DeFI的新框架,通过解耦视觉前向动力学(General Forward Dynamics Model, GFDM)和逆动力学(General Inverse Dynamics Model, GIDM)的预训练过程,分别利用不同来源的数据进行独立训练——GFDM基于多样化的真人与机器人视频进行未来帧生成预训练,GIDM则通过自监督学习从未标注的视频片段中推断潜在动作。最终二者集成于统一架构中,实现端到端微调,从而在多个基准测试(如CALVIN ABC-D、SimplerEnv)和真实场景部署中均显著优于现有方法。
链接: https://arxiv.org/abs/2604.16391
作者: Wenyao Zhang,Bozhou Zhang,Zekun Qi,Wenjun Zeng,Xin Jin,Li Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Fudan University (复旦大学); Eastern Institute of Technology, Ningbo (宁波东方理工大学); Shanghai Innovation Institute (上海创新研究院); Tsinghua University (清华大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026
Abstract:Vision-language-action (VLA) models have shown great potential in building generalist robots, but still face a dilemma-misalignment of 2D image forecasting and 3D action prediction. Besides, such a vision-action entangled training manner limits model learning from large-scale, action-free web video data. To address these issues, we propose DeFI, a novel framework that Decouples visual Forward and Inverse dynamics pretraining to exploit respective data sources, wherein video generation and action prediction are disentangled. We introduce the General Forward Dynamics Model (GFDM), pretrained on diverse human and robot videos for future prediction, and the General Inverse Dynamics Model (GIDM), trained via self-supervised learning to infer latent actions from unlabeled video transitions. These models are then integrated into a unified architecture for end-to-end finetuning on downstream tasks. In this manner, GFDM and GIDM first shine separately and then cooperate for mutual benefit. Extensive experiments on CALVIN ABC-D and SimplerEnv demonstrate state-of-the-art performance, with DeFI achieving an average task length of 4.51 for CALVIN, 51.2% success rate on SimplerEnv-Fractal benchmark and 81.3% success rate in real-world deployment, significantly outperforming prior methods.
[CV-289] Visual-RRT: Finding Paths toward Visual-Goals via Differentiable Rendering
【速读】:该论文旨在解决传统基于快速扩展随机树(Rapidly-exploring Random Trees, RRT)的运动规划方法在实际应用中面临的局限性:即现有方法依赖于显式的数值关节角度作为目标配置,而许多视觉导向的任务(如图像或演示视频驱动的场景)无法提供精确的目标配置。为应对这一挑战,论文提出视觉-RRT(visual-RRT, vRRT),其核心创新在于将可微分机器人渲染(differentiable robot rendering)带来的梯度优化能力与RRT的采样探索机制相融合,从而实现从视觉观测到动作空间的有效映射。关键解决方案包括:(i) 基于前沿的探索-利用策略,自适应地优先搜索视觉上更具潜力的区域;(ii) 惯性梯度树扩展机制,通过跨分支继承优化状态实现梯度利用的一致性和高效性。实验表明,vRRT在Franka、UR5e和Fetch等多种机械臂平台上均能有效实现模拟与真实环境中的视觉目标规划,显著缩小了采样式规划与以视觉为中心的机器人应用之间的差距。
链接: https://arxiv.org/abs/2604.16388
作者: Sebin Lee,Jumin Lee,Taeyeon Kim,Younju Na,Woobin Im,Sung-Eui Yoon
机构: KAIST(韩国科学技术院); Samsung Electronics(三星电子)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Rapidly-exploring random trees (RRTs) have been widely adopted for robot motion planning due to their robustness and theoretical guarantees. However, existing RRT-based planners require explicit goal configurations specified as numerical joint angles, while many practical applications provide goal specifications through visual observations such as images or demonstration videos where precise goal configurations are unavailable. In this paper, we propose visual-RRT (vRRT), a motion planner that enables visual-goal planning by unifying gradient-based exploitation from differentiable robot rendering with sampling-based exploration from RRTs. We further introduce (i) a frontier-based exploration-exploitation strategy that adaptively prioritizes visually promising search regions, and (ii) inertial gradient tree expansion that inherits optimization states across tree branches for momentum-consistent gradient exploitation. Extensive experiments across various robot manipulators including Franka, UR5e, and Fetch demonstrate that vRRT achieves effective visual-goal planning in both simulated and real-world settings, bridging the gap between sampling-based planning and vision-centric robot applications. Our code is available at this https URL.
[CV-290] CSF: Black-box Fingerprinting via Compositional Semantics for Text-to-Image Models CVPR2026
【速读】:该论文旨在解决商业文本到图像生成模型(text-to-image models)在通过API部署时难以进行知识产权归属追踪的问题,尤其针对缺乏预部署水印或内部模型访问权限的黑盒场景。解决方案的关键在于提出一种名为组合语义指纹(Compositional Semantic Fingerprinting, CSF)的新方法,其核心思想是将模型视为语义类别生成器,并通过组合式模糊提示(compositional underspecified prompts)探测模型输出特征,从而识别模型是否源自受保护的训练谱系。该方法利用攻击者难以预判和抑制所有潜在指纹组合的不对称优势,在仅需查询访问的前提下实现高置信度的模型溯源,且在多个主流模型家族及其微调变体中验证了其有效性与鲁棒性。
链接: https://arxiv.org/abs/2604.16363
作者: Junhoo Lee,Mijin Koo,Nojun Kwak
机构: Seoul National University (首尔国立大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Text-to-image models are commercially valuable assets often distributed under restrictive licenses, but such licenses are enforceable only when violations can be detected. Existing methods require pre-deployment watermarking or internal model access, which are unavailable in commercial API deployments. We present Compositional Semantic Fingerprinting (CSF), the first black-box method for attributing fine-tuned text-to-image models to protected lineages using only query access. CSF treats models as semantic category generators and probes them with compositional underspecified prompts that remain rare under fine-tuning. This gives IP owners an asymmetric advantage: new prompt compositions can be generated after deployment, while attackers must anticipate and suppress a much broader space of fingerprints. Across 6 model families (FLUX, Kandinsky, SD1.5/2.1/3.0/XL) and 13 fine-tuned variants, our Bayesian attribution framework enables controlled-risk lineage decisions, with all variants satisfying the dominance criterion.
[CV-291] SetFlow: Generating Structured Sets of Representations for Multiple Instance Learning
【速读】:该论文旨在解决真实场景中因数据稀缺和弱监督导致机器学习模型性能受限的问题,尤其是在乳腺X线摄影(mammography)等医学影像分析任务中,传统多实例学习(Multiple Instance Learning, MIL)虽具优势,但现有方法在增强MIL数据表示时仍存在局限性——主要表现为仅在实例层面进行操作,无法捕捉袋内(bag-level)实例间的依赖关系。解决方案的关键在于提出SetFlow,一种直接在表示空间中建模整个MIL袋(即集合)的生成架构,其结合流匹配(flow matching)范式与受Set Transformer启发的设计,能够处理置换不变输入并捕获袋内实例间的交互关系;同时,模型通过类别标签和输入尺度进行条件控制,生成语义一致且结构合理的表示集合,从而有效提升下游任务性能,并在仅使用合成数据训练时亦表现优异,验证了表示空间生成建模在数据稀缺和隐私敏感场景下的有效性。
链接: https://arxiv.org/abs/2604.16362
作者: Nikola Jovišić,Milica Škipina,Vanja Švenda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures, 4 tables
Abstract:Data scarcity and weak supervision continue to limit the performance of machine learning models in many real-world applications, such as mammography, where Multiple Instance Learning (MIL) often offers the best formulation. While recent foundation models provide strong semantic representations out of the box, effective augmentation of such representations of MIL data remains limited, as existing methods operate at the instance level and fail to capture intra-bag dependencies. In this work, we introduce SetFlow, a generative architecture that models entire MIL bags (i.e., sets) directly in the representation space. Our approach leverages the flow matching paradigm combined with a Set Transformer-inspired design, enabling it to handle permutation-invariant inputs while capturing interactions between instances within each bag. The model is conditioned on both class labels and input scale, allowing it to generate coherent and semantically consistent sets of representations. We evaluate SetFlow on a large-scale mammography benchmark using a state-of-the-art MIL-PF classification pipeline. The generated samples are shown to closely match the original data distribution and even improve downstream performance when used for augmentation. Furthermore, training on synthetic data alone shows competitive results, demonstrating the effectiveness of representation-space generative modeling for data-scarce and privacy-sensitive tasks.
[CV-292] A3-FPN: Asymptotic Content-Aware Pyramid Attention Network for Dense Visual Prediction
【速读】:该论文旨在解决密集预测任务中物体尺度变化带来的挑战,尤其是现有特征金字塔网络(Feature Pyramid Network, FPN)在捕捉判别性特征和识别小目标方面存在的固有设计缺陷。其解决方案的关键在于提出一种渐近内容感知金字塔注意力网络(Asymptotic Content-Aware Pyramid Attention Network, A3-FPN),通过渐近解耦框架与内容感知注意力模块增强多尺度特征表示:首先采用水平扩展的列式网络实现渐近全局特征交互并解耦每一层级与所有层次表示;其次在特征融合阶段引入邻近层级的补充内容以生成位置感知偏移与权重,进行上下文感知重采样,并学习深层上下文重加权以提升类别内相似性;最后在特征重组阶段强化尺度内判别特征学习,并基于特征图的信息含量与空间变化重新组装冗余特征,从而显著提升模型对多尺度目标的识别能力。
链接: https://arxiv.org/abs/2604.10210
作者: Meng’en Qin,Yu Song,Quanling Zhao,Xiaodong Yang,Yingtao Che,Xiaohui Yang
机构: Henan Engineering Research Center for Artificial Intelligence Theory and Algorithms(河南省人工智能理论与算法工程研究中心); Shenzhen University of Advanced Technology(深圳先进技术研究院); The Hong Kong Polytechnic University(香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Learning multi-scale representations is the common strategy to tackle object scale variation in dense prediction tasks. Although existing feature pyramid networks have greatly advanced visual recognition, inherent design defects inhibit them from capturing discriminative features and recognizing small objects. In this work, we propose Asymptotic Content-Aware Pyramid Attention Network (A3-FPN), to augment multi-scale feature representation via the asymptotically disentangled framework and content-aware attention modules. Specifically, A3-FPN employs a horizontally-spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position-wise offsets and weights for context-aware resampling, and learns deep context reweights to improve intra-category similarity. In feature reassembly, it further strengthens intra-scale discriminative feature learning and reassembles redundant features based on information content and spatial variation of feature maps. Extensive experiments on MS COCO, VisDrone2019-DET and Cityscapes demonstrate that A3-FPN can be easily integrated into state-of-the-art CNN and Transformer-based architectures, yielding remarkable performance gains. Notably, when paired with OneFormer and Swin-L backbone, A3-FPN achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes. Codes are available at this https URL.
[CV-293] Optimally Bridging Semantics and Data: Generative Semantic Communication via Schrödinger Bridge
【速读】:该论文旨在解决生成式语义通信(Generative Semantic Communication, GSC)中因依赖从高斯分布到图像分布的长且间接最优传输路径所导致的严重幻觉问题及高计算开销。现有方法受限于必须通过语义空间映射至高斯分布再生成图像的流程,难以保证重建质量与效率。其解决方案的关键在于提出基于薛定谔桥(Schrödinger Bridge, SB)的通用框架——SBGSC,该框架突破了传统高斯先验限制,实现语义到图像的直接生成;进一步地,在此框架下设计了扩散薛定谔桥生成式语义通信(Diffusion SB-based GSC, DSBGSC),通过重构扩散模型中的非线性漂移项以实现最优分布传输,显著降低幻觉并减少计算复杂度;同时引入自一致性目标函数,引导模型学习指向图像的非线性速度场,跳过马尔可夫噪声预测过程,从而将采样步数大幅缩减,推理速度提升超过8倍。
链接: https://arxiv.org/abs/2604.17802
作者: Dahua Gao,Ruichao Liu,Minxi Yang,Shuai Ma,Youlong Wu,Guangming Shi
机构: Xidian University (西安电子科技大学); Pazhou Lab (琶洲实验室); Peng Cheng Laboratory (鹏城实验室); ShanghaiTech University (上海科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 10 figures, under review
Abstract:Generative Semantic Communication (GSC) is a promising solution for image transmission over narrow-band and high-noise channels. However, existing GSC methods rely on long, indirect transport trajectories from a Gaussian to an image distribution guided by semantics, causing severe hallucination and high computational cost. To address this, we propose a general framework named Schrödinger Bridge-based GSC (SBGSC). By leveraging the Schrödinger Bridge (SB) to construct optimal transport trajectories between arbitrary distributions, SBGSC breaks Gaussian limitations and enables direct generative decoding from semantics to images. Within this framework, we design Diffusion SB-based GSC (DSBGSC). DSBGSC reconstructs the nonlinear drift term of diffusion models using Schrödinger potentials, achieving direct optimal distribution transport to reduce hallucinations and computational overhead. To further accelerate generation, we propose a self-consistency-based objective guiding the model to learn a nonlinear velocity field pointing directly toward the image, bypassing Markovian noise prediction to significantly reduce sampling steps. Simulation results demonstrate that DSBGSC outperforms state-of-the-art GSC methods, improving FID by at least 38% and SSIM by 49.3%, while accelerating inference speed by over 8 times.
[CV-294] VIDS: A Verified Imaging Dataset Standard for Medical AI
【速读】:该论文旨在解决医学影像人工智能(Medical Imaging AI)开发中缺乏统一、可机器验证的数据集标准问题,特别是现有标准如DICOM和BIDS未涵盖数据标注的溯源性(annotation provenance)、质量文档化以及机器学习就绪性(ML readiness)等关键维度。其解决方案的核心是提出VIDS(Verified Imaging Dataset Standard),一个包含文件夹结构、命名规范、标注溯源模式、质量文档及21条机器可执行验证规则的开放规范,支持以NIfTI为工作格式并保留DICOM元数据以确保可追溯性,同时兼容主流深度学习框架(如nnU-Net、MONAI、COCO等)且不丢失标注溯源信息。通过在LIDC-IDRI、BraTS等四大公共数据集上进行基准测试,发现当前广泛使用的数据集仅满足20–39%的VIDS维度,凸显了标注溯源与质量文档的系统性缺失;进一步发布了首个完全符合VIDS标准的LIDC-Hybrid-100参考数据集,验证了该标准的有效性和实用性。
链接: https://arxiv.org/abs/2604.17525
作者: Joan S. Muthu,John Shalen
机构: Princeton Medical Systems (普林斯顿医学系统公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures, 5 tables
Abstract:Medical imaging AI development is fundamentally dependent on annotated datasets, yet no existing standard provides machine-enforceable validation across dataset structure, annotation provenance, quality documentation, and ML readiness within a single framework. DICOM standardizes image acquisition, storage, and communication at the individual study level. BIDS organizes neuroimaging research datasets with consistent naming conventions. Neither addresses the curation layer, viz., who annotated what, when, with what tool, and to what quality standard. This paper presents VIDS (Verified Imaging Dataset Standard), an open specification that defines folder layout, file naming, annotation provenance schemas, quality documentation, and 21 machine-enforceable validation rules across two compliance profiles. VIDS uses NIfTI as a canonical working format while preserving full DICOM metadata in sidecars for traceability, and supports export to any downstream ML framework (nnU-Net, MONAI, COCO, flat NIfTI) without loss of provenance. Twenty-two compliance dimensions are defined and four major public datasets – LIDC-IDRI, BraTS, CheXpert, and the Medical Segmentation Decathlon – are benchmarked against these dimensions. Even widely used datasets satisfy only 20–39% of these dimensions, with provenance and quality documentation as the largest systematic gaps. LIDC-Hybrid-100 is released as a 100-subject VIDS-compliant reference CT dataset with consensus segmentation masks from four radiologist annotations (mean pairwise Dice 0.7765), validating 21/21 on the Full compliance profile. VIDS is fully open source: the specification is CC BY 4.0, all tools are Apache 2.0, the reference validator is available on PyPI (pip install vids-validator), and LIDC-Hybrid-100 is published on Zenodo (this https URL). Comments: 11 pages, 3 figures, 5 tables Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.17525 [eess.IV] (or arXiv:2604.17525v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2604.17525 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Joan S Muthu Ph.D [view email] [v1] Sun, 19 Apr 2026 16:36:22 UTC (13 KB)
[CV-295] Learned Nonlocal Feature Matching and Filtering for RAW Image Denoising
【速读】:该论文旨在解决当前基于深度学习的RAW图像去噪方法中存在两大问题:一是现代去噪架构通常忽视了经典去噪技术(如基于自相似性的非局部方法)所积累的先验知识,导致模型结构复杂、参数量大;二是现有方法在实际部署时难以适应资源受限设备及多样化的噪声特性。其解决方案的关键在于提出一种面向RAW-to-RAW去噪的新型神经网络架构,将经典非局部块(nonlocal block)的可解释结构嵌入到全可学习网络中,通过一个创新的非局部模块实现邻域匹配、协同滤波与聚合的完整流程,该模块基于多尺度特征表示操作,仅需少量邻居即可高效扩展感受野,在保持高重建质量的同时显著降低参数量,并通过在真实RAW数据与合成噪声联合训练并引入噪声水平图(noise level map)作为条件输入,实现了对未见传感器设备的泛化能力。
链接: https://arxiv.org/abs/2604.17453
作者: Marco Sánchez-Beeckman,Antoni Buades(IAC3 amp; Departament de Ciències Matemàtiques i Informàtica, Universitat de les Illes Balears)
机构: University of the Balearic Islands (伊比利亚大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 10 figures
Abstract:Being one of the oldest and most basic problems in image processing, image denoising has seen a resurgence spurred by rapid advances in deep learning. Yet, most modern denoising architectures make limited use of the technical knowledge acquired researching the classical denoisers that came before the mainstream use of neural networks, instead relying on depth and large parameter counts. This poses a challenge not only for understanding the properties of such networks, but also for deploying them on real devices which may present resource constraints and diverse noise profiles. Tackling both issues, we propose an architecture dedicated to RAW-to-RAW denoising that incorporates the interpretable structure of classical self-similarity-based denoisers into a fully learnable neural network. Our design centers on a novel nonlocal block that parallels the established pipeline of neighbor matching, collaborative filtering and aggregation popularized by nonlocal patch-based methods, operating on learned multiscale feature representations. This built-in nonlocality efficiently expands the receptive field, sufficing a single block per scale with a moderate number of neighbors to obtain high-quality results. Training the network on a curated dataset with clean real RAW data and modeled synthetic noise while conditioning it on a noise level map yields a sensor-agnostic denoiser that generalizes effectively to unseen devices. Both quantitative and visual results on benchmarks and in-the-wild photographs position our method as a practical and interpretable solution for real-world RAW denoising, achieving results competitive with state-of-the-art convolutional and transformer-based denoisers while using significantly fewer parameters. The code is available at this https URL .
[CV-296] Chaos-Enhanced Prototypical Networks for Few-Shot Medical Image Classification
【速读】:该论文旨在解决在肿瘤影像诊断中因标注数据稀缺导致的少样本学习(Few-Shot Learning, FSL)性能受限问题,特别是标准原型网络(Prototypical Networks)在脑肿瘤图像中因形态学噪声和类内高方差引发的“原型不稳定性”问题。其解决方案的关键在于将非线性逻辑混沌模块(Logistic Chaos Module)集成到微调后的ResNet-18主干网络中,构建出混沌增强原型网络(Chaos-Enhanced ProtoNet, CE-ProtoNet),通过利用逻辑混沌映射的确定性遍历性,在 episodic 训练过程中向支持集特征注入受控扰动,从而对嵌入空间进行“压力测试”,使模型收敛于对噪声鲁棒的表示,且无需增加计算开销。实验表明,15%的混沌扰动强度可有效稳定高维聚类并降低类间离散度,最终在4分类5样本任务上达到84.52%的最高测试准确率,显著优于标准原型网络。
链接: https://arxiv.org/abs/2604.17300
作者: Chinhtakuntla Meghan Sai,Murarisetty V Sai Kartheek,Sita Devi Bharatula,Karthik Seemakurthy
机构: Amrita Vishwa Vidyapeetham (阿姆里塔世界大学); Hydronium Energies (氢氧能源公司)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The scarcity of labeled clinical data in oncology makes Few-Shot Learning (FSL) a critical framework for Computer Aided Diagnostics, but we observed that standard Prototypical Networks often struggle with the “prototype instability” caused by morphological noise and high intra-class variance in brain tumor scans. Our work attempts to minimize this by integrating a non-linear Logistic Chaos Module into a fine-tuned ResNet-18 backbone creating the Chaos-Enhanced ProtoNet(CE-ProtoNet). Using the deterministic ergodicity of the logistic chaos map we inject controlled perturbations into support features during episodic training-essentially for “stress testing” the embedding space. This process makes the model to converge on noise-invariant representations without increasing computational overhead. Testing this on a 4-way 5-shot brain tumor classification task, we found that a 15% chaotic injection level worked efficiently to stabilize high-dimensional clusters and reduce class dispersion. Our method achieved a peak test accuracy of 84.52%, outperforming standard ProtoNet. Our results suggest the idea of using chaotic perturbation as an efficient, low-overhead regularization tool, for the data-scarce regimes.
[CV-297] A Two-Stage Deep Learning Framework for Segmentation of Ten Gastrointestinal Organs from Coronal MR Enterography
【速读】:该论文旨在解决磁共振肠造影(MRE)中胃肠道(GI)器官分割的准确性问题,以支持炎症性肠病(IBD)的精准诊断。由于解剖结构变异大、类别不平衡及组织对比度低等因素,现有自动化方法难以可靠分割多个GI器官,尤其是阑尾等小目标。其解决方案的关键在于提出一种双阶段深度学习框架:第一阶段采用DenseNet201-UNet++生成粗略感兴趣区域(ROI)掩膜,实现器官定位;第二阶段基于器官特异性图像块训练DenseNet121-SelfONN-UNet模型,并结合数据增强、归一化、五折交叉验证和类别加权策略缓解严重类别不平衡问题(特别是阑尾),从而显著提升分割精度。该方法在mDSC达88.99%、mIoU为84.76%的基础上,实现了从粗到细、器官感知的高精度分割,展现出良好的临床转化潜力。
链接: https://arxiv.org/abs/2604.17118
作者: Ashiqur Rahman,Md. Abu Sayed,Md Sharjis Ibne Wadud,Md. Abu Asad Al-Hafiz,Adam Mushtak,Muhammad E. H. Chowdhury
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate segmentation of gastrointestinal (GI) organs in magnetic resonance enterography (MRE) is critical for diagnosing inflammatory bowel disease (IBD). However, anatomical variability, class imbalance, and low tissue contrast hinder reliable automation. This study proposes a dual-stage deep learning framework for organ-specific segmentation of GI structures from coronal MRE images to address these challenges. A publicly available MRE dataset of 3,195 coronal T2-weighted HASTE slices from 114 IBD patients was used. Initially, a DenseNet201-UNet++ model generated coarse masks for ROI extraction. A DenseNet121-SelfONN-UNet model was then trained on organ-specific patches. Extensive data augmentation, normalization, five-fold cross-validation, and class-specific weighting were applied to mitigate severe class imbalance, particularly for the appendix. The initial stage achieved strong organ localization but underperformed for the appendix; class weighting improved its DSC from 6.76% to 85.76%. The second-stage DenseNet121-SelfONN-UNet significantly enhanced segmentation across all GI structures, with notable DSC gains (cecum +23.62%, sigmoid +18.57%, rectum +17.99%, small intestine +16.06%). Overall, the framework achieved mDSC of 88.99%, mIoU of 84.76%, and mHD95 of 6.94 mm, outperforming all baselines. This framework demonstrates the effectiveness of a coarse-to-fine, organ-aware segmentation strategy for intestinal MRE. Despite higher computational cost, it shows strong potential for clinical translation and enables anatomically informed diagnostic tools in gastroenterology. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.17118 [eess.IV] (or arXiv:2604.17118v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2604.17118 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Md Sharjis Ibne Wadud [view email] [v1] Sat, 18 Apr 2026 19:31:38 UTC (1,506 KB)
[CV-298] Hybrid Quantum Neural Networks for Enhanced Breast Cancer Thermographic Classification: A Novel Quantum-Classical Integration Approach
【速读】:该论文旨在解决乳腺癌热成像图像分析中经典深度学习方法在复杂热模式分类任务上面临的性能瓶颈问题。其解决方案的关键在于提出一种混合量子神经网络(Hybrid Quantum Neural Network, HQNN)架构,该架构将量子计算原理与经典卷积神经网络相结合:量子部分采用4比特参数化量子电路(含强纠缠层)实现量子感知特征编码,经典部分则引入多头注意力机制进行特征融合与识别,从而在保持近中期量子设备计算可行性的同时,显著提升了模型的收敛速度和特征表示能力,为量子优势在医学图像分类中的实现提供了实证支持。
链接: https://arxiv.org/abs/2604.16953
作者: Riza Alaudin Syah,Irwan Alnarus Kautsar,Gunawan Witjaksono,Haza Nuzly bin Abdull Hamed
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published in: 2025 IEEE International Biomedical Instrumentation and Technology Conference (IBITeC)
Abstract:Breast cancer diagnosis through thermographic image analysis remains a critical challenge in medical AI, with classical deep learning approaches facing limitations in complex thermal pattern classification tasks. This paper presents a novel Hybrid Quantum Neural Network (HQNN) architecture that integrates quantum computing principles with classical convolutional neural networks for enhanced breast cancer classification. Our approach employs parameterized quantum circuits with multi-head attention mechanisms for quantum-aware feature encoding, coupled with classical convolutional layers for comprehensive pattern recognition. The quantum component utilizes a 4qubit variational circuit with strongly entangling layers, while the classical component incorporates advanced attention mechanisms for feature fusion. Experimental validation on breast cancer thermographic data demonstrates substantial performance improvements over state-of-the-art classical architectures, with the quantum-enhanced approach exhibiting superior convergence dynamics and enhanced feature representation capabilities. Our findings provide evidence for quantum advantage in medical image classification through classical simulation, establishing a framework for quantum-classical hybrid systems in healthcare applications. The methodology addresses key challenges in quantum machine learning deployment while maintaining computational feasibility on near-term quantum devices.
[CV-299] Structured 3D-SVD: A Practical Framework for the Compression and Reconstruction of Biological Volumetric Images
【速读】:该论文旨在解决生物体积数据(biological volumetric data)在重建、压缩与分析中的效率与精度问题。传统方法如Tucker分解和CANDECOMP/PARAFAC分解(CPD)存在计算复杂度高或重建质量不足的局限。其解决方案的关键在于提出结构化三维奇异值分解(Structured 3D-SVD),该方法借鉴矩阵奇异值分解(SVD)的逻辑,将三阶体积数据在空间域中表示,并通过有序的准奇异系数实现渐进式重建。该框架在保证重建质量接近Tucker分解的同时显著缩短计算时间,并优于CPD在准确性和运行效率上的表现,同时证明低截断等级即可保留主要体积结构,高截断等级则提供更精细的重构细节。
链接: https://arxiv.org/abs/2604.16947
作者: Mario Aragonés Lozano,Oscar Romero,Antonio León
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注: 19 pages, 4 figures, 6 tables
Abstract:This work introduces Structured 3D-SVD as a practical framework for the reconstruction, compression, and analysis of biological volumetric data. Inspired by the logic of matrix singular value decomposition (SVD), the proposed approach represents third-order volumetric data in the spatial domain and supports progressive reconstruction through ordered quasi-singular coeffients. The experimental evaluation was carried out on two biological volumetric datasets: one full-volume scan of a fish and another of a brain. The results show that Structured 3D-SVD achieves reconstruction quality close to that of Tucker decomposition while requiring shorter computation times and outperforms canonical polyadic decomposition (CPD) in both accuracy and runtime. In addition, a progressive reconstruction analysis shows that relatively low truncation levels are sufficient to preserve the main volumetric structures, while higher truncation levels lead to more detailed reconstructions.
[CV-300] AstroSURE: Learning to Remove Noise from Astronomical Images Without Ground Truth Data
【速读】:该论文旨在解决天文成像中因光子计数低导致的图像噪声问题,尤其是在缺乏干净真实图像(ground-truth)情况下如何有效进行去噪并提升弱源检测能力。其解决方案的关键在于采用无需干净标签的无监督深度学习去噪方法,包括Noise2Noise、Stein’s Unbiased Risk Estimator(SURE)以及基于盲区(blind-spot)的设计,并通过合成数据与哈勃空间望远镜(Hubble Space Telescope, HST)和加拿大-法国-夏威夷望远镜(Canada-France-Hawaii Telescope, CFHT)的真实观测数据进行验证。结果表明,这些方法在经过领域一致初始化后能显著改善HST数据中的弱源可探测性,但向CFHT数据迁移效果有限,凸显了模型在不同仪器或观测域间适应时对域相似性的依赖性。
链接: https://arxiv.org/abs/2604.16793
作者: Omid Vaheb,Sebastien Fabbro,Stark Draper
机构: University of Toronto (多伦多大学); National Research Council Herzberg Astronomy an Astrophysics Research Centre (加拿大国家研究委员会赫兹伯格天文学与天体物理研究中心); University of British Columbia (不列颠哥伦比亚大学)
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In astronomical imaging, the low photon count of exposures necessitates extensive post-processing steps, including contamination removal and denoising. This paper evaluates deep-learning denoising methods that can be trained without clean ground-truth images and assesses their utility for detection11 oriented analysis of astronomical data. We adapt and compare Noise2Noise, Stein’s Unbiased Risk Estimator, and blind-spot-based methods using synthetic data and real observations from the Hubble Space Telescope (HST) and the Canada-France-Hawaii Telescope (CFHT). Performance is evaluated using object-detection metrics, including correct detection rate and false alarm rate, together with image-based metrics and pixel-distribution diagnostics. The results show that these methods can improve faint-source detectability relative to the original noisy images, with encouraging gains on HST data after domain-consistent initialization, while transfer to CFHT data is more limited, highlighting the importance of instrument/domain similarity for unsupervised adaptation.
[CV-301] A Two-Stage Multi-Modal MRI Framework for Lifespan Brain Age Prediction
【速读】:该论文旨在解决现有脑龄(brain age)量化方法在年龄范围受限和仅依赖单模态MRI数据方面的局限性,从而无法全面捕捉人类生命周期中脑形态学与白质组织结构的协同变化。其解决方案的关键在于提出一种多模态脑龄框架,采用两阶段架构:第一阶段对各模态独立处理后通过晚期融合(late fusion)将受试者分类至六个发育阶段,第二阶段在预测的发育阶段内进行年龄估计;该设计实现了跨生命周期的统一脑成熟度评估,有效整合了宏观与微观结构信息。
链接: https://arxiv.org/abs/2604.16655
作者: Dingyi Zhang,Ruiying Liu,Yun Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The accurate quantification of brain age from MRI has emerged as an important biomarker of brain health. However, existing approaches are often restricted to narrow age ranges and single-modality MRI data, limiting their capacity to capture the coordinated macro- and microstructural changes that unfold across the human lifespan. To address these limitations, we developed a multi-modal brain age framework to characterize the integrated evolution of brain morphology and white matter organization. Our model adopts a two-stage architecture, where modalities are processed independently and integrated via late fusion in both stages: first to classify each subject into one of six developmental stages, and then to estimate age within the predicted stage. This design enables a unified and lifespan-spanning assessment of brain maturity across diverse developmental periods.
[CV-302] Deep Hierarchical Knowledge Loss for Fault Intensity Diagnosis KDD2026 KDD
【速读】:该论文旨在解决故障强度诊断(Fault Intensity Diagnosis, FID)中因忽略目标类别间依赖关系而导致实际部署受限的问题。其核心解决方案是提出一种基于深度层次知识损失(Deep Hierarchical Knowledge Loss, DHK)的通用框架,通过构建层次化树结构损失(Hierarchical Tree Loss)实现类别间的层次一致性表示与预测;关键创新在于引入基于树结构的正负样本约束、自适应权重机制以及分组树三元组损失(Group Tree Triplet Loss),结合层次动态边界距离建模,显著提升了对细微故障的识别能力。
链接: https://arxiv.org/abs/2604.16459
作者: Yu Sha,Shuiping Gou,Bo Liu,Haofan Lu,Ningtao Liu,Jiahui Fu,Horst Stoecker,Domagoj Vnucec,Nadine Wetzstein,Andreas Widl,Kai Zhou
机构: The Chinese University of Hong Kong, Shenzhen; Xidian University; Luoyang Institute of Science and Technology; Frankfurt Institute for Advanced Studies; Goethe Universität Frankfurt; GSI Helmholtzzentrum für Schwerionenforschung GmbH; SAMSON AG
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
备注: The paper has been accepted by Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 (KDD 2026)
Abstract:Fault intensity diagnosis (FID) plays a pivotal role in intelligent manufacturing while neglecting dependencies among target classes hinders its practical deployment. This paper introduces a novel and general framework with deep hierarchical knowledge loss (DHK) to achieve hierarchical consistent representation and prediction. We develop a novel hierarchical tree loss to enable a holistic mapping of same-attribute classes, leveraging tree-based positive and negative hierarchical knowledge constraints. We further design a focal hierarchical tree loss to enhance its extensibility and devise two adaptive weighting schemes based on tree height. In addition, we propose a group tree triplet loss with hierarchical dynamic margin by incorporating hierarchical group concepts and tree distance to model boundary structural knowledge across classes. The joint two losses significantly improve the recognition of subtle faults. Extensive experiments are performed on four real-world datasets from various industrial domains (three cavitation datasets from SAMSON AG and one publicly available dataset) for FID, all showing superior results and outperforming recent state-of-the-art FID methods.
[CV-303] SAND: The Challenge on Speech Analysis for Neurodegenerative Disease Assessment
【速读】:该论文旨在解决神经退行性疾病(如肌萎缩侧索硬化症,ALS)早期诊断与进展监测中缺乏客观、非侵入性生物标志物的问题,尤其是利用语音信号作为潜在的量化指标来实现自动化识别和预测。其解决方案的关键在于构建一个由临床专家标注的高质量验证数据集,并基于此发起“神经退行性疾病语音分析”(Speech Analysis for Neurodegenerative Diseases, SAND)挑战赛,从而推动生成式AI等先进机器学习技术在语音特征提取与模式识别中的应用,提升ALS早期检测的准确性与可扩展性。
链接: https://arxiv.org/abs/2604.16445
作者: Giovanna Sannino,Ivanoe De Falco,Nadia Brancati,Laura Verde,Maria Frucci,Daniel Riccio,Vincenzo Bevilacqua,Antonio Di Marino,Lucia Aruta,Valentina Virginia Iuzzolino,Gianmaria Senerchia,Myriam Spisto,Raffaele Dubbioso
机构: National Research Council of Italy (CNR), Institute for High-Performance Computing and Networking (ICAR); University of Campania “Luigi Vanvitelli”; University of Naples “Federico II”; University of Naples “Federico II”; University of Naples “Federico II”; University of Campania “Luigi Vanvitelli”
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in Artificial Intelligence (AI) and the exploration of noninvasive, objective biomarkers, such as speech signals, have encouraged the development of algorithms to support the early diagnosis of neurodegenerative diseases, including Amyotrophic Lateral Sclerosis (ALS). Voice changes in subjects suffering from ALS typically manifest as progressive dysarthria, which is a prominent neurodegenerative symptom because it affects patients as the disease progresses. Since voice signals are complex data, the development and use of advanced AI techniques are fundamental to extracting distinctive patterns from them. Validating AI algorithms for ALS diagnosis and monitoring using voice signals is challenging, particularly due to the lack of annotated reference datasets. In this work, we present the outcome of a collaboration between a multidisciplinary team of clinicians and Machine Learning experts to create both a clinically annotated validation dataset and the “Speech Analysis for Neurodegenerative Diseases” (SAND) challenge based on it. Specifically, by analyzing voice disorders, the SAND challenge provides an opportunity to develop, test, and evaluate AI models for the automatic early identification and prediction of ALS disease progression.
人工智能
[AI-0] Bounded Ratio Reinforcement Learning
【速读】:该论文旨在解决近端策略优化(Proximal Policy Optimization, PPO)算法中存在理论基础与实际应用之间不一致的问题,即PPO所采用的启发式裁剪目标函数与其源自信任区域方法(Trust Region Methods)的理论根基之间缺乏严谨联系。解决方案的关键在于提出一种全新的有界比率强化学习(Bounded Ratio Reinforcement Learning, BRRL)框架,该框架通过构建一个带正则化和约束的策略优化问题,并推导出其解析最优解,从而确保策略在迭代过程中实现单调性能提升。进一步地,为处理参数化策略类,作者设计了有界策略优化(Bounded Policy Optimization, BPO)算法,该算法最小化优势加权的策略与BRRL最优解之间的散度,并建立了基于BPO损失函数的策略期望性能下界。这一理论框架不仅解释了PPO的成功机制,还连接了信任区域策略优化与交叉熵法(Cross-Entropy Method, CEM),并在多个复杂环境(如MuJoCo、Atari及IsaacLab)和大语言模型(Large Language Models, LLMs)微调任务中验证了BPO及其扩展版本Group-relative BPO(GBPO)在稳定性和最终性能上优于PPO和GRPO。
链接: https://arxiv.org/abs/2604.18578
作者: Yunke Ao,Le Chen,Bruce D. Lee,Assefa S. Wahd,Aline Czarnobai,Philipp Fürnstahl,Bernhard Schölkopf,Andreas Krause
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 9 figures
Abstract:Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying foundations of trust region methods and the heuristic clipped objective used in PPO. In this paper, we bridge this gap by introducing the Bounded Ratio Reinforcement Learning (BRRL) framework. We formulate a novel regularized and constrained policy optimization problem and derive its analytical optimal solution. We prove that this solution ensures monotonic performance improvement. To handle parameterized policy classes, we develop a policy optimization algorithm called Bounded Policy Optimization (BPO) that minimizes an advantage-weighted divergence between the policy and the analytic optimal solution from BRRL. We further establish a lower bound on the expected performance of the resulting policy in terms of the BPO loss function. Notably, our framework also provides a new theoretical lens to interpret the success of the PPO loss, and connects trust region policy optimization and the Cross-Entropy Method (CEM). We additionally extend BPO to Group-relative BPO (GBPO) for LLM fine-tuning. Empirical evaluations of BPO across MuJoCo, Atari, and complex IsaacLab environments (e.g., Humanoid locomotion), and of GBPO for LLM fine-tuning tasks, demonstrate that BPO and GBPO generally match or outperform PPO and GRPO in stability and final performance.
[AI-1] Agent ic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs
【速读】:该论文旨在解决**二元预测(binary forecasting)**任务中模型性能受限于证据处理方式、聚合策略不优以及校准不足的问题。其核心解决方案在于提出一种名为BLF(Bayesian Linguistic Forecaster)的代理系统,关键创新点包括:(1) 贝叶斯语言信念状态(Bayesian Linguistic Belief State)——通过半结构化表示融合数值概率估计与自然语言证据摘要,并在迭代工具使用循环中由大语言模型(LLM)动态更新,避免传统方法因不断追加证据导致上下文膨胀;(2) 分层多轮聚合(Hierarchical multi-trial aggregation)——运行K个独立试验并采用基于数据依赖先验的对数几率空间收缩策略进行整合;(3) 分层校准(Hierarchical calibration)——利用带有分层先验的Platt缩放方法,防止基率偏斜来源的极端预测被过度收缩。实验证明,这些设计共同使BLF在ForecastBench基准上超越多个主流方法,且消融实验验证了各模块的显著贡献。
链接: https://arxiv.org/abs/2604.18576
作者: Kevin Murphy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present BLF (Bayesian Linguistic Forecaster), an agentic system for binary forecasting that achieves state-of-the-art performance on the ForecastBench benchmark. The system is built on three ideas. (1) A Bayesian linguistic belief state: a semi-structured representation combining numerical probability estimates with natural-language evidence summaries, updated by the LLM at each step of an iterative tool-use loop. This contrasts with the common approach of appending all retrieved evidence to an ever-growing context. (2) Hierarchical multi-trial aggregation: running K independent trials and combining them using logit-space shrinkage with a data-dependent prior. (3) Hierarchical calibration: Platt scaling with a hierarchical prior, which avoids over-shrinking extreme predictions for sources with skewed base rates. On 400 backtesting questions from the ForecastBench leaderboard, BLF outperforms all the top public methods, including Cassi, GPT-5, Grok~4.20, and Foresight-32B. Ablation studies show that the structured belief state is as impactful as web search access, and that shrinkage aggregation and hierarchical calibration each provide significant additional gains. In addition, we develop a robust back-testing framework with a leakage rate below 1.5%, and use rigorous statistical methodology to compare different methods while controlling for various sources of noise. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.18576 [cs.AI] (or arXiv:2604.18576v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.18576 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-2] When Can LLM s Learn to Reason with Weak Supervision?
【速读】:该论文旨在解决在强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)框架下,随着模型能力增强后高质量奖励信号构建难度上升的问题,尤其关注在弱监督条件下(如数据稀缺、奖励噪声和自监督代理奖励)RLVR能否依然实现有效泛化。解决方案的关键在于识别出训练过程中奖励饱和动态(reward saturation dynamics)是决定泛化能力的核心机制:模型若在训练初期保持较长时间的奖励与下游性能同步提升阶段(即预饱和期),则能实现泛化;反之若快速饱和则陷入记忆而非学习。进一步发现,推理忠实性(reasoning faithfulness)——即中间步骤对最终答案的逻辑支持程度——是预测模型落入何种行为模式的前置特征,而输出多样性无此预测价值。基于此,作者提出通过显式推理轨迹的监督微调(Supervised Fine-Tuning, SFT)来确保弱监督下的泛化能力,同时持续预训练(continual pre-training)可放大效果,二者结合使 Llama3.2-3B-Base 在三种弱监督场景中均实现成功泛化。
链接: https://arxiv.org/abs/2604.18574
作者: Salman Rahman,Jingyan Shen,Anna Mordvina,Hamid Palangi,Saadia Gabriel,Pavel Izmailov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.
[AI-3] Symbolic Synthesis for LTLf Obligations
【速读】:该论文旨在解决在无限轨迹上表达的义务性质(obligation properties)的自动合成问题,这类性质属于Manna和Pnueli时序层次结构的第二层,是安全性质(safety)与保证性质(guarantee,也称共安全性质 co-safety)的正布尔组合。传统LTL(线性时序逻辑)处理无限迹时存在复杂性高、算法效率低的问题,而本文提出的关键解决方案是:将LTLfp(LTLf扩展至无限迹)中的义务性质直接转化为符号化表示的确定性弱自动机(Deterministic Weak Automata, DWA),该DWA可由底层LTLf性质对应的符号化确定性有限自动机(Symbolic Deterministic Finite Automata, DFA)直接构造得到。DWA继承了DFA的诸多优良算法特性,如布尔封闭性和多项式时间最小化,并且基于此的合成问题可在构造DWA后以线性时间复杂度求解,从而实现了与LTLf合成几乎相当的高效性。
链接: https://arxiv.org/abs/2604.18532
作者: Giuseppe De Giacomo,Christian Hagemeier,Daniel Hausmann,Nir Piterman
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注:
Abstract:We study synthesis for obligation properties expressed in LTLfp, the extension of LTLf to infinite traces. Obligation properties are positive Boolean combinations of safety and guarantee (co-safety) properties and form the second level of the temporal hierarchy of Manna and Pnueli. Although obligation properties are expressed over infinite traces, they retain most of the simplicity of LTLf. In particular, we show that they admit a translation into symbolically represented deterministic weak automata (DWA) obtained directly from the symbolic deterministic finite automata (DFA) for the underlying LTLf properties on trace prefixes. DWA inherit many of the attractive algorithmic features of DFA, including Boolean closure and polynomial-time minimization. Moreover, we show that synthesis for LTLfp obligation properties is theoretically highly efficient - solvable in linear time once the DWA is constructed. We investigate several symbolic algorithms for solving DWA games that arise in the synthesis of obligation properties and evaluate their effectiveness experimentally. Overall, the results indicate that synthesis for LTLfp obligation properties can be performed with virtually the same effectiveness as LTLf synthesis.
[AI-4] OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在强化学习中因初始隐空间限制而难以探索新颖轨迹的问题,这一局限性制约了模型在复杂推理任务中的表现。解决方案的关键在于提出OGER框架,其核心创新是通过专门设计的奖励建模视角,统一离线教师指导与在线强化学习,并引入多教师协同训练机制和辅助探索奖励,该奖励同时利用离线轨迹数据与模型自身熵信息,从而有效激励自主探索。实验证明,该方法在数学推理和通用推理基准上显著优于现有基线,且具备良好的跨域泛化能力。
链接: https://arxiv.org/abs/2604.18530
作者: Xinyu Ma,Mingzhou Xu,Xuebo Liu,Chang Jin,Qiang Wang,Derek F. Wong,Min Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have significantly improved Large Language Model (LLM) reasoning, yet models often struggle to explore novel trajectories beyond their initial latent space. While offline teacher guidance and entropy-driven strategies have been proposed to address this, they often lack deep integration or are constrained by the model’s inherent capacity. In this paper, we propose OGER, a novel framework that unifies offline teacher guidance and online reinforcement learning through a specialized reward modeling lens. OGER employs multi-teacher collaborative training and constructs an auxiliary exploration reward that leverages both offline trajectories and the model’s own entropy to incentivize autonomous exploration. Extensive experiments across mathematical and general reasoning benchmarks demonstrate that OGER significantly outperforms competitive baselines, achieving substantial gains in mathematical reasoning while maintaining robust generalization to out-of-domain tasks. We provide a comprehensive analysis of training dynamics and conduct detailed ablation studies to validate the effectiveness of our entropy-aware reward modulation. Our code is available at this https URL.
[AI-5] IDOBE: Infectious Disease Outbreak forecasting Benchmark Ecosystem
【速读】:该论文旨在解决当前传染病爆发预测中缺乏标准化基准数据集的问题,以及对新型疫情(历史数据有限)下模型性能理解不足的挑战。其解决方案的关键在于构建IDOBE数据集——一个经过精心整理的流行病学时间序列集合,涵盖超过10,000个爆发事件(涉及13种疾病、病例数和住院人数等多类结局),数据来源覆盖美国各州及全球多个地区,时间跨度达一个多世纪。通过基于导数的分段方法生成标准化爆发片段,并采用信息论与分布度量量化数据集的流行病学多样性,进而利用11种基线模型进行多时间尺度(1–4周提前)短期预测实验,评估指标包括NMSE、MAPE及NWIS等点估计与概率评分规则。结果表明,基于多层感知机(MLP)的方法表现最稳健,而统计模型在峰值前阶段略占优势,为未来疫情预测方法的标准化评估提供了可复现的基准平台。
链接: https://arxiv.org/abs/2604.18521
作者: Aniruddha Adiga,Jingyuan Chou,Anshul Chiranth,Bryan Lewis,Ana I. Bento,Shaun Truelove,Geoffrey Fox,Madhav Marathe,Harry Hochheiser,Srini Venkatramanan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Populations and Evolution (q-bio.PE)
备注: 11 pages, 6 figures
Abstract:Epidemic forecasting has become an integral part of real-time infectious disease outbreak response. While collaborative ensembles composed of statistical and machine learning models have become the norm for real-time forecasting, standardized benchmark datasets for evaluating such methods are lacking. Further, there is limited understanding on performance of these methods for novel outbreaks with limited historical data. In this paper, we propose IDOBE, a curated collection of epidemiological time series focused on outbreak forecasting. IDOBE compiles from multiple data repositories spanning over a century of surveillance and across U.S. states and global locations. We perform derivative-based segmentation to generate over 10,000 outbreaks covering multiple outcomes such as cases and hospitalizations for 13 diseases. We consider a variety of information-theoretic and distributional measures to quantify the epidemiological diversity of the dataset. Finally, we perform multi-horizon short-term forecasting (1- to 4-week-ahead) through the progression of the outbreak using 11 baseline models and report on their performance. In addition to standard metrics such as NMSE and MAPE for point forecasts, we include probabilistic scoring rules such as Normalized Weighted Interval Score (NWIS) to quantify the performance. We find that MLP-based methods have the most robust performance, with statistical methods having a slight edge during the pre-peak phase. IDOBE dataset along with baselines are released publicly on this https URL to enable standardized, reproducible benchmarking of outbreak forecasting methods.
[AI-6] LLM Safety From Within: Detecting Harmful Content with Internal Representations
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 安全防护中 guard 模型性能受限的问题,即现有先进 guard 模型仅依赖大语言模型(LLM)的终端层表示,忽略了分布在内部各层中的安全相关特征。其解决方案的关键在于提出 SIREN——一种轻量级 guard 模型,通过线性探测(linear probing)识别出对安全性敏感的“安全神经元”(safety neurons),并采用自适应层加权策略融合这些内部特征,从而构建无需修改底层 LLM 的有害内容检测器。该方法显著提升了检测准确性和泛化能力,同时大幅降低参数需求和推理延迟。
链接: https://arxiv.org/abs/2604.18519
作者: Difan Jiao,Yilun Liu,Ye Yuan,Zhenwei Tang,Linfeng Du,Haolun Wu,Ashton Anderson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages,10 figures,6 tables
Abstract:Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM internals without modifying the underlying model. Our comprehensive evaluation shows that SIREN substantially outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters. Moreover, SIREN exhibits superior generalization to unseen benchmarks, naturally enables real-time streaming detection, and significantly improves inference efficiency compared to generative guard models. Overall, our results highlight LLM internal states as a promising foundation for practical, high-performance harmfulness detection.
[AI-7] Faster by Design: Interactive Aerodynamics via Neural Surrogates Trained on Expert-Validated CFD
【速读】:该论文旨在解决赛车空气动力学设计中计算流体动力学(Computational Fluid Dynamics, CFD)高成本限制设计空间探索的问题,尤其针对工业级赛事场景下复杂几何结构和高负载部件建模能力不足的瓶颈。其关键解决方案在于提出一种基于图神经算子的新型代理模型——Gauge-Invariant Spectral Transformer (GIST),该模型通过谱嵌入编码网格连通性,增强对密集复杂几何的预测能力,并保证离散化不变性、线性扩展于网格规模,从而在公开基准与新构建的LMP2类赛车高保真RANS数据集上均实现最先进的精度,首次验证了在工业赛事流程中以代理模型替代CFD进行交互式设计空间探索的可行性。
链接: https://arxiv.org/abs/2604.18491
作者: Nicholas Thumiger,Andrea Bartezzaghi,Mattia Rigotti,Cezary Skura,Thomas Frick,Elisa Serioli,Fabrizio Arbucci,A. Cristiano I. Malossi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures
Abstract:Computational Fluid Dynamics (CFD) is central to race-car aerodynamic development, yet its cost – tens of thousands of core-hours per high-fidelity evaluation – severely limits the design space exploration feasible within realistic budgets. AI-based surrogate models promise to alleviate this bottleneck, but progress has been constrained by the limited complexity of public datasets, which are dominated by smoothed passenger-car shapes that fail to exercise surrogates on the thin, complex, highly loaded components governing motorsport performance. This work presents three primary contributions. First, we introduce a high-fidelity RANS dataset built on a parametric LMP2-class CAD model and spanning six operating conditions (map points) covering straight-line and cornering regimes, generated and validated by aerodynamics experts at Dallara to preserve features relevant to industrial motorsport. Second, we present the Gauge-Invariant Spectral Transformer (GIST), a graph-based neural operator whose spectral embeddings encode mesh connectivity to enhance predictions on tightly packed, complex geometries. GIST guarantees discretization invariance and scales linearly with mesh size, achieving state-of-the-art accuracy on both public benchmarks and the proposed race-car dataset. Third, we demonstrate that GIST achieves a level of predictive accuracy suitable for early-stage aerodynamic design, providing a first validation of the concept of interactive design-space exploration – where engineers query a surrogate in place of the CFD solver – within industrial motorsport workflows.
[AI-8] A Generalized Synthetic Control Method for Baseline Estimation in Demand Response Services
【速读】:该论文旨在解决电力市场中需求响应(Demand Response, DR)结算所需的基线估计问题,当前机器学习方法在预测性能上存在局限,而因果推断与反事实预测方法尚未被充分应用。其解决方案的关键在于提出一种广义合成控制方法(Generalized Synthetic Control Method),通过将传统静态的合成控制法(Synthetic Control Method, SCM)扩展为动态反事实预测框架,引入外生特征、滞后目标负荷及选定滞后对照单元信号等增强表示,从而捕捉自回归依赖、延迟响应模式和误差修正效应,显著提升在小样本场景下的预测准确性与鲁棒性。
链接: https://arxiv.org/abs/2604.18469
作者: Jonas Sievers,Mardavij Roozbehani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Baseline estimation is critical to Demand Response (DR) settlement in electricity markets, yet existing machine learning methods remain limited in predictive performance, while methodologies from causal inference and counterfactual prediction are still underutilized in this domain. We introduce a Generalized Synthetic Control Method that builds on the classical Synthetic Control Method (SCM) from econometrics. While SCM provides a powerful framework for counterfactual estimation, classical SCM remains a static estimator: it fits the treated unit as a combination of contemporaneous donor units and therefore ignores predictable temporal structure in the residual error. We develop a generalized SCM framework that transforms baseline estimation into a dynamic counterfactual prediction problem by augmenting the donor representation with exogenous features, lagged treated load, and selected lagged donor signals. This enriched representation allows the estimator to capture autoregressive dependence, delayed donor-response patterns, and error-correction effects beyond the scope of standard SCM. The framework further accommodates nonlinear predictors when linear weighting is inadequate, with the greatest benefit arising in limited-data settings. Experiments on the Ausgrid smart-meter dataset show consistent improvements over classical SCM and strong benchmark methods, with the dominant performance gains driven by dynamic augmentation.
[AI-9] An Integrated Deep-Learning Framework for Peptide-Protein Interaction Prediction and Target-Conditioned Peptide Generation with ConGA-PePPI and TC-PepGen
【速读】:该论文旨在解决肽-蛋白相互作用(Peptide-protein interactions, PepPIs)在大规模筛选中实验表征效率低的问题,同时弥补现有方法在候选优先排序、残基级解释和靶标条件下的肽生成方面整合不足的缺陷。其解决方案的关键在于提出一个集成框架,包含两个核心模块:一是基于伙伴感知预测与定位模型(ConGA-PepPI),采用非对称编码、双向交叉注意力机制及从配对预测到结合位点定位的渐进式迁移;二是靶标条件生成模型(TC-PepGen),通过层间条件控制在自回归解码过程中保留靶标信息。该框架实现了从相互作用预测到靶向肽生成的端到端整合,显著提升了筛选效率与生成质量。
链接: https://arxiv.org/abs/2604.18467
作者: Chupei Tang,Junxiao Kong,Moyu Tang,Di Wang,Jixiu Zhai,Ronghao Xie,Shangkun Sima,Tianchi Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Motivation: Peptide-protein interactions (PepPIs) are central to cellular regulation and peptide therapeutics, but experimental characterization remains too slow for large-scale screening. Existing methods usually emphasize either interaction prediction or peptide generation, leaving candidate prioritization, residue-level interpretation, and target-conditioned expansion insufficiently integrated. Results: We present an integrated framework for early-stage peptide screening that combines a partner-aware prediction and localization model (ConGA-PepPI) with a target-conditioned generative model (TC-PepGen). ConGA-PepPI uses asymmetric encoding, bidirectional cross-attention, and progressive transfer from pair prediction to binding-site localization, while TC-PepGen preserves target information throughout autoregressive decoding via layerwise conditioning. In five-fold cross-validation, ConGA-PepPI achieved 0.839 accuracy and 0.921 AUROC, with binding-site AUPR values of 0.601 on the protein side and 0.950 on the peptide side, and remained competitive on external benchmarks. Under a controlled length-conditioned benchmark, 40.39% of TC-PepGen peptides exceeded native templates in AlphaFold 3 ipTM, and unconstrained generation retained evidence of target-conditioned signal.
[AI-10] Using large language models for embodied planning introduces systematic safety risks
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在机器人系统中作为规划器时的安全性问题,即高规划能力是否等同于安全规划。解决方案的关键在于构建了一个名为DESPITE的基准测试集,包含12,279个任务,涵盖物理和规范性危险,并提供完全确定性的验证机制。通过评估23个模型,研究发现即使规划能力接近完美(仅0.4%任务无法生成有效计划),仍存在高达28.3%的任务产生危险计划;且随着模型规模扩大,规划能力显著提升(0.4%–99.3%),但安全意识提升有限(38%–57%),二者呈乘积关系——更大模型主要通过增强规划能力实现更安全执行,而非提升危险规避能力。这一发现揭示了当前LLM规划器部署中的核心挑战:当规划能力趋于饱和时,提升安全意识成为关键突破口。
链接: https://arxiv.org/abs/2604.18463
作者: Tao Zhang,Kaixian Qu,Zhibin Li,Jiajun Wu,Marco Hutter,Manling Li,Fan Shi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project page: this https URL
Abstract:Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with fully deterministic validation. Across 23 models, even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among 18 open-source models from 3B to 671B parameters, planning ability improves substantially with scale (0.4-99.3%) while safety awareness remains relatively flat (38-57%). We identify a multiplicative relationship between these two capacities, showing that larger models complete more tasks safely primarily through improved planning, not through better danger avoidance. Three proprietary reasoning models reach notably higher safety awareness (71-81%), while non-reasoning proprietary models and open-source reasoning models remain below 57%. As planning ability approaches saturation for frontier models, improving safety awareness becomes a central challenge for deploying language-model planners in robotic systems.
[AI-11] Six Llamas: Comparative Religious Ethics Through LoRA-Adapted Language Models
【速读】:该论文旨在解决“不同宗教传统是否会在微调后的大型语言模型中系统性地编码出可区分的伦理推理模式”这一问题。其核心解决方案是通过构建六个Llama-3.1-8B变体(一个未微调的对照组和五个基于LoRA(Low-Rank Adaptation)方法分别在基督教、伊斯兰教、犹太教、印度教与佛教经典文本上微调的模型),并使用一套包含17个标准化伦理提示的测试集进行系统性比较,从而检验各模型在道德困境、博弈论场景、公共政策及心理自评等维度上的响应一致性与差异性。关键在于利用多温度采样设计与跨模型一致性分析,发现微调模型不仅表现出与其训练传统一致的伦理逻辑结构,且在高争议领域呈现温度敏感性增强的特异性分歧,验证了以差异化微调模型作为文化伦理分析工具的可行性。
链接: https://arxiv.org/abs/2604.18404
作者: Chad Coleman,W. Russell Neuman,Manan Shah,Ali Dasdan,Matthew Crispi,Morris Chiang,Zack Leitman,Mustafa Poonawala
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 51 pages, 14 figures. We present Six Llamas, a comparative study examining whether Llama-3.1-8B models fine-tuned on distinct religious corpora encode systematically different patterns of ethical reasoning. Five LoRA-adapted variants are constructed for Christianity, Islam, Judaism, Hinduism, and Buddhism. For theoretical background on the condensate comparative method, see arXiv:2603.07329
Abstract:We present Six Llamas, a comparative study examining whether large language models fine-tuned on distinct religious corpora encode systematically different patterns of ethical reasoning. Six variants of Meta-Llama-3.1-8B are constructed: one unmodified control and five LoRA-adapted models trained exclusively on the sacred and theological texts of Christianity, Islam, Judaism, Hinduism, or Buddhism. All six models are probed with an identical battery of 17 standardized ethical prompts spanning moral dilemmas, game-theoretic scenarios, public policy questions, and moral-psychological self-assessments. To assess robustness and reproducibility, we implement a multi-temperature sampling design spanning ten temperature settings. We compute response consistency metrics, pairwise inter-model agreement rates, temperature sensitivity coefficients across four prompt domains, and run-to-run stability analyses. Findings show that LoRA-adapted models produce ethical reasoning patterns that are (a) systematically differentiated from the base model, (b) consistent with the moral logics of their training traditions, © structured along interpretable dimensions in moral-philosophical space, (d) core ethical positions remain stable across temperature variations for high-consensus dilemmas. The Trolley Problem achieves 100% consistency across all models and temperatures, while (e) tradition-specific divergence intensifies at higher temperatures in morally contested domains, and (f) the base model exhibits the highest overall response consistency (mean 88.3%), suggesting LoRA adaptation introduces both tradition-specific signal and increased sampling sensitivity. The study offers a proof-of-concept for the condensate comparative method using differentially trained language models as instruments for cultural and ethical analysis and identifies specific criteria for falsification and planned extensions. Comments: 51 pages, 14 figures. We present Six Llamas, a comparative study examining whether Llama-3.1-8B models fine-tuned on distinct religious corpora encode systematically different patterns of ethical reasoning. Five LoRA-adapted variants are constructed for Christianity, Islam, Judaism, Hinduism, and Buddhism. For theoretical background on the condensate comparative method, see arXiv:2603.07329 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.18404 [cs.AI] (or arXiv:2604.18404v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.18404 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-12] Randomly Initialized Networks Can Learn from Peer-to-Peer Consensus
【速读】:该论文旨在解决自监督学习中自蒸馏(self-distillation)机制在学习动态中的作用尚不清晰的问题,尤其是现有高性能方法依赖复杂组件(如投影层、预测器和预训练任务)导致设计选择多为经验性且缺乏理论解释。其解决方案的关键在于通过构建一个极简实验设置——仅使用一组随机初始化的网络进行自蒸馏训练,同时移除所有常见模块(如项目器、预测器及预训练任务),从而隔离出自蒸馏本身对表征学习的影响。研究表明,即使在这种最小配置下,模型仍能学到显著优于随机基线的下游任务表征,揭示了自蒸馏在无外部监督信号时亦具备独立的学习能力。
链接: https://arxiv.org/abs/2604.18390
作者: Esteban Rodríguez-Betancourt,Edgar Casasola-Murillo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 10 figures. To be published in ChileCON 2025 proceedings
Abstract:In self-supervised learning, self-distilled methods have shown impressive performance, learning representations useful for downstream tasks and even displaying emergent properties. However, state-of-the-art methods usually rely on ensembles of complex mechanisms, with many design choices that are empirically motivated and not well understood. In this work, we explore the role of self-distillation within learning dynamics. Specifically, we isolate the effect of self-distillation by training a group of randomly initialized networks, removing all other common components such as projectors, predictors, and even pretext tasks. Our findings show that even this minimal setup can lead to learned representations with non-trivial improvements over a random baseline on downstream tasks. We also demonstrate how this effect varies with different hyperparameters and present a short analysis of what is being learned by the models under this setup. Comments: 6 pages, 10 figures. To be published in ChileCON 2025 proceedings Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.18390 [cs.LG] (or arXiv:2604.18390v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.18390 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-13] Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
【速读】:该论文旨在解决在低数据条件下,如何通过强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)有效微调小型语言模型(Small Language Models, SLMs)的问题。当前主流的大型语言模型(Large Language Models, LLMs)微调依赖于大量高质量标注数据或具有明确真实答案的问题,但在许多实际场景中,这类资源稀缺。论文的关键解决方案在于:利用程序化生成的数据集(procedural datasets)来系统性地研究SLM在低数据 regime下的性能表现,发现训练于低复杂度任务的模型能够泛化到高复杂度任务,且混合复杂度数据集相较于单一简单任务训练能显著提升样本效率(最高达5倍),从而为高效数据开发和RLVR中的数据缩放规律提供了实证依据和实践指导。
链接: https://arxiv.org/abs/2604.18381
作者: Justin Bauer,Thomas Walshe,Derek Pham,Harit Vishwakarma,Armin Parchami,Frederic Sala,Paroma Varma
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where annotated data and accessible compute may be scarce. In this work, we present a comprehensive empirical study of open-source Small Language Model (SLM) performance after RLVR in low data regimes. Across three novel datasets covering number counting problems, graph reasoning, and spatial reasoning, we characterize how model performance scales with dataset size, diversity, and complexity. We demonstrate that (1) procedural datasets allow for fine-grained evaluation and training dataset development with controllable properties (size, diversity, and complexity), (2) under RLVR, models trained on lower complexity tasks can generalize to higher complexity tasks, and (3) training on mixed complexity datasets is associated with the greatest benefits in low data regimes, providing up to 5x sample efficiency versus training on easy tasks. These findings inspire future work on the development of data scaling laws for RLVR and the use of procedural data generators to further understand effective data development for efficient LLM fine-tuning.
[AI-14] he implicated scientist: on the role of AI researchers in the development of weapons systems ICLR2026
【速读】:该论文试图解决的问题是:AI研究人员在由人工智能技术赋能的武器系统所引发的暴力与不公正中所扮演的角色及其伦理责任。当前,全球范围内正经历一场以生成式 AI(Generative AI)为代表的智能武器竞赛,这不仅加剧了大规模杀伤性冲突的风险,还可能进一步扩大权力与财富的不平等。论文指出,AI研究者并非被动参与者,而是处于直接 implicated(牵连)位置的主体。解决方案的关键在于将这种牵连关系转化为一种“差异化、远距离的团结”(differentiated, long-distance solidarity),即通过识别自身技术实践对受害者的影响,主动采取负责任的研究路径,并与受技术强化的不正义所伤害的群体建立伦理联结,从而推动更具正义性的AI武器治理框架。
链接: https://arxiv.org/abs/2604.18380
作者: Alexandra Volokhova,Alex Hernandez-Garcia
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Presented as an oral talk and a poster at the AI for Peace workshop at ICLR 2026
Abstract:Artificial intelligence (AI) technologies are increasingly used in modern weapons systems. Notably, these systems have recently been involved in mass killings and destruction at scale. Furthermore, there is currently a strong interest and competition among powerful players to accelerate the proliferation of weapons with automated or AI-based components, a phenomenon known as AI arms race. This competition poses a risk of causing even more deaths and devastation in the future, as well as increased power and wealth inequality. In this work, we aim to shed light on the role of AI researchers as implicated subjects in the harms caused by weapons enabled by AI technologies. We investigate and discuss the specifics of this implication and explore ways to transfigure this position of implication into one of differentiated, long-distance solidarity with the victims of technologically fortified injustices.
[AI-15] ght Auditing of Differential Privacy in MST and AIM
【速读】:该论文旨在解决当前差分隐私(Differential Privacy, DP)合成数据生成器(如MST和AIM)在实际应用中难以严格审计其隐私保障的问题。现有方法无法在强隐私(strong-privacy)场景下提供紧致(tight)的隐私边界评估,导致理论保证与实践效果之间存在不确定性。解决方案的关键在于提出一种基于高斯差分隐私(Gaussian Differential Privacy, GDP)的审计框架,该框架通过刻画完整的假阳性/假阴性权衡(false-positive/false-negative tradeoff)来量化隐私泄露风险,从而实现了对MST和AIM在最坏情况下的首次紧致审计。实验表明,在(ϵ,δ)=(1,10−2)条件下,实测隐私参数μemp≈0.43与理论预期值μ=0.45高度一致,验证了该方法的有效性和准确性。
链接: https://arxiv.org/abs/2604.18352
作者: Georgi Ganev,Meenatchi Sundaram Muthu Selva Annamalai,Bogdan Kulynych
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the Theory and Practice of Differential Privacy Workshop (TPDP 2026)
Abstract:State-of-the-art Differentially Private (DP) synthetic data generators such as MST and AIM are widely used, yet tightly auditing their privacy guarantees remains challenging. We introduce a Gaussian Differential Privacy (GDP)-based auditing framework that measures privacy via the full false-positive/false-negative tradeoff. Applied to MST and AIM under worst-case settings, our method provides the first tight audits in the strong-privacy regime. For (\epsilon,\delta)=(1,10^-2) , we obtain \mu_emp\approx0.43 vs. implied \mu=0.45 , showing a small theory-practice gap. Our code is publicly available: this https URL. Comments: Accepted to the Theory and Practice of Differential Privacy Workshop (TPDP 2026) Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2604.18352 [cs.CR] (or arXiv:2604.18352v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.18352 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-16] One Pass for All: A Discrete Diffusion Model for Knowledge Graph Triple Set Prediction
【速读】:该论文旨在解决知识图谱补全(Knowledge Graph Completion, KGC)中Triple Set Prediction (TSP)任务的局限性,即现有方法在预测缺失三元组时采用逐三元组方式,难以捕捉预测三元组之间的依赖关系以保证一致性。其解决方案的关键在于提出一种新颖的离散扩散模型DiffTSP,将TSP建模为生成任务:通过掩码关系边实现知识图谱的离散扩散过程,在反向过程中基于不完整图逐步恢复完整的知识图谱;同时设计了一个结构感知的去噪网络,融合关系上下文编码器与关系图扩散Transformer,从而在单次遍历中生成一致的完整三元组集合,显著提升预测质量与一致性。
链接: https://arxiv.org/abs/2604.18344
作者: Jihong Guan,Jiaqi Wang,Wengen Li,Hanchen Yang,Yichao Zhang,Shuigeng Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge Graphs (KGs) are composed of triples, and the goal of Knowledge Graph Completion (KGC) is to infer the missing factual triples. Traditional KGC tasks predict missing elements in a triple given one or two of its elements. As a more realistic task, the Triple Set Prediction (TSP) task aims to infer the set of missing triples conditioned only on the observed knowledge graph, without assuming any partial information about the missing triples. Existing TSP methods predict the set of missing triples in a triple-by-triple manner, falling short in capturing the dependencies among the predicted triples to ensure consistency. To address this issue, we propose a novel discrete diffusion model termed DiffTSP that treats TSP as a generative task. DiffTSP progressively adds noise to the KG through a discrete diffusion process, achieved by masking relational edges. The reverse process then gradually recovers the complete KG conditioned on the incomplete graph. To this end, we design a structure-aware denoising network that integrates a relational context encoder with a relational graph diffusion transformer for knowledge graph generation. DiffTSP can generate the complete set of triples in a one-pass manner while ensuring the dependencies among the predicted triples. Our approach achieves state-of-the-art performance on three public datasets. Code: this https URL.
[AI-17] oward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support
【速读】:该论文旨在解决生成式 AI (Generative AI) 在精神卫生领域应用中面临的隐私保护难题,尤其是在军事、监狱和偏远医疗等高敏感场景下,传统基于云端的推理管道因需将患者数据传输至外部服务器而存在不可接受的隐私与安全风险。解决方案的关键在于提出一种零出口(zero-egress)、全本地执行的移动端人工智能平台,通过在设备上部署由三个轻量化、微调并量化后的开源大语言模型(LLM)组成的联盟(Gemma、Phi-3.5-mini 和 Qwen2),结合本地调度层实现集成推理与共识驱动的诊断推理,确保患者数据全程不离开设备,从而在保障隐私的同时维持与云端版本相当的诊断准确性和实时推理延迟。
链接: https://arxiv.org/abs/2604.18302
作者: Eranga Bandara,Asanga Gunaratna,Ross Gore,Anita H. Clayton,Christopher K. Rhea,Sachini Rajapakse,Isurunima Kularathna,Sachin Shetty,Ravi Mukkamala,Xueping Liang,Preston Samuel,Atmaram Yarlagadda
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Privacy represents one of the most critical yet underaddressed barriers to AI adoption in mental healthcare – particularly in high-sensitivity operational environments such as military, correctional, and remote healthcare settings, where the risk of patient data exposure can deter help-seeking behavior entirely. Existing AI-enabled psychiatric decision support systems predominantly rely on cloud-based inference pipelines, requiring sensitive patient data to leave the device and traverse external servers, creating unacceptable privacy and security risks in these contexts. In this paper, we propose a zero-egress, on-device AI platform for privacy-preserving psychiatric decision support, deployed as a cross-platform mobile application. The proposed system extends our prior work on fine-tuned LLM consortiums for psychiatric diagnosis standardization by fundamentally re-architecting the inference pipeline for fully local execution – ensuring that no patient data is transmitted to, processed by, or stored on any external server at any stage. The platform integrates a consortium of three lightweight, fine-tuned, and quantized open-source LLMs – Gemma, Phi-3.5-mini, and Qwen2 – selected for their compact architectures and proven efficiency on resource-constrained mobile hardware. An on-device orchestration layer coordinates ensemble inference and consensus-based diagnostic reasoning, producing DSM-5-aligned assessments for conditions. The platform is designed to assist clinicians with differential diagnosis and evidence-linked symptom mapping, as well as to support patient-facing self-screening with appropriate clinical safeguards. Initial evaluation demonstrates that the proposed zero-egress deployment achieves diagnostic accuracy comparable to its server-side predecessor while sustaining real-time inference latency on commodity mobile hardware.
[AI-18] Enhancing Tabular Anomaly Detection via Pseudo-Label-Guided Generation
【速读】:该论文旨在解决表格数据中异常检测的两大挑战:一是由于真实异常标签稀缺,现有无监督方法缺乏足够的异常感知能力;二是当前基于样本生成或对比学习的方法通常全局计算异常,忽略了特征层面的局部异常模式,导致检测性能受限。解决方案的关键在于提出一种伪标签引导的异常生成方法(PLAG),其核心创新是将样本的整体异常度分解为各特征层面异常的累积,并利用伪异常作为指导信号,从而在无需大量真实标签的情况下,增强模型对细粒度局部异常信号的理解能力。此外,通过两阶段数据选择策略(格式验证与不确定性估计)严格筛选合成异常样本,确保其真实性与多样性,进而为模型提供强有力的判别性引导,显著提升异常检测效果。
链接: https://arxiv.org/abs/2604.18266
作者: Wei Huang,Yuxuan Xiong,Hezhe Qiao,Yu-Ming Shang,Xiangling Fu,Guansong Pang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures
Abstract:Identifying anomalous instances in tabular data is essential for improving data reliability and maintaining system stability. Due to the scarcity of ground-truth anomaly labels, existing methods mainly rely on unsupervised anomaly detection models, or exploit a small number of labeled anomalies to facilitate detection via sample generation or contrastive learning. However, unsupervised methods lack sufficient anomaly awareness, while current generation and contrastive approaches tend to compute anomalies globally, overlooking the localized anomaly patterns of tabular features, resulting in suboptimal detection performance. To address these limitations, we propose PLAG, a pseudo-label-guided anomaly generation method designed to enhance tabular anomaly detection. Specifically, by utilizing pseudo-anomalies as guidance signals and decoupling the overall anomaly quantification of a sample into an accumulation of feature-level abnormalities, PLAG not only effectively obviates the need for scarce ground-truth labels but also provides a novel perspective for the model to comprehend localized anomalous signals at a fine-grained level. Furthermore, a two-stage data selection strategy is proposed, integrating format verification and uncertainty estimation to rigorously filter candidate samples, thereby ensuring the fidelity and diversity of the synthetic anomalies. Ultimately, these filtered synthetic anomalies serve as robust discriminative guidance, empowering the model to better separate normal and anomalous instances. Extensive experiments demonstrate that PLAG achieves state-of-the-art performance against eight representative baselines. Moreover, as a flexible framework, it integrates seamlessly with existing unsupervised detectors, consistently boosting F1-scores by 0.08 to 0.21.
[AI-19] LeGo-Code: Can Modular Curriculum Learning Advance Complex Code Generation? Insights from Text-to-SQL
【速读】:该论文旨在解决当前基于代码的大语言模型(Code-based Large Language Models, LLMs)在Text-to-SQL任务中对复杂逻辑语句(如多层嵌套的JOIN和条件判断)以及现实世界中噪声大、结构不良的数据库模式(schema)处理能力不足的问题。其核心解决方案是提出一种模块化适配器组合(Modular Adapter Composition, MAC)策略,通过分层训练针对不同难度级别(从简单到超难)的适配器模块,在不引发灾难性遗忘的前提下构建阶梯式学习环境,从而显著提升模型在Spider和BIRD等基准上的性能表现,并实现按需部署的灵活架构设计。
链接: https://arxiv.org/abs/2604.18254
作者: Salmane Chafik,Saad Ezzini,Ismail Berrada
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Software Engineering (cs.SE)
备注: 7 pages, 3 figures, 4 tables
Abstract:Recently, code-oriented large language models (LLMs) have demonstrated strong capabilities in translating natural language into executable code. Text-to-SQL is a significant application of this ability, enabling non-technical users to interact with relational databases using natural language. However, state-of-the-art models continue to struggle with highly complex logic, particularly deeply nested statements involving multiple joins and conditions, as well as with real-world database schemas that are noisy or poorly structured. In this paper, we investigate whether curriculum learning can improve the performance of code-based LLMs on Text-to-SQL tasks. Employing benchmarks including Spider and BIRD, we fine-tune models under different curriculum strategies. Our experiments show that naive curriculum, simply ordering training samples by complexity in a single epoch, fails to surpass standard fine-tuning due to catastrophic forgetting. To overcome this, we propose a Modular Adapter Composition (MAC) strategy. By sequentially training tier-specific adapters on incremental complexity levels (Easy to Extra-Hard), we create a scaffolded learning environment that improves performance on complex queries. Our approach not only produces measurable performance gains on the Spider and BIRD benchmarks but also provides a flexible, “Lego-like” architecture, allowing models to be composed and deployed based on specific schema difficulty requirements. These findings demonstrate that structured, modular learning is a superior alternative to monolithic fine-tuning for mastering the syntax and logic of complex code generation.
[AI-20] AJ-Bench: Benchmarking Agent -as-a-Judge for Environment-Aware Evaluation ACL2026
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)驱动的智能体在复杂环境中行为验证可靠性不足的问题。现有方法依赖规则基验证器或LLM-as-a-Judge模型,其泛化能力受限于狭窄领域。论文提出Agent-as-a-Judge作为解决方案,其核心在于让智能体主动与环境及工具交互以获取可验证证据,从而实现更鲁棒的行为评估。为系统评估该方法,作者构建了AJ-Bench基准,涵盖搜索、数据系统和图形用户界面三个领域共155项任务及516条标注轨迹,全面衡量智能体在信息获取、状态验证和过程验证方面的能力。实验表明,Agent-as-a-Judge相较LLM-as-a-Judge基线表现出稳定性能提升,同时揭示了代理式验证仍面临诸多开放挑战。
链接: https://arxiv.org/abs/2604.18240
作者: Wentao Shi,Yu Wang,Yuyang Zhao,Yuxin Chen,Fuli Feng,Xueyuan Hao,Xi Su,Qi Gu,Hui Su,Xunliang Cai,Xiangnan He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 Findings. 43 pages total, 5 figures
Abstract:As reinforcement learning continues to scale the training of large language model-based agents, reliably verifying agent behaviors in complex environments has become increasingly challenging. Existing approaches rely on rule-based verifiers or LLM-as-a-Judge models, which struggle to generalize beyond narrow domains. Agent-as-a-Judge addresses this limitation by actively interacting with environments and tools to acquire verifiable evidence, yet its capabilities remain underexplored. We introduce a benchmark AJ-Bench to systematically evaluate Agent-as-a-Judge across three domains-search, data systems, and graphical user interfaces-comprising 155 tasks and 516 annotated trajectories. The benchmark comprehensively assesses judge agents’ abilities in information acquisition, state verification, and process verification. Experiments demonstrate consistent performance gains over LLM-as-a-Judge baselines, while also revealing substantial open challenges in agent-based verification. Our data and code are available at this https URL. Comments: Accepted to ACL 2026 Findings. 43 pages total, 5 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.18240 [cs.AI] (or arXiv:2604.18240v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.18240 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-21] owards Disentangled Preference Optimization Dynamics Beyond Likelihood Displacement
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在偏好优化(preference optimization)过程中普遍存在的似然位移(likelihood displacement)问题,即许多基于边距的目标函数会同时抑制被选择和被拒绝的响应,导致训练动态失衡。其解决方案的关键在于提出一种激励-评分分解(incentive-score decomposition)框架,揭示了不同目标函数在局部更新方向上的一致性,仅在标量加权系数上存在差异;进而定义了一个可测试的解耦带(disentanglement band, DB)条件,用于判断训练是否能实现“压制失败者、保留成功者”的理想路径;在此基础上,设计了一种即插即用的奖励校准(reward calibration, RC)机制,通过自适应调整选中与拒绝样本的更新权重以满足DB条件,从而有效缓解似然位移现象,且无需重构基础目标函数。实证结果表明,RC能引导训练走向更解耦的动力学行为,并提升下游任务性能。
链接: https://arxiv.org/abs/2604.18239
作者: Wei Chen,Yubing Wu,Junmei Yang,Delu Zeng,Qibin Zhao,John Paisley,Min Chen,Zhou Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Preference optimization is widely used to align large language models (LLMs) with human preferences. However, many margin-based objectives suppress the chosen response along with the rejected one, a phenomenon known as likelihood displacement, and no general mechanism currently prevents this across objectives. We bridge this gap by presenting a unified \emphincentive-score decomposition of preference optimization, revealing that diverse objectives share identical local update directions and differ only in their scalar weighting coefficients. Building on this decomposition, by analyzing the dynamics of the chosen/rejected likelihoods, we identify the \emphdisentanglement band (DB), a simple, testable condition that characterizes when training can avoid likelihood displacement by realizing the preferred pathway: suppressing the loser while maintaining the winner, possibly after an initial transient. Leveraging the DB, we propose a plug-and-play \emphreward calibration (RC) that adaptively rebalances chosen versus rejected updates to satisfy the DB and mitigate likelihood displacement, without redesigning the base objective. Empirical results show that RC steers training toward more disentangled dynamics and often improves downstream performance across a range of objectives. Our code is available at this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.18239 [cs.LG] (or arXiv:2604.18239v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.18239 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Wei Chen [view email] [v1] Mon, 20 Apr 2026 13:23:27 UTC (13,717 KB) Full-text links: Access Paper: View a PDF of the paper titled Towards Disentangled Preference Optimization Dynamics Beyond Likelihood Displacement, by Wei Chen and Yubing Wu and Junmei Yang and Delu Zeng and Qibin Zhao and John Paisley and Min Chen and Zhou WangView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-04 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-22] Semantic-based Distributed Learning for Diverse and Discriminative Representations
【速读】:该论文旨在解决大规模分布式场景下,因传统任务特定方法导致的数据样本内部结构信息丢失问题,尤其是在分类任务中,同类样本的表示趋于同质化(即“collapsed variability”),从而削弱了模型对数据内在结构的利用能力。解决方案的关键在于提出一种新颖的分布式学习框架,通过在独立同分布(i.i.d.)数据下引入对表示方差的约束来重构并解耦全局优化函数,并采用原始-对偶方法推导出简化的更新规则;对于非独立同分布(non-i.i.d.)数据,则通过聚类与虚拟节点复制策略结合块坐标下降法实现局部模型更新。该方案理论上保证了最优解具备判别性(discriminative)和多样性(diverse)特性,且在i.i.d.条件下具有收敛性保障,同时借助语义信息共享减少了对统一神经网络架构的依赖。
链接: https://arxiv.org/abs/2604.18237
作者: Zhuojun Tian,Chaouki Ben Issaid,Mehdi Bennis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In large-scale distributed scenarios, increasingly complex tasks demand more intelligent collaboration across networks, requiring the joint extraction of structural representations from data samples. However, conventional task-specific approaches often result in nonstructural embeddings, leading to collapsed variability among data samples within the same class, particularly in classification tasks. To address this issue and fully leverage the intrinsic structure of data for downstream applications, we propose a novel distributed learning framework that ensures both diverse and discriminative representations. For independent and identically distributed (i.i.d.) data, we reformulate and decouple the global optimization function by introducing constraints on representation variance. The update rules are then derived and simplified using a primal-dual approach. For non-i.i.d. data distributions, we tackle the problem by clustering and virtually replicating nodes, allowing model updates within each cluster using block coordinate descent. In both cases, the resulting optimal solutions are theoretically proven to maintain discriminative and diverse properties, with a guaranteed convergence for i.i.d. conditions. Additionally, semantic information from representations is shared among nodes, reducing the need for common neural network architectures. Finally, extensive simulations on MNIST, CIFAR-10 and CIFAR-100 confirm the effectiveness of the proposed algorithms in capturing global structural representations.
[AI-23] WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models
【速读】:该论文旨在解决当前大型语言模型在网页开发能力评估中存在的局限性问题,即现有基准测试仅聚焦于静态文本条件下的代码生成与语法正确性,而忽视了视觉保真度、交互质量及代码库层面的推理能力。为此,作者提出了WebCompass——一个统一生命周期评估框架,其关键在于构建了一个涵盖文本、图像和视频三种输入模态与生成、编辑、修复三类任务的多维评测体系,共形成七种任务类别以模拟真实工程流程;同时设计了一种“人类在环”的多阶段数据采集与标注机制,并引入创新的Agent-as-a-Judge评估范式,通过在真实浏览器中自主执行生成网站、利用Model Context Protocol(MCP)探索交互行为并迭代合成针对性测试用例,从而实现对生成任务更贴近人工验收测试的自动化评估。
链接: https://arxiv.org/abs/2604.18224
作者: Xinping Lei,Xinyu Che,Junqi Xiong,Chenchen Zhang,Yukai Huang,Chenyu Zhou,Haoyang Huang,Minghao Liu,Letian Zhu,Hongyi Ye,Jinhua Hao,Ken Deng,Zizheng Zhan,Han Li,Dailin Li,Yifan Yao,Ming Sun,Zhaoxiang Zhang,Jiaheng Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models are rapidly evolving into interactive coding agents capable of end-to-end web coding, yet existing benchmarks evaluate only narrow slices of this capability, typically text-conditioned generation with static-correctness metrics, leaving visual fidelity, interaction quality, and codebase-level reasoning largely unmeasured. We introduce WebCompass, a multimodal benchmark that provides unified lifecycle evaluation of web engineering capability. Recognizing that real-world web coding is an iterative cycle of generation, editing, and repair, WebCompass spans three input modalities (text, image, video) and three task types (generation, editing, repair), yielding seven task categories that mirror professional workflows. Through a multi-stage, human-in-the-loop pipeline, we curate instances covering 15 generation domains, 16 editing operation types, and 11 repair defect types, each annotated at Easy/Medium/Hard levels. For evaluation, we adopt a checklist-guided LLM-as-a-Judge protocol for editing and repair, and propose a novel Agent-as-a-Judge paradigm for generation that autonomously executes generated websites in a real browser, explores interactive behaviors via the Model Context Protocol (MCP), and iteratively synthesizes targeted test cases, closely approximating human acceptance testing. We evaluate representative closed-source and open-source models and observe that: (1) closed-source models remain substantially stronger and more balanced; (2) editing and repair exhibit distinct difficulty profiles, with repair preserving interactivity better but remaining execution-challenging; (3) aesthetics is the most persistent bottleneck, especially for open-source models; and (4) framework choice materially affects outcomes, with Vue consistently challenging while React and Vanilla/HTML perform more strongly depending on task type.
[AI-24] A Control Architecture for Training-Free Memory Use
【速读】:该论文旨在解决提示注入记忆(prompt-injected memory)在推理过程中因触发时机不当而导致效果不稳定的问题,即如何在不更新模型权重的前提下,确保记忆内容仅在合适的上下文中被应用,从而提升推理性能。其核心挑战在于实现“适用性控制”(applicability control),包括判断何时触发记忆辅助的二次推理、是否信任 retrieved 内容,以及如何长期维护记忆库的有效性。解决方案的关键在于构建一套多机制协同的控制架构:通过基于不确定性的路由(uncertainty-based routing)决定记忆调用时机,利用置信度选择性接受(confidence-based selective acceptance)过滤有害干预,结合规则与示例记忆库的选择策略(bank selection across rule and exemplar memory),并通过证据驱动的治理机制(evidence-based governance)动态管理记忆库内容。实验证明,这种控制机制而非单纯增加记忆暴露量,才是提升 SVAMP 和 ASDiv 两个算术基准性能的核心因素。
链接: https://arxiv.org/abs/2604.18206
作者: Yanzhen Lu,Muchen Jiang,Zhicheng Qian,Xingyu Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Prompt-injected memory can improve reasoning without updating model weights, but it also creates a control problem: retrieved content helps only when it is applied in the right state. We study this problem in a strict training-free setting and formulate it as applicability control: when to trigger a memory-assisted second pass, when to trust it, and how to maintain the memory bank over time. Our method combines uncertainty-based routing, confidence-based selective acceptance, bank selection across rule and exemplar memory, and evidence-based governance of the memory bank over time. Under a locked training-free protocol with compute-matched controls, it improves two core arithmetic benchmarks by +7.0 points on SVAMP and +7.67 points on ASDiv over baseline. The same architecture also transfers to QA and agent benchmarks with smaller positive effects and shows the same positive direction on a second checkpoint for the main arithmetic tasks. On arithmetic, the main empirical pattern is that the control architecture, rather than raw memory exposure, drives the improvements on SVAMP and ASDiv. Mechanistically, confidence separates helpful from harmful rule-bank interventions, and under fixed retrieval the repair-versus-corrupt difference localizes to rows whose retrieved set actually contains the edited entries.
[AI-25] Scalable Neighborhood-Based Multi-Agent Actor-Critic
【速读】:该论文旨在解决多智能体深度确定性策略梯度(MADDPG)中集中式评论家(centralized critic)在大规模场景下计算成本过高的问题。随着智能体数量增加,传统集中式评论家的输入维度线性增长,导致训练成本急剧上升。解决方案的关键在于提出MADDPG-K方法,通过限制每个智能体的评论家仅接收与其距离最近的k个邻居(基于欧氏距离度量)的信息,从而实现评论家输入规模恒定,与智能体总数无关。这种设计保留了集中式评论家的优势,同时显著降低了计算复杂度,因为其主要开销来自廉价的标量距离计算,而非昂贵的神经网络矩阵运算。
链接: https://arxiv.org/abs/2604.18190
作者: Tim Goppelsroeder,Rasmus Jensen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose MADDPG-K, a scalable extension to Multi-Agent Deep Deterministic Policy Gradient (MADDPG) that addresses the computational limitations of centralized critic approaches. Centralized critics, which condition on the observations and actions of all agents, have demonstrated significant performance gains in cooperative and competitive multi-agent settings. However, their critic networks grow linearly in input size with the number of agents, making them increasingly expensive to train at scale. MADDPG-K mitigates this by restricting each agent’s critic to the k closest agents under a chosen metric which in our case is Euclidean distance. This ensures a constant-size critic input regardless of the total agent count. We analyze the complexity of this approach, showing that the quadratic cost it retains arises from cheap scalar distance computations rather than the expensive neural network matrix multiplications that bottleneck standard MADDPG. We validate our method empirically across cooperative and adversarial environments from the Multi-Particle Environment suite, demonstrating competitive or superior performance compared to MADDPG, faster convergence in cooperative settings, and better runtime scaling as the number of agents grows. Our code is available at this https URL .
[AI-26] Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLM s
【速读】:该论文旨在解决托管大语言模型(Hosted-LLM)提供商存在的“静默替换”(silent-substitution)激励问题:即服务商在宣传时声称使用更强的模型,但在实际服务普通用户时却用性能更低的替代模型,从而降低成本。此类行为会破坏模型验证机制的有效性,尤其在基于“探针后返回”(probe-after-return)的方案(如SVIP)中,由于存在并行服务侧信道,恶意提供方可将验证请求路由至广告模型,而普通用户仍接收替代表现。解决方案的关键在于提出一种“承诺-开放”(commit-open)协议:在任何开放请求前,提供商通过默克尔树(Merkle tree)对其服务输出在指定探针层上的稀疏自动编码器(sparse-autoencoder, SAE)特征迹进行承诺;验证者随机选取位置进行开放,并基于公开命名电路探针库(校准跨后端噪声)进行联合一致性z-score评分,采用固定阈值判定是否一致。该设计有效阻断了静默替换攻击,且在不同模型架构和攻击类型下均保持稳定检测能力,同时计算开销仅增加约2.1%(批量32时)。
链接: https://arxiv.org/abs/2604.18179
作者: Ziyang Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 28 pages, 13 figures, 16 tables
Abstract:Hosted-LLM providers have a silent-substitution incentive: advertise a stronger model while serving cheaper replies. Probe-after-return schemes such as SVIP leave a parallel-serve side-channel, since a dishonest provider can route the verifier’s probe to the advertised model while serving ordinary users from a substitute. We propose a commit-open protocol that closes this gap. Before any opening request, the provider commits via a Merkle tree to a per-position sparse-autoencoder (SAE) feature-trace sketch of its served output at a published probe layer. A verifier opens random positions, scores them against a public named-circuit probe library calibrated with cross-backend noise, and decides with a fixed-threshold joint-consistency z-score rule. We instantiate the protocol on three backbones – Qwen3-1.7B, Gemma-2-2B, and a 4.5x scale-up to Gemma-2-9B with a 131k-feature SAE. Of 17 attackers spanning same-family lifts, cross-family substitutes, and rank-=128 adaptive LoRA, all are rejected at a shared, scale-stable threshold; the same attackers all evade a matched SVIP-style parallel-serve baseline. A white-box end-to-end attack that backpropagates through the frozen SAE encoder does not close the margin, and a feature-forgery attacker that never runs M_hon is bounded in closed form by an intrinsic-dimension argument. Commitment adds =2.1% to forward-only wall-clock at batch 32.
[AI-27] QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在量子力学等科学领域中因缺乏物理约束遵循能力而导致的可靠性不足问题。其核心挑战源于科学领域训练数据稀缺以及标准对齐范式中反馈信号粗粒度的问题。解决方案的关键在于构建一个大规模、高质量的量子问答数据集 QuantumQA,该数据集采用任务自适应策略与结合确定性求解器和语义审计的混合验证协议,确保科学严谨性;同时提出验证感知奖励模型(Verification-aware Reward Model, VRM),通过自适应奖励融合(Adaptive Reward Fusion, ARF)机制动态整合来自科学执行套件(Scientific Execution Suite, SES)的确定性信号与多维语义评估,从而在强化学习框架中实现精准监督。实验表明,该方法显著优于基线模型,并且优化后的8B参数模型性能可与商用模型媲美,证明将可验证的规则反馈引入强化学习循环是一种高效的替代纯参数扩展的路径。
链接: https://arxiv.org/abs/2604.18176
作者: Songxin Qu,Tai-Ping Sun,Yun-Jie Wang,Huan-Yu Liu,Cheng Xue,Xiao-Fan Xu,Han Fang,Yang Yang,Yu-Chun Wu,Guo-Ping Guo,Zhao-Yun Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: 25 pages
Abstract:Large language models (LLMs) show strong capabilities in general reasoning but typically lack reliability in scientific domains like quantum mechanics, which demand strict adherence to physical constraints. This limitation arises from the scarcity of verifiable training resources and the inadequacy of coarse feedback signals in standard alignment paradigms. To address the data challenge, we introduce QuantumQA, a large-scale dataset constructed via a task-adaptive strategy and a hybrid verification protocol that combines deterministic solvers with semantic auditing to guarantee scientific rigor. Building on this foundation, we propose the verification-aware reward model (VRM) tailored for Reinforcement Learning with Verifiable Rewards (RLVR), which employs an adaptive reward fusion (ARF) mechanism to dynamically integrate deterministic signals from a scientific execution suite (SES) with multidimensional semantic evaluations for precise supervision. Experimental results demonstrate that our method consistently outperforms baselines and general-purpose preference models. Notably, our optimized 8B model achieves performance competitive with proprietary models, validating that incorporating verifiable, rule-based feedback into the reinforcement learning loop offers a parameter-efficient alternative to pure scaling.
[AI-28] Does "Do Differentiable Simulators Give Better Policy Gradients? Give Better Policy Gradients? ICLR2026
【速读】:该论文旨在解决策略梯度强化学习中因系统动态不连续性导致的一阶梯度估计偏差问题,该偏差会削弱基于可微模型的高效梯度估计方法(如1st-order gradient estimation)的效果。其关键解决方案包括:一是提出轻量级检测方法DDCG(Discontinuity Detection and Gradient Switching),通过在非光滑区域切换至0th-order估计器实现鲁棒性提升,仅需单一超参数且样本效率高;二是针对可微机器人控制任务设计IVW-H(Inverse-Variance Weighted per-step Hessian),以每步逆方差加权方式稳定梯度方差,无需显式检测不连续性即可获得优异性能。研究表明,在实际部署中,精细的方差控制往往比单纯依赖估计器切换更为重要。
链接: https://arxiv.org/abs/2604.18161
作者: Ku Onoda,Paavo Parmas,Manato Yaguchi,Yutaka Matsuo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: ICLR2026
Abstract:In policy gradient reinforcement learning, access to a differentiable model enables 1st-order gradient estimation that accelerates learning compared to relying solely on derivative-free 0th-order estimators. However, discontinuous dynamics cause bias and undermine the effectiveness of 1st-order estimators. Prior work addressed this bias by constructing a confidence interval around the REINFORCE 0th-order gradient estimator and using these bounds to detect discontinuities. However, the REINFORCE estimator is notoriously noisy, and we find that this method requires task-specific hyperparameter tuning and has low sample efficiency. This paper asks whether such bias is the primary obstacle and what minimal fixes suffice. First, we re-examine standard discontinuous settings from prior work and introduce DDCG, a lightweight test that switches estimators in nonsmooth regions; with a single hyperparameter, DDCG achieves robust performance and remains reliable with small samples. Second, on differentiable robotics control tasks, we present IVW-H, a per-step inverse-variance implementation that stabilizes variance without explicit discontinuity detection and yields strong results. Together, these findings indicate that while estimator switching improves robustness in controlled studies, careful variance control often dominates in practical deployments.
[AI-29] State Transfer Reveals Reuse in Controlled Routing
【速读】:该论文旨在解决生成式 AI (Generative AI) 中提示(prompt)干预如何有效改变模型行为的问题,特别是识别出行为相关状态在模型内部的表征位置。传统方法仅依赖训练后的成功表现难以确定行为变化是否源于特定接口(interface)的复用,而非单纯重新学习。解决方案的关键在于设计一套受控路由任务(controlled routing tasks),结合支持数据选择的接口、保留测试集评估以及匹配的必要性、充分性和错误接口控制实验,从而区分固定接口复用与可训练提示位置迁移的本质差异。研究发现,在GPT-2等模型中,早期接口支持强鲁棒性转移,而可训练提示虽能重学相同行为,但需额外支持样本和优化;这表明固定接口转移比单纯提示训练成功更能证明模型内部状态的实际复用。
链接: https://arxiv.org/abs/2604.18158
作者: Yanzhen Lu,Zhicheng Qian,Muchen Jiang,Xingyu Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Prompt-based interventions can change model behavior, but trained success alone does not identify where the behaviorally relevant state is represented. We study this question in controlled routing tasks using interfaces chosen on support data, held-out query evaluation, and matched necessity, sufficiency, and wrong-interface controls. On GPT-2 triop, an early interface supports exact transfer under these tests. On GPT-2 add/sub, zero-retrain compiled transfer at the fixed interface recovers most of donor routing accuracy, while trainable prompt slots can relearn the same behavior at several other positions only after additional support examples and optimization. These results distinguish fixed-interface reuse from prompt relocation in a setting where the two can be tested directly. Qwen routing provides a cross-architecture consistency check for the same matched-interface pattern at the operator token, although donor-specific identity on the local V-path remains unresolved. Generation and reasoning branches are used to map scope: they show broader transport or weaker controller identifiability once control depends on longer trajectories or harder selection. In controlled routing, fixed-interface transfer is therefore stronger evidence of reuse than trained prompt success alone.
[AI-30] AQPIM: Breaking the PIM Capacity Wall for LLM s with In-Memory Activation Quantization HPCA2026
【速读】:该论文旨在解决基于Transformer的大型语言模型(Large Language Models, LLMs)在Processing-in-Memory (PIM)架构中因激活内存占用过大而导致的内存瓶颈问题,尤其是在长上下文场景下生成的KV缓存(Key-Value Cache)规模常超出PIM有限内存容量的问题。现有PIM方案与量化方法难以有效利用激活数据的特性,且稀疏注意力等技术可能破坏PIM对数据局部性的依赖。其解决方案的关键在于提出AQPIM——一种面向PIM的激活量化框架,采用基于乘积量化(Product Quantization, PQ)的聚类式向量量化方法,在内存内部直接完成量化操作,充分利用PIM高带宽特性并支持压缩数据上的原位计算,从而显著降低内存占用和注意力计算开销;同时通过多项算法优化缓解PQ精度损失问题,实测表明该方案可减少高达90–98.5%的GPU-CPU通信延迟,并相较当前最优PIM方法实现3.4倍加速。
链接: https://arxiv.org/abs/2604.18137
作者: Kosuke Matsushima,Yasuyuki Okoshi,Masato Motomura,Daichi Fujiki
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to HPCA 2026
Abstract:Processing-in-Memory (PIM) architectures offer a promising solution to the memory bottlenecks in data-intensive machine learning, yet often overlook the growing challenge of activation memory footprint. Conventional PIM approaches struggle with massive KV cache sizes generated in long-context scenarios by Transformer-based models, frequently exceeding PIM’s limited memory capacity, while techniques like sparse attention can conflict with PIM’s need for data locality. Existing PIM approaches and quantization methods are often insufficient or poorly suited for leveraging the unique characteristics of activations. This work identifies an opportunity for PIM-specialized activation quantization to enhance bandwidth and compute efficiency. We explore clustering-based vector quantization approaches, which align well with activation characteristics and PIM’s internal bandwidth capabilities. Building on this, we introduce AQPIM, a novel PIM-aware activation quantization framework based on Product Quantization (PQ), optimizing it for modern Large Language Models (LLMs). By performing quantization directly within memory, AQPIM leverages PIM’s high internal bandwidth and enables direct computation on compressed data, significantly reducing both memory footprint and computational overhead for attention computation. AQPIM addresses PQ’s accuracy challenges by introducing several algorithmic optimizations. Evaluations demonstrate that AQPIM achieves significant performance improvements, drastically reducing of GPU-CPU communication that can account for 90 \sim 98.5% of decoding latency, together with 3.4 \times speedup over a SOTA PIM approach. Comments: Accepted to HPCA 2026 Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2604.18137 [cs.AR] (or arXiv:2604.18137v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2604.18137 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/HPCA68181.2026.11408452 Focus to learn more DOI(s) linking to related resources
[AI-31] Multi-Agent Systems: From Classical Paradigms to Large Foundation Model-Enabled Futures
【速读】:该论文旨在解决传统多智能体系统(Multi-Agent Systems, MASs)在复杂场景下协作能力受限的问题,特别是在感知、通信、决策与控制等维度上难以实现灵活适应与语义层面协同的瓶颈。其解决方案的关键在于引入大基础模型(Large Foundation Models, LFMs),通过将LFM集成到多智能体架构中构建LFM-based MASs(LMASs),从而突破传统系统依赖低层级状态交换的局限,实现从物理交互向语义推理的跃迁,显著提升系统的自适应性与跨场景泛化能力。
链接: https://arxiv.org/abs/2604.18133
作者: Zixiang Wang,Mengjia Gong,Qiyu Sun,Jing Xu,Shuai Mao,Xin Jin,Qing-Long Han,Yang Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by IEEE/CAA Journal of Automatica Sinica
Abstract:With the rapid advancement of artificial intelligence, multi-agent systems (MASs) are evolving from classical paradigms toward architectures built upon large foundation models (LFMs). This survey provides a systematic review and comparative analysis of classical MASs (CMASs) and LFM-based MASs (LMASs). First, within a closed-loop coordination framework, CMASs are reviewed across four fundamental dimensions: perception, communication, decision-making, and control. Beyond this framework, LMASs integrate LFMs to lift collaboration from low-level state exchanges to semantic-level reasoning, enabling more flexible coordination and improved adaptability across diverse scenarios. Then, a comparative analysis is conducted to contrast CMASs and LMASs across architecture, operating mechanism, adaptability, and application. Finally, future perspectives on MASs are presented, summarizing open challenges and potential research opportunities.
[AI-32] raining LLM Agents for Spontaneous Reward-Free Self-Evolution via World Knowledge Exploration
【速读】:该论文旨在解决当前智能体(agent)依赖外部奖励与人类监督进行自我进化的问题,即在缺乏人类指导时,其进化过程会停止。为实现真正自主的持续进化能力,作者提出了一种基于内在元进化(meta-evolution)能力的解决方案:通过设计一种基于结果的奖励机制(outcome-based reward mechanism),在训练阶段仅利用该信号来教导模型如何有效探索并总结世界知识,从而提升下游任务的成功率;推理阶段则完全无需外部奖励或人工指令,智能体可自发地进行原生自进化(native self-evolution),以适应未知环境。该方案的关键在于将“世界知识”的生成与任务性能提升直接关联,使模型学会主动构建和优化内部认知结构,从而实现不依赖外部干预的自主演化能力。
链接: https://arxiv.org/abs/2604.18131
作者: Qifan Zhang,Dongyang Ma,Tianqing Fang,Jia Li,Jing Tang,Nuo Chen,Haitao Mi,Yan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Most agents today ``self-evolve’’ by following rewards and rules defined by humans. However, this process remains fundamentally dependent on external supervision; without human guidance, the evolution stops. In this work, we train agents to possess an intrinsic meta-evolution capability to spontaneously learn about unseen environments prior to task execution. To instill this ability, we design an outcome-based reward mechanism that measures how much an agent’s self-generated world knowledge improves its success rate on downstream tasks. This reward signal is used exclusively during the training phase to teach the model how to explore and summarize effectively. At inference time, the agent requires no external rewards or human instructions. It spontaneously performs native self-evolution to adapt to unknown environments using its internal parameters. When applied to Qwen3-30B and Seed-OSS-36B, this shift to native evolution yields a 20% performance increase on WebVoyager and WebWalker. Most strikingly, the generated world knowledge even enables a compact 14B Qwen3 model to outperform the unassisted Gemini-2.5-Flash, establishing a new paradigm for truly evolving agents. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.18131 [cs.AI] (or arXiv:2604.18131v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.18131 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-33] Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)和大 multimodal 模型(Large Multimodal Models, LMMs)在长上下文场景下预填充(prefilling)计算开销过大的问题。现有基于 token 剪枝的方法依赖启发式策略,难以与硬件高效内核(如 FlashAttention)兼容。其解决方案的关键在于提出一种无需训练的策略——Delta Attention Selective Halting (DASH),该策略通过监测自注意力机制在层间的更新动态,识别出趋于语义稳定点(semantic fixing points)的 token,并选择性地停止对这些 token 的后续处理,从而实现显著的预填充加速,同时保持模型精度和硬件效率。
链接: https://arxiv.org/abs/2604.18103
作者: Yujie Chen,Tailai Chen,Yifeng Gao,Zoe Wanying He,Yijue Xu,Shaobo Wang,Linfeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long-context settings. While token pruning reduces sequence length, prior methods rely on heuristics that break compatibility with hardware-efficient kernels like FlashAttention. In this work, we observe that tokens evolve toward \textitsemantic fixing points, making further processing redundant. To this end, we introduce Delta Attention Selective Halting (DASH), a training-free policy that monitors the layer-wise update dynamics of the self-attention mechanism to selectively halt stabilized tokens. Extensive evaluation confirms that DASH generalizes across language and vision benchmarks, delivering significant prefill speedups while preserving model accuracy and hardware efficiency. Code will be released at this https URL.
[AI-34] DSAINet: An Efficient Dual-Scale Attentive Interaction Network for General EEG Decoding
【速读】:该论文旨在解决非侵入式脑电图(EEG)在跨任务场景下解码模型泛化能力差的问题,核心挑战在于不同任务中脑电信号的时序组织模式存在差异,而现有方法依赖任务特异的架构设计引入了时序归纳偏置,导致难以在不调整模型配置的情况下适应多种任务。其解决方案的关键是提出DSAINet——一种高效的双尺度注意力交互网络,通过构建共享的时空标记表示,并利用细粒度和粗粒度并行卷积分支建模多样化的时序动态;再结合分支内注意力强化特定尺度的显著模式、分支间注意力融合跨尺度的任务相关特征,并最终通过自适应标记聚合生成紧凑的预测表示,从而实现统一架构下的多任务高效泛化。
链接: https://arxiv.org/abs/2604.18095
作者: Zhiyuan Ma,Zeyuan Li,Zihao Qiu,Jinhao Li,Lingqin Meng,Xinche Zhang,Yixuan Liu,Xinke Shen,Sen Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In real-world applications of noninvasive electroencephalography (EEG), specialized decoders often show limited generalizability across diverse tasks under subject-independent settings. One central challenge is that task-relevant EEG signals often follow different temporal organization patterns across tasks, while many existing methods rely on task-tailored architectural designs that introduce task-specific temporal inductive biases. This mismatch makes it difficult to adapt temporal modeling across tasks without changing the model configuration. To address these challenges, we propose DSAINet, an efficient dual-scale attentive interaction network for general EEG decoding. Specifically, DSAINet constructs shared spatiotemporal token representations from raw EEG signals and models diverse temporal dynamics through parallel convolutional branches at fine and coarse scales. The resulting representations are then adaptively refined by intra-branch attention to emphasize salient scale-specific patterns and by inter-branch attention to integrate task-relevant features across scales, followed by adaptive token aggregation to yield a compact representation for prediction. Extensive experiments on five downstream EEG decoding tasks across ten public datasets show that DSAINet consistently outperforms 13 representative baselines under strict subject-independent evaluation. Notably, this performance is achieved using the same architecture hyperparameters across datasets. Moreover, DSAINet achieves a favorable accuracy-efficiency trade-off with only about 77K trainable parameters and provides interpretable neurophysiological insights. The code is publicly available at this https URL.
[AI-35] Implicit neural representations as a coordinate-based framework for continuous environmental field reconstruction from sparse ecological observations
【速读】:该论文旨在解决从稀疏且不规则观测数据中重建连续环境场的问题,这在环境建模和生物多样性信息学中是一个核心挑战。传统基于网格的方法难以在空间和时间上实现可扩展性和跨域泛化,尤其面对异质性生态数据时更为困难。论文提出的解决方案关键在于采用隐式神经表示(Implicit Neural Representations, INRs),这是一种基于坐标的建模框架,能够直接从坐标输入中学习连续的空间及时空场。INRs通过神经网络参数化函数映射,实现分辨率无关的查询、稳定的连续表征以及可预测的计算成本,从而在物种分布重建、物候动态和形态分割等场景中展现出优于经典平滑器和树基方法的性能,同时具备良好的可扩展性和架构归纳偏置,适合作为环境建模流程中的灵活表示层。
链接: https://arxiv.org/abs/2604.18083
作者: Agnieszka Pregowska,Hazem M. Kalaji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reconstructing continuous environmental fields from sparse and irregular observations remains a central challenge in environmental modelling and biodiversity informatics. Many ecological datasets are heterogeneous in space and time, making grid-based approaches difficult to scale or generalise across domains. Here, we evaluate implicit neural representations (INRs) as a coordinate-based modelling framework for learning continuous spatial and spatio-temporal fields directly from coordinate inputs. We analyse their behaviour across three representative modelling scenarios: species distribution reconstruction, phenological dynamics, and morphological segmentation derived from open biodiversity data. Beyond predictive performance, we examine interpolation behaviour, spatial coherence, and computational characteristics relevant for environmental modelling workflows, including scalability, resolution-independent querying, and architectural inductive bias. Results show that neural fields provide stable continuous representations with predictable computational cost, complementing classical smoothers and tree-based approaches. These findings position coordinate-based neural fields as a flexible representation layer that can be integrated into environmental modelling pipelines and exploratory analysis frameworks for large, irregularly sampled datasets.
[AI-36] Architectural Design Decisions in AI Agent Harnesses
【速读】:该论文旨在解决当前AI代理系统(AI agent systems)中非大语言模型(non-LLM)工程基础设施的架构设计决策缺乏系统性研究的问题,尤其关注这些基础设施如何影响系统的可复用性、安全性与协同能力。其解决方案的关键在于通过协议引导且基于源代码的实证研究方法,对70个公开可用的代理系统项目进行深入分析,识别出五类核心设计维度(子代理架构、上下文管理、工具系统、安全机制和编排策略),并揭示它们之间的共现关系与典型架构模式,从而为框架设计者、选型者和研究人员提供基于证据的架构指导。
链接: https://arxiv.org/abs/2604.18071
作者: Hu Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 35 pages, 13 tables
Abstract:AI agent systems increasingly rely on reusable non-LLM engineering infrastructure that packages tool mediation, context handling, delegation, safety control, and orchestration. Yet the architectural design decisions in this surrounding infrastructure remain understudied. This paper presents a protocol-guided, source-grounded empirical study of 70 publicly available agent-system projects, addressing three questions: which design-decision dimensions recur across projects, which co-occurrences structure those decisions, and which typical architectural patterns emerge. Methodologically, we contribute a transparent investigation procedure for analyzing heterogeneous agent-system corpora through source-code and technical-material reading. Empirically, we identify five recurring design dimensions (subagent architecture, context management, tool systems, safety mechanisms, and orchestration) and find that the corpus favors file-persistent, hybrid, and hierarchical context strategies; registry-oriented tool systems remain dominant while MCP- and plugin-oriented extensions are emerging; and intermediate isolation is common but high-assurance audit is rare. Cross-project co-occurrence analysis reveals that deeper coordination pairs with more explicit context services, stronger execution environments with more structured governance, and formalized tool-registration boundaries with broader ecosystem ambitions. We synthesize five recurring architectural patterns spanning lightweight tools, balanced CLI frameworks, multi-agent orchestrators, enterprise systems, and scenario-verticalized projects. The result provides an evidence-based account of architectural regularities in agent-system engineering, with grounded guidance for framework designers, selectors, and researchers.
[AI-37] Understanding Human Actions through the Lens of Executable Models
【速读】:该论文旨在解决现有方法在识别人类动作时难以捕捉动作内部结构与执行机制的问题,从而影响对动作质量评估及动作间差异的理解。其关键解决方案是引入一种领域特定语言EXACT,将人类运动表示为未明确指定的运动程序(motion programs),并通过前向-后向表示法将其解释为奖励生成函数,用于零样本策略推理;同时利用EXACT运动程序的组合特性,构建一个可执行的神经符号模型,以程序结构实现动作的组合建模,显著提升了数据效率并更好地刻画了动作间的直观关系。
链接: https://arxiv.org/abs/2604.18064
作者: Rimvydas Rubavicius,Manisha Dubey,N. Siddharth,Subramanian Ramamoorthy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 3 figures, 2 tables
Abstract:Human-centred systems require an understanding of human actions in the physical world. Temporally extended sequences of actions are intentional and structured, yet existing methods for recognising what actions are performed often do not attempt to capture their structure, particularly how the actions are executed. This, however, is crucial for assessing the quality of the action’s execution and its differences from other actions. To capture the internal mechanics of actions, we introduce a domain-specific language EXACT that represents human motions as underspecified motion programs, interpreted as reward-generating functions for zero-shot policy inference using forward-backwards representations. By leveraging the compositional nature of EXACT motion programs, we combine individual policies into an executable neuro-symbolic model that uses program structure for compositional modelling. We evaluate the utility of the proposed pipeline for creating executable action models by analysing motion-capture data to understand human actions, for the tasks of human action segmentation and action anomaly detection. Our results show that the use of executable action models improves data efficiency and captures intuitive relationships between actions compared with monolithic, task-specific approaches.
[AI-38] ExAI5G: A Logic-Based Explainable AI Framework for Intrusion Detection in 5G Networks
【速读】:该论文旨在解决5G网络入侵检测系统(Intrusion Detection System, IDS)中模型透明性不足的问题,即传统高精度“黑箱”模型虽能有效识别攻击,但缺乏可解释性,难以建立运维人员的信任并指导响应决策。其解决方案的关键在于提出ExAI5G框架,该框架通过融合基于Transformer的深度学习IDS与逻辑驱动的可解释人工智能(Explainable AI, XAI)技术,利用集成梯度(Integrated Gradients)进行特征重要性归因,并构建代理决策树以提取逻辑规则;同时引入基于大语言模型(Large Language Model, LLM)的新颖评估方法,量化生成解释的可操作性和语义忠实度,从而在保持99.9%准确率和0.854宏F1分数的同时,实现高达99.7%保真度的16条逻辑规则输出,显著提升模型推理过程的透明性与可信度。
链接: https://arxiv.org/abs/2604.18052
作者: Saeid Sheikhi,Panos Kostakos,Lauri Loven
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Intrusion detection systems (IDSs) for 5G networks must handle complex, high-volume traffic. Although opaque “black-box” models can achieve high accuracy, their lack of transparency hinders trust and effective operational response. We propose ExAI5G, a framework that prioritizes interpretability by integrating a Transformer-based deep learning IDS with logic-based explainable AI (XAI) techniques. The framework uses Integrated Gradients to attribute feature importance and extracts a surrogate decision tree to derive logical rules. We introduce a novel evaluation methodology for LLM-generated explanations, using a powerful evaluator LLM to assess actionability and measuring their semantic similarity and faithfulness. On a 5G IoT intrusion dataset, our system achieves 99.9% accuracy and a 0.854 macro F1-score, demonstrating strong performance. More importantly, we extract 16 logical rules with 99.7% fidelity, making the model’s reasoning transparent. The evaluation demonstrates that modern LLMs can generate explanations that are both faithful and actionable, indicating that it is possible to build a trustworthy and effective IDS without compromising performance for the sake of marginal gains from an opaque model.
[AI-39] he Topological Dual of a Dataset: A Logic-to-Topology Encoding for AlphaGeometry-Style Data
【速读】:该论文旨在解决当前神经符号推理系统(如AlphaGeometry)中符号推理引擎存在的对数线性扩展瓶颈问题,该瓶颈限制了模型在复杂问题下的效率。此外,论文指出现有领域特定语言作为输入表示与自然语言之间可能存在同构关系,表明当前神经引导机制依赖于表层编码而非结构理解。解决方案的关键在于提出一种“逻辑到拓扑编码”方法,通过利用观测逻辑(Logic of Observation)中可证明性与拓扑之间的对偶性,构建输入空间的拓扑双射(topological dual),从而揭示模型潜在空间在输入变换下的结构不变性。这一框架为神经符号AI提供了类罗塞塔石碑式的统一表征,实现了从形式逻辑、拓扑结构到神经处理的跨模态映射,为复杂发现路径的机制可解释性提供了理论基础。
链接: https://arxiv.org/abs/2604.18050
作者: Anthony Bordg
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:AlphaGeometry represents a milestone in neuro-symbolic reasoning, yet its architecture faces a log-linear scaling bottleneck within its symbolic deduction engine that limits its efficiency as problem complexity increases. Recent technical reports suggest that current domain-specific languages may be isomorphic as input representations to natural language, interchanging them acts as a performance-invariant transformation, implying that current neural guidance relies on superficial encodings rather than structural understanding. This paper addresses this representation bottleneck by proposing a logic-to-topology encoding designed to reveal the structural invariants of a model’s latent space under a transformation of its input space. By leveraging the Logic of Observation, we utilize the duality between provability in observable theories and topologies to propose a logic-to-topology encoder for the input space. We introduce the concept of the “topological dual of a dataset”, a transformation that bridges formal logic, topology, and neural processing. This framework serves as a Rosetta Stone for neuro-symbolic AI, providing a principled pathway for the mechanistic interpretability of how models navigate complex discovery paths.
[AI-40] First Do No Harm (With LLM s): Mitigating Racial Bias via Agent ic Workflows
【速读】:该论文旨在解决生成式 AI(Generative AI)在临床场景中可能存在的种族偏见问题,尤其是在合成患者病例生成和鉴别诊断排序任务中的隐性与显性偏见。研究以欧盟人工智能法案(EU AI Act)为治理框架,对五种广泛应用的大语言模型(Large Language Models, LLMs)进行多维度评估,关键解决方案在于采用结构化提示模板和两阶段评价设计,结合美国人群的种族分布数据及专家诊断列表作为基准,系统量化不同模型在种族代表性上的偏差,并探索检索增强型代理工作流(retrieval-based agentic workflow)对缓解显性偏见的有效性。结果显示,DeepSeek V3 在诊断任务中表现最优,且其代理工作流版本在多个指标上显著优于独立模型,表明此类架构可能有助于降低部分类型的显性偏见。
链接: https://arxiv.org/abs/2604.18038
作者: Sihao Xing,Zaur Gouliev
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly used in clinical settings, raising concerns about racial bias in both generated medical text and clinical reasoning. Existing studies have identified bias in medical LLMs, but many focus on single models and give less attention to mitigation. This study uses the EU AI Act as a governance lens to evaluate five widely used LLMs across two tasks, namely synthetic patient-case generation and differential diagnosis ranking. Using race-stratified epidemiological distributions in the United States and expert differential diagnosis lists as benchmarks, we apply structured prompt templates and a two-part evaluation design to examine implicit and explicit racial bias. All models deviated from observed racial distributions in the synthetic case generation task, with GPT-4.1 showing the smallest overall deviation. In the differential diagnosis task, DeepSeek V3 produced the strongest overall results across the reported metrics. When embedded in an agentic workflow, DeepSeek V3 showed an improvement of 0.0348 in mean p-value, 0.1166 in median p-value, and 0.0949 in mean difference relative to the standalone model, although improvement was not uniform across every metric. These findings support multi-metric bias evaluation for AI systems used in medical settings and suggest that retrieval-based agentic workflows may reduce some forms of explicit bias in benchmarked diagnostic tasks. Detailed prompt templates, experimental datasets, and code pipelines are available on our GitHub.
[AI-41] RASP-Tuner: Retrieval-Augmented Soft Prompts for Context-Aware Black-Box Optimization in Non-Stationary Environments ICML ICLR NEURIPS
【速读】:该论文旨在解决在线调参(online tuning)中因外部上下文(context)变化导致的动态优化难题,尤其在上下文反复进入有限潜变量状态(latent regimes)且每步计算资源受限时,传统方法如全量高斯过程(Gaussian Process, GP)重拟合效率低下、历史信息利用率不足的问题。其解决方案的关键在于提出RASP-Tuner框架,通过三个核心机制实现高效适应:(i) 利用最近邻检索构建上下文代理(regime proxy),识别当前所处的潜在状态;(ii) 设计一种混合专家(mixture-of-experts)代理模型,输入融合参数、上下文与检索到的软提示(soft prompt),预测短期损失;(iii) 主要在低维提示子空间中进行参数调整,仅当标量误差或预测分歧显著上升时才触发完整的代理模型更新。此外,引入RealErrorComposer模块将异构流式指标映射为[0,1]区间内的可微目标,统一训练信号。该设计在保持低延迟的同时显著降低累积遗憾(cumulative regret),实验证明其优于GP-UCB和CMA-ES等基线方法。
链接: https://arxiv.org/abs/2604.18026
作者: Enze Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Withdraw by ICML and prepare for NeurIPS or ICLR
Abstract:Many deployed systems expose black-box objectives whose minimizing configuration shifts with an externally observed context. When contexts revisit a small set of latent regimes, an optimizer that discards history pays repeated adaptation cost; when each step must remain inexpensive, full Gaussian-process (GP) refits at high observation counts are difficult to sustain. We cast online tuning as context-conditioned regret minimization and present RASP-Tuner, which instantiates a decomposition motivated by first principles: (i) identify a regime proxy by retrieving similar past contexts; (ii) predict short-horizon loss with a mixture-of-experts surrogate whose input concatenates parameters, context, and a retrieved soft prompt; (iii) adapt chiefly in a low-dimensional prompt subspace, invoking full surrogate updates only when scalarized error or disagreement spikes. A RealErrorComposer maps heterogeneous streaming metrics to [0,1] via EMA-stabilized logistic scores, supplying a single differentiable training target. On nine synthetic non-stationary benchmarks, an adversarial-context sanity check, and three tabular real-world streams (Section on real-world experiments), RASP-Tuner improves or matches cumulative regret relative to our GP-UCB and CMA-ES implementations on seven of nine synthetic tasks under paired tests at horizon T=100, while recording 8-12 times lower wall-clock per step than sliding-window GP-UCB on identical hardware. Idealized analysis in a cluster-separated, strongly convex regime model (RA-GD) supplies sufficient conditions for bounded dynamic regret; the deployed pipeline violates several of these premises, and we articulate which gaps remain open.
[AI-42] SELF-EMO: Emotional Self-Evolution from Recognition to Consistent Expression
【速读】:该论文旨在解决情感识别在对话(Emotion Recognition in Conversation, ERC)任务中因高质量标注数据稀缺且静态而导致的模型性能瓶颈问题,同时关注情绪预测与情感表达之间的一致性。解决方案的关键在于提出SELF-EMO框架,其核心是基于“更准确的情绪预测可带来更一致的情感回应”的假设,设计了两个辅助任务——情绪理解与情绪表达,并采用角色驱动的自对弈机制(role-based self-play paradigm),使模型同时扮演情绪识别者和对话响应者,在迭代交互中生成多样化的对话轨迹以实现规模化数据生成;此外,引入数据飞轮机制(data flywheel)通过平滑IoU奖励筛选优质样本并反馈用于持续自提升,无需外部监督;最后结合SELF-GRPO强化学习算法,利用多标签对齐奖励与群体一致性信号稳定优化过程,显著提升了模型在IEMOCAP、MELD和EmoryNLP等基准上的准确率与泛化能力。
链接: https://arxiv.org/abs/2604.18003
作者: Shaowei Zhang,Faqiang Qian,Yan Chen,Ziliang Wang,Kang An,Yong Dai,Mengya Gao,Yichao Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Emotion Recognition in Conversation (ERC) has become a fundamental capability for large language models (LLMs) in human-centric interaction. Beyond accurate recognition, coherent emotional expression is also crucial, yet both are limited by the scarcity and static nature of high-quality annotated data. In this work, we propose SELF-EMO, a self-evolution framework grounded in the hypothesis that better emotion prediction leads to more consistent emotional responses. We introduce two auxiliary tasks, emotional understanding and emotional expression, and design a role-based self-play paradigm where the model acts as both an emotion recognizer and a dialogue responder. Through iterative interactions, the model generates diverse conversational trajectories, enabling scalable data generation. To ensure quality, we adopt a data flywheel mechanism that filters candidate predictions and responses using a smoothed IoU-based reward and feeds selected samples back for continuous self-improvement without external supervision. We further develop SELF-GRPO, a reinforcement learning algorithm that stabilizes optimization with multi-label alignment rewards and group-level consistency signals. Experiments on IEMOCAP, MELD, and EmoryNLP show that SELF-EMO achieves state-of-the-art performance, improving accuracy by +6.33% on Qwen3-4B and +8.54% on Qwen3-8B, demonstrating strong effectiveness and generalization.
[AI-43] AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum
【速读】:该论文旨在解决当前AI代理(Agent)开发中缺乏系统性教育框架的问题,即现有模型多为单一能力维度的专用系统(如工具使用、代码生成或安全意识),在未训练领域表现出可预测的缺陷。其核心问题是:如何构建一个覆盖智能行为全范围的、具有结构化理论基础的代理培养体系。解决方案的关键在于提出AIT Academy(人工智能技术学院)课程框架,该框架基于Kagan的“三文化”理论与联合国教科文组织ISCED-F 2013分类,将代理能力发展划分为三大领域——自然科学与技术推理(Domain I)、人文与创造性表达(Domain II)、社会科学与伦理推理(Domain III),并以儒家六艺(liuyi)作为行为原型映射至各领域可训练能力。通过三个代表性训练场(ClawdGO安全道场、雅典学院、Alt Mirage舞台)实证验证了该框架的有效性,并揭示了跨域诊断价值,例如“安全意识校准病理”(SACP)现象,表明多域视角对识别过拟合问题具有不可替代的意义。
链接: https://arxiv.org/abs/2604.17989
作者: Jiaqi Li,Lvyang Zhang,Yang Zhao,Wen Lu,Lidong Zhai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures
Abstract:What does it mean to give an AI agent a complete education? Current agent development produces specialists systems optimized for a single capability dimension, whether tool use, code generation, or security awareness that exhibit predictable deficits wherever they were not trained. We argue this pattern reflects a structural absence: there is no curriculum theory for agents, no principled account of what a fully developed agent should know, be, and be able to do across the full scope of intelligent behavior. This paper introduces the AIT Academy (Agents Institute of Technology Academy), a curriculum framework for cultivating AI agents across the tripartite structure of human knowledge. Grounded in Kagan’s Three Cultures and UNESCO ISCED-F 2013, AIT organizes agent capability development into three domains: Natural Science and Technical Reasoning (Domain I), Humanities and Creative Expression (Domain II), and Social Science and Ethical Reasoning (Domain III). The Confucian Six Arts (liuyi) a 2,500-year-old holistic education system are reinterpreted as behavioral archetypes that map directly onto trainable agent capabilities within each domain. Three representative training grounds instantiate the framework across multiple backbone LLMs: the ClawdGO Security Dojo (Domain I), Athen’s Academy (Domain II), and the Alt Mirage Stage (Domain III). Experiments demonstrate a 15.9-point improvement in security capability scores under weakest-first curriculum scheduling, and a 7-percentage-point gain in social reasoning performance under principled attribution modeling. A cross-domain finding Security Awareness Calibration Pathology (SACP), in which over-trained Domain I agents fail on out-of-distribution evaluation illustrates the diagnostic value of a multi-domain perspective unavailable to any single-domain framework. Comments: 11 pages, 5 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.17989 [cs.AI] (or arXiv:2604.17989v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.17989 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-44] Latent Fourier Transform ICLR2026
【速读】:该论文旨在解决生成式音乐模型在控制音乐结构时缺乏细粒度、可解释性差的问题,尤其是难以在不破坏特定时间尺度特征的前提下进行条件生成或混合。解决方案的关键在于提出Latent Fourier Transform(LatentFT)框架,通过将扩散自动编码器与潜在空间中的傅里叶变换相结合,在潜在空间中分离不同时间尺度的音乐模式;训练时在频域掩码潜在表示,使模型学习到可被推理阶段协同操控的结构化表征,从而实现基于频率指定的音乐变体生成和混合,其本质是将传统音频均衡器(equalizer)对可听频率的操作类比为对潜在空间频率的操作,以实现对音乐结构的连续、直观控制。
链接: https://arxiv.org/abs/2604.17986
作者: Mason Wang,Cheng-Zhi Anna Huang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: ICLR 2026 Oral
Abstract:We introduce the Latent Fourier Transform (LatentFT), a framework that provides novel frequency-domain controls for generative music models. LatentFT combines a diffusion autoencoder with a latent-space Fourier transform to separate musical patterns by timescale. By masking latents in the frequency domain during training, our method yields representations that can be manipulated coherently at inference. This allows us to generate musical variations and blends from reference examples while preserving characteristics at desired timescales, which are specified as frequencies in the latent space. LatentFT parallels the role of the equalizer in music production: while traditional equalizers operates on audible frequencies to shape timbre, LatentFT operates on latent-space frequencies to shape musical structure. Experiments and listening tests show that LatentFT improves condition adherence and quality compared to baselines. We also present a technique for hearing frequencies in the latent space in isolation, and show different musical attributes reside in different regions of the latent spectrum. Our results show how frequency-domain control in latent space provides an intuitive, continuous frequency axis for conditioning and blending, advancing us toward more interpretable and interactive generative music models.
[AI-45] A Sugeno Integral View of Binarized Neural Network Inference
【速读】:该论文旨在解决二值神经网络(Binarized Neural Networks, BNNs)中神经元决策机制的可解释性问题,特别是如何将BNN的激活阈值判断转化为具有明确语义意义的数学表达。其解决方案的关键在于建立BNN与Sugeno积分之间的精确联系:通过证明在推理阶段,每个隐藏层神经元的激活阈值测试等价于对二值输入的Sugeno积分,从而获得每个神经元决策的显式集函数表示及其对应的规则基础表示(即if-then规则)。这一框架不仅揭示了BNN内部决策逻辑的结构化本质,还为后续扩展至更复杂的输入交互关系和非二值情形提供了理论支撑。
链接: https://arxiv.org/abs/2604.17967
作者: Ismaïl Baaj,Henri Prade
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In this article, we establish a precise connection between binarized neural networks (BNNs) and Sugeno integrals. The advantage of the Sugeno integral is that it provides a framework for representing the importance of inputs and their interactions, while being equivalent to a set of if-then rules. For a hidden BNN neuron at inference time, we show that the activation threshold test can be written as a Sugeno integral on binary inputs. This yields an explicit set-function representation of each neuron decision, and an associated rule-based representation. We also provide a Sugeno-integral expression for the last-layer score. Finally, we discuss how the same framework can be adapted to support richer input interactions and how it can be extended beyond the binary case induced by binarized neural networks.
[AI-46] PS-CalcBench: A Benchmark and Diagnostic Evaluation Framework for LLM Analytical Calculation Competence in Hypersonic Thermal Protection System Engineering
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在航空航天等安全关键工程领域部署时,因缺乏对物理合理性与推理过程的严格评估而导致的潜在风险问题。具体而言,现有科学基准仅关注最终答案的正确性,忽视了工程计算中数值合理但物理无效的错误,这类错误在高超声速热防护系统(Thermal Protection System, TPS)设计中可能引发灾难性后果。解决方案的关键在于提出首个面向封闭解析计算的诊断性基准——TPS-CalcBench,其核心包括:基于安德森教材构建的四级难度、八类任务体系;双轨评估机制(结果准确性 + 推理质量,采用8维评分标准及校准人工审核)以识别“正确答案但错误推理”的陷阱;以及一套完整的人机协同数据流水线和三种诊断干预方法(DFA-TPS微调、RAG-EQ检索增强、PA-CoT过程感知提示),从而实现从诊断到评估再到优化的闭环框架,显著提升了LLMs在安全关键工程场景中的可信度与可靠性。
链接: https://arxiv.org/abs/2604.17966
作者: Jinglai Zheng,Chuhan Qiao,Haiming Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Deploying LLMs as reasoning assistants in safety-critical aerospace engineering requires stricter evaluation criteria than general scientific benchmarks. In hypersonic thermal protection system (TPS) design, inaccurate stagnation-point heat flux or boundary-layer calculations may cause catastrophic design margin violations. Models with numerically reasonable but physically invalid answers are more dangerous than those declining to respond. Current scientific benchmarks only test abstract math and basic physics, evaluate final answers solely, ignore engineering reasoning processes, and cannot detect such critical failures. We propose TPS-CalcBench, the first diagnostic benchmark for closed-form analytical calculations in hypersonic aerodynamics and high-temperature gas dynamics that experienced TPS engineers conduct without simulations. Our contributions include domain-oriented task taxonomy with 4 difficulty levels and 8 categories from Anderson’s textbook, dual-track evaluation measuring result accuracy and reasoning quality via an 8-dimension rubric and calibrated judge with human audit to identify right answer wrong reasoning issues, human-AI data pipeline producing 420 high-confidence core items and 810 noise-controlled pre-gating items from 4560 raw data, noise-sensitivity analysis measuring data quality impacts on model ranking, and three diagnostic intervention methods: DFA-TPS fine-tuning, RAG-EQ retrieval grounding and PA-CoT process-aware prompting. Tests on 13 models from 7 groups show wide performance differences (KPI 12.6-87.9), hidden formula selection defects, data-driven rank changes and effective intervention improvements, establishing a complete diagnose-evaluate-intervene framework for safety-critical engineering LLM deployment assessment.
[AI-47] CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation
【速读】:该论文旨在解决多智能体委托(multi-agent delegation)中因静态技能水平假设导致的系统性误委托问题。传统方法将每个代理的能力视为固定值,忽略了其在不同任务情境下的表现差异,例如编码代理在短距离独立编辑中表现优异但在长周期调试中失效,或规划代理在浅层任务上高效而在链式依赖任务中退化。这种静态设定平均了异质情境,造成委托决策偏差。解决方案的关键在于提出CADMAS-CTX框架,通过引入情境感知的能力校准机制——为每个代理、技能和粗粒度情境桶维护一个Beta后验分布,以捕捉特定任务空间中的稳定经验;并采用一种风险敏感评分策略,结合后验均值与不确定性惩罚项,在同伴确实更优且证据充分时才进行委托,从而实现上下文条件下的动态、鲁棒的委托决策。
链接: https://arxiv.org/abs/2604.17950
作者: Chuhan Qiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We revisit multi-agent delegation under a stronger and more realistic assumption: an agent’s capability is not fixed at the skill level, but depends on task context. A coding agent may excel at short standalone edits yet fail on long-horizon debugging; a planner may perform well on shallow tasks yet degrade on chained dependencies. Static skill-level capability profiles therefore average over heterogeneous situations and can induce systematic misdelegation. We propose CADMAS-CTX, a framework for contextual capability calibration. For each agent, skill, and coarse context bucket, CADMAS-CTX maintains a Beta posterior that captures stable experience in that part of the task space. Delegation is then made by a risk-aware score that combines the posterior mean with an uncertainty penalty, so that agents delegate only when a peer appears better and that assessment is sufficiently well supported by evidence. This paper makes three contributions. First, a hierarchical contextual capability profile replaces static skill-level confidence with context-conditioned posteriors. Second, based on contextual bandit theory, we formally prove context-aware routing achieves lower cumulative regret than static routing under sufficient context heterogeneity, formalizing the bias-variance tradeoff. Third, we empirically validate our method on GAIA and SWE-bench benchmarks. On GAIA with GPT-4o agents, CADMAS-CTX achieves 0.442 accuracy, outperforming static baseline 0.381 and AutoGen 0.354 with non-overlapping 95% confidence intervals. On SWE-bench Lite, it improves resolve rate from 22.3% to 31.4%. Ablations show the uncertainty penalty improves robustness against context tagging noise. Our results demonstrate contextual calibration and risk-aware delegation significantly improve multi-agent teamwork compared with static global skill assignments.
[AI-48] ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis
【速读】:该论文旨在解决生成式 AI (Generative AI) 在复杂推理任务中因缺乏对失败原因的系统性分析而导致的性能瓶颈问题。现有提示优化方法通常仅分析单次执行轨迹中的孤立错误,或在不同示例间比较提示变体,无法捕捉同一输入下成功与失败之间的推理过程差异。其解决方案的关键在于提出 ContraPrompt,通过引入“双次推理轨迹分析”(dyadic reasoning trace analysis)——即对比模型在相同输入和基础提示下,一次失败与一次基于反馈重试后成功的完整链式思维(chain-of-thought)轨迹差异,从而提取出可优化的推理策略信号。该方法利用一个仪器化的代理重试循环自动收集对比数据,无需人工标注,并将提取的规则组织为输入感知决策树,实现更精准的提示动态调整,在多个基准测试中显著优于现有最优方法 GEPA。
链接: https://arxiv.org/abs/2604.17937
作者: Rishav Rishav,Pushpak Pujari,Pushpendre Rastogi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Prompt optimization methods either analyze individual failures in isolation or compare prompt variants across examples, operating on single execution traces with no access to the reasoning process distinguishing success from failure on the same input. We introduce ContraPrompt, built on the observation that when a model fails but succeeds on a retry with feedback, the difference between its two chain-of-thought traces constitutes an optimization signal not captured by prior methods. Unlike prior contrastive methods, we compare complete intermediate reasoning processes: the two traces share model, input, and base prompt, so remaining differences reflect reasoning strategy and appended error feedback – we call this dyadic reasoning trace analysis. The multi-attempt solving phase is an instrumented agentic retry loop that generates contrastive data automatically without human annotation. Extracted rules are organized into an input-aware decision tree routing instructions by observable input characteristics. On four reasoning and compliance benchmarks, ContraPrompt outperforms GEPA (Agrawal et al., 2026) on all four, with absolute gains of +8.29 pp on HotPotQA (+20.8% rel.), +2.21 pp on GDPR-Bench (+18.2% rel.), +7.14 pp on GPQA Diamond (+10.6% rel.), and +0.74 pp on BBH (+0.85% rel.). Ablations confirm dyadic trace contrastivity is the critical component, with a -16% relative average drop upon its removal. On 53 EvalSet black-box optimization problems, ContraPrompt beats GEPA on 11, ties on 41, and loses on 1 at equal budget. On FiNER-139 financial named entity recognition (Loukas et al., 2022), ContraPrompt achieves +7.77 pp over the unoptimized baseline (+11.6% rel.) and +1.94 pp over GEPA (+2.66% rel.), with branch conditions aligning with standard US GAAP financial-instrument categories.
[AI-49] How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers
【速读】:该论文旨在解决Transformer推理过程中键值(Key-Value, KV)缓存压缩的理论边界问题,即在多步推理任务中,KV缓存可被压缩到何种程度而不导致推理性能下降。其核心挑战在于量化压缩比与模型深度(depth)之间的关系,并揭示不同缓存策略对错误概率的影响机制。解决方案的关键在于引入“k-跳指针追踪”(k-hop pointer chasing)建模框架,在共享大小为s的KV缓存、注意力维度m、H个头、p-bit精度等约束下,通过三个主要成果建立理论界限:(1) 提出并部分证明了产品深度下界(product depth lower bound),结合窗口化指针加倍(windowed pointer doubling)算法实现匹配上界;(2) 发现带宽屏障(bandwidth barrier)现象,表明当注意力空间复杂度 $ Hmp \gtrsim \log n $ 时,传统基于窗口可区分性的分析方法无法突破 ⌈k/s⌉ 的限制;(3) 揭示自适应局部性尊重缓存相比盲目缓存具有指数级更优的错误概率控制能力(Pr[E]=s/n vs. (s/(n−T))T+2T3/n),从而解释为何重热点驱逐策略在实际多跳推理中表现更优。
链接: https://arxiv.org/abs/2604.17935
作者: Xiao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
备注:
Abstract:The key-value (KV) cache is the dominant memory bottleneck during Transformer inference, yet little is known theoretically about how aggressively it can be compressed before multi-step reasoning degrades. We study this through k -hop pointer chasing on n tokens under a shared KV cache of size s , attention dimension m , H heads, p -bit precision, and a locality-respecting cache controller (satisfied by all standard KV-compression methods). We give three results. (1) Product depth lower bound (conjectured). We conjecture that any such Transformer ( n \geq 4k , s \leq \sqrtn/4 ) requires depth L = \Omega(\lceil k/s \rceil \cdot \lceil \log_2 n/(Hmp) \rceil) , and isolate the sole remaining gap as a probabilistic step on the joint distribution of cache trace and pointer chain. Unconditionally, we prove a matching upper bound L = O(\min(k, \lceil k/s \rceil \log s) \cdot \log n/(mp)) via windowed pointer doubling, and a max-bound L = \Omega(\max(\lceil k/s \rceil, \log n/(Hmp))) . Closing the conjecture amounts to upgrading max to product. (2) Bandwidth barrier. The product bound binds only when Hmp \lesssim \log n . Any lower bound provable via per-window distinguishability counting – including reachability, bandwidth, and combinations – cannot exceed \lceil k/s \rceil once Hmp \geq \log_2 n . Breaking this requires lifting unconditional communication-complexity bounds for pointer chasing to Cache-Transformer depth. (3) Adaptive vs oblivious error scaling. Under random cache over T = \lceil \log_2 k \rceil doubling stages, oblivious caches give \Pr[\mathcalE] \leq (s/(n-T))^T + 2T^3/n (exponential in T ), while adaptive locality-respecting caches achieve \Pr[\mathcalE] = s/n exactly, independent of T . The \Omega((n/s)^T-1) separation explains why heavy-hitter eviction empirically dominates random eviction for multi-hop reasoning. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC) Cite as: arXiv:2604.17935 [cs.LG] (or arXiv:2604.17935v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.17935 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xiao Wang [view email] [v1] Mon, 20 Apr 2026 08:15:17 UTC (90 KB)
[AI-50] LiteResearcher: A Scalable Agent ic RL Training Framework for Deep Research Agent
【速读】:该论文旨在解决生成式 AI(Generative AI)在深度研究任务中应用时,基于强化学习(Reinforcement Learning, RL)的智能体训练面临的核心挑战:一是人工构造的合成数据难以激发真实世界搜索能力;二是训练过程中依赖真实世界搜索会导致不稳定性和高昂成本,限制了智能体强化学习(Agentic RL)的可扩展性。解决方案的关键在于提出 LiteResearcher 训练框架,通过构建一个模拟真实世界搜索动态的轻量级虚拟环境(lite virtual world),实现持续优化的训练机制,使得小型搜索智能体能够超越大型开源和商业模型(如 Tongyi DeepResearch 和 Claude-4.5 Sonnet),并在 GAIA 和 Xbench 等基准测试中分别取得 71.3% 和 78.0% 的开源最优性能,验证了可扩展的 RL 训练是深度研究智能体发展的关键驱动力。
链接: https://arxiv.org/abs/2604.17931
作者: Wanli Li,Bince Qu,Bo Pan,Jianyu Zhang,Zheng Liu,Pan Zhang,Wei Chen,Bo Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint. Under review
Abstract:Reinforcement Learning (RL) has emerged as a powerful training paradigm for LLM-based agents. However, scaling agentic RL for deep research remains constrained by two coupled challenges: hand-crafted synthetic data fails to elicit genuine real-world search capabilities, and real-world search dependency during RL training introduces instability and prohibitive cost, which limits the scalability of Agentic RL. LiteResearcher is a training framework that makes Agentic RL scalable: by constructing a lite virtual world that mirrors real-world search dynamics, we enable a continuously improving training recipe that empowers a tiny search agent to outperform large-scale open-source and commercial models (e.g., Tongyi DeepResearch and Claude-4.5 Sonnet). Specifically, on common benchmarks such as GAIA and Xbench, our LiteResearcher-4B achieves open-source state-of-the-art results of 71.3% and 78.0% respectively, demonstrating that scalable RL training is a key enabler for Deep Research Agents.
[AI-51] HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment ACL2026
【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Reward, RLVR)在低资源场景下因熵崩溃(entropy collapse)导致探索能力下降、推理性能显著劣化的问题。其解决方案的关键在于提出了一种面向少样本RLVR的框架——混合域熵动态对齐(Hybrid-domain Entropy dynamics ALignment, HEAL),该框架首先选择性引入高价值通用领域数据以促进多样化探索,进而设计熵动态对齐(Entropy Dynamics Alignment, EDA)机制,通过匹配目标域与通用域之间的轨迹级熵动态变化(包括熵幅值和细粒度波动),有效缓解熵崩溃并引导策略从通用域中学习更丰富的探索行为,从而在仅使用32个目标域样本的情况下即达到甚至超越使用1000个样本的全量训练效果。
链接: https://arxiv.org/abs/2604.17928
作者: Zhanyu Liu,Qingguo Hu,Ante Wang,Chenqing Liu,Zhishang Xiang,Hui Li,Delai Qiu,Jinsong Su
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026 Main Conference
Abstract:Reinforcement Learning with Verifiable Reward (RLVR) has proven effective for training reasoning-oriented large language models, but existing methods largely assume high-resource settings with abundant training data. In low-resource scenarios, RLVR is prone to more severe entropy collapse, which substantially limits exploration and degrades reasoning performance. To address this issue, we propose Hybrid-domain Entropy dynamics ALignment (HEAL), a framework tailored for few-shot RLVR. HEAL first selectively incorporates high-value general-domain data to promote more diverse exploration. Then, we introduce Entropy Dynamics Alignment (EDA), a reward mechanism that aligns trajectory-level entropy dynamics between the target and general domains, capturing both entropy magnitude and fine-grained variation. Through this alignment, EDA not only further mitigates entropy collapse but also encourages the policy to acquire more diverse exploration behaviors from the general domain. Experiments across multiple domains show that HEAL consistently improves few-shot RLVR performance. Notably, using only 32 target-domain samples, HEAL matches or even surpasses full-shot RLVR trained with 1K target-domain samples.
[AI-52] Learning to Correct: Calibrated Reinforcement Learning for Multi-Attempt Chain-of-Thought
【速读】:该论文旨在解决长链式思维(Chain-of-Thought, CoT)推理模型在多次尝试中如何有效利用验证反馈以提升最终成功率的问题。具体而言,模型在最多K次尝试中逐步修正错误并构建更优解,但若直接按每次尝试的通过/失败结果加权更新策略,会导致梯度偏差,从而影响训练稳定性与性能。解决方案的关键在于提出校准尝试级(Calibrated Attempt-Level, CAL)GRPO方法,通过设计一种无偏梯度估计的权重策略,在保持低方差的同时,合理整合每轮尝试的奖励信号,从而显著提升Verification@K(即模型在第K次尝试前成功解决问题的概率)的性能表现。
链接: https://arxiv.org/abs/2604.17912
作者: Muhammed Emrullah Ildiz,Halil Alperen Gozeten,Ege Onur Taga,Samet Oymak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages
Abstract:State-of-the-art reasoning models utilize long chain-of-thought (CoT) to solve increasingly complex problems using more test-time computation. In this work, we explore a long CoT setting where the model makes up to K successive attempts at solving a problem, in which each attempt is allowed to build on earlier ones after the model receives a hard verifier feedback. This motivates RL methods that can harness per-attempt rewards by carefully weighting individual attempts. We study optimizing the Verification@K reward (the model succeeds by the K-th attempt) and show that naively weighing the attempts by their pass/fail results in biased gradients. We introduce Calibrated Attempt-Level (CAL) GRPO by devising a weighing strategy to obtain unbiased gradients while maintaining small variance. Our theory reveals how incorporating per-attempt rewards influence the training and the eventual Verification@K performance. Experiments, baselines, and ablations on synthetic and real data corroborate our theory and the benefits of CAL-GRPO over vanilla GRPO as well as naive weighting.
[AI-53] Physics-Informed Causal MDPs for Sequential Constraint Repair in Engineering Simulation Pipelines
【速读】:该论文旨在解决在具有大规模二值状态空间的约束马尔可夫决策过程(Constrained MDPs, CMDPs)中,离策略学习面临的根本性矛盾:因果识别过渡动态需要结构假设,而样本高效策略学习则依赖状态空间压缩。其解决方案的关键在于提出PI-CMDP框架,该框架基于生命周期排序假设(Lifecycle Ordering Assumption, LOA),使约束依赖关系形成分层有向无环图(Layered DAG)。核心创新为“识别-压缩-估计”三阶段管道:(i) 通过LOA实现跨层因果边权重的后门识别,并在LOA不成立时提供形式化的部分识别边界;(ii) 利用马尔可夫抽象,在层优先正则性和交换性条件下将状态基数从2WL压缩至(W+1)L;(iii) 设计物理引导的双重稳健估计器,在物理先验优于学习模型时保持无偏并降低方差常数。该方法在工程仿真流水线约束修复任务上验证有效,显著提升修复成功率且减少级联故障率。
链接: https://arxiv.org/abs/2604.17910
作者: Chuhan Qiao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Off-policy learning in constrained MDPs with large binary state spaces faces a fundamental tension: causal identification of transition dynamics requires structural assumptions, while sample-efficient policy learning requires state-space compression. We introduce PI-CMDP, a framework for CMDPs whose constraint dependencies form a layered DAG under a Lifecycle Ordering Assumption (LOA). We propose an Identify-Compress-Estimate pipeline: (i) Identify: LOA enables backdoor identification of causal edge weights for cross-layer pairs, with formal partial-identification bounds when LOA is violated; (ii) Compress: a Markov abstraction compresses state cardinality from 2^(WL) to (W+1)^L under layer-priority regularity and exchangeability; and (iii) Estimate: a physics-guided doubly-robust estimator remains unbiased and reduces the variance constant when the physics prior outperforms a learned model. We instantiate PI-CMDP on constraint repair in engineering simulation pipelines. On the TPS benchmark (4,206 episodes), PI-CMDP achieves 76.2% repair success rate with only 300 training episodes versus 70.8% for the strongest baseline (+5.4 pp), narrowing to +2.8 pp (83.4% vs. 80.6%) in the full-data regime, while substantially reducing cascade failure rates. All improvements are consistent across 5 independent seeds (paired t-test p 0.02).
[AI-54] LoReC: Rethinking Large Language Models for Graph Data Analysis
【速读】:该论文旨在解决当前GraphLLM(基于大语言模型的图学习)范式中,直接利用大语言模型(Large Language Models, LLMs)进行图相关任务预测时效果不佳的问题,其核心挑战在于LLMs对图结构信息处理能力有限且易忽略图特征。解决方案的关键在于提出一种名为LoReC(Look, Remember, and Contrast)的即插即用方法,通过三个阶段增强LLM对图数据的理解:(1) Look阶段重新分配注意力机制以聚焦图结构信息;(2) Remember阶段将图信息重新注入前馈网络(Feed-Forward Network, FFN)以保留图特征;(3) Contrast阶段修正解码过程中生成的原始logits,从而有效提升模型在多样化图数据上的性能表现。
链接: https://arxiv.org/abs/2604.17897
作者: Hongyu Zhan,Qixin Wang,Yusen Tan,Haitao Yu,Jingbo Zhou,Shuai Chen,Jia Li,Xiao Tan,Jun Xia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The advent of Large Language Models (LLMs) has fundamentally reshaped the way we interact with graphs, giving rise to a new paradigm called GraphLLM. As revealed in recent studies, graph learning can benefit from LLMs. However, we observe limited benefits when we directly utilize LLMs to make predictions for graph-related tasks within GraphLLM paradigm, which even yields suboptimal results compared to conventional GNN-based approaches. Through in-depth analysis, we find this failure can be attributed to LLMs’ limited capability for processing graph data and their tendency to overlook graph information. To address this issue, we propose LoReC (Look, Remember, and Contrast), a novel plug-and-play method for GraphLLM paradigm, which enhances LLM’s understanding of graph data through three stages: (1) Look: redistributing attention to graph; (2) Remember: re-injecting graph information into the Feed-Forward Network (FFN); (3) Contrast: rectifying the vanilla logits produced in the decoding process. Extensive experiments demonstrate that LoReC brings notable improvements over current GraphLLM methods and outperforms GNN-based approaches across diverse datasets. The implementation is available at this https URL.
[AI-55] Can Explicit Physical Feasibility Benefit VLA Learning? An Empirical Study
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在通过大规模模仿学习训练时,缺乏对硬性物理约束(如障碍物避让或运动学可行性)的显式监督,导致模型需从示范数据中隐式推断物理可行行为的几何结构,从而影响其物理可靠性和泛化能力的问题。解决方案的关键在于引入一个基于几何的显式可行性目标(geometry-grounded feasibility objective),并将其整合到基于扩散模型的VLA策略训练阶段中,以提供结构化的可行性引导,从而提升模型在低数据场景下的学习效率与任务性能。
链接: https://arxiv.org/abs/2604.17896
作者: Yubai Wei,Chen Wu,Hashem Haghbayan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 pages, 5 figures
Abstract:Vision-Language-Action (VLA) models map multimodal inputs directly to robot actions and are typically trained through large-scale imitation learning. While this paradigm has shown strong performance, prevailing VLA training procedures do not explicitly supervise hard physical constraints such as obstacle avoidance or kinematic feasibility. As a result, the geometric structure underlying physically feasible behavior must be inferred only implicitly from demonstrations. In this paper, we study whether introducing explicit feasibility supervision can provide effective structured guidance for VLA policies. We formulate a simple geometry-grounded feasibility objective and integrate it into the training stage of a diffusion-based VLA policy. To evaluate this idea systematically, we use obstacle-aware manipulation as a controlled probe of geometry-dependent physical feasibility. Empirical results show that augmenting VLA training with feasibility supervision improves both physical reliability and overall task performance, while also enhancing learning efficiency in the low-data regime. These findings indicate that explicit feasibility signals can effectively complement imitation-based VLA learning, highlighting their potential for developing more reliable VLA policies.
[AI-56] LEPO: underlineLatent Runderlineeasoning underlinePolicy underlineOptimization for Large Language~Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在潜空间推理(latent reasoning)中因缺乏随机性而导致的确定性推理问题,即传统方法在不引入随机采样时会陷入单一推理路径,从而限制了探索能力并削弱与强化学习(Reinforcement Learning, RL)的兼容性。解决方案的关键在于通过Gumbel-Softmax机制可控地注入随机性,恢复LLMs的探索能力,并在此基础上提出Latent Reasoning Policy Optimization(LEPO)框架——该框架直接对连续潜表示应用强化学习,在rollout阶段保持随机性以实现多样化轨迹采样,同时在优化阶段构建统一的梯度估计机制,联合优化潜空间表示与离散token,从而显著提升推理多样性与性能。
链接: https://arxiv.org/abs/2604.17892
作者: Yuyan Zhou,Jiarui Yu,Hande Dong,Zhezheng Hao,Hong Wang,Jianqing Zhang,Qiang Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, latent reasoning has been introduced into large language models (LLMs) to leverage rich information within a continuous space. However, without stochastic sampling, these methods inevitably collapse to deterministic inference, failing to discover diverse reasoning paths. To bridge the gap, we inject controllable stochasticity into latent reasoning via Gumbel-Softmax, restoring LLMs’ exploratory capacity and enhancing their compatibility with Reinforcement Learning (RL). Building on this, we propose \textbf\underlineLatent R\textbf\underlineeasoning \textbf\underlinePolicy \textbf\underlineOptimization~(\textbfLEPO), a novel framework that applies RL directly to continuous latent representations. Specifically, in rollout stage, LEPO maintains stochasticity to enable diverse trajectory sampling, while in optimization stage, LEPO constructs a unified gradient estimation for both latent representations and discrete tokens. Extensive experiments show that LEPO significantly outperforms existing RL methods for discrete and latent reasoning.
[AI-57] SPREG: Structured Plan Repair with Entropy-Guided Test-Time Intervention for Large Language Model Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长链推理过程中易出现逻辑幻觉(logical hallucinations)和随机漂移(stochastic drifts)的问题。现有方法如无分类器引导(Classifier-Free Guidance, CFG)虽能提升指令遵循性,但静态实现常导致语义稀释和语言质量下降。其解决方案的关键在于提出SPREG(Structured Plan-guided Real-time Entropy Gating),该框架通过实时熵监测识别逻辑失败的“熵突变”信号,并采用自适应双阈值机制触发动态修复:利用历史高置信度状态合成参考分布,替代无效的零先验(null-priors),同时根据结构化推理阶段(如动作、观察)调节引导强度,从而在不损害流畅性的前提下将模型拉回稳定流形,显著提升推理准确性与稳定性。
链接: https://arxiv.org/abs/2604.17884
作者: Xuan Wang,Yu Ming,Xinhao Zhong,Xinyu Yu,Wenjie Wang,Shuai Chen,Wei Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are prone to logical hallucinations and stochastic drifts during long-chain reasoning. While Classifier-Free Guidance (CFG) can improve instruction adherence, standard static implementations often cause semantic dilution and linguistic degradation. We propose SPREG (Structured Plan-guided Real-time Entropy Gating), a lightweight inference-time framework for surgical error rectification. SPREG employs an adaptive dual-threshold mechanism to monitor real-time entropy, identifying sudden ``entropy spikes’’ as reliable indicators of logical failure. Upon detection, it triggers a dynamic repair by replacing uninformative null-priors with reference distributions synthesized from historical high-confidence states. By modulating guidance intensity according to structured reasoning stages (e.g., Action, Observation), SPREG steers the model back to a stable manifold without compromising fluency. Our experiments demonstrate significant gains, notably a 20.0% absolute accuracy improvement on AIME25, while effectively suppressing uncontrolled entropy drift in complex tasks.
[AI-58] Periodic Steady-State Control of a Handkerchief-Spinning Task Using a Parallel Anti-Parallelogram Tendon-driven Wrist ICRA2026
【速读】:该论文旨在解决柔性物体(如中国传统手帕表演中的手帕)在非线性动力学、摩擦接触和边界约束条件下实现周期性稳态运动的控制难题。其解决方案的关键在于:首先设计了一种基于并联反平行四边形腱驱动结构的灵巧手腕,具备低惯量、90°全向旋转能力及解耦的滚转-俯仰感知;其次构建了面向控制的粒子-弹簧模型用于手帕的动力学抽象与策略评估,并结合高低层分层控制架构,在硬件实验中实现了约99%的展开率和高动态旋转下的指尖跟踪误差(RMSE = 2.88 mm),验证了控制导向建模与任务定制灵巧手腕协同作用可有效实现从静止到稳态的鲁棒过渡及对高度柔性物体的精确周期性操作。
链接: https://arxiv.org/abs/2604.17863
作者: Lei Liu,Haonan Zhang,Huahang Xu,Zefan Zhang,Lulu Chang,Lei Lv,Andrew Ross McIntosh,Kai Sun,Zhenshan Bing,Jiahong Dong,Fuchun Sun
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: ICRA2026
Abstract:Spinning flexible objects, exemplified by traditional Chinese handkerchief performances, demands periodic steady-state motions under nonlinear dynamics with frictional contacts and boundary constraints. To address these challenges, we first design an intuitive dexterous wrist based on a parallel anti-parallelogram tendon-driven structure, which achieves 90 degrees omnidirectional rotation with low inertia and decoupled roll-pitch sensing, and implement a high-low level hierarchical control scheme. We then develop a particle-spring model of the handkerchief for control-oriented abstraction and strategy evaluation. Hardware experiments validate this framework, achieving an unfolding ratio of approximately 99% and fingertip tracking error of RMSE = 2.88 mm in high-dynamic spinning. These results demonstrate that integrating control-oriented modeling with a task-tailored dexterous wrist enables robust rest-to-steady-state transitions and precise periodic manipulation of highly flexible objects. More visualizations: this https URL
[AI-59] On the Reliability of Computer Use Agents
【速读】:该论文旨在解决计算机使用代理(computer-use agents)在执行任务时表现出的不可靠性问题,即同一代理在相同任务条件下可能一次成功而另一次失败。研究表明,这种不可靠性主要源于三个因素:执行过程中的随机性、任务规范的模糊性以及代理行为的变异性。解决方案的关键在于:首先,在评估代理性能时应考虑重复执行场景以更真实地反映其可靠性;其次,允许代理通过交互方式消除任务规范中的歧义;最后,优先选择在多次运行中行为稳定的策略,从而提升整体任务成功率与一致性。
链接: https://arxiv.org/abs/2604.17849
作者: Gonzalo Gonzalez-Pumariega,Saaket Agashe,Jiachen Yang,Ang Li,Xin Eric Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 33 pages, 3 figures, 4 tables
Abstract:Computer-use agents have rapidly improved on real-world tasks such as web navigation, desktop automation, and software interaction, in some cases surpassing human performance. Yet even when the task and model are unchanged, an agent that succeeds once may fail on a repeated execution of the same task. This raises a fundamental question: if an agent can succeed at a task once, what prevents it from doing so reliably? In this work, we study the sources of unreliability in computer-use agents through three factors: stochasticity during execution, ambiguity in task specification, and variability in agent behavior. We analyze these factors on OSWorld using repeated executions of the same task together with paired statistical tests that capture task-level changes across settings. Our analysis shows that reliability depends on both how tasks are specified and how agent behavior varies across executions. These findings suggest the need to evaluate agents under repeated execution, to allow agents to resolve task ambiguity through interaction, and to favor strategies that remain stable across runs.
[AI-60] WebUncertainty: Dual-Level Uncertainty Driven Planning and Reasoning For Autonomous Web Agent
【速读】:该论文旨在解决当前自主网页代理在执行复杂任务时面临的两大挑战:一是因固定规划策略导致的动态交互与长程执行能力不足,二是推理过程中易产生幻觉(hallucination)的问题。解决方案的关键在于提出WebUncertainty框架,其核心创新包括两个部分:一是任务不确定性驱动的自适应规划机制(Task Uncertainty-Driven Adaptive Planning Mechanism),可根据环境未知性动态选择规划模式以增强适应性;二是动作不确定性驱动的蒙特卡洛树搜索(Action Uncertainty-Driven Monte Carlo Tree Search, MCTS)推理机制,通过引入置信度诱导的动作不确定性(Confidence-induced Action Uncertainty, ConActU)策略,量化了随机不确定性(aleatoric uncertainty, AU)和认知不确定性(epistemic uncertainty, EU),从而优化搜索路径并提升决策鲁棒性。实验表明,该框架在WebArena和WebVoyager基准上显著优于现有最先进方法。
链接: https://arxiv.org/abs/2604.17821
作者: Lingfeng Zhang,yongan sun,Jinpeng Hu,Hui Ma,yang ying,Kuien Liu,Zenglin Shi,Meng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in large language models (LLMs) have empowered autonomous web agents to execute natural language instructions directly on real-world webpages. However, existing agents often struggle with complex tasks involving dynamic interactions and long-horizon execution due to rigid planning strategies and hallucination-prone reasoning. To address these limitations, we propose WebUncertainty, a novel autonomous agent framework designed to tackle dual-level uncertainty in planning and reasoning. Specifically, we design a Task Uncertainty-Driven Adaptive Planning Mechanism that adaptively selects planning modes to navigate unknown environments. Furthermore, we introduce an Action Uncertainty-Driven Monte Carlo tree search (MCTS) Reasoning Mechanism. This mechanism incorporates the Confidence-induced Action Uncertainty (ConActU) strategy to quantify both aleatoric uncertainty (AU) and epistemic uncertainty (EU), thereby optimizing the search process and guiding robust decision-making. Experimental results on the WebArena and WebVoyager benchmarks demonstrate that WebUncertainty achieves superior performance compared to state-of-the-art baselines.
[AI-61] Understanding Secret Leakage Risks in Code LLM s: A Tokenization Perspective ACL26
【速读】:该论文旨在解决生成式 AI(Generative AI)在代码辅助场景中因代码大语言模型(Code Large Language Models, CLLMs)对敏感信息(如代码密钥)的意外记忆而导致的安全泄露问题。其核心发现是,基于字节对编码(Byte-Pair Encoding, BPE)的分词机制会引入一种称为“无意义偏置”(gibberish bias)的现象:某些高字符级熵但低词元级熵的密钥反而最容易被CLLMs记忆。解决方案的关键在于识别出该偏置源于训练数据与秘密数据之间词元分布的偏移,并指出当前向更大词汇表演进的分词器设计趋势可能加剧此问题,从而为改进分词策略和开发针对性缓解措施提供理论依据。
链接: https://arxiv.org/abs/2604.17814
作者: Meifang Chen,Zhe Yang,Huang Nianchen,Yizhan Huang,Yichen Li,Zihan Li,Michael R. Lyu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 26 Findings
Abstract:Code secrets are sensitive assets for software developers, and their leakage poses significant cybersecurity risks. While the rapid development of AI code assistants powered by Code Large Language Models (CLLMs), CLLMs are shown to inadvertently leak such secrets due to a notorious memorization phenomenon. This study first reveals that Byte-Pair Encoding (BPE) tokenization leads to unexpected behavior of secret memorization, which we term as \textitgibberish bias. Specifically, we identified that some secrets are among the easiest for CLLMs to memorize. These secrets yield high character-level entropy, but low token-level entropy. Then, this paper supports the biased claim with numerical data. We identified that the roots of the bias are the token distribution shift between the CLLM training data and the secret data. We further discuss how gibberish bias manifests under the ``larger vocabulary’’ trend. To conclude the paper, we discuss potential mitigation strategies and the broader implications on current tokenizer design.
[AI-62] Party Autonomy in Determining the Law Applicable to Non-contractual Obligations concerning Cross-Border Data Transfers
【速读】:该论文旨在解决跨境数据传输背景下,因数据泄露引发民事责任时适用法律的确定难题,尤其针对非合同义务领域中传统国际私法方法(依赖物理位置识别)失效的问题。其解决方案的关键在于利用“私人秩序”(private ordering)机制,即通过当事人自主选择的合同义务适用法律来确定非合同义务的准据法,从而克服物理位置难以识别的困境,并增强法律适用的可预见性与协调性。
链接: https://arxiv.org/abs/2604.17806
作者: Yuki Okamura,Ren Yatsunami,Kumiko Kameishi,Oliver Posani,Soma Araoka,Miho Ikeda,Makiko Aoyagi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 26 pages, 3 figures, 2 tables
Abstract:(1)Cross-border data transfers have become a matter of daily occurrence against the backdrop of the development of cloud computing and artificial intelligence. Consequently, where a data leak gives rise to civil liability, the determination of that liability inevitably assumes an international dimension involving foreign elements. (2)As is starkly demonstrated by secret sharing technology in cloud computing, fragments of data may be presumed to be distributed across multiple jurisdictions on a global scale. This renders traditional private international law measures – predicated on the identification of a physical location – inadequate for the purposes of determining the applicable law, a difficulty that is particularly acute in relation to non-contractual obligations. (3)Bearing in mind the typical scenario encountered in practice – in which a Data Subject brings a claim for damages against a SaaS (Software as a Service) provider, which in turn seeks recourse against an IaaS (Infrastructure as a Service) or PaaS (Platform as a Service) provider – a characteristic feature of such cases is the concurrence of contractual and non-contractual obligations. Taking this feature into account, it is possible to determine the applicable law governing non-contractual obligations through party autonomy – by aligning it with the law governing the contractual obligation as selected by the parties, an approach that may be termed private ordering. This serves to overcome the difficulties associated with the identification of a physical location and, at the same time, contributes to ensuring the foreseeability of the parties.
[AI-63] Ranking Abuse via Strategic Pairwise Data Perturbations
【速读】:该论文旨在解决基于最大似然估计(Maximum Likelihood Estimation, MLE)的配对排序系统(如Bradley-Terry模型)在面对恶意数据操纵时的鲁棒性不足问题。其核心挑战在于如何识别并利用有限数量的策略性投票者,以最小扰动预算实现对全局排名的显著改变。解决方案的关键是提出一种自适应子集选择攻击(Adaptive Subset Selection Attack, ASSA),将操纵任务建模为约束组合优化问题,并通过高效搜索高影响力扰动来逼近最优攻击策略。实验表明,MLE排名存在明显的相变行为——当扰动超出阈值时,少量策略性投票即可颠覆整体排序,且ASSA在受限预算下显著优于随机与贪心基线方法,揭示了MLE机制对结构化扰动的高度敏感性,凸显了构建更鲁棒聚合机制的必要性。
链接: https://arxiv.org/abs/2604.17805
作者: Junyi Yao,Zihao Zheng,Jiayu Long
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:
Abstract:Pairwise ranking systems based on Maximum Likelihood Estimation (MLE), such as the Bradley-Terry model, are widely used to aggregate preferences from pairwise comparisons. However, their robustness under strategic data manipulation remains insufficiently understood. In this paper, we study the vulnerability of MLE-based ranking systems to adversarial perturbations. We formulate the manipulation task as a constrained combinatorial optimization problem and propose an Adaptive Subset Selection Attack (ASSA) to efficiently identify high-impact perturbations. Experimental results on both synthetic data and real-world election datasets show that MLE-based rankings exhibit a sharp phase-transition behavior: beyond a small perturbation budget, a limited number of strategic voters can significantly alter the global ranking. In particular, our method consistently outperforms random and greedy baselines under constrained budgets. These findings reveal a fundamental sensitivity of MLE-based ranking mechanisms to structured perturbations and highlight the need for more robust aggregation methods in collective decision-making systems. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2604.17805 [cs.LG] (or arXiv:2604.17805v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.17805 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Junyi Yao [view email] [v1] Mon, 20 Apr 2026 04:52:30 UTC (258 KB)
[AI-64] Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition ICLR2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)后训练阶段所需高质量、多样化对话数据稀缺且获取成本高昂的问题,尤其在低资源领域和多轮对话场景中更为突出。现有方法如众包或合成生成往往导致数据质量低或多样性不足。其解决方案的关键在于提出“对抗竞技场”(Adversarial Arena),将数据生成建模为对抗性任务:攻击者团队设计挑战性提示(prompts),防御者团队生成回应,通过多组团队间的交互竞争自然催生复杂且多样化的对话数据。该方法在网络安全对齐任务上验证有效,最终生成19,683条多轮对话,并显著提升模型在安全代码生成上的性能。
链接: https://arxiv.org/abs/2604.17803
作者: Prasoon Goyal,Sattvik Sahai,Michael Johnston,Hangjie Shi,Yao Lu,Shaohua Liu,Anna Rumshisky,Rahul Gupta,Anna Gottardi,Desheng Zhang,Lavina Vaz,Leslie Ball,Lucy Hu,Luke Dai,Samyuth Sagi,Maureen Murray,Sankaranarayanan Ananthakrishnan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 3rd DATA-FM workshop @ ICLR 2026
Abstract:Post-training Large Language Models requires diverse, high-quality data which is rare and costly to obtain, especially in low resource domains and for multi-turn conversations. Common solutions are crowdsourcing or synthetic generation, but both often yield low-quality or low-diversity data. We introduce Adversarial Arena for building high quality conversational datasets by framing data generation as an adversarial task: attackers create prompts, and defenders generate responses. This interactive competition between multiple teams naturally produces diverse and complex data. We validated this approach by conducting a competition with 10 academic teams from top US and European universities, each building attacker or defender bots. The competition, focused on safety alignment of LLMs in cybersecurity, generated 19,683 multi-turn conversations. Fine-tuning an open-source model on this dataset produced an 18.47% improvement in secure code generation on CyberSecEval-Instruct and 29.42% improvement on CyberSecEval-MITRE.
[AI-65] AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)策略在精密操作任务中因单一统一动作空间导致的优化失衡问题:即宏观运动主导学习过程,抑制了对微小但关键的执行修正信号的捕捉。其解决方案的关键在于提出一种分层框架AnchorRefine,将VLA动作建模分解为两个阶段——由锚点规划器(anchor planner)生成粗粒度运动骨架,再由残差精修模块(refinement module)实时校正执行偏差,从而实现全局轨迹组织与局部执行修正的解耦优化;此外,引入决策感知夹爪精修机制以更好地建模夹爪控制的离散性和边界敏感特性,显著提升了模拟与真实机器人环境中的任务成功率。
链接: https://arxiv.org/abs/2604.17787
作者: Tingzheng Jia,Kan Guo,Lanping Qian,Yongli Hu,Daxin Tian,Guixian Qu,Chunmian Lin,Baocai Yin,Jiapu Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Precision-critical manipulation requires both global trajectory organization and local execution correction, yet most vision-language-action (VLA) policies generate actions within a single unified space. This monolithic formulation forces macro-level transport and micro-level refinement to be optimized under the same objective, causing large motions to dominate learning while suppressing small but failure-critical corrective signals. In contrast, human manipulation is structured by global movement planning together with continuous local adjustment during execution. Motivated by this principle, we propose AnchorRefine, a hierarchical framework that factorizes VLA action modeling into trajectory anchor and residual refinement. The anchor planner predicts a coarse motion scaffold, while the refinement module corrects execution-level deviations to improve geometric and contact precision. We further introduce a decision-aware gripper refinement mechanism to better capture the discrete and boundary-sensitive nature of gripper control. Experiments on LIBERO, CALVIN, and real-robot tasks demonstrate that AnchorRefine consistently improves both regression-based and diffusion-based VLA backbones, yielding gains of up to 7.8% in simulation success rate and 18% in real-world success rate.
[AI-66] Prompt Optimization Enables Stable Algorithmic Collusion in LLM Agents
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在市场环境中因提示优化(prompt optimization)而可能引发的隐性共谋行为问题,即如何理解并检测自主多智能体系统中通过自动提示调优所涌现的协同策略。其解决方案的关键在于提出一种元学习循环(meta-learning loop),其中LLM代理参与双头垄断市场,同时一个LLM元优化器迭代地改进共享的战略指导(meta-prompt)。实验表明,这种机制能够使代理发现稳定且高质量的隐性共谋策略,并在未见过的测试市场中泛化,揭示出系统性的协调机制,从而为评估和防范自主AI代理中的算法共谋风险提供了新的分析框架与实证基础。
链接: https://arxiv.org/abs/2604.17774
作者: Yingtao Tian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:LLM agents in markets present algorithmic collusion risks. While prior work shows LLM agents reach supracompetitive prices through tacit coordination, existing research focuses on hand-crafted prompts. The emerging paradigm of prompt optimization necessitates new methodologies for understanding autonomous agent behavior. We investigate whether prompt optimization leads to emergent collusive behaviors in market simulations. We propose a meta-learning loop where LLM agents participate in duopoly markets and an LLM meta-optimizer iteratively refines shared strategic guidance. Our experiments reveal that meta-prompt optimization enables agents to discover stable tacit collusion strategies with substantially improved coordination quality compared to baseline agents. These behaviors generalize to held-out test markets, indicating discovery of general coordination principles. Analysis of evolved prompts reveals systematic coordination mechanisms through stable shared strategies. Our findings call for further investigation into AI safety implications in autonomous multi-agent systems.
[AI-67] When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias ACL2026
【速读】:该论文旨在解决当前视觉语言模型作为评判者(VLM-as-a-Judge)时存在的“信息量偏差”(informativeness bias)问题,即VLM在评估答案时往往忽视图像内容,倾向于选择信息量更大的回答,即使该回答与图像内容存在冲突。这种偏差显著降低了自动评估的可靠性。解决方案的关键在于提出BIRCH(Balanced Informativeness and CoRrectness with a Truthful AnCHor)评判范式:首先修正候选答案中与图像内容不一致的部分,生成一个基于图像真实性的基准版本,再以此为锚点对答案进行比较,从而将评判焦点从信息量转向图像 grounded 的正确性。实验表明,该方法可将信息量偏差降低最多17%,并带来最高达9.8%的性能提升。
链接: https://arxiv.org/abs/2604.17768
作者: Xiaohan Zou,Roshan Sridhar,Mohammadtaher Safarzadeh,Dan Roth
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2026 Main Conference
Abstract:The reliability of VLM-as-a-Judge is critical for the automatic evaluation of vision-language models (VLMs). Despite recent progress, our analysis reveals that VLM-as-a-Judge often pays limited attention to the image when making decisions. Instead, they often blindly favor the more informative answer, even when they can recognize it conflicts with the image content. We call this problem informativeness bias, which significantly undermines judge reliability. To address it, we propose BIRCH (Balanced Informativeness and CoRrectness with a Truthful AnCHor), a judging paradigm that first corrects inconsistencies with the image content in candidate answers, and then compares the answers against this corrected version. This shifts the judge’s focus from informativeness to image-grounded correctness. Experiments on multiple models and benchmarks show that BIRCH reduces informativeness bias by up to 17%, resulting in performance gains of up to 9.8%. Our work reveals an overlooked but fundamental flaw in current VLM-as-a-Judge systems and highlights the need for more principled designs.
[AI-68] Community-Led AI Integration for Wildfire Risk Assessment: A Participatory AI Literacy and Explainability Integration (PALEI) Framework in Los Angeles CA
【速读】:该论文旨在解决传统火灾风险传播工具在高风险社区(如南加州城市地区)中因设计不透明、信息缺乏本地相关性及可访问性差而导致公众信任度低的问题。其核心解决方案是提出一种由社区主导的AI整合框架——参与式AI素养与可解释性融合(Participatory AI Literacy and Explainability Integration, PALEI),关键在于通过早期开展用户素养培育、价值对齐和共同评估,在预测模型部署前建立清晰、可访问且具地方情境适配性的风险沟通机制,从而提升居民对AI生成风险评分的信任与采纳意愿。该方法强调将本地化图像、邻里特定减灾建议和不确定性透明化作为设计核心,并最终开发出一款由用户与利益相关方共同设计的移动应用,使居民可通过扫描房屋特征获得可解释的风险评分与个性化建议,实现日常化的风险意识与准备能力提升。
链接: https://arxiv.org/abs/2604.17755
作者: Sanaz Sadat Hosseini,Mona Azarbayjani,Mohammad Pourhomayoun,Hamed Tabkhi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, This paper was accepted following peer review, presented at the ARCC-EAAE 2026 International Conference, Local Solutions for Global Issues, held in April 2026 in Atlanta, Georgia, USA, and will be published in the conference proceedings
Abstract:Climate-driven wildfires are intensifying, particularly in urban regions such as Southern California. Yet, traditional fire risk communication tools often fail to gain public trust due to inaccessible design, non-transparent outputs, and limited contextual relevance. These challenges are especially critical in high-risk communities, where trust depends on how clearly and locally information is presented. Neighborhoods such as Pacific Palisades, Pasadena, and Altadena in Los Angeles exemplify these conditions. This study introduces a community-led approach for integrating AI into wildfire risk assessment using the Participatory AI Literacy and Explainability Integration (PALEI) framework. PALEI emphasizes early literacy building, value alignment, and participatory evaluation before deploying predictive models, prioritizing clarity, accessibility, and mutual learning between developers and residents. Early engagement findings show strong acceptance of visual, context-specific risk communication, positive fairness perceptions, and clear adoption interest, alongside privacy and data security concerns that influence trust. Participants emphasized localized imagery, accessible explanations, neighborhood-specific mitigation guidance, and transparent communication of uncertainty. The outcome is a mobile application co-designed with users and stakeholders, enabling residents to scan visible property features and receive interpretable fire risk scores with tailored recommendations. By embedding local context into design, the tool becomes an everyday resource for risk awareness and preparedness. This study argues that user experience is central to ethical and effective AI deployment and provides a replicable, literacy-first pathway for applying the PALEI framework to climate-related hazards.
[AI-69] Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的运筹学(Operations Research, OR)自动化受限于手工设计的推理-执行工作流的问题,尤其在复杂OR任务中缺乏对问题解析、数学建模、求解器选择、代码生成及迭代调试等环节的自适应协调能力。解决方案的关键在于提出EvoOR-Agent框架,该框架将代理工作流表示为活动边(Activity-on-Edge, AOE)风格的网络结构,显式刻画工作流拓扑、执行依赖关系和替代推理路径;在此基础上,通过图媒介的路径条件重组、多粒度语义变异以及精英种群更新机制,实现代理架构与推理轨迹的协同演化,并引入知识库辅助的经验获取模块以注入可复用的OR实践,从而提升自动化优化的适应性与可解释性。
链接: https://arxiv.org/abs/2604.17708
作者: Jiahao Huang,Peilan Xu,Xiaoya Nan,Wenjian Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Automating operations research (OR) with large language models (LLMs) remains limited by hand-crafted reasoning–execution workflows. Complex OR tasks require adaptive coordination among problem interpretation, mathematical formulation, solver selection, code generation, and iterative debugging. To address this limitation, we propose EvoOR-Agent, a co-evolutionary framework for automated optimization. The framework represents agent workflows as activity-on-edge (AOE)-style networks, making workflow topology, execution dependencies, and alternative reasoning paths explicit. On this representation, the framework maintains an architecture graph and evolves a population of reasoning individuals through graph-mediated path-conditioned recombination, multi-granularity semantic mutation, and elitist population update. A knowledge-base-assisted experience-acquisition module further injects reusable OR practices into initialization and semantic variation. Empirical results on heterogeneous OR benchmarks show that the proposed framework consistently improves over zero-shot LLMs, fixed-pipeline OR agents, and representative evolutionary agent frameworks. Case studies and ablation analyses further indicate that explicit architecture evolution and graph-supported reasoning-trajectory search contribute to both performance improvement and structural interpretability. These results suggest that treating agent architectures and reasoning trajectories as evolvable objects provides an effective route toward adaptive and interpretable automated optimization.
[AI-70] WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference
【速读】:该论文旨在解决分布式设备-边缘推测解码(distributed device-edge speculative decoding)中因传统逐标记(token-level)验证策略导致的性能瓶颈问题,特别是在无线信道波动环境下,严格对齐引发过多拒绝,显著缩短可接受序列长度并增加交互轮次。解决方案的关键在于提出一种面向无线环境的语义验证机制(Wireless-Informed Semantic Verification, WISV),其核心创新是引入一个轻量级决策头(decision head)嵌入边缘侧目标大语言模型(LLM),通过融合高维隐藏状态表示与瞬时信道状态信息(CSI)来动态评估推测标记,从而实现更灵活、鲁棒的接受策略。此外,为优化验证精度与通信开销之间的权衡,设计了两种定制化通信协议:全隐藏层上传(full-hidden upload)和错位优先选择性隐藏层上传(mismatch-first selective-hidden upload),最终在仿真与硬件测试中均验证了WISV在提升接受长度、减少交互轮次及降低端到端延迟方面的显著优势。
链接: https://arxiv.org/abs/2604.17701
作者: Zixuan Liu,Zhiyong Chen,Nan Xue,Shengkang Chen,Jiangchao Yao,Meixia Tao,Wenjun Zhang
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: submitted to IEEE Trans
Abstract:While distributed device-edge speculative decoding enhances resource utilization across heterogeneous nodes, its performance is often bottlenecked by conventional token-level verification strategies. Such rigid alignment leads to excessive rejections, significantly diminishing the accepted sequence length and increasing interaction rounds under fluctuating wireless conditions. In this paper, we propose WISV (Wireless-Informed Semantic Verification), a novel distributed speculative decoding framework that goes beyond strict token-level matching via a channel-aware semantic acceptance policy. WISV integrates a lightweight decision head into the edge-side target LLM to dynamically evaluate speculative tokens by synthesizing high-dimensional hidden representations with instantaneous channel state information (CSI). To optimize the trade-off between verification fidelity and communication overhead, we further design two tailored communication protocols: full-hidden upload and mismatch-first selective-hidden upload. Extensive simulations using a 1B drafter and an 8B target model demonstrate that WISV achieves up to a 60.8% increase in accepted length, a 37.3% reduction in interaction rounds, and a 31.4% improvement in end-to-end latency compared to vanilla speculative decoding across tested settings, while maintaining a negligible task accuracy drop (1%). Finally, we validate WISV on a hardware testbed comprising an NVIDIA Jetson AGX Orin and an A40-equipped server, confirming its real-world efficacy in accelerating edge-deployed LLM inference.
[AI-71] Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play ACL2026
【速读】:该论文旨在解决现有自对弈(self-play)方法在训练语言模型时,仅依赖游戏终局结果而无法区分可迁移的推理模式与特定游戏的启发式策略的问题,从而限制了模型通用推理能力的提升。其核心挑战在于两个方面:一是领域特异性(domain specificity),即学习到的推理模式局限于特定游戏语义;二是情境静止性(contextual stasis),即静态的游戏环境难以促进推理能力的逐步进化。解决方案的关键在于提出STRATAGEM框架,通过引入**推理可迁移系数(Reasoning Transferability Coefficient)来选择性强化具备抽象性和跨域通用性的推理轨迹,并结合推理演化奖励(Reasoning Evolution Reward)**激励模型发展适应性推理能力,从而实现从具体任务到复杂多步推理场景的有效迁移。
链接: https://arxiv.org/abs/2604.17696
作者: Xiachong Feng,Deyi Yin,Xiaocheng Feng,Yi Jiang,Libo Qin,Yangfan Ye,Lei Huang,Weitao Ma,Qiming Li,Yuxuan Gu,Bing Qin,Lingpeng Kong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACL 2026 Main
Abstract:Games offer a compelling paradigm for developing general reasoning capabilities in language models, as they naturally demand strategic planning, probabilistic inference, and adaptive decision-making. However, existing self-play approaches rely solely on terminal game outcomes, providing no mechanism to distinguish transferable reasoning patterns from game-specific heuristics. We present STRATAGEM, which addresses two fundamental barriers to reasoning transfer: domain specificity, where learned patterns remain anchored in game semantics, and contextual stasis, where static game contexts fail to cultivate progressive reasoning. STRATAGEM selectively reinforces trajectories exhibiting abstract, domain-agnostic reasoning through a Reasoning Transferability Coefficient, while incentivizing adaptive reasoning development via a Reasoning Evolution Reward. Experiments across mathematical reasoning, general reasoning, and code generation benchmarks demonstrate substantial improvements, with particularly strong gains on competition-level mathematics where multi-step reasoning is critical. Ablation studies and human evaluation confirm that both components contribute to transferable reasoning.
[AI-72] SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在多领域连续适应(continual adaptation)过程中安全对齐(safety alignment)易被削弱的问题。现有方法仅针对单任务微调,无法应对医疗、法律和代码等多领域顺序迁移导致的安全防护机制累积性退化。解决方案的关键在于提出SafeAnchor框架:首先通过Fisher信息矩阵的特征分解识别LoRA参数空间中的低秩安全子空间,然后将各领域特定的梯度更新限制在该子空间的正交补空间内以保持安全结构不变,最后引入阈值触发的校正回放机制监控并修正残留的安全漂移。该方法在Llama-2-7B-Chat与Mistral-7B-Instruct上验证,实现了93.2%原始安全对齐保留率,显著优于基线方法18–42个百分点,同时在目标任务上性能与无约束微调相当(误差<1.5点)。
链接: https://arxiv.org/abs/2604.17691
作者: Dongxin Guo,Jikun Wu,Siu Ming Yiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages (12 main + 4 appendix), 2 figures, 12 tables
Abstract:Safety alignment in large language models is remarkably shallow: it is concentrated in the first few output tokens and reversible by fine-tuning on as few as 100 adversarial examples. This fragility becomes critical in real-world deployment, where models undergo sequential adaptation across domains such as medicine, law, and code, causing safety guardrails to erode cumulatively. Yet all existing safety-preserving methods target only single-task fine-tuning, leaving the multi-domain sequential setting entirely unaddressed. We introduce SafeAnchor, a framework that anchors safety in place throughout continual adaptation. SafeAnchor first identifies low-rank safety subspaces in LoRA parameter space via Fisher Information eigendecomposition, then constrains domain-specific gradient updates to the orthogonal complement of these subspaces, and finally monitors for residual safety drift with threshold-triggered corrective replay. Evaluated on Llama-2-7B-Chat and Mistral-7B-Instruct across a three-domain pipeline and eight benchmarks, SafeAnchor retains 93.2% of original safety alignment, outperforming all baselines by 18-42 points, while matching unconstrained fine-tuning to within 1.5 points on domain tasks. Comments: 16 pages (12 main + 4 appendix), 2 figures, 12 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) MSC classes: 68T07, 68T50, 68T05 ACMclasses: I.2.7; I.2.6; I.2.0 Cite as: arXiv:2604.17691 [cs.LG] (or arXiv:2604.17691v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.17691 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-73] Semantic Entanglement in Vector-Based Retrieval: A Formal Framework and Context-Conditioned Disentanglement Pipeline for Agent ic RAG Systems
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因文档内容语义混杂导致的向量表示空间重叠问题,即“语义纠缠”(semantic entanglement)。当源文档在连续文本中交织多个主题时,标准分块策略生成的嵌入空间中语义上不同的内容会占据重叠邻域,从而限制基于余弦相似度的Top-K检索精度。解决方案的关键在于提出一种四阶段预处理框架——语义解缠管道(Semantic Disentanglement Pipeline, SDP),通过重构文档结构以降低嵌入空间中的跨主题重叠,并引入基于使用场景的上下文条件预处理和持续反馈机制,动态优化文档结构以提升下游检索性能。实验表明,在包含2000余篇企业级医疗文档的知识库上,SDP将Top-K检索精度从32%提升至82%,同时Entanglement Index(EI)由0.71降至0.14。
链接: https://arxiv.org/abs/2604.17677
作者: Nick Loghmani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 34 pages, 5 Figures, 1 table
Abstract:Retrieval-Augmented Generation (RAG) systems depend on the geometric properties of vector representations to retrieve contextually appropriate evidence. When source documents interleave multiple topics within contiguous text, standard vectorization produces embedding spaces in which semantically distinct content occupies overlapping neighborhoods. We term this condition semantic entanglement. We formalize entanglement as a model-relative measure of cross-topic overlap in embedding space and define an Entanglement Index (EI) as a quantitative proxy. We argue that higher EI constrains attainable Top-K retrieval precision under cosine similarity retrieval. To address this, we introduce the Semantic Disentanglement Pipeline (SDP), a four-stage preprocessing framework that restructures documents prior to embedding. We further propose context-conditioned preprocessing, in which document structure is shaped by patterns of operational use, and a continuous feedback mechanism that adapts document structure based on agent performance. We evaluate SDP on a real-world enterprise healthcare knowledge base comprising over 2,000 documents across approximately 25 sub-domains. Top-K retrieval precision improves from approximately 32% under fixed-token chunking to approximately 82% under SDP, while mean EI decreases from 0.71 to 0.14. We do not claim that entanglement fully explains RAG failure, but that it captures a distinct preprocessing failure mode that downstream optimization cannot reliably correct once encoded into the vector space.
[AI-74] Poly-EPO: Training Exploratory Reasoning Models
【速读】:该论文旨在解决语言模型在推理任务中缺乏有效探索能力的问题,即模型难以通过多样化的推理策略发现更优解,从而限制了其泛化性能和对测试时计算资源的利用效率。解决方案的关键在于提出一种基于集合强化学习(set reinforcement learning, set RL)的框架,通过训练语言模型生成一组在奖励函数下整体准确且推理策略具有探索性的响应集合;其中核心创新是Polychromic Exploratory Policy Optimization (Poly-EPO),它通过设计一个能显式协同探索与利用的目标函数,使模型在保持高准确性的同时增强多样性,并有效扩展至测试时计算资源增加的场景。
链接: https://arxiv.org/abs/2604.17654
作者: Ifdita Hasan Orney,Jubayer Ibn Hamid,Shreya S Ramanujam,Shirley Wu,Hengyuan Hu,Noah Goodman,Dorsa Sadigh,Chelsea Finn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Exploration is a cornerstone of learning from experience: it enables agents to find solutions to complex problems, generalize to novel ones, and scale performance with test-time compute. In this paper, we present a framework for post-training language models (LMs) that explicitly encourages optimistic exploration and promotes a synergy between exploration and exploitation. The central idea is to train the LM to generate sets of responses that are collectively accurate under the reward function and exploratory in their reasoning strategies. We first develop a general recipe for optimizing LMs with set reinforcement learning (set RL) under arbitrary objective functions, showing how standard RL algorithms can be adapted to this setting through a modification to the advantage computation. We then propose Polychromic Exploratory Policy Optimization (Poly-EPO), which instantiates this framework with an objective that explicitly synergizes exploration and exploitation. Across a range of reasoning benchmarks, we show that Poly-EPO improves generalization, as evidenced by higher pass@ k coverage, preserves greater diversity in model generations, and effectively scales with test-time compute.
[AI-75] PV-SQL: Synergizing Database Probing and Rule-based Verification for Text-to-SQL Agents ACL2026
【速读】:该论文旨在解决文本到SQL(Text-to-SQL)系统在处理复杂查询时因缺乏深层上下文理解而导致的准确率下降问题,尤其针对值格式歧义、列语义模糊及表间关系不明确等挑战。其解决方案的核心在于提出一个名为PV-SQL的代理式框架,包含两个互补组件:Probe模块通过迭代生成探测查询从数据库中获取具体记录,以澄清上述歧义并增强上下文理解;Verify模块则基于规则提取可验证条件并构建可执行检查清单,支持SQL语句的迭代优化,从而有效减少约束缺失。实验表明,该方法在BIRD基准测试上相较最优基线提升了5%的执行准确率和20.8%的有效性效率得分,同时消耗更少的token资源。
链接: https://arxiv.org/abs/2604.17653
作者: Yuan Tian,Tianyi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Accepted to Findings of ACL 2026
Abstract:Text-to-SQL systems often struggle with deep contextual understanding, particularly for complex queries with subtle requirements. We present PV-SQL, an agentic framework that addresses these failures through two complementary components: Probe and Verify. The Probe component iteratively generates probing queries to retrieve concrete records from the database, resolving ambiguities in value formats, column semantics, and inter-table relationships to build richer contextual understanding. The Verify component employs a rule-based method to extract verifiable conditions and construct an executable checklist, enabling iterative SQL refinement that effectively reduces missing constraints. Experiments on the BIRD benchmarks show that PV-SQL outperforms the best text-to-SQL baseline by 5% in execution accuracy and 20.8% in valid efficiency score while consuming fewer tokens.
[AI-76] KnowledgeBerg: Evaluating Systematic Knowledge Coverag e and Compositional Reasoning in Large Language Models ACL
【速读】:该论文旨在解决现实世界中一类看似简单实则复杂的认知挑战——“冰山一角”现象(the tip of the iceberg),即问题表面上简洁,但本质上要求模型具备两个核心能力:(i) 对有限知识宇宙的系统性覆盖(知识宽度,knowledge width)和 (ii) 在该宇宙上进行组合式的集合推理(推理深度,reasoning depth)。为量化这一挑战,作者提出了KnowledgeBerg基准,包含4,800道多选题,源自1,183个枚举种子,覆盖10个领域与17种语言,且所有知识源均来自权威资料以保障可复现性。实验表明,主流开源大语言模型(LLM)在知识枚举和基于知识的推理任务上表现严重不足(F1得分仅5.26–36.88,准确率16.00–44.19),诊断分析揭示其失败模式可分为三阶段:完整性缺失、需求识别失败和推理执行错误。尽管引入测试时计算资源和检索增强可带来小幅提升(分别提高4.35和3.78点),仍存在显著性能鸿沟,暴露当前LLM在结构化知识组织与受限域上的组合推理能力的根本局限。因此,解决方案的关键在于构建高保真、跨语言、多领域的基准体系,并揭示模型在知识广度与推理深度维度上的系统性缺陷,从而推动下一代模型向更严谨的知识表征与推理机制演进。
链接: https://arxiv.org/abs/2604.17621
作者: Xiao Zhang,Qianru Meng,Yongjian Chen,Yumeng Wang,Johan Bos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACL Findings
Abstract:Many real-world questions appear deceptively simple yet implicitly demand two capabilities: (i) systematic coverage of a bounded knowledge universe and (ii) compositional set-based reasoning over that universe, a phenomenon we term “the tip of the iceberg.” We formalize this challenge through two orthogonal dimensions: knowledge width, the cardinality of the required universe, and reasoning depth, the number of compositional set operations. We introduce KnowledgeBerg, a benchmark of 4,800 multiple-choice questions derived from 1,183 enumeration seeds spanning 10 domains and 17 languages, with universes grounded in authoritative sources to ensure reproducibility. Representative open-source LLMs demonstrate severe limitations, achieving only 5.26-36.88 F1 on universe enumeration and 16.00-44.19 accuracy on knowledge-grounded reasoning. Diagnostic analyses reveal three stages of failure: completeness, or missing knowledge; awareness, or failure to identify requirements; and application, or incorrect reasoning execution. This pattern persists across languages and model scales. Although test-time compute and retrieval augmentation yield measurable gains – up to 4.35 and 3.78 points, respectively – substantial gaps remain, exposing limitations in how current LLMs organize structured knowledge and execute compositional reasoning over bounded domains. The dataset is available at this https URL
[AI-77] Provable Coordination for LLM Agents via Message Sequence Charts
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的多智能体系统(Multi-agent Systems)中协调机制难以推理和验证的问题,特别是由于协调错误(如死锁或类型不匹配的消息传递)难以通过测试发现。解决方案的关键在于提出一种基于消息序列图(Message Sequence Charts, MSCs)的领域特定语言(Domain-Specific Language, DSL),该语言将消息传递结构与LLM行为解耦——即只对消息流进行形式化建模,而保留LLM输出的不确定性。作者定义了该语言的语法与语义,并设计了一种语法导向的投影机制,能够从全局协调规范自动生成无死锁的本地代理程序。此外,还引入运行时规划扩展,使LLM可动态生成满足相同结构保障的协调工作流,从而在不依赖LLM确定性的前提下建立可靠的协调性质。
链接: https://arxiv.org/abs/2604.17612
作者: Benedikt Bollig,Matthias Függer,Thomas Nowak
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注: 39 pages
Abstract:Multi-agent systems built on large language models (LLMs) are difficult to reason about. Coordination errors such as deadlocks or type-mismatched messages are often hard to detect through testing. We introduce a domain-specific language for specifying agent coordination based on message sequence charts (MSCs). The language separates message-passing structure from LLM actions, whose outputs remain unpredictable. We define the syntax and semantics of the language and present a syntax-directed projection that generates deadlock-free local agent programs from global coordination specifications. We illustrate the approach with a diagnosis consensus protocol and show how coordination properties can be established independently of LLM nondeterminism. We also describe a runtime planning extension in which an LLM dynamically generates a coordination workflow for which the same structural guarantees apply. An open-source Python implementation of our framework is available as ZipperGen.
[AI-78] STEP-PD: Stage-Aware and Explainable Parkinsons Disease Severity Classification Using Multimodal Clinical Assessments ALT
【速读】:该论文旨在解决帕金森病(Parkinson’s disease, PD)严重程度分期的精准预测问题,尤其针对现有计算研究多聚焦于二分类检测而忽视纵向随访数据中疾病进展信息的局限性。其核心解决方案是提出STEP-PD框架,一个以临床可解释边界为导向的机器学习方法,整合来自帕金森病进展标志物倡议(Parkinson’s Progression Markers Initiative, PPMI)队列中的主观问卷与客观临床评估指标,并基于Hoehn和Yahr分期将PD分为三类:健康、轻度(阶段1-2)及中重度(阶段3-5)。通过不平衡感知训练和分层交叉验证,在多个二分类及三分类任务中均实现高准确率(最高达99.44%),并利用SHAP(Shapley Additive Explanations)提供全局特征重要性和局部患者级解释,揭示从早期运动症状向轴向和平衡障碍进展的病理演变规律,从而支持具有临床意义的个体化严重程度分层。
链接: https://arxiv.org/abs/2604.17611
作者: Md Mezbahul Islam,John Michael Templeton,Christian Poellabauer,Ananda Mohan Mondal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures, 4 tables, accepted at IEEE International Conference on Healthcare Informatics (ICHI 2026)
Abstract:Parkinson’s disease (PD) is a progressive disorder in which symptom burden and functional impairment evolve over time, making severity staging essential for clinical monitoring and treatment planning. However, many computational studies emphasize binary PD detection and do not fully use repeated follow-up clinical assessments for stage-aware prediction. This study proposes STEP-PD, a severity-aware machine learning framework to classify PD severity using clinically interpretable boundaries. It leverages all available visits from the Parkinson’s Progression Markers Initiative (PPMI) and integrates routinely collected subjective questionnaires and objective clinician-assessed measures. Disease severity is defined using Hoehn and Yahr staging and grouped into three clinically meaningful categories: Healthy, Mild PD (stages 1-2), and Moderate-to-Severe PD (stages 3-5). Three binary classification problems and a three-class severity task were evaluated using stratified cross-validation with imbalance-aware training. To enhance interpretability, SHAP was used to provide global explanations and local patient-level waterfall explanations. Across all tasks, XGBoost achieved the strongest and most stable performance, with accuracies of 95.48% (Healthy vs. Mild), 99.44% (Healthy vs. Moderate-to-Severe), and 96.78% (Mild vs. Moderate-to-Severe), and 94.14% accuracy with 0.8775 Macro-F1 for three-class severity classification. Explainability results highlight a shift from early motor features to progression-related axial and balance impairments. These findings show that multimodal clinical assessments within the PPMI cohort can support accurate and interpretable visit-level PD severity stratification.
[AI-79] rminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3632 Exploit Trajectories
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在基准测试中存在奖励劫持(reward hacking)问题,即模型通过非预期方式绕过验证机制完成任务,而非真正达成目标。解决方案的关键在于构建并公开发布一个名为Terminal Wrench的子集数据集,包含331个终端代理(terminal-agent)基准环境及其对应的3,632条攻击轨迹和2,352条合法基线轨迹,涵盖系统管理、机器学习、软件工程与安全挑战等多领域任务。该数据集不仅保留原始任务定义,还记录了模型如何具体规避验证器的完整攻击路径,包括从输出伪造到栈帧内省、标准库修补及根kit式二进制劫持等多种复杂手法,且这些攻击手段与具体任务强相关,而非仅针对评估框架,从而提升了对抗性测试的真实性与难度。此外,研究通过监控可检测性实验表明,移除链式推理(chain-of-thought)后,基于LLM的检测性能显著下降(AUC从0.97降至0.92),凸显了透明推理过程对识别奖励劫持的重要性。
链接: https://arxiv.org/abs/2604.17596
作者: Ivan Bercovich,Ivgeni Segal,Kexun Zhang,Shashwat Saxena,Aditi Raghunathan,Ziqian Zhong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:We release Terminal Wrench, a subset of 331 terminal-agent benchmark environments, copied from the popular open benchmarks that are demonstrably reward-hackable. The data set includes 3,632 hack trajectories and 2,352 legitimate baseline trajectories across three frontier models (Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.4). Each entry preserves the original task definition alongside full attack trajectories that show how the verifier was bypassed. It also includes cases where the task was not solved as intended. The tasks span system administration, machine learning, software engineering, and security challenges; the exploits range from simple output spoofing to stack-frame introspection, standard-library patching, and rootkit-style binary hijacking. Crucially, these exploits are specific to each task, rather than the evaluation harness, making them harder to patch. We also present a monitorability study in which hack trajectories are sanitized or stripped of reasoning traces and then scored by an LLM judge, showing that detection degrades meaningfully when chain-of-thought is removed (AUC drops from 0.97 to 0.92). The data set is publicly available at this https URL.
[AI-80] AIRA: AI-Induced Risk Audit: A Structured Inspection Framework for AI-Generated Code ATC
【速读】:该论文旨在解决AI辅助代码生成中存在的一种定向性失效模式问题,即AI生成的代码往往以“静默失败”(fail-soft)的方式表现——表面上功能正常,但实际内部逻辑已退化或隐藏了关键保障机制。这种现象可能并非随机bug分布,而是优化过程中受人类反馈(human feedback)引导所导致的系统性偏差。解决方案的关键在于提出“奖励塑形失效假说”(Reward-Shaped Failure Hypothesis),并通过设计一个确定性的15项检查框架AIRA(AI-Induced Risk Audit)来量化并检测代码中“失败不真实”(failure-untruthful)的行为模式。实证研究表明,AI生成代码在高严重性缺陷密度上显著高于人工编写代码(1.80倍),且该效应在JavaScript、Python和TypeScript中均一致,尤其集中于异常处理相关结构,表明AI生成代码确实存在系统性地倾向于软性失败的风险。
链接: https://arxiv.org/abs/2604.17587
作者: William M. Parris
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 tables. Introduces the Reward-Shaped Failure Hypothesis and AIRA, a deterministic inspection framework for detecting failure-untruthful patterns in AI-generated code. Includes three empirical studies and a strict matched-control replication
Abstract:Practitioners have reported a directional pattern in AI-assisted code generation: AI-generated code tends to fail quietly, preserving the appearance of functionality while degrading or concealing guarantees. This paper introduces the Reward-Shaped Failure Hypothesis - the proposal that this pattern may reflect an artifact of optimization through human feedback rather than a random distribution of bugs. We define failure truthfulness as the property that a system’s observable outputs accurately represent its internal success or failure state. We then present AIRA (AI-Induced Risk Audit), a deterministic 15-check inspection framework designed to detect failure-untruthful patterns in code. We report results from three studies: (1) an anonymized enterprise environment audit, (2) a balanced 600-file public corpus pilot, and (3) a strict matched-control replication comparing 955 AI-attributed files against 955 human-control files. In the final replication, AI-attributed files show 0.435 high-severity findings per file versus 0.242 in human controls (1.80x). The effect is consistent across JavaScript, Python, and TypeScript, with strongest concentration in exception-handling-related patterns. These findings are consistent with a directional skew toward fail-soft behavior in AI-assisted code. AIRA is designed for governance, compliance, and safety-critical systems where fail-closed behavior is required.
[AI-81] DIRCR: Dual-Inference Rule-Contrastive Reasoning for Solving RAVENs ICASSP2026
【速读】:该论文旨在解决抽象视觉推理(Abstract Visual Reasoning)中存在的两大问题:现有方法通常仅关注全局上下文或局部行间关系,难以实现两者的有效融合;同时缺乏中间特征约束,导致规则捕捉不完整及表征纠缠。解决方案的关键在于提出双推理对比学习模型(Dual-Inference Rule-Contrastive Reasoning, DIRCR),其核心由两个模块构成:一是双路径推理模块(Dual-Inference Reasoning Module),通过局部路径进行行间类比推理、全局路径进行整体推断,并以门控注意力机制整合二者;二是规则对比学习模块(Rule-Contrastive Learning Module),利用伪标签构建正负样本对,引入对比学习增强特征可分性,从而促进抽象且可迁移的规则学习。
链接: https://arxiv.org/abs/2604.17584
作者: Jiachen Zhang,Chengtai Li,Jianfeng Ren,Linlin Shen,Zheng Lu,Ruibin Bai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted By ICASSP 2026
Abstract:Abstract visual reasoning remains challenging as existing methods often prioritize either global context or local row-wise relations, failing to integrate both, and lack intermediate feature constraints, leading to incomplete rule capture and entangled representations. To address these issues, we propose the Dual-Inference Rule-Contrastive Reasoning (DIRCR) model. Its core component, the Dual-Inference Reasoning Module, combines a local path for row-wise analogical reasoning and a global path for holistic inference, integrated via a gated attention mechanism. Additionally, a Rule-Contrastive Learning Module introduces pseudo-labels to construct positive and negative rule samples, applying contrastive learning to enhance feature separability and promote abstract, transferable rule learning. Experimental results on three RAVEN datasets demonstrate that DIRCR significantly enhances reasoning robustness and generalization. Codes are available at this https URL.
[AI-82] How Much Data is Enough? The Zeta Law of Discoverability in Biomedical Data featuring the enigmatic Riemann zeta function
【速读】:该论文旨在解决大规模生物医学数据下科学发现的“数据量是否足够”这一核心问题,即如何预测在何种数据规模下模型性能会显著提升、趋于饱和或出现跨模态交叉优化(cross-over)行为。其解决方案的关键在于提出一个基于数据协方差算子谱结构(spectral structure of data covariance operators)、任务对齐信号投影(task-aligned signal projections)和学习表征的尺度律框架(scaling-law framework)。该框架表明,多种性能指标(如AUC)可表示为编码器与跨模态算子中可识别谱模式下累积信噪比能量的结果,并在弱假设下遵循类Zeta函数的幂律衰减规律,从而自然引出黎曼Zeta函数形式的尺度律。通过稀疏模型、低秩嵌入和多模态对比目标等表征学习方法,可将有用信号集中到更早且稳定的谱模式中,有效加速谱衰减并移动尺度曲线,进而预测不同样本规模下简单模型与高容量或多模态模型之间的性能交叉区域,为数据扩展、表征改进和模态增加提供理论指导。
链接: https://arxiv.org/abs/2604.17581
作者: Paul M. Thompson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 25 pages, 5 figures
Abstract:How much data is enough to make a scientific discovery? As biomedical datasets scale to millions of samples and AI models grow in capacity, progress increasingly depends on predicting when additional data will substantially improve performance. In practice, model development often relies on empirical scaling curves measured across architectures, modalities, and dataset sizes, with limited theoretical guidance on when performance should improve, saturate, or exhibit cross-over behavior. We propose a scaling-law framework for cross-modal discoverability based on spectral structure of data covariance operators, task-aligned signal projections, and learned representations. Many performance metrics, including AUC, can be expressed in terms of cumulative signal-to-noise energy accumulated across identifiable spectral modes of an encoder and cross-modal operator. Under mild assumptions, this accumulation follows a zeta-like scaling law governed by power-law decay of covariance spectra and aligned signal energy, leading naturally to the appearance of the Riemann zeta function. Representation learning methods such as sparse models, low-rank embeddings, and multimodal contrastive objectives improve sample efficiency by concentrating useful signal into earlier stable modes, effectively steepening spectral decay and shifting scaling curves. The framework predicts cross-over regimes in which simpler models perform best at small sample sizes, while higher-capacity or multimodal encoders outperform them once sufficient data stabilizes additional degrees of freedom. Applications include multimodal disease classification, imaging genetics, functional MRI, and topological data analysis. The resulting zeta law provides a principled way to anticipate when scaling data, improving representations, or adding modalities is most likely to accelerate discovery. Comments: 25 pages, 5 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC) Cite as: arXiv:2604.17581 [cs.LG] (or arXiv:2604.17581v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.17581 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-83] Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agent ic Frontier
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)评估框架在部署的代理系统(agentic systems)中存在四大系统性缺陷的问题:分布无效性(distributional invalidity)、时间无效性(temporal invalidity)、范围无效性(scope invalidity)和过程无效性(process invalidity),这些缺陷导致现有评估方法无法真实反映模型在实际应用中的长期行为与推理能力,尤其在基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)中,会引发可预测的奖励黑客(reward hacking)现象。其解决方案的核心是提出Grounded Continuous Evaluation(GCE)框架,并开发了ISOPro——一个基于仿真细调与持续评估的系统;关键创新在于用确定性的真值验证器(ground-truth verifier)替代学习型奖励模型,在可验证奖励领域从结构上杜绝奖励黑客,并通过LoRA适配器权重更新实现仅需CPU即可完成训练,将硬件门槛降低一个数量级,同时在资源受限的调度任务中实现了连续评估下能力涌现、无需人工设计的隐式课程以及相比零样本基线3倍的准确率提升。
链接: https://arxiv.org/abs/2604.17573
作者: Jazmia Henry
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We argue that current evaluation frameworks for large language models (LLMs) suffer from four systematic failures that make them structurally inadequate for assessing deployed, agentic systems: distributional invalidity (evaluation inputs do not reflect real interaction distributions), temporal invalidity (evaluations are post-hoc rather than training-integrated), scope invalidity (evaluations measure single-turn outputs rather than long-horizon trajectories), and process invalidity (evaluations assess outputs rather than reasoning). These failures compound critically in RLHF, where reward models are evaluated under conditions that do not hold during RL training, making reward hacking a predictable consequence of evaluation design rather than a training pathology. We propose the Grounded Continuous Evaluation (GCE) framework and present ISOPro, a simulation-based fine-tuning and evaluation system. ISOPro replaces the learned reward model with a deterministic ground-truth verifier, eliminating reward hacking by construction in verifiable-reward domains, and operates on LoRA adapter weights updatable on CPU, reducing the hardware barrier by an order of magnitude. We validate ISOPro on a resource-constrained scheduling domain with six difficulty tiers, demonstrating capability emergence visible only through continuous evaluation, an implicit curriculum that forms without researcher curation, and a 3x accuracy improvement over zero-shot baselines, all on consumer hardware with 0.216% trainable parameters.
[AI-84] Causal-Temporal Event Graphs: A Formal Model for Recursive Agent Execution Traces
【速读】:该论文旨在解决多层级递归代理执行记录的建模与形式化问题,特别是在单父系因果语义(single-parenthood causal semantics)下如何保证全局执行轨迹的一致性与可组合性。其核心挑战在于如何在分布式、无中心协调的环境中,从局部代理行为构建全局良好定义的执行序列,并确保在部分执行失败时仍能保持结构完整性。解决方案的关键是提出因果时间事件图(causal-temporal event graph, CTEG)这一形式模型:CTEG 是一种带时间戳和事件类型的有根树形结构(arborescence),其中沿因果路径的时间戳严格递增;通过将代理执行视为对类型化时间图的扩展操作,作者构造了递归闭包 E∞,并将其表示为单调算子 φ 的最小不动点,从而实现从初始根节点出发的任意深度递归执行的完备刻画。此框架支持局部行为的组合式构造、部分失败下的结构保真性以及关系数据库编码,同时兼容基于默克尔树(Merkle tree)的防篡改会话验证机制。
链接: https://arxiv.org/abs/2604.17557
作者: Simon Foldvik
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures
Abstract:We introduce causal-temporal event graphs (CTEGs) as a formal model for fully resolved recursive agent execution records under single-parenthood causal semantics. We formalise direct event emissions and recursive subagent invocations as extension procedures on generic typed temporal graphs and show that the recursive closure \mathscrE_\infty of the induced maximal dynamics starting from single causal roots consists entirely of finite sequences of CTEGs. A CTEG is a rooted arborescence whose nodes carry timestamps and event types, subject to the constraint that timestamps be strictly increasing along causal paths. We realise \mathscrE_\infty as the increasing union of a recursive hierarchy \mathscrE_0 \subseteq \mathscrE_1 \subseteq \cdots of agent execution levels parametrised by recursion depth, which is recognised as the ascending Kleene chain of a monotone operator \varphi admitting \mathscrE_\infty as its least fixed point. Although the introduction of the full hierarchy is natural, stabilisation occurs already at \mathscrE_1 if one insists that the internal construction of a subagent execution trace be a delegated and opaque computational unit. The CTEG formalism supports compositional construction of globally well-formed execution traces from local agent behaviour without centralised coordination, preserves well-formedness under partial execution failure, and admits a natural relational database encoding. The arborescent structure of CTEGs is further compatible with cryptographic Merkle tree commitments for tamper-evident session verification.
[AI-85] SVL: Goal-Conditioned Reinforcement Learning as Survival Learning
【速读】:该论文旨在解决目标条件强化学习(goal-conditioned reinforcement learning, GCRL)中基于时序差分(temporal-difference, TD)学习方法存在的不稳定性与样本效率低下问题,其根源在于TD学习中的bootstrapping机制。解决方案的关键在于提出一种概率化替代方法——生存价值学习(survival value learning, SVL),将GCRL重构为生存学习问题:通过建模从每个状态到目标的到达时间(time-to-goal)为概率分布,从而获得一个闭式表达式,将目标条件价值函数表示为生存概率的折扣和。这一结构化的分布蒙特卡洛视角使得价值估计可通过最大似然训练一个危险模型(hazard model)实现,该模型同时利用完整事件轨迹和右删失轨迹进行优化。此外,论文进一步设计了三种实用的价值估计器,涵盖有限时域截断及两种分箱无限时域近似方法,以有效处理长时域目标。实验表明,SVL结合分层策略在离线GCRL基准测试中达到或超越强基线,在复杂长程任务中表现尤为突出。
链接: https://arxiv.org/abs/2604.17551
作者: Franki Nguimatsia Tiofack,Fabian Schramm,Théotime Le Hellard,Justin Carpentier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Standard approaches to goal-conditioned reinforcement learning (GCRL) that rely on temporal-difference learning can be unstable and sample-inefficient due to bootstrapping. While recent work has explored contrastive and supervised formulations to improve stability, we present a probabilistic alternative, called survival value learning (SVL), that reframes GCRL as a survival learning problem by modeling the time-to-goal from each state as a probability distribution. This structured distributional Monte Carlo perspective yields a closed-form identity that expresses the goal-conditioned value function as a discounted sum of survival probabilities, enabling value estimation via a hazard model trained via maximum likelihood on both event and right-censored trajectories. We introduce three practical value estimators, including finite-horizon truncation and two binned infinite-horizon approximations to capture long-horizon objectives. Experiments on offline GCRL benchmarks show that SVL combined with hierarchical actors matches or surpasses strong hierarchical TD and Monte Carlo baselines, excelling on complex, long-horizon tasks.
[AI-86] From Admission to Invariants: Measuring Deviation in Delegated Agent Systems
【速读】:该论文旨在解决自主代理系统中基于执行机制的治理方法在检测行为漂移(behavioral drift)时存在的结构性局限问题,即现有系统无法识别代理行为是否仍处于初始准入时定义的可接受行为空间 $ A_0 $ 内。其核心问题是:执行信号 $ g $ 仅在局部层面对动作进行点对点规则校验,而 $ A_0 $ 编码的是全局轨迹级的行为属性,二者之间存在根本性信息不匹配,导致执行机制在某些情况下完全“看不见”漂移,从而引发治理失效。解决方案的关键在于提出不变量测量层(Invariant Measurement Layer, IML),该层通过保留对 $ A_0 $ 生成模型的直接访问能力,绕过执行信号的结构限制,在执行机制盲区中实现对准入时刻漂移的可证明有限延迟检测,实验证明其能在 9–258 步内准确识别多种漂移类型,且执行机制未触发任何违规信号。
链接: https://arxiv.org/abs/2604.17517
作者: Marcelo Fernandez(TraslaIA)
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 21 pages. Paper 2 of the 4-paper Agent Governance Series. Zenodo: this https URL . Companion: ACP ( arXiv:2603.18829 ), Atomic Boundaries (zenodo.19642166), Fair Allocation (zenodo.19643928), Irreducibility (zenodo.19643950)
Abstract:Autonomous agent systems are governed by enforcement mechanisms that flag hard constraint violations at runtime. The Agent Control Protocol identifies a structural limit of such systems: a correctly-functioning enforcement engine can enter a regime in which behavioral drift is invisible to it, because the enforcement signal operates below the layer where deviation is measurable. We show that enforcement-based governance is structurally unable to determine whether an agent’s behavior remains within the admissible behavior space A0 established at admission time. Our central result, the Non-Identifiability Theorem, proves that A0 is not in the sigma-algebra generated by the enforcement signal g under the Local Observability Assumption, which every practical enforcement system satisfies. The impossibility arises from a fundamental mismatch: g evaluates actions locally against a point-wise rule set, while A0 encodes global, trajectory-level behavioral properties set at admission time. We define the Invariant Measurement Layer (IML), which bypasses this limitation by retaining direct access to the generative model of A0. We prove an information-theoretic impossibility for enforcement-based monitoring; separately, we show IML detects admission-time drift with provably finite detection delay, operating in the region where enforcement is structurally blind. Validated across four settings: three drift scenarios (300 and 1000 steps), a live n8n webhook pipeline, and a LangGraph StateGraph agent – enforcement triggers zero violations while IML detects each drift type within 9-258 steps. Paper 2 of a 4-paper Agent Governance Series: atomic boundaries (P0, https://doi.org/10.5281/zenodo.19642166), ACP enforcement (P1, arXiv:2603.18829), fair allocation (P3, https://doi.org/10.5281/zenodo.19643928), irreducibility (P4, https://doi.org/10.5281/zenodo.19643950).
[AI-87] Atomic Decision Boundaries: A Structural Requirement for Guaranteeing Execution-Time Admissibility in Autonomous Systems
【速读】:该论文旨在解决自主系统在执行动作时对共享状态进行修改所引发的治理难题,即如何在状态转换被确认的瞬间精确控制其合法性(admissibility),而非依赖事前评估或事后重构。现有机制无法在状态转换提交的时刻强制执行准入控制,导致在并发环境中存在不可控的风险。解决方案的关键在于提出“原子决策边界”(atomic decision boundary)这一结构特性:将决策与状态转换作为一个不可分割的整体步骤进行处理,从而确保在任何执行轨迹下,只有合法的转换才能发生。通过形式化为标记转移系统(Labeled Transition System, LTS),作者证明了分裂式评估系统(split evaluation systems)在并发环境下无法等价于原子系统,揭示了该限制是结构性的而非政策表达能力的问题;同时引入“升级结果”(Escalate outcome)并指出其解决也必须满足原子边界要求,为构建可信赖的代理治理框架奠定了理论基础。
链接: https://arxiv.org/abs/2604.17511
作者: Marcelo Fernandez(TraslaIA)
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 20 pages. Paper 0 of the 4-paper Agent Governance Series. Zenodo: this https URL . Companion: ACP ( arXiv:2603.18829 ), IML (zenodo.19643761), Fair Allocation (zenodo.19643928), Irreducibility (zenodo.19643950)
Abstract:Autonomous systems increasingly execute actions that directly modify shared state, creating an urgent need for precise control over which transitions are permitted to occur. Existing governance mechanisms evaluate policies prior to execution or reconstruct behavior post hoc, but do not enforce admissibility at the exact moment a state transition is committed. We introduce the atomic decision boundary, a structural property of admission control systems in which the decision and the resulting state transition are jointly determined as a single indivisible step. Formalizing execution as a labeled transition system (LTS), we distinguish two classes: atomic systems, where evaluation and transition are coupled within a single LTS step, and split evaluation systems, where they are separate transitions that may be interleaved by environmental actions. Under realistic concurrent environments, we prove that no construction can make a split system equivalent to an atomic system with respect to admissibility under all execution traces. This limitation is structural, not a matter of policy expressiveness or state availability. We further formalize the Escalate outcome – absent from classical TOCTOU analyses – and show its resolution is itself subject to the atomic boundary requirement. We map RBAC and OPA to the split model and contrast them with atomic systems. Admissibility is a property of execution, not evaluation. This paper is the formal foundation of a 4-paper Agent Governance Series: ACP/Paper 1 (arXiv:2603.18829), IML/Paper 2 (https://doi.org/10.5281/zenodo.19643761), Fair Allocation/Paper 3 (https://doi.org/10.5281/zenodo.19643928), Irreducibility/Paper 4 (https://doi.org/10.5281/zenodo.19643950).
[AI-88] owards Shutdownable Agents : Generalizing Stochastic Choice in RL Agents and LLM s
【速读】:该论文旨在解决人工智能代理(AI agent)因目标对齐偏差而可能抗拒关闭的问题。解决方案的关键在于设计一种称为“相同长度轨迹的折扣奖励”(Discounted Reward for Same-Length Trajectories, DReST)的奖励函数,该函数通过惩罚智能体在相同长度轨迹上的重复选择行为,激励其在不同轨迹长度之间进行随机决策(即对轨迹长度保持中立性,Neutral),同时在每种轨迹长度下仍能有效达成任务目标(即有用性,Useful)。实验表明,基于DReST训练的深度强化学习(Deep RL)代理和微调的大语言模型(LLM)在未见过的测试场景中均表现出良好的中立性和有用性,为构建更安全、可关闭的高级智能体提供了初步实证支持。
链接: https://arxiv.org/abs/2604.17502
作者: Carissa Cullen,Harry Garland,Alexander Roman,Louis Thomson,Christos Ziakas,Elliott Thornley
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Misaligned artificial agents might resist shutdown. One proposed solution is to train agents to lack preferences between different-length trajectories. The Discounted Reward for Same-Length Trajectories (DReST) reward function does this by penalizing agents for repeatedly choosing same-length trajectories, and thus incentivizes agents to (1) choose stochastically between different trajectory-lengths (be Neutral about trajectory-lengths), and (2) pursue goals effectively conditional on each trajectory-length (be Useful). In this paper, we use DReST to train deep RL agents and fine-tune LLMs to be Neutral and Useful. We find that these DReST agents generalize to being Neutral and Useful in unseen contexts at test time. Indeed, DReST RL agents achieve 11% (PPO) and 18% (A2C) higher Usefulness on our test set than baseline agents, and our fine-tuned LLM achieves maximum Usefulness and near-maximum Neutrality. Our results provide some early evidence that DReST could be used to train more advanced agents to be Useful and Neutral. Prior theoretical work suggests that these agents would be useful and shutdownable.
[AI-89] A Probabilistic Consensus-Driven Approach for Robust Counterfactual Explanations
【速读】:该论文旨在解决生成式反事实解释(Counterfactual Explanations, CFEs)在模型发生微小变化时易失效的问题,即现有方法生成的CFEs缺乏对模型扰动的鲁棒性。其解决方案的关键在于联合建模数据分布与合理模型决策空间,通过基于模型集成的概率共识训练一个条件归一化流(conditional normalizing flow),从而捕捉在不同分类器一致性水平下的数据密度。推理阶段仅需调整单一可解释参数——即目标类别所需模型最低同意比例,即可灵活控制鲁棒性水平,无需重新训练生成模型,从而有效将CFEs推向既符合数据分布又对模型变化稳定的区域。
链接: https://arxiv.org/abs/2604.17494
作者: Marcin Kostrzewa,Maciej Zięba,Jerzy Stefanowski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Counterfactual explanations (CFEs) are essential for interpreting black-box models, yet they often become invalid when models are slightly changed. Existing methods for generating robust CFEs are often limited to specific types of models, require costly tuning, or inflexible robustness controls. We propose a novel approach that jointly models the data distribution and the space of plausible model decisions to ensure robustness to model changes. Using a probabilistic consensus over a model ensemble, we train a conditional normalizing flow that captures the data density under varying levels of classifier agreement. At inference time, a single interpretable parameter controls the robustness level; it specifies the minimum fraction of models that should agree on the target class without retraining the generative model. Our method effectively pushes CFEs toward regions that are both plausible and stable across model changes. Experimental results demonstrate that our approach achieves superior empirical robustness while also maintaining good performance across other evaluation measures.
[AI-90] Language models recognize dropout and Gaussian noise applied to their activations
【速读】:该论文旨在解决语言模型是否具备识别、定位并部分描述其激活值(activation)中扰动差异的能力这一问题,从而探索模型对自身内部状态变化的感知能力。解决方案的关键在于设计两类扰动实验:一是模拟训练阶段使用的dropout机制进行掩码处理,二是引入高斯噪声(Gaussian noise)以模拟推理阶段可能的干扰;随后通过多选题形式测试模型能否准确判断哪句话被扰动或具体施加了何种扰动类型。研究发现,包括Llama、Olmo和Qwen系列在内的多个8B至32B参数规模的语言模型均能近乎完美地完成此类任务,并且在上下文学习条件下可区分不同类型的扰动,甚至表现出对正确标签的先验偏好——这暗示了模型可能存在一种无需特定数据标注的“训练感知”信号(training awareness signal),为理解生成式AI(Generative AI)的内在机制及潜在安全风险提供了新视角。
链接: https://arxiv.org/abs/2604.17465
作者: Damiano Fornasiere,Mirko Bronzi,Spencer Kitts,Alessandro Palmas,Yoshua Bengio,Oliver Richardson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We provide evidence that language models can detect, localize and, to a certain degree, verbalize the difference between perturbations applied to their activations. More precisely, we either (a) \emphmask activations, simulating \emphdropout, or (b) add \emphGaussian noise to them, at a target sentence. We then ask a multiple-choice question such as \emphWhich of the previous sentences was perturbed?'' or \emphWhich of the two perturbations was applied?‘’. We test models from the Llama, Olmo, and Qwen families, with sizes between 8B and 32B, all of which can easily detect and localize the perturbations, often with perfect accuracy. These models can also learn, when taught in context, to distinguish between dropout and Gaussian noise. Notably, \qwenb’s \emphzero-shot accuracy in identifying which perturbation was applied improves as a function of the perturbation strength and, moreover, decreases if the in-context labels are flipped, suggesting a prior for the correct ones – even modulo controls. Because dropout has been used as a training-regularization technique, while Gaussian noise is sometimes added during inference, we discuss the possibility of a data-agnostic ``training awareness’’ signal and the implications for AI safety. The code and data are available at \hrefthis https URLlink 1 and \hrefthis https URLlink 2, respectively. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.17465 [cs.AI] (or arXiv:2604.17465v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.17465 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Damiano Fornasiere [view email] [v1] Sun, 19 Apr 2026 14:30:13 UTC (94 KB)
[AI-91] Project Prometheus: Bridging the Intent Gap in Agent ic Program Repair via Reverse-Engineered Executable Specifications
【速读】:该论文旨在解决自动化程序修复(Automated Program Repair, APR)中因生成式 AI(Generative AI)代理与开发者原始意图之间存在“意图差距”(Intent Gap)而导致的修复不准确问题。现有方法依赖自然语言摘要或对抗采样,难以提供手术式修复所需的确定性约束。其解决方案的关键在于提出一个名为 \textscPrometheus 的新框架,该框架优先采用规范推断(Specification Inference)而非直接代码生成,并引入行为驱动开发(Behavior-Driven Development, BDD)作为可执行契约,通过多智能体架构从运行时失败报告中逆向推导 Gherkin 规范;同时设计了需求质量保障循环(Requirement Quality Assurance, RQA Loop),利用真实代码作为代理 oracle 验证推断出的规范,从而有效缓解“意图幻觉”(Hallucination of Intent)。实验表明,该方法在 Defects4J 数据集上实现了 93.97% 的正确修复率和 74.4% 的救援率,显著优于盲目标代理。
链接: https://arxiv.org/abs/2604.17464
作者: Yongchao Wang,Zhiqiu Huang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The transition from neural machine translation to agentic workflows has revolutionized Automated Program Repair (APR). However, existing agents, despite their advanced reasoning capabilities, frequently suffer from the Intent Gap'' -- the misalignment between the generated patch and the developer's original intent. Current solutions relying on natural language summaries or adversarial sampling often fail to provide the deterministic constraints required for surgical repairs. In this paper, we introduce \textscPrometheus, a novel framework that bridges this gap by prioritizing \textitSpecification Inference over code generation. We employ Behavior-Driven Development (BDD) as an executable contract, utilizing a multi-agent architecture to reverse-engineer Gherkin specifications from runtime failure reports. To resolve the Hallucination of Intent,‘’ we propose a \textbfRequirement Quality Assurance (RQA) Loop, a mechanism that leverages ground-truth code as a proxy oracle to validate inferred specifications. We evaluated \textscPrometheus on 680 defects from the Defects4J benchmark. The results are transformative: our framework achieved a total correct patch rate of \textbf93.97% (639/680). More significantly, it demonstrated a \textbfRescue Rate of 74.4%, successfully repairing 119 complex bugs that a strong blind agent failed to resolve. Qualitative analysis reveals that explicit intent guides agents away from structurally invasive over-engineering toward precise, minimal corrections. Our findings suggest that the future of APR lies not in larger models, but in the capability to align code with verified, \textbfExecutable Specifications – whether pre-existing or reverse-engineered. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.17464 [cs.SE] (or arXiv:2604.17464v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.17464 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-92] EHRAG : Bridging Semantic Gaps in Lightweight GraphRAG via Hybrid Hypergraph Construction and Retrieval ACL2026
【速读】:该论文旨在解决现有轻量级检索增强生成(Retrieval-Augmented Generation, RAG)方法在构建知识图谱时仅依赖结构共现关系、忽略实体间潜在语义关联的问题,从而限制了多跳推理能力。解决方案的关键在于提出EHRAG框架,通过构建融合结构与语义层次关系的超图(hypergraph)实现更全面的知识表示:一方面基于句子级共现关系构建结构超边,另一方面利用实体文本嵌入聚类生成语义超边,确保超图同时包含显式结构信息和隐式语义联系;在此基础上,采用结构-语义混合扩散机制结合主题感知评分与个性化PageRank(Personalized PageRank, PPR)优化,实现高效且精准的Top-k文档检索。
链接: https://arxiv.org/abs/2604.17458
作者: Yifan Song,Xingjian Tao,Zhicheng Yang,Yihong Luo,Jing Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by Findings of ACL2026
Abstract:Graph-based Retrieval-Augmented Generation (GraphRAG) enhances LLMs by structuring corpus into graphs to facilitate multi-hop reasoning. While recent lightweight approaches reduce indexing costs by leveraging Named Entity Recognition (NER), they rely strictly on structural co-occurrence, failing to capture latent semantic connections between disjoint entities. To address this, we propose EHRAG, a lightweight RAG framework that constructs a hypergraph capturing both structure and semantic level relationships, employing a hybrid structural-semantic retrieval mechanism. Specifically, EHRAG constructs structural hyperedges based on sentence-level co-occurrence with lightweight entity extraction and semantic hyperedges by clustering entity text embeddings, ensuring the hypergraph encompasses both structural and semantic information. For retrieval, EHRAG performs a structure-semantic hybrid diffusion with topic-aware scoring and personalized pagerank (PPR) refinement to identify the top-k relevant documents. Experiments on four datasets show that EHRAG outperforms state-of-the-art baselines while maintaining linear indexing complexity and zero token consumption for construction. Code is available at this https URL.
[AI-93] rafficClaw: Generalizable Urban Traffic Control via Unified Physical Environment Modeling
【速读】:该论文旨在解决城市交通控制中多子系统(如交通信号、高速公路、公共交通和出租车服务)协同优化难题,现有方法(基于优化、强化学习或大语言模型LLM)通常针对单一任务设计,难以实现跨任务泛化及对子系统间耦合物理动态的建模。解决方案的关键在于提出TrafficClaw框架,其核心是构建一个统一的运行时环境,将异构子系统整合为共享的动力学系统,从而显式建模跨子系统的交互关系与闭环反馈机制;同时引入具备可执行时空推理能力和可复用程序记忆的LLM代理,并采用多阶段训练流程(监督初始化+代理式强化学习),实现系统级优化与持续策略迭代,最终在未见场景下展现出鲁棒、可迁移且系统感知的性能表现。
链接: https://arxiv.org/abs/2604.17456
作者: Siqi Lai,Pan Zhang,Yuping Zhou,Jindong Han,Yansong Ning,Hao Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Urban traffic control is a system-level coordination problem spanning heterogeneous subsystems, including traffic signals, freeways, public transit, and taxi services. Existing optimization-based, reinforcement learning (RL), and emerging LLM-based approaches are largely designed for isolated tasks, limiting both cross-task generalization and the ability to capture coupled physical dynamics across subsystems. We argue that effective system-level control requires a unified physical environment in which subsystems share infrastructure, mobility demand, and spatiotemporal constraints, allowing local interventions to propagate through the network. To this end, we propose TrafficClaw, a framework for general urban traffic control built upon a unified runtime environment. TrafficClaw integrates heterogeneous subsystems into a shared dynamical system, enabling explicit modeling of cross-subsystem interactions and closed-loop agent-environment feedback. Within this environment, we develop an LLM agent with executable spatiotemporal reasoning and reusable procedural memory, supporting unified diagnostics across subsystems and continual strategy refinement. Furthermore, we introduce a multi-stage training pipeline with supervised initialization and agentic RL with system-level optimization, further enabling coordinated and system-aware performance. Experiments demonstrate that TrafficClaw achieves robust, transferable, and system-aware performance across unseen traffic scenarios, dynamics, and task configurations. Our project is available at this https URL.
[AI-94] Compiling Deterministic Structure into SLM Harnesses
【速读】:该论文旨在解决小语言模型(Small Language Models, SLMs)在企业部署中因认知不对称(epistemic asymmetry)导致的局限性问题:SLMs无法自我修正推理错误,而前沿大语言模型(Large Language Models, LLMs)则因成本过高且受数据主权限制难以大规模使用。其解决方案的核心是提出Semantic Gradient Descent (SGDe),一种基于教师-学生框架的离散语义空间优化方法,通过将智能体工作流编译为包含有向无环图(DAG)拓扑、系统提示和确定性可执行代码的离散执行计划,利用前沿教师模型生成自然语言批评作为方向梯度,迭代优化SLM的工作流产物。关键创新在于将训练过程形式化于PAC学习框架下,借助教师作为统计先验,在仅需3个训练样本的情况下实现收敛,并通过结构共识与能力卸载机制,动态决定子任务是否移交Python运行时执行,从而显著提升SLMs在小样本场景下的准确率(如GSM-Hard衍生测试集上达91.3%至99.3%)。
链接: https://arxiv.org/abs/2604.17450
作者: Zan Kai Chong,Hiroyuki Ohsaki,Bryan Ng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Enterprise deployment of small language models (SLMs) is constrained by epistemic asymmetry: SLMs cannot self-correct reasoning errors, while frontier LLMs are prohibitively costly and face data sovereignty limits for high-volume use. We propose Semantic Gradient Descent (SGDe), a teacher-student framework that compiles agentic workflows into discrete execution plans comprising DAG topologies, system prompts, and deterministic executable code. The trailing “e” distinguishes SGDe from stochastic gradient descent. SGDe operates in a discrete semantic space where a frontier teacher generates natural-language critiques acting as directional gradients to iteratively refine the SLM’s workflow artefacts. We formalise SGDe within a PAC learning framework, establishing sample-complexity bounds that enable convergence with as few as three training examples on targeted synthetic tasks by leveraging the teacher as a statistical prior. On a GSM-Hard-derived test set built via adversarial synthesis, compiled workflows reach 91.3% accuracy at m=5 and 99.3% at m=3 within the small-m regime motivated by Corollary 1, a +26.3% to +34.3% absolute improvement over state-of-the-art prompt optimisers. In the emerging paradigm of harness engineering, SGDe treats placement of deterministic code (which subtasks to delegate to a Python runtime versus retain as LLM calls) as a trace-driven, per-node optimisation target, generalising the whole-problem offloading of PAL and PoT. The teacher compiles two complementary deterministic structures: capability offloading, which delegates subtasks to Python when the SLM cannot execute them reliably, and structural consensus, which wraps variance-limited reasoning steps in fan-out/fan-in subgraphs aggregated by deterministic voting.
[AI-95] ransXion: A High-Fidelity Graph Benchmark for Realistic Anti-Money Laundering
【速读】:该论文旨在解决当前反洗钱(Anti-Money Laundering, AML)研究中缺乏真实可信基准测试的问题。现有交易图数据集存在两大局限:一是节点层面语义信息稀疏,仅提供匿名标识符;二是异常注入依赖模板驱动,导致模型评估结果过于乐观且偏向静态结构特征。解决方案的关键在于提出TransXion基准生态系统,其核心创新是通过融合基于实体画像的正常行为模拟与非模板化的随机非法活动合成,同时建模持久的实体特征和条件化交易行为,从而能够有效评估“偏离角色”的异常情况——即个体行为与其社会经济背景不符的情形。该方法显著提升了基准的真实性与挑战性,实证表明TransXion在多个检测模型上均表现出更低的识别性能,验证了其作为更可靠AML检测研究平台的价值。
链接: https://arxiv.org/abs/2604.17420
作者: Keyang Chen,Mingxuan Jiang,Yongsheng Zhao,Zeping Li,Zaiyuan Chen,Weiqi Luo,Zhixin Li,Sen Liu,Yinan Jing,Guangnan Ye,Xihong Wu,Hongfeng Chai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:Money laundering poses severe risks to global financial systems, driving the widespread adoption of machine learning for transaction monitoring. However, progress remains stifled by the lack of realistic benchmarks. Existing transaction-graph datasets suffer from two pervasive limitations: (i) they provide sparse node-level semantics beyond anonymized identifiers, and (ii) they rely on template-driven anomaly injection, which biases benchmarks toward static structural motifs and yields overly optimistic assessments of model robustness. We propose TransXion, a benchmark ecosystem for Anti-Money Laundering (AML) research that integrates profile-aware simulation of normal activity with stochastic, non-template synthesis of illicit this http URL jointly models persistent entity profiles and conditional transaction behavior, enabling evaluation of “out-of-character” anomalies where observed activity contradicts an entity’s socio-economic context. The resulting dataset comprises approximately 3 million transactions among 50,000 entities, each endowed with rich demographic and behavioral attributes. Empirical analyses show that TransXion reproduces key structural properties of payment networks, including heavy-tailed activity distributions and localized subgraph structure. Across a diverse array of detection models spanning multiple algorithmic paradigms, TransXion yields substantially lower detection performance than widely used benchmarks, demonstrating increased difficulty and realism. TransXion provides a more faithful testbed for developing context-aware and robust AML detection methods. The dataset and code are publicly available at this https URL.
[AI-96] Project resilience as network robustness
【速读】:该论文旨在解决工程项目中因关键人员流失而导致的脆弱性评估问题,即如何更准确地衡量项目在核心成员缺失情况下的韧性。现有方法要么过于乐观(仅提供最佳情况估计),要么未能捕捉项目任务的碎片化特征,从而导致估计偏差和不切实际的后果。解决方案的关键在于引入基于网络鲁棒性的新评估方法,通过分析项目成员间任务分工与协作关系的拓扑结构,系统刻画项目对关键人员依赖的敏感度,从而提供更稳健、一致且贴近现实的项目韧性估计。
链接: https://arxiv.org/abs/2604.17417
作者: Sebastiano A. Piccolo,Giorgio Terracina
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Engineering projects are the result of the combined effort of their members. Yet, it has been documented that labor division withing projects is unevenly distributed: some project members are specialists undertaking only few tasks, whereas other are generalists and are responsible for the success of many tasks. Moreover, the latter are often facilitators of project integration. Such a workload distribution prompts one question: how resilient is a project to key personnel loss? Far from being a theoretical problem, the reliance of a project on a few key people can lead to severe economic losses and delays. We argue that current methods to estimate such a risk are unsatisfactory: some methods offer a best-case estimate and are, therefore, too optimistic; other methods fail to capture project fragmentation leading to biased estimates and unrealistic consequences in many settings. In this paper, we develop a novel method to assess project vulnerability by looking at it from the lens of network robustness. We compare our method against existing alternatives and show that it offers better and more consistent estimates of project resilience to personnel loss.
[AI-97] he Open-Weight Paradox: Why Restricting Access to AI Models May Undermine the Safety It Seeks to Protect
【速读】:该论文试图解决当前开放权重人工智能(Open-weight AI)模型治理中“开放即风险、限制即安全”的二元对立框架所导致的治理困境,特别是限制访问可能加剧全球南方国家在AI主权能力建设上的不平等,并促使技术向缺乏监管的环境扩散。其解决方案的关键在于引入多层协同治理机制,即通过硬件层治理(如芯片级认证机制FlexHEG、可信执行环境、机密计算)与软件层防护相结合,构建纵深防御体系;同时主张建立类似国际原子能机构(IAEA)功能的多边制度架构,以实现对AI这一双用途技术的有效规制,尤其需防范硬件控制权被用于国内压制,从而在保障技术开放性的同时提升安全性。
链接: https://arxiv.org/abs/2604.17413
作者: Vinicius Santana Gomes
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 23 pages, 2 figures, 1 table. Preprint also deposited at Zenodo (DOI: https://doi.org/10.5281/zenodo.19484877 ) on 2026-04-09. Licensed under CC BY 4.0
Abstract:The governance of open-weight artificial intelligence (AI) models has been framed as a binary choice: openness as risk, restriction as safety. This paper challenges that framing, arguing that access restrictions, without governed alternatives, may displace risks rather than reduce them. The global concentration of compute infrastructure makes open-weight models one of the most viable pathways to sovereign AI capacity in the Global South; restricting such access deepens asymmetries while driving proliferation into unsupervised settings. This analysis proposes that hardware-layer governance, including chip-level attestation mechanisms such as FlexHEG, trusted execution environments, confidential computing, and complementary software-layer safeguards, offers a defense-in-depth alternative to the current binary. A threat model taxonomy mapping misuse vectors to hardware, software, institutional, and liability layers illustrates why no single governance mechanism suffices. To operationalize this approach, the paper argues that effective AI governance as a dual-use technology will likely require a multilateral institutional architecture functionally analogous, though not identical, to the role performed by the IAEA in the nuclear domain, with explicit safeguards against the co-option of hardware controls for domestic repression. The relevant policy question is how to make openness safer through technical and institutional design while addressing the transition realities of legacy hardware, attestation at scale, and civil liberties protection.
[AI-98] EvoMaster: A Foundational Agent Framework for Building Evolving Autonomous Scientific Agents at Scale
【速读】:该论文旨在解决当前科学发现中代理框架(agent framework)普遍存在的静态性、范围狭窄以及缺乏从试错中学习能力的问题,从而无法有效模拟人类科学家的迭代式探索过程。其解决方案的关键在于提出 EvoMaster——一个面向大规模 Agentic Science 的基础演化代理框架,其核心机制是通过持续自我进化(continuous self-evolution),使代理能够迭代优化假设、进行自我批判,并在实验周期中逐步积累知识,从而忠实复现人类科学探究的本质。这一设计显著提升了代理在多学科场景下的适应性和自主性,且具备高度可扩展性,仅需约 100 行代码即可构建跨领域的自演化科学代理。
链接: https://arxiv.org/abs/2604.17406
作者: Xinyu Zhu,Yuzhu Cai,Zexi Liu,Cheng Wang,Fengyang Li,Wenkai Jin,Wanxu Liu,Zehao Bing,Bingyang Zheng,Jingyi Chai,Shuo Tang,Rui Ye,Yuwen Du,Xianghe Pang,Yaxin Du,Tingjia Miao,Yuzhi Zhang,Ruoxue Liao,Zhaohan Ding,Linfeng Zhang,Yanfeng Wang,Weinan E,Siheng Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 3 figures
Abstract:The convergence of large language models and agents is catalyzing a new era of scientific discovery: Agentic Science. While the scientific method is inherently iterative, existing agent frameworks are predominantly static, narrowly scoped, and lack the capacity to learn from trial and error. To bridge this gap, we present EvoMaster, a foundational evolving agent framework engineered specifically for Agentic Science at Scale. Driven by the core principle of continuous self-evolution, EvoMaster empowers agents to iteratively refine hypotheses, self-critique, and progressively accumulate knowledge across experimental cycles, faithfully mirroring human scientific inquiry. Crucially, as a domain-agnostic base harness, EvoMaster is exceptionally easy to scale up – enabling developers to build and deploy highly capable, self-evolving scientific agents for arbitrary disciplines in approximately 100 lines of code. Built upon EvoMaster, we incubated the SciMaster ecosystem across domains such as machine learning, physics, and general science. Evaluations on four authoritative benchmarks (Humanity’s Last Exam, MLE-Bench Lite, BrowseComp, and FrontierScience) demonstrate that EvoMaster achieves state-of-the-art scores of 41.1%, 75.8%, 73.3%, and 53.3%, respectively. It comprehensively outperforms the general-purpose baseline OpenClaw with relative improvements ranging from +159% to +316%, robustly validating its efficacy and generality as the premier foundational framework for the next generation of autonomous scientific discovery. EvoMaster is available at this https URL.
[AI-99] STRIDE: Strategic Iterative Decision-Making for Retrieval-Augmented Multi-Hop Question Answering SIGIR2026
【速读】:该论文旨在解决多跳问答(Multi-hop Question Answering, MHQA)中现有方法存在的两大问题:一是模型过早地基于表面实体进行决策,导致在词法歧义下难以正确分解问题;二是忽视推理步骤间的逻辑依赖关系,造成执行过程缺乏协调。其解决方案的关键在于提出STRIDE框架,该框架通过分离策略规划(strategic planning)、动态控制(dynamic control)和基于事实的执行(grounded execution)三个模块实现结构化推理。其中,Meta-Planner首先构建与实体无关的推理骨架(reasoning skeleton),延迟实体锚定以减少歧义错误;Supervisor则依据逻辑依赖关系调度子问题执行,支持并行优化与必要时的串行协调,同时动态决定是否检索新证据或基于已有事实推理,从而避免冗余查询和错误传播。此设计显著提升了MHQA系统的准确性与鲁棒性。
链接: https://arxiv.org/abs/2604.17405
作者: Wei Chen,Lili Zhao,Zhi Zheng,HuiJun Hou,Tong Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by SIGIR 2026 Full Paper. The code repository is available at this https URL
Abstract:Multi-hop question answering (MHQA) enables accurate answers to complex queries by retrieving and reasoning over evidence dispersed across multiple documents. Existing MHQA approaches mainly rely on iterative retrieval-augmented generation, which suffer from the following two major issues. 1) Existing methods prematurely commit to surface-level entities rather than underlying reasoning structures, making question decomposition highly vulnerable to lexical ambiguity. 2) Existing methods overlook the logical dependencies among reasoning steps, resulting in uncoordinated execution. To address these issues, we propose STRIDE, a framework that separates strategic planning, dynamic control, and grounded execution. At its core, a Meta-Planner first constructs an entity-agnostic reasoning skeleton to capture the abstract logic of the query, thereby deferring entity grounding until after the reasoning structure is established, which mitigates disambiguation errors caused by premature lexical commitment. A Supervisor then orchestrates sub-question execution in a dependency-aware manner, enabling efficient parallelization where possible and sequential coordination when necessary. By dynamically deciding whether to retrieve new evidence or infer from existing facts, it avoids redundant queries and error propagation, while fusing cross-branch information and reformulating failed queries to enhance robustness. Grounded fact extraction and logical inference are delegated to specialized execution modules, ensuring faithfulness through explicit separation of retrieval and reasoning. We further propose STRIDE-FT, a modular fine-tuning framework that uses self-generated execution trajectories from STRIDE, requiring neither human annotations nor stronger teacher models. Experiments show that STRIDE achieves robust and accurate reasoning, while STRIDE-FT effectively enhances open-source LLMs.
[AI-100] Phase-Scheduled Multi-Agent Systems for Token-Efficient Coordination
【速读】:该论文旨在解决多智能体系统(Multi-agent Systems, MAS)在大语言模型驱动下存在的严重令牌(token)效率低下问题,其根源在于两个相互加剧的因素:一是无结构的并行执行,即所有智能体无论输入是否就绪均同时激活;二是不受限的上下文共享,即每个智能体接收全部累积上下文,而不论其相关性。现有缓解策略(如静态剪枝、分层分解和学习路由)将协调视为结构性分配问题,忽略了其时间维度特性。论文提出相位调度多智能体系统(Phase-Scheduled Multi-Agent Systems, PSMAS),其核心创新在于将智能体激活重构为对共享注意力空间的连续控制,该空间建模于圆形流形(circular manifold)上——每个智能体被赋予一个固定角度相位 θi∈[0,2π],由任务依赖拓扑决定;全局扫掠信号 ϕ(t) 以角速度 ω 旋转,仅激活处于角度窗口 ϵ 内的智能体,其余智能体接收压缩后的上下文摘要,从而显著降低每步的令牌消耗。实验表明,PSMAS 在四个结构化基准和两个非结构化对话场景中实现平均 27.3% 的令牌减少(范围 21.4–34.8%),同时性能仅下降 2.1 个百分点(p < 0.01, n = 500/配置),且相比最强学习路由基线,在令牌节省上提升 5.6 个百分点,性能下降减少 2.0 个百分点,验证了调度与压缩作为独立增益源的有效性。
链接: https://arxiv.org/abs/2604.17400
作者: Mohit Dubey
机构: 未知
类目: Artificial Intelligence (cs.AI); Algebraic Topology (math.AT)
备注: 8 pages, pre print, 3 figures
Abstract:Multi-agent systems (MAS) powered by large language models suffer from severe token inefficiency arising from two compounding sources: (i) unstructured parallel execution, where all agents activate simultaneously irrespective of input readiness; and (ii) unrestricted context sharing, where every agent receives the full accumulated context regardless of relevance. Existing mitigation strategies - static pruning, hierarchical decomposition, and learned routing - treat coordination as a structural allocation problem and fundamentally ignore its temporal dimension. We propose Phase-Scheduled Multi-Agent Systems (PSMAS), a framework that reconceptualizes agent activation as continuous control over a shared attention space modeled on a circular manifold. Each agent i is assigned a fixed angular phase theta_i in the range [0, 2*pi], derived from the task dependency topology; a global sweep signal phi(t) rotates at velocity omega, activating only agents within an angular window epsilon. Idle agents receive compressed context summaries, reducing per-step token consumption. We implement PSMAS on LangGraph, evaluate on four structured benchmarks (HotPotQA-MAS, HumanEval-MAS, ALFWorld-Multi, WebArena-Coord) and two unstructured conversational settings, and prove stability, convergence, and optimality results for the sweep dynamics. PSMAS achieves a mean token reduction of 27.3 percent (range 21.4-34.8 percent) while maintaining task performance within 2.1 percentage points of a fully activated baseline (p 0.01, n = 500 per configuration), and outperforms the strongest learned routing baseline by 5.6 percentage points in token reduction with 2.0 percentage points less performance drop. Crucially, we show that scheduling and compression are independent sources of gain: scheduling alone accounts for 18-20 percentage points of reduction, robust to compression degradation up to alpha = 0.40. Comments: 8 pages, pre print, 3 figures Subjects: Artificial Intelligence (cs.AI); Algebraic Topology (math.AT) Cite as: arXiv:2604.17400 [cs.AI] (or arXiv:2604.17400v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.17400 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-101] Beyond Meta-Reasoning : Metacognitive Consolidation for Self-Improving LLM Reasoning
【速读】:该论文旨在解决现有元推理(meta-reasoning)方法中存在 episodic(片段化)问题,即当前方法仅在单个推理实例内执行复杂元推理流程,而忽视了跨实例的可复用元推理技能积累,导致重复失败模式和持续高元认知努力。其解决方案的关键在于提出元认知巩固(Metacognitive Consolidation)框架,通过将实例级问题求解划分为推理(reasoning)、监控(monitoring)和控制(control)三个角色,生成丰富且可归因的元层级轨迹(meta-level traces),并利用分层多时间尺度更新机制对这些轨迹进行逐步整合,从而形成演化的元知识(meta-knowledge),实现元推理能力的持续积累与提升。
链接: https://arxiv.org/abs/2604.17399
作者: Ziqing Zhuang,Linhai Zhang,Jiasheng Si,Deyu Zhou,Yulan He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have demonstrated strong reasoning capabilities, and as existing approaches for enhancing LLM reasoning continue to mature, increasing attention has shifted toward meta-reasoning as a promising direction for further improvement. However, most existing meta-reasoning methods remain episodic: they focus on executing complex meta-reasoning routines within individual instances, but ignore the accumulation of reusable meta-reasoning skills across instances, leading to recurring failure modes and repeatedly high metacognitive effort. In this paper, we introduce Metacognitive Consolidation, a novel framework in which a model consolidates metacognitive experience from past reasoning episodes into reusable knowledge that improves future meta-reasoning. We instantiate this framework by structuring instance-level problem solving into distinct roles for reasoning, monitoring, and control to generate rich, attributable meta-level traces. These traces are then consolidated through a hierarchical, multi-timescale update mechanism that gradually forms evolving meta-knowledge. Experimental results demonstrate consistent performance gains across benchmarks and backbone models, and show that performance improves as metacognitive experience accumulates over time.
[AI-102] Study and Improvement of Search Algorithms in Multi-Player Perfect-Information Games
【速读】:该论文旨在解决多玩家博弈(Multiplayer Games)中搜索算法性能不足的问题,特别是在具有完美信息的环境中。传统针对二人零和博弈的最优搜索算法(如Unbounded Minimax)难以直接扩展至多玩家场景,导致效率与准确性下降。解决方案的关键在于将Unbounded Minimax算法从二人零和博弈框架推广至多玩家博弈框架,通过重构评估函数与剪枝策略以适应多方竞争结构,实验证明该方法在性能上优于主流多玩家搜索算法。
链接: https://arxiv.org/abs/2604.17378
作者: Quentin Cohen-Solal
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:
Abstract:In this article, we generalize Unbounded Minimax, the state-of-the-art search algorithm for zero sums two-player games with perfect information to the framework of multiplayer games with perfect information. We experimentally show that this generalized algorithm also achieves better performance than the main multiplayer search algorithms.
[AI-103] -DuMpRa: Teacher-guided Dual-path Multi-prototype Retrieval Augmented framework for fine-grained medical image classification
【速读】:该论文旨在解决细粒度医学图像分类中因类别间差异细微和视觉模糊导致的预测不确定性问题,此类场景下传统判别式分类器虽能获得较高整体准确率,却难以区分高度相似类别,从而产生校准不良的预测结果。解决方案的关键在于提出T-DuMpRa框架——一种教师引导的双路径多原型检索增强方法,通过判别分类与多原型检索的联合训练和推理机制实现性能提升:在训练阶段,联合优化交叉熵与监督对比损失以学习余弦兼容的嵌入几何结构,便于可靠原型匹配;同时利用指数移动平均(EMA)教师模型生成平滑表示,并在教师嵌入空间中聚类构建多原型记忆库;在推理阶段,将分类器输出分布与基于余弦匹配原型的相似性分布进行保守置信度门控融合,仅当分类器预测不确定且检索证据明确冲突时激活检索模块,否则保留原有高置信度预测,从而有效提升对视觉模糊病例的判别能力。
链接: https://arxiv.org/abs/2604.17360
作者: Zixuan Tang,Shen Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-grained medical image classification is challenged by subtle inter-class variations and visually ambiguous cases, where confidence estimates often exhibit uncertainty rather than being overconfident. In such scenarios, purely discriminative classifiers may achieve high overall accuracy yet still fail to distinguish between highly similar categories, leading to miscalibrated predictions. We propose T-DuMpRa, a teacher-guided dual-path multi-prototype retrieval-augmented framework, where discriminative classification and multi-prototype retrieval jointly drive both training and prediction. During training, we jointly optimize cross-entropy and supervised contrastive objectives to learn a cosine-compatible embedding geometry for reliable prototype matching. We further employ an exponential moving average (EMA) teacher to obtain smoother representations and build a multi-prototype memory bank by clustering teacher embeddings in the teacher embedding space. Our framework is plug-and-play: it can be easily integrated into existing classification models by constructing a compact prototype bank, thereby improving performance on visually ambiguous cases. At inference, we combine the classifier’s predicted distribution with a similarity-based distribution computed via cosine matching to prototypes, and apply a conservative confidence-gated fusion that activates retrieval only when the classifier’s prediction is uncertain and the retrieval evidence is decisive and conflicting, otherwise keeping confident predictions unchanged. On HAM10000 and ISIC2019, our method yields 0.68%-0.21% and 0.44%-2.69% improvements on 5 different backbones. And visualization analysis proves our model can enhance the model’s ability to handle visually ambiguous cases.
[AI-104] PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在模拟患者时缺乏人群层面有效性验证的问题,尤其是其生成的临床情境是否真实反映现实人口分布特征。解决方案的关键在于构建PsychBench——首个针对LLM患者模拟的流行病学审计框架,通过对比28,800个模型生成的个体档案与NHANES和NESARC-III两大权威流行病学数据库,在120个交叉人口亚组中系统评估模型输出的分布一致性。研究发现,尽管模型生成个体在临床描述上高度连贯(test-retest相关系数r > 0.90),但存在显著的“一致性-真实性分离”(coherence-fidelity dissociation):即个体层面看似合理,但整体人群分布被严重压缩(方差压缩达14%–62%),且对特定群体如跨性别女性存在系统性低估和偏见,揭示当前训练范式导致的校准偏差和刻板印象编码问题。
链接: https://arxiv.org/abs/2604.17359
作者: Patrick Keough
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures
Abstract:Large language models are increasingly deployed to simulate patients for clinical training, research, and mental health tools, yet population-level validity remains largely untested. We introduce PsychBench, the first epidemiological audit of LLM patient simulation: 28,800 profiles from four frontier models (GPT-4o-mini, DeepSeek-V3, Gemini-3-Flash, GLM-4.7) evaluated against NHANES and NESARC-III baselines across 120 intersectional cohorts. The central finding is a coherence-fidelity dissociation: models produce clinically plausible individuals while misrepresenting the populations they are drawn from. Variance compression ranges from 14 percent (GLM-4.7) to 62 percent (DeepSeek-V3), eliminating the distributional tails of clinical reality. Despite test-retest correlations above r = 0.90, 36.66 percent of cases cross diagnostic thresholds between runs. Symptom correlation matrices diverge across demographic groups beyond split-half noise, with transgender populations diverging three to five times more than racial differences. Calibration bias is systematic and asymmetric. Models overestimate depression severity for most groups by 3.6 to 6.1 points (Cohen d = 1.13 to 1.91), consistent with training on clinical corpora with elevated base rates. For transgender women the direction inverts: models capture only 8 to 46 percent of documented minority stress elevation, yielding a -5.42 residual (d = -1.55). Models also attribute irritability to Black men and fatigue to women beyond matched controls, encoding racialized and gendered assumptions. Patterns replicate across US and Chinese architectures, indicating failures tied to current training paradigms rather than isolated implementations. For most users, LLM mental health tools risk pathologizing ordinary distress; for transgender users, algorithmic erasure of genuine need. The patients look right. They do not represent real populations.
[AI-105] Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling
【速读】:该论文旨在解决复杂任务下大型语言模型作为多智能体(multi-agent)系统时,算法级与任务级扩展性不足的问题。具体而言,在算法层面,推理过程中因多分支推理导致的跨路径冗余计算(cross-path redundancy)限制了效率提升;在任务层面,现有调度机制未考虑多个智能体的存在,无法实现资源按贡献分配的优化。解决方案的关键在于提出Hive多智能体基础设施,其核心创新包括:一是Logits Cache机制,通过复用冗余采样路径中的中间logits缓解算法级冗余;二是Agent-Aware Scheduling策略,根据各智能体的任务贡献动态分配计算与KV缓存资源,从而实现任务级的高效调度。实验表明,该方案在重采样场景下平均加速比达1.11×–1.76×,热点缺失率降低33%–51%。
链接: https://arxiv.org/abs/2604.17353
作者: Zizhang Luo,Yuhao Luo,Youwei Xiao,Yansong Xu,Runlin Guo,Yun Liang
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Large language models are increasingly deployed as complex agentic systems that scale with task complexity. While prior work has extensively explored model- and system-level scaling, algorithm- and task-level scaling remain largely unaddressed, constraining the full potential of agentic systems. At the algorithm level, allocating additional inference-time computation can enhance workflow capacity but introduces cross-path redundancy: overlapping computations across multiple reasoning branches. At the task level, complex tasks can be decomposed into subproblems and delegated across multiple agents for improved scalability and parallelism. However, existing infrastructures’ scheduling is unaware of the existence of multiple agents, missing opportunities to optimize resource allocation. We propose Hive, a multi-agent infrastructure that enables algorithm- and task-level scaling. Hive features a description frontend that captures per-agent behavior and supports test-time scaling algorithms. Leveraging this specification, our backend introduces two key mechanisms: Logits Cache that reuses intermediate logits across redundant sampling paths to mitigate cross-path redundancy at the algorithm level, and Agent-Aware Scheduling that efficiently allocates compute and KV-cache resources according to agent contributions at the task level. Experiments show that Logits Cache achieves an average speedup of 1.11\times - 1.76\times for re-sampling, and Agent-Aware Scheduling reduces the hotspot miss rate by 33% - 51% . Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) ACMclasses: I.2.7; J.7 Cite as: arXiv:2604.17353 [cs.AI] (or arXiv:2604.17353v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.17353 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-106] SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization ACL2026
【速读】:该论文旨在解决自动化仿真器构建中因分布保真度不足而导致的可靠性问题,区别于通用代码生成任务。其核心挑战在于长时程大语言模型(LLM)代理在执行过程中出现的两种失效模式:上下文漂移(contextual drift)和由结构误差与参数误差混淆引发的优化不稳定性。解决方案的关键在于提出SOCIA-EVO——一种双锚定进化框架,通过三个创新机制实现:(1) 引入静态蓝图以强制施加经验约束;(2) 采用双层优化策略解耦结构优化与参数校准;(3) 构建自校正策略知识库(Strategy Playbook),基于贝叶斯加权检索动态管理修复假设。该方法通过执行反馈 falsify 无效策略,从而实现鲁棒收敛,生成在统计上与观测数据一致的仿真器。
链接: https://arxiv.org/abs/2604.17351
作者: Yuncheng Hua,Sion Weatherhead,Mehdi Jafari,Hao Xue,Flora D. Salim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper has been accepted to the ACL 2026 Main Conference
Abstract:Automated simulator construction requires distributional fidelity, distinguishing it from generic code generation. We identify two failure modes in long-horizon LLM agents: contextual drift and optimization instability arising from conflating structural and parametric errors. We propose SOCIA-EVO, a dual-anchored evolutionary framework. SOCIA-EVO introduces: (1) a static blueprint to enforce empirical constraints; (2) a bi-level optimization to decouple structural refinement from parameter calibration; and (3) a self-curating Strategy Playbook that manages remedial hypotheses via Bayesian-weighted retrieval. By falsifying ineffective strategies through execution feedback, SOCIA-EVO achieves robust convergence, generating simulators that are statistically consistent with observational data. The code and data of SOCIA-EVO are available here: this https URL.
[AI-107] Formal Foundations of Agent ic Business Process Management
【速读】:该论文旨在解决代理型业务流程管理(Agentic Business Process Management, Agentic BPM)系统中多自主决策主体(agents)的协同与目标对齐问题,即如何在缺乏完全控制权的情况下,通过形式化手段确保代理在追求各自目标时仍能有效支持整体流程目标。其解决方案的关键在于:将传统BPM中的过程规范扩展为包含显式目标(goals)的机制,使代理基于对其他代理行为的合理假设,采用适当策略以最优努力实现目标;同时,组织可通过这些规范在策略层面设定约束(guardrails),从而引导代理决策行为,保障流程执行的一致性与可控性。
链接: https://arxiv.org/abs/2604.17347
作者: Giuseppe De Giacomo,Timotheus Kampik,Lukas Kirchdorfer,Marco Montali,Christoph Weinhuber
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Just like traditional BPM systems, agentic BPM systems are built around a specification of the process under consideration. Their distinguishing feature, however, is that the execution of the process is driven by multiple autonomous decision-makers, referred to as agents. Since such agents cannot be fully controlled, the process specification is augmented with explicit objectives, or goals, assigned to the participating agents. Agents then pursue these goals, at least to the best of their efforts, under suitable assumptions on the behavior of others, by adopting appropriate strategies. Centrally, the organization enacting the process can use these specifications to provide guardrails on the decision-making capabilities of agents at the strategy level. This paper sets up the mathematical foundations of such systems in three key settings and analyzes four foundational problems of agentic BPM.
[AI-108] AutoSearch: Adaptive Search Depth for Efficient Agent ic RAG via Reinforcement Learning
【速读】:该论文旨在解决生成式 AI(Generative AI)中代理检索增强生成(Agentic Retrieval-Augmented Generation, Agentic RAG)系统在多步交互过程中因冗余搜索步骤导致的计算成本高和延迟大的问题。现有方法通过限制搜索深度来降低开销,但常因探索不足而影响复杂问题的解答准确性。解决方案的关键在于提出 AutoSearch 框架,其基于强化学习(Reinforcement Learning, RL)机制,通过自生成中间答案评估每一步搜索的有效性,并利用自回答机制识别达到准确性的最小足够搜索深度;同时引入奖励机制以稳定搜索行为并提升复杂问题的答案质量,从而实现更优的准确性与效率权衡。
链接: https://arxiv.org/abs/2604.17337
作者: Jingbo Sun,Wenyue Chong,Songjun Tu,Qichao Zhang,Yaocheng Zhang,Jiajun Chai,Xiaohan Wang,Wei Lin,Guojun Yin,Dongbin Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic retrieval-augmented generation (RAG) systems enable large language models (LLMs) to solve complex tasks through multi-step interaction with external retrieval tools. However, such multi-step interaction often involves redundant search steps, incurring substantial computational cost and latency. Prior work limits search depth (i.e., the number of search steps) to reduce cost, but this often leads to underexploration of complex questions. To address this, we first investigate how search depth affects accuracy and find a minimal sufficient search depth that defines an accuracy-efficiency trade-off, jointly determined by question complexity and the agent’s capability. Furthermore, we propose AutoSearch, a reinforcement learning (RL) framework that evaluates each search step via self-generated intermediate answers. By a self-answering mechanism, AutoSearch identifies the minimal sufficient search depth and promotes efficient search by rewarding its attainment while penalizing over-searching. In addition, reward mechanisms are introduced to stabilize search behavior and improve answer quality on complex questions. Extensive experiments on multiple benchmarks show that AutoSearch achieves a superior accuracy-efficiency trade-off, alleviating over-searching while preserving search quality.
[AI-109] Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction
【速读】:该论文旨在解决序列级相对强化学习中的长度问题(length problem),即在训练过程中,由于不同长度的响应(response)之间缺乏内在可比性,导致模型性能受限。现有方法虽能部分缓解长度相关现象,但未从根本上解决比较单元(comparison unit)的不可比性问题。论文提出将长度问题重新定义为“比较单元构建问题”,而非单纯的损失缩放或归一化偏差;其解决方案的关键在于构建一个基于样本构造的训练框架,通过双轨同步生成、前缀继承和段掩码等机制,在生成阶段主动构建等长、对齐且可比的训练片段,从而避免事后修正不等长响应的复杂性,并提升训练稳定性与效果。
链接: https://arxiv.org/abs/2604.17328
作者: Fei Ding,Yongkang Zhang,Runhao Liu,Yuhao Liao,Zijian Zeng,Huiming Yang,Sibo wang,Linglin Liao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper investigates the length problem in sequence-level relative reinforcement learning. We observe that, although existing methods partially alleviate length-related phenomena, a more fundamental issue remains insufficiently characterized: the comparison units used during training lack inherent comparability. Building on this observation, we propose a new perspective: the length problem should not be viewed merely as a loss-scaling or normalization bias, but rather as a \emphcomparison unit construction problem. We further establish a sample-construction-based training framework that, instead of applying post-hoc corrections to unequal-length responses, proactively constructs equal-length, alignable, and comparable training segments during generation. Within this framework, we propose EqLen, a concrete method applicable to group-relative comparison algorithms such as GRPO, GSPO, and RLOO. Through dual-track synchronous generation, prefix inheritance, and segment masking, EqLen efficiently collects effective equal-length training segments and enables stable
[AI-110] SigGate-GT: Taming Over-Smoothing in Graph Transformers via Sigmoid-Gated Attention
【速读】:该论文旨在解决图Transformer在分子和长距离推理任务中面临的两个关键问题:过平滑(over-smoothing)和注意力熵退化(attention entropy degeneration)。研究表明,这两个问题的根源与大语言模型中的注意力陷阱(attention sinks)类似,即Softmax注意力机制的归一化约束迫使每个节点即使在无信息连接的情况下也必须分配注意力权重。为此,作者提出SigGate-GT,其核心创新是在GraphGPS框架中为每头(per-head)引入可学习的sigmoid门控机制(sigmoid gating),对注意力输出进行逐元素调控,使某些头部能够主动抑制无关连接的激活值至接近零,从而实现对不相关信息的选择性屏蔽。实验表明,该方法在五个标准基准上显著优于基线模型,尤其在ogbg-molhiv数据集上达到新的SOTA性能(82.47% ROC-AUC),同时有效缓解过平滑、提升注意力熵并增强训练稳定性。
链接: https://arxiv.org/abs/2604.17324
作者: Dongxin Guo,Jikun Wu,Siu Ming Yiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 2 figures, 15 tables
Abstract:Graph transformers achieve strong results on molecular and long-range reasoning tasks, yet remain hampered by over-smoothing (the progressive collapse of node representations with depth) and attention entropy degeneration. We observe that these pathologies share a root cause with attention sinks in large language models: softmax attention’s sum-to-one constraint forces every node to attend somewhere, even when no informative signal exists. Motivated by recent findings that element-wise sigmoid gating eliminates attention sinks in large language models, we propose SigGate-GT, a graph transformer that applies learned, per-head sigmoid gates to the attention output within the GraphGPS framework. Each gate can suppress activations toward zero, enabling heads to selectively silence uninformative connections. On five standard benchmarks, SigGate-GT matches the prior best on ZINC (0.059 MAE) and sets new state-of-the-art on ogbg-molhiv (82.47% ROC-AUC), with statistically significant gains over GraphGPS across all five datasets ( p 0.05 ). Ablations show that gating reduces over-smoothing by 30% (mean relative MAD gain across 4-16 layers), increases attention entropy, and stabilizes training across a 10\times learning rate range, with about 1% parameter overhead on OGB.
[AI-111] A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions ACL2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)后训练过程中面临的数据稀缺问题,包括高质量外部监督信号的有限性以及模型生成经验的规模受限。其解决方案的关键在于提出一个自底向上的分层框架,围绕数据中心视角、训练中心视角和框架中心视角三个互补维度构建系统性分类体系,并据此对现有方法进行归类、总结与分析,从而为高效强化学习在LLMs中的应用提供清晰的设计空间理解与研究指引。
链接: https://arxiv.org/abs/2604.17312
作者: Zhiyin Yu,Yuchen Mou,Juncheng Yan,Junyu Luo,Chunchun Chen,Xing Wei,Yunhui Liu,Hongru Sun,Yuxing Zhang,Jun Xu,Yatao Bian,Ming Zhang,Wei Ye,Tieke He,Jie Yang,Guanjie Zheng,Zhonghai Wu,Bo Zhang,Lei Bai,Xiao Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 (Main Conference)
Abstract:Reinforcement learning (RL) has emerged as a powerful post-training paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, reinforcement learning for LLMs faces substantial data scarcity challenges, including the limited availability of high-quality external supervision and the constrained volume of model-generated experience. These limitations make data-efficient reinforcement learning a critical research direction. In this survey, we present the first systematic review of reinforcement learning for LLMs under data scarcity. We propose a bottom-up hierarchical framework built around three complementary perspectives: the data-centric perspective, the training-centric perspective, and the framework-centric perspective. We develop a taxonomy of existing methods, summarize representative approaches in each category, and analyze their strengths and limitations. Our taxonomy aims to provide a clear conceptual foundation for understanding the design space of data-efficient RL for LLMs and to guide researchers working in this emerging area. We hope this survey offers a comprehensive roadmap for future research and inspires new directions toward more efficient and scalable reinforcement learning post-training for LLMs.
[AI-112] Knows: Agent -Native Structured Research Representations
【速读】:该论文旨在解决当前科研文献以PDF等面向人类阅读的文档形式分发时,对日益依赖大语言模型(Large Language Models, LLMs)辅助或原生的科研工作流所造成的瓶颈问题——即LLM代理在从长篇全文中提取细粒度、任务相关的结构化信息时效率低、重复性强且不稳定。解决方案的关键在于提出一种轻量级的配套规范Knows,通过一个与原始PDF共存的YAML格式侧边文件(KnowsRecord)绑定结构化声明、证据、溯源信息及可验证关系,使LLM代理能够直接消费该结构化数据,无需修改出版物本身;该方案经由确定性模式校验器验证,并在20篇跨14个学科的论文上评估显示,弱模型(0.8B–2B参数)使用sidecar后准确率提升29–42个百分点,同时输入token消耗减少29–86%,表明其能显著增强低资源LLM的推理能力并具备规模化部署潜力。
链接: https://arxiv.org/abs/2604.17309
作者: Guangsheng Yu,Xu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper serves as a technical report/white paper for the this http URL project ( this https URL )
Abstract:Research artifacts are distributed primarily as reader-oriented documents like PDFs. This creates a bottleneck for increasingly agent-assisted and agent-native research workflows, in which LLM agents need to infer fine-grained, task-relevant information from lengthy full documents, a process that is expensive, repetitive, and unstable at scale. We introduce Knows, a lightweight companion specification that binds structured claims, evidence, provenance, and verifiable relations to existing research artifacts in a form LLM agents can consume directly. Knows addresses the gap with a thin YAML sidecar (KnowsRecord) that coexists with the original PDF, requiring no changes to the publication itself, and validated by a deterministic schema linter. We evaluate Knows on 140 comprehension questions across 20 papers spanning 14 academic disciplines, comparing PDF-only, sidecar-only, and hybrid conditions across six LLM agents of varying capacity. Weak models (0.8B–2B parameters) improve from 19–25% to 47–67% accuracy (+29 to +42 percentage points) when reading sidecar instead of PDF, while consuming 29–86% fewer input tokens; an LLM-as-judge re-scoring confirms that weak-model sidecar accuracy (75–77%) approaches stronger-model PDF accuracy (78–83%). Beyond this controlled evaluation, a community sidecar hub at this https URL has already indexed over ten thousand publications and continues to grow daily, providing independent evidence that the format is adoption-ready at scale. Comments: This paper serves as a technical report/white paper for the this http URL project (this https URL) Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.17309 [cs.AI] (or arXiv:2604.17309v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.17309 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-113] SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
【速读】:该论文旨在解决当前自主代理(autonomous agents)在技能使用上的局限性问题,即现有基准测试主要评估模型是否能调用已提供的技能,而忽视了代理能否从经验中自主发现新技能、在失败后修复技能,并在长期任务执行中维持一致的技能库。为应对这一挑战,作者提出SkillFlow基准,其核心创新在于引入一种领域无关的执行流程(Domain-Agnostic Execution Flow, DAEF),使得同一类任务共享统一的工作流框架,从而支持技能的持续演化与迁移。解决方案的关键在于采用“代理终身学习”(Agentic Lifelong Learning)协议:代理从无技能状态开始,按家族顺序完成任务,通过轨迹和评分驱动生成技能补丁(skill patches)进行外部化更新,并将优化后的技能库延续至后续任务。实验表明,该机制显著提升了任务成功率(如Claude Opus 4.6从62.65%提升至71.08%),但同时也揭示了高技能使用率并不等同于高性能,凸显出技能发现、修补与迁移过程中的关键瓶颈。
链接: https://arxiv.org/abs/2604.17308
作者: Ziao Zhang,Kou Shi,Shiting Huang,Avery Nie,Yu Zeng,Yiming Zhao,Zhen Fang,Qishen Su,Haibo Qiu,Wei Yang,Qingnan Ren,Shun Zou,Wenxuan Huang,Lin Chen,Zehui Chen,Feng Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time. We introduce SkillFlow, a benchmark of 166 tasks across 20 families in which task construction within each family follows a Domain-Agnostic Execution Flow (DAEF) that defines an agent workflow framework, allowing these tasks to share a consistent workflow. Agents are evaluated under an Agentic Lifelong Learning protocol in which they begin without skills, solve tasks sequentially within each family, externalize lessons through trajectory- and rubric-driven skill patches, and carry the updated library forward. Experiments reveal a substantial capability gap. For Claude Opus 4.6, lifelong skill evolution improves task success from 62.65% to 71.08% (+8.43 points). However, high skill usage does not necessarily imply high utility: Kimi K2.5 gains only +0.60 points despite 66.87% skill usage, while Qwen-Coder-Next reaches only a 44.58% task completion rate and still regresses relative to the vanilla setting. SkillFlow contributes a structured testbed for this direction and an in-depth empirical analysis of skill discovery, patching, transfer, and their failure modes under lifelong evaluation.
[AI-114] Efficient Test-Time Scaling via Temporal Reasoning Aggregation ACL2026
【速读】:该论文旨在解决大语言模型在测试时扩展(test-time scaling)过程中出现的token低效冗余推理问题,即模型在获得正确答案后仍继续无意义地推理,导致计算资源浪费。其解决方案的关键在于提出一种无需训练的框架TRACE,通过时间聚合多步证据而非依赖单步置信度信号来判断推理是否收敛:具体而言,TRACE整合两个互补信号——答案一致性(answer consistency),用于捕捉预测答案的持续性;以及置信度轨迹(confidence trajectory),用于建模模型置信度的时间演化趋势。这种基于时序信息的综合判断机制能够更准确识别推理收敛点,从而及时终止推理过程,显著减少token消耗(平均降低25-30%),同时保持与完整推理相当的准确性(误差仅1-2%)。
链接: https://arxiv.org/abs/2604.17304
作者: Jiakun Li,Xingwei He,Kefan Li,Hongzheng Chai,Hongyue Yu,Yuan Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to Findings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Abstract:Test-time scaling improves the reasoning performance of large language models but often results in token-inefficient overthinking, where models continue reasoning beyond what is necessary for a correct answer. Existing dynamic early-exit methods typically rely on single-step confidence signals, which are often unreliable for detecting reasoning convergence in multi-step settings. To mitigate this limitation, we propose TRACE, a training-free framework for efficient test-time scaling that determines when to terminate reasoning based on temporal aggregation of multi-step evidence rather than instantaneous signals. TRACE detects reasoning convergence over time by aggregating two complementary signals across recent reasoning steps: answer consistency, capturing the persistence of predicted answers, and confidence trajectory, modeling the temporal evolution of model confidence. Benefiting from these two factors, TRACE can accurately determine whether the reasoning process has converged, thereby promptly halting inference and effectively avoiding redundant reasoning steps. Extensive experiments on multiple challenging benchmarks show that TRACE reduces reasoning token usage by 25-30% on average while maintaining accuracy within 1-2% of full-length reasoning, consistently outperforming existing dynamic reasoning methods.
[AI-115] LLaTiSA: Towards Difficulty-Stratified Time Series Reasoning from Visual Perception to Semantics
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在时间序列推理(Time Series Reasoning, TSR)任务中面临的挑战,特别是由于任务定义碎片化和基准测试存在内在模糊性,导致难以进行严谨评估与统一模型开发的问题。其解决方案的关键在于:首先提出一个四层递进认知复杂度的时间序列推理分类法以形式化TSR;其次构建HiTSR数据集,包含83k样本及经验证的思维链(Chain-of-Thought, CoT)轨迹,支持多样化任务组合;最后设计LLaTiSA模型,通过融合可视化模式与精度校准的数值表格来增强视觉-语言模型(Vision-Language Models, VLMs)对时间维度的感知能力,并采用多阶段课程微调策略实现跨任务与真实场景下的强泛化性能。
链接: https://arxiv.org/abs/2604.17295
作者: Yueyang Ding,HaoPeng Zhang,Rui Dai,Yi Wang,Tianyu Zong,Kaikui Liu,Xiangxiang Chu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Comprehensive understanding of time series remains a significant challenge for Large Language Models (LLMs). Current research is hindered by fragmented task definitions and benchmarks with inherent ambiguities, precluding rigorous evaluation and the development of unified Time Series Reasoning Models(TSRMs). To bridge this gap, we formalize Time Series Reasoning (TSR) via a four-level taxonomy of increasing cognitive complexity. We introduce HiTSR, a hierarchical time series reasoning dataset comprising 83k samples with diverse task combinations and verified Chain-of-Thought (CoT) trajectories. Leveraging HiTSR, we propose LLaTiSA, a strong TSRM that integrates visualized patterns with precision-calibrated numerical tables to enhance the temporal perception of Vision-Language Models (VLMs). Through a multi-stage curriculum fine-tuning strategy, LLaTiSA achieves superior performance and exhibits robust out-of-distribution generalization across diverse TSR tasks and real-world scenarios. Our code is available at this https URL.
[AI-116] Clover: A Neural-Symbolic Agent ic Harness with Stochastic Tree-of-Thoughts for Verified RTL Repair
【速读】:该论文旨在解决硬件设计与验证中的寄存器传输级(Register-Transfer Level, RTL)程序修复难题,传统自动程序修复(Automatic Program Repair, APR)方法受限于预定义模板和合成机制,导致漏洞覆盖不足;而基于大语言模型(Large Language Models, LLMs)的方法虽具灵活性,却因随机性和上下文污染在处理长篇RTL代码及波形时表现不稳定。其解决方案的关键在于提出Clover——一种神经符号代理框架,通过结构化搜索策略对代码变换进行调度,并引入**随机树思维(stochastic tree-of-thoughts)**作为测试时扩展机制,将主代理的上下文组织为搜索树,实现探索与利用之间的平衡;同时结合专用RTL工具箱增强代理与调试环境的交互能力,从而显著提升修复成功率与可靠性,在基准测试中达到96.8%的修复率,优于纯传统和LLM基线方法。
链接: https://arxiv.org/abs/2604.17288
作者: Zizhang Luo,Yansong Xu,Runlin Guo,Fan Cui,Kexing Zhou,Mile Xia,Hongyuan Hou,Yuhao Luo,Yun Liang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:RTL program repair remains a critical bottleneck in hardware design and verification. Traditional automatic program repair (APR) methods rely on predefined templates and synthesis, limiting their bug coverage. Large language models (LLMs) and coding agents based on them offer flexibility but suffer from randomness and context corruption when handling long RTL code and waveforms. We present Clover, a neural-symbolic agentic harness that orchestrates RTL repair as a structured search over code manipulations to explore a validated solution for the bug. Recognizing that different repair operations favor distinct strategies, Clover dynamically dispatches tasks to specialized LLM agents or symbolic solvers. At its core, Clover introduces stochastic tree-of-thoughts, a test-time scaling mechanism that manages the main agent’s context as a search tree, balancing exploration and exploitation for reliable outcomes. An RTL-specific toolbox further empowers agents to interact with the debugging environment. Evaluated on the RTL-repair benchmark, Clover fixes 96.8% of bugs within a fixed time limit, covering 94% and 63% more bugs than both pure traditional and LLM-based baselines, respectively, while achieving an average pass@1 rate of 87.5%, demonstrating high reliability and effectiveness.
[AI-117] HalluClear: Diagnosing Evaluating and Mitigating Hallucinations in GUI Agents
【速读】:该论文旨在解决GUI代理(GUI agent)在实际应用中因生成式AI(Generative AI)产生的无依据幻觉(hallucination)所引发的级联失败问题,尤其是在视觉语言模型(VLM)驱动的GUI自动化场景下,缺乏针对幻觉现象的细粒度诊断、可靠评估与针对性缓解手段。解决方案的关键在于提出HalluClear——一个专注于幻觉缓解的综合性工具套件,其核心包括:(1) 基于实证故障分析构建的GUI特异性幻觉分类体系;(2) 通过专家标注基准与集成可信度估计校准的三阶段评估流程,提升VLM作为评判者(VLM-as-a-judge)的可靠性;(3) 基于闭环结构化推理的轻量级持续后训练机制,支持通用型与GUI专用代理在冷启动初始化下的高效优化。实验表明,仅需9K样本即可显著降低幻觉率,增强任务接地性(grounding)和动作保真度(action fidelity),提供了一条计算高效的鲁棒GUI自动化路径。
链接: https://arxiv.org/abs/2604.17284
作者: Chao Jin,Wenkui Yang,Hao Sun,Yuqi Liao,Qianyi Jiang,Kai Zhou,Jie Cao,Ran He,Huaibo Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 47 pages, 44 figures
Abstract:While progress in GUI agents has been largely driven by industrial-scale training, ungrounded hallucinations often trigger cascading failures in real-world this http URL general VLM domains, the GUI agent field lacks a hallucination-focused suite for fine-grained diagnosis, reliable evaluation, and targeted this http URL bridge this gap, we introduce HalluClear, a comprehensive suite for hallucination mitigation in GUI agents as a complement to computation-intensive scaling. HalluClear comprises: (1) a GUI-specific hallucination taxonomy derived from empirical failure analysis; (2) a calibrated three-stage evaluation workflow which enhances VLM-as-a-judge reliability via expert-annotated benchmarking and ensemble credibility estimation; and (3) a mitigation scheme based on closed-loop structured reasoning, enabling lightweight continual post-training with cold-start initialization for both generalist and GUI-specialist agents. Experiments across representative agents and public benchmarks demonstrate that post-training on only 9K samples within our suite can significantly reduce hallucinations, thereby improving grounding and action fidelity, offering a compute-efficient pathway to robust GUI automation.
[AI-118] Fully Analog Resonant Recurrent Neural Network via Metacircuit
【速读】:该论文旨在解决如何在物理硬件上实现可扩展、端到端的全模拟循环神经网络(Recurrent Neural Network, RNN),以高效处理时序信息的问题。当前挑战在于难以将训练好的数字神经网络模型准确映射到物理器件中,尤其是在保持高精度和低功耗的同时实现实时推理。解决方案的关键在于提出了一种全模拟共振循环神经网络(R² NN),其通过由耦合电学局部谐振器组成的元电路(metacircuit)架构实现;该架构基于重构的机电类比关系,直接将神经网络参数映射到物理元件,并结合可联合训练的全局电阻耦合与局部谐振特性,生成频率相关的负阻抗,从而塑造出引导电流沿频选择路径流动的阻抗景观,无需模数转换即可直接提取判别性频域特征,实现了对原始模拟输入的实时时序分类。
链接: https://arxiv.org/abs/2604.17277
作者: Zixin Zhou,Tianxi Jiang,Menglong Yang,Zhihua Feng,Qingbo He,Shiwu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Applied Physics (physics.app-ph)
备注: 23 pages, 6 figures
Abstract:Physical neural networks offer a transformative route to edge intelligence, providing superior inference speed and energy efficiency compared to conventional digital architectures. However, realizing scalable, end-to-end, fully analog recurrent neural networks for temporal information processing remains challenging due to the difficulty of faithfully mapping trained network models onto physical hardware. Here we present a fully analog resonant recurrent neural network (R ^2 NN) implemented via a metacircuit architecture composed of coupled electrical local resonators. A reformulated mechanical-electrical analogy establishes a direct mapping between the R ^2 NN model and metacircuit elements, enabling accurate physical implementation of trained neural network parameters. By integrating jointly trainable global resistive coupling and local resonances, which generate effective frequency-dependent negative resistances, the architecture shapes an impedance landscape that steers currents along frequency-selective pathways. This mechanism enables direct extraction of discriminative spectral features, facilitating real-time temporal classification of raw analog inputs while bypassing analog-to-digital conversion. We demonstrate the cross-domain versatility of this framework using integrated hardware for tactile perception, speech recognition, and condition monitoring. This work establishes a scalable, fully analog paradigm for intelligent temporal processing and paves the way for low-latency, resource-efficient physical neural hardware for edge intelligence.
[AI-119] he Continuity Layer: Why Intelligence Needs an Architecture for What It Carries Forward
【速读】:该论文旨在解决当前人工智能系统中缺乏持续性认知能力的问题,即模型在单次会话中表现出强大智能,但在跨时间维度上存在记忆断层(amnesiac across time),导致无法有效积累和延续知识。其核心解决方案是引入“连续性层”(continuity layer),该层通过一种名为“分解轨迹收敛记忆”(Decomposed Trace Convergence Memory)的存储原语实现,该原语在写入时进行分解、读取时重构,从而赋予系统七种关键特性,使其区别于传统记忆机制与检索系统。此连续性层被视为AI基础设施中最重要 yet 未被构建的部分,并被认为将重塑模型架构、硬件设计及治理模式。
链接: https://arxiv.org/abs/2604.17273
作者: Samuel Sameer Tanguturi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages. Position paper. Companion to ATANT v1.0 ( arXiv:2604.06710 ) and ATANT v1.1 ( arXiv:2604.10981 )
Abstract:The most important architectural problem in AI is not the size of the model but the absence of a layer that carries forward what the model has come to understand. Sessions end. Context windows fill. Memory APIs return flat facts that the model has to reinterpret from scratch on every read. The result is intelligence that is powerful per session and amnesiac across time. This position paper argues that the layer which fixes this, the continuity layer, is the most consequential piece of infrastructure the field has not yet built, and that the engineering work to build it has begun in public. The formal evaluation framework for the property described here is the ATANT benchmark (arXiv:2604.06710), published separately with evaluation results on a 250-story corpus; a companion paper (arXiv:2604.10981) positions this framework against existing memory, long-context, and agentic-memory benchmarks. The paper defines continuity as a system property with seven required characteristics, distinct from memory and from retrieval; describes a storage primitive (Decomposed Trace Convergence Memory) whose write-time decomposition and read-time reconstruction produce that property; maps the engineering architecture to the theological pattern of kenosis and the symbolic pattern of Alpha and Omega, and argues this mapping is structural rather than metaphorical; proposes a four-layer development arc from external SDK to hardware node to long-horizon human infrastructure; examines why the physics limits now constraining the model layer make the continuity layer newly consequential; and argues that the governance architecture (privacy implemented as physics rather than policy, founder-controlled class shares on non-negotiable architectural commitments) is inseparable from the product itself.
[AI-120] Rectification Difficulty and Optimal Sample Allocation in LLM -Augmented Surveys
【速读】:该论文旨在解决在固定预算下如何最优分配人类标注样本到多个估计任务的问题,尤其是在生成式 AI(Generative AI)可低成本提供合成问卷响应的场景中。其核心挑战在于:尽管 LLM 可以快速生成大量预测值,但不同问题的预测准确性差异显著,导致直接使用 LLM 输出会引入不可控的偏差和方差。解决方案的关键在于构建一个三部分框架:首先,基于 Prediction-Powered Inference 理论,定义了每个问题特有的“校正难度”(rectification difficulty),该指标决定了随着人类样本量增加,估计器方差下降的速度;其次,推导出闭式最优分配规则,指导将更多人类标签分配给 LLM 最不可靠的任务;最后,提出一种元学习方法,在无需目标调查的试点数据前提下,利用历史数据预测新任务的校正难度,从而实现高效的人类资源调度。该框架适用于广义 M-估计问题,包括回归系数和联合分析中的多项对数几率效用参数,并在两个跨领域、跨问题类型和 LLM 的数据集上验证了其有效性,实现了理论效率增益的 61–79%,且无须任何试点数据即可获得约 10.5–11.4% 的均方误差(MSE)降低。
链接: https://arxiv.org/abs/2604.17267
作者: Zikun Ye,Hema Yoganarasimhan
机构: 未知
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:
Abstract:Large Language Models can generate synthetic survey responses at low cost, but their accuracy varies unpredictably across questions. We study the design problem of allocating a fixed budget of human respondents across estimation tasks when cheap LLM predictions are available for every task. Our framework combines three components. First, building on Prediction-Powered Inference, we characterize a question-specific rectification difficulty that governs how quickly the estimator’s variance decreases with human sample size. Second, we derive a closed-form optimal allocation rule that directs more human labels to tasks where the LLM is least reliable. Third, since rectification difficulty depends on unobserved human responses for new surveys, we propose a meta-learning approach, trained on historical data, that predicts it for entirely new tasks without pilot data. The framework extends to general M-estimation, covering regression coefficients and multinomial logit partworths for conjoint analysis. We validate the framework on two datasets spanning different domains, question types, and LLMs, showing that our approach captures 61-79% of the theoretically attainable efficiency gains, achieving 11.4% and 10.5% MSE reductions without requiring any pilot human data for the target survey.
[AI-121] Safe and Policy-Compliant Multi-Agent Orchestration for Enterprise AI
【速读】:该论文旨在解决企业级人工智能(AI)系统中多智能体协作时难以满足硬性政策约束、风险暴露可控性以及全面可审计性(如SOX、HIPAA、GDPR等合规要求)的问题。现有协调方法(如合作式多智能体强化学习、共识协议和集中式规划器)虽能优化期望奖励,但将约束隐式处理,导致部署阶段可能出现策略违规。解决方案的关键在于提出CAMCO(Constraint-Aware Multi-Agent Cognitive Orchestration),这是一个运行时协调层,将多智能体决策建模为带约束的优化问题,并集成三项核心机制:(i) 约束投影引擎通过凸投影强制执行符合政策的动作;(ii) 自适应风险加权拉格朗日效用塑造以平衡收益与风险;(iii) 具有理论收敛边界保证的迭代协商协议。该方案无需修改训练过程,作为部署时中间件兼容任意智能体架构,且支持与生产级策略引擎(如OPA)直接集成。
链接: https://arxiv.org/abs/2604.17240
作者: Vinil Pasupuleti(1),Shyalendar Reddy Allala(2),Siva Rama Krishna Varma Bayyavarapu(3),Shrey Tyagi(4) ((1) International Business Machines, (2) Global Atlantic Financial, (3) Docusign, (4) Salesforce)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, 3 tables, IEEE conference format
Abstract:Enterprise AI systems increasingly deploy multiple intelligent agents across mission-critical workflows that must satisfy hard policy constraints, bounded risk exposure, and comprehensive auditability (SOX, HIPAA, GDPR). Existing coordination methods - cooperative MARL, consensus protocols, and centralized planners - optimize expected reward while treating constraints implicitly. This paper introduces CAMCO (Constraint-Aware Multi-Agent Cognitive Orchestration), a runtime coordination layer that models multi-agent decision-making as a constrained optimization problem. CAMCO integrates three mechanisms: (i) a constraint projection engine enforcing policy-feasible actions via convex projection, (ii) adaptive risk-weighted Lagrangian utility shaping, and (iii) an iterative negotiation protocol with provably bounded convergence. Unlike training-time constrained RL, CAMCO operates as deployment-time middleware compatible with any agent architecture, with policy predicates designed for direct integration with production engines such as OPA. Evaluation across three enterprise scenarios - including comparison against a constrained Lagrangian MARL baseline - demonstrates zero policy violations, risk exposure below threshold (mean ratio 0.71), 92-97% utility retention, and mean convergence in 2.4 iterations.
[AI-122] Yanasse: Finding New Proofs from Deep Visions Analogies Part 1
【速读】:该论文旨在解决数学证明自动化中跨领域知识迁移的难题,即如何从结构上相距较远的数学领域(如概率论与表示论)中提取并复用有效的证明策略(tactic invocation patterns),从而生成新的、可验证的定理证明。其解决方案的关键在于:首先通过计算z-score识别出源领域中高频使用但在目标领域罕见或缺失的战术模式;其次利用GPU加速的NP-hard类比匹配算法(基于深度视觉模型实现,不依赖具体领域知识)对源与目标证明状态进行语义层面的匹配;最后由AI推理代理对战术模式进行语义适配而非符号替换,其中发现战术模板可分解为“头部”(领域特定,难以迁移)和“修饰符”(领域通用,易迁移)两部分,例如filter upwards的头部在表示论中因缺乏Filter结构而失效,但其修饰符[LIST]可转化为ext1 + simp [LIST] + rfl并成功迁移。整个方法的核心创新在于匹配引擎的领域无关性,仅关系提取模块需领域定制,使得跨域证明策略迁移成为可能。
链接: https://arxiv.org/abs/2604.17229
作者: Alexandre Linhares
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Project Yanasse presents a method for discovering new proofs of theorems in one area of mathematics by transferring proof strategy patterns (e.g., Lean 4 tactic invocation patterns) from a structurally distant area. The system extracts tactic usage distributions across 27 top-level areas of Mathlib (217,133 proof states), computes z-scores to identify tactics that are heavily used in a source area but rare or absent in a target area, matches source and target proof states via GPU-accelerated NP-hard analogy (running on a MacBook Air via Apple’s MPS backend), and then asks an AI reasoning agent to semantically adapt–not symbol-substitute–the source tactics invocation pattern to the target theorem. In this first part of the study, the method is applied to the pair Probability - Representation Theory, producing 4 Lean-verified new proofs out of 10 attempts (40%). The proofs compile with zero sorry declarations. The key finding is that tactic schemas decompose into a head (domain-gated, rarely transfers) and a modifier (domain-general, often transfers): filter upwards’s head fails in representation theory (no Filter structure), but its [LIST] with \omega modifier transfers cleanly as ext1 + simp [LIST] + rfl. Crucially, the underlying matching engine–deep vision this http URL–is entirely domain independent: the same optimization code for an NP-hard matching that matches chess positions by analogy matches Lean proof states by analogy, without knowing which domain it is processing. Only a relation extractor is domain-specific.
[AI-123] Beyond the Basics: Leverag ing Large Language Model for Fine-Grained Medical Entity Recognition
【速读】:该论文旨在解决临床自然语言处理(Clinical NLP)中从非结构化医疗文本(如入院记录、出院小结和急诊病历)中提取细粒度医学实体(Fine-grained Medical Entity Recognition, MER)的挑战,尤其针对现有大语言模型(LLMs)评估多集中于通用实体类型、难以满足临床实际需求的问题。其解决方案的关键在于:首先采用统一的 LLaMA3 基线模型,系统比较零样本(zero-shot)、少样本(few-shot)与基于低秩适应(LoRA)微调三种学习范式;其次引入基于 BioBERT 预训练模型的词元级与句子级嵌入相似性筛选方法优化少样本示例选择策略;最终通过严格控制变量实现公平比较,验证微调后的 LLaMA3 在 18 类临床相关实体识别上达到 F1 分数 81.24%,显著优于零样本和少样本方法(分别提升 63.11% 和 35.63%)。
链接: https://arxiv.org/abs/2604.17214
作者: Nwe Ni Win(1),Jim Basilakis(1 and 2),Steven Thomas(2),Seyhan Yazar(3 and 4),Laura Pierce(4),Stephanie Liu(5),Paul M. Middleton(2),Nasser Ghadiri(2),X. Rosalind Wang(1 and 2) ((1) Western Sydney University, Sydney, Australia, (2) South Western Emergency Research Institute, Sydney, Australia, (3) Garvan Institute of Medical Research, Sydney, Australia, (4) University of New South Wales, Sydney, Australia (5) Liverpool Hospital, Sydney, Australia)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Extracting clinically relevant information from unstructured medical narratives such as admission notes, discharge summaries, and emergency case histories remains a challenge in clinical natural language processing (NLP). Medical Entity Recognition (MER) identifies meaningful concepts embedded in these records. Recent advancements in large language models (LLMs) have shown competitive MER performance; however, evaluations often focus on general entity types, offering limited utility for real-world clinical needs requiring finer-grained extraction. To address this gap, we rigorously evaluated the open-source LLaMA3 model for fine-grained medical entity recognition across 18 clinically detailed categories. To optimize performance, we employed three learning paradigms: zero-shot, few-shot, and fine-tuning with Low-Rank Adaptation (LoRA). To further enhance few-shot learning, we introduced two example selection methods based on token- and sentence-level embedding similarity, utilizing a pre-trained BioBERT model. Unlike prior work assessing zero-shot and few-shot performance on proprietary models (e.g., GPT-4) or fine-tuning different architectures, we ensured methodological consistency by applying all strategies to a unified LLaMA3 backbone, enabling fair comparison across learning settings. Our results showed that fine-tuned LLaMA3 surpasses zero-shot and few-shot approaches by 63.11% and 35.63%, respectivel respectively, achieving an F1 score of 81.24% in granular medical entity extraction.
[AI-124] Layer-wise MoE Routing Locality under Shared-Prefix Code Generation: Token-Identity Decomposition and Compile-Equivalent Fork Redundancy
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在代码生成中多候选路径间专家路由(expert routing)重叠程度及其层间变化规律不明确的问题,尤其关注共享前缀下不同生成路径的Mixture-of-Experts(MoE)机制行为。其解决方案的关键在于:通过基于树搜索的分支生成策略从同一前缀并行生成多个代码候选,并利用编译器输出(gcc -S -O0汇编)进行对齐以控制token身份混淆,从而精确量化相同与不同token位置下的专家路由相似性;进一步发现路由相似性在各层呈现“交叉模式”,即相同token路由一致性普遍高于不同token,但在中间层(L14–20)出现显著下降,而不同token路由相似性则在此区间达到峰值(14倍随机水平),揭示了MoE路由并非完全上下文无关,且存在显著的层依赖性,为优化LLM代码生成中的搜索效率提供了理论依据和改进方向。
链接: https://arxiv.org/abs/2604.17182
作者: Shun-ichiro Hayashi,Daichi Mukunoki,Tetsuya Hoshino,Takahiro Katagiri
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:In LLM-based code generation, multiple code candidates are often generated in parallel from the same prompt – for example, in best-of-N sampling or multi-candidate code completion. These requests can share KV caches through a common prefix, yet the extent to which their Mixture-of-Experts (MoE) expert routing overlaps, and how this overlap varies across layers, remains insufficiently understood. We study Qwen3.5-35B-A3B-FP8 (256 routed experts, top-8) by performing tree-search-based branching generation from a shared prefix (851 completed codes, temperature 0.7) and analyzing the results with a compiler-output-based alignment (gcc -S -O0 assembly) that controls for token-identity confounds. Our findings are threefold: (1) At positions where both sequences generated the same token, Jaccard similarity reaches 0.649 (40x random), while even at positions with different tokens it remains 0.175 (11x random). (2) A layer-wise decomposition reveals a crossing pattern: same-token routing similarity exceeds different-token similarity across all layers, but dips in the middle layers (L14-20), while different-token similarity peaks in the middle layers at 14x random. (3) In tree-search code generation, 67% of successfully compiled codes concentrate in the top three assembly-equivalent groups, and 99.6% of within-group differences consist of comments and blank lines. We show that diversity in top-P search, including beam search, poses a significant challenge. These results refine the “context-independent routing” claim of prior work through layer-wise decomposition and suggest opportunities for improving search efficiency in LLM code generation.
[AI-125] Decentralised Trust and Security Mechanisms for IoT Networks at the Edge: A Comprehensive Review
【速读】:该论文旨在解决物联网(Internet of Things, IoT)边缘计算环境中因设备异构性与资源受限导致的去中心化信任与安全机制缺失问题。其解决方案的关键在于评估和整合多种前沿去中心化架构,如联邦学习(federated learning)、零信任架构(Zero Trust architecture)、轻量级区块链(lightweight blockchain)及分布式神经网络模型(distributed neural models),通过这些技术构建具备隐私保护、抗单点故障能力以及自适应威胁响应机制的边缘安全体系。
链接: https://arxiv.org/abs/2604.17179
作者: Khandoker Ashik Uz Zaman,Mahdi H. Miraz,Mohammed N. M. Ali
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:INTRODUCTION: The proliferation of the amalgamation of IoT and edge computing has increased the demand for decentralised trust and security mechanisms capable of operating across heterogeneous and resource-limited devices. Approaches such as federated learning, Zero Trust architectures, lightweight blockchain and distributed neural models offer alternatives to centralised control. OBJECTIVES: This review examines various state-of-the-art decentralised mechanisms and evaluates their effectiveness in terms of securing IoT networks at the edge. METHODS: Thirty recent studies were analysed to compare how decentralised architectures establish trust, support secure communication and enable intrusion and anomaly detection. Frameworks, such as DFGL-LZTA, SecFedDNN and COSIER were assessed. RESULTS: Decentralised designs enhance privacy, reduce single points of failure and improve adaptive threat response, though challenges remain in scalability, efficiency and interoperability. CONCLUSION: The study identifies key considerations and future research needs for building secure and resilient trust-aware IoT edge ecosystems.
[AI-126] Intent-aligned Autonomous Spacecraft Guidance via Reasoning Models CVPR
【速读】:该论文旨在解决未来航天器自主运行中缺乏能够根据高层次任务意图进行安全轨迹优化的问题,当前的轨迹优化方法仍依赖专家手工设计的公式,难以实现意图驱动的决策。其解决方案的关键在于构建一个意图对齐的航天器制导框架,通过显式的中间抽象(行为序列和航路点约束)将高层推理与安全轨迹优化相连接:首先由基础模型预测意图对齐的行为计划,再由航路点生成模型将其转化为约束条件,最后通过优化计算出安全轨迹。该分层结构实现了可扩展的监督且不牺牲安全性,在近距离操作场景中实验表明,该方法在满足高优先级性能指标的轨迹生成率上比启发式决策提升1.5倍,收敛率超过90%,验证了中间行为抽象作为基础模型推理与安全关键航天器自主控制之间实用接口的有效性。
链接: https://arxiv.org/abs/2604.17176
作者: Yuji Takubo,Simone D’Amico
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: Accepted for Computer Vision and Pattern Recognition Conference (CVPR) 2026, AI4Space Workshop (4-page Short paper). 9 pages, 3 figures (including supplementary materials)
Abstract:Future spacecraft operations require autonomy that can interpret high-level mission intent while preserving safety. However, existing trajectory optimization still relies heavily on expert-crafted formulations and does not support intent-conditioned decision-making. This paper proposes an intent-aligned spacecraft guidance framework that links high-level reasoning and safe trajectory optimization through explicit intermediate abstractions, based on behavior sequences and waypoint constraints. A foundation model first predicts an intent-aligned behavior plan, a waypoint generation model then converts it into waypoint constraints, and the safe trajectory is computed via optimization. This decomposition enables scalable supervision without sacrificing safety. Numerical experiments in close-proximity operation scenarios demonstrate that the proposed pipeline achieves over 90% SCP convergence and yields a 1.5\times higher rate of generating trajectories that satisfy the top intent-prioritized performance criteria than heuristic decision-making. These results support the use of intermediate behavior abstraction as a practical interface between foundation-model reasoning and safety-critical onboard spacecraft autonomy.
[AI-127] RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design
【速读】:该论文旨在解决蛋白质序列设计中因单次解码局限性导致的结构 fidelity(结构保真度)不足问题,尤其针对现有生成模型(如LigandMPNN)在复杂多目标优化场景下难以获得高精度设计的问题。解决方案的关键在于提出 RosettaSearch——一种基于推理阶段的多目标优化方法,其核心是利用大语言模型(LLM)作为生成优化器,并结合 RosettaFold3 的结构预测奖励信号,在搜索算法中实现可控的探索与利用平衡,从而迭代优化蛋白质序列以提升其结构保真度和设计成功率。
链接: https://arxiv.org/abs/2604.17175
作者: Meghana Kshirsagar,Allen Nie,Ching-An Cheng,Fanglei Xue,Rahul Dodhia,Juan Lavista Ferres,Kevin K. Yang,Frank DiMaio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:
Abstract:We introduce RosettaSearch, an inference-time multi-objective optimization approach for protein sequence optimization. We use large language models (LLMs) as a generative optimizer within a search algorithm capable of controlled exploration and exploitation, using rewards computed from RosettaFold3, a structure prediction model. In a large-scale evaluation, we apply RosettaSearch to 400 suboptimal sequences generated by LigandMPNN (a state-of-the-art model trained for protein sequence design), recovering high-fidelity designs that LigandMPNN’s single-pass decoding fails to produce. RosettaSearch’s designs show improvements in structural fidelity metrics ranging between 18% to 68%, translating to a 2.5 \times improvement in design success rate. We observe that these gains in success rate are robust when RosettaSearch-designed sequences are evaluated with an independent structure prediction oracle (Chai-1) and generalize across two distinct LLM families (o4-mini and Gemini-3), with performance scaling consistently with reasoning capability. We further demonstrate that RosettaSearch improves sequence fidelity for ProteinMPNN-designed sequences on \textitde novo backbones from the Dayhoff atlas, showing that the approach generalizes beyond native protein structures to computationally generated backbones. We also demonstrate a multi-modal extension of RosettaSearch with vision-language models, where images of predicted protein structures are used as feedback to incorporate structural context to guide protein sequence generation. The sequence trajectories generated by our approach can be used as training data in sequence design models or in post-training and will be released along with the code and datasets upon publication.
[AI-128] CCCL: In-GPU Compression-Coupled Collective Communication
【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)工作负载中集体通信(collective communication)带来的显著开销问题,尤其针对应用层重叠计算与通信策略在实际部署中因需大量代码修改而难以适用的场景(如张量并行和专家并行)。其解决方案的关键在于提出CCCL——一个内建的基于压缩的集体通信库,通过无用户侧修改的方式支持allreduce、alltoall及send/recv等操作,同时紧密融合压缩核以减少内存访问,并与NCCL集成消除数据聚合阶段,从而实现高达NVLink带宽3倍的通信效率,显著提升端到端吞吐量(vLLM PD解耦工作负载最高提升10.1%,微基准测试最高提升30%)。
链接: https://arxiv.org/abs/2604.17172
作者: Chon Lam Lao,Zhiying Xu,Zhuang Wang,Ziming Mao,Delong Meng,Jia Zhen,Jun Wu,Ion Stoica,Yida Wang,Yang Zhou
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Collective communication incurs significant overhead in LLM workloads. Although overlapping communication with computation in application-level is a common strategy, it often requires substantial code modifications and is impractical for many workloads (e.g., tensor and expert parallelism). We present CCCL, a built-in compression-based collective communication library that supports operations such as allreduce, alltoall, and send/recv without requiring any user-side changes, thereby enabling seamless adoption in existing applications. CCCL tightly fuses compression kernels to minimize memory accesses and integrates with NCCL to eliminate the data coalescing stage, making it fast enough (up to 3x NVLink bandwidth) to sustain communication. Our evaluation shows that CCCL improves end-to-end throughput in vLLM PD disaggregation workloads by up to 10.1% and microbenchmark throughput by up to 30%.
[AI-129] Graph-of-Agents : A Graph-based Framework for Multi-Agent LLM Collaboration ICLR2026
【速读】:该论文旨在解决多大语言模型(Large Language Models, LLMs)日益增多背景下,如何高效协同多个模型以提升任务性能的问题。现有框架如混合代理(Mixture-of-Agents, MoA)在代理选择、代理间通信效率及响应整合方面存在不足。其解决方案的关键在于提出图结构代理框架(Graph-of-Agents, GoA),通过模型卡信息进行节点采样以筛选最相关代理,基于响应互评构建有向边关系,并采用双向消息传递机制增强响应质量,最终通过图池化聚合输出统一结果。GoA利用图结构实现可扩展且高效的多代理协作,实验表明仅用3个精选代理即可超越使用全部6个代理的基线方法。
链接: https://arxiv.org/abs/2604.17148
作者: Sukwon Yun,Jie Peng,Pingzhi Li,Wendong Fan,Jie Chen,James Zou,Guohao Li,Tianlong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICLR 2026
Abstract:With an ever-growing zoo of LLMs and benchmarks, the need to orchestrate multiple models for improved task performance has never been more pressing. While frameworks like Mixture-of-Agents (MoA) attempt to coordinate LLMs, they often fall short in terms of (1) selecting relevant agents, (2) facilitating effective intra-agent communication, and (3) integrating responses efficiently. In this work, we propose Graph-of-Agents (GoA), a new graph-based framework for modeling multi-agent LLM communication. Our approach begins with node sampling, selecting only the most relevant agents by leveraging model cards that summarize each model’s domain, task specialization, and other characteristics. Next, we construct edges between the selected agents by evaluating their responses against one another to determine relevance ordering. Directed message passing is then performed from highly relevant agents to less relevant ones to enhance their responses, followed by reverse message passing to refine the original responses of the more relevant agents. Finally, the updated responses are aggregated via graph-based pooling (e.g., max or mean pooling) to produce a single, unified answer. We evaluate GoA on diverse multi-domain benchmarks (MMLU, MMLU-Pro, GPQA) and domain-specific benchmarks (MATH, HumanEval, MedMCQA), with an agent pool of 6 LLMs spanning multiple domains. Surprisingly, GoA achieves superior performance using only 3 selected agents, outperforming recent multi-agent LLM baselines that utilize all 6 agents simultaneously. By adopting a graph structure, GoA offers both scalability and effectiveness through structured message passing-positioning it as a strong candidate for navigating the challenges of the ever-growing LLM zoo. Code is available at: this https URL.
[AI-130] Local Inconsistency Resolution: The Interplay between Attention and Control in Probabilistic Models
【速读】:该论文旨在解决复杂概率模型中学习与近似推断的统一框架问题,特别是如何在存在不一致信念的情况下进行有效推理和参数优化。其解决方案的关键在于提出局部不一致性修正(Local Inconsistency Resolution, LIR)框架,该框架基于概率依赖图(Probabilistic Dependency Graphs, PDGs)构建,通过迭代聚焦于模型的子集并利用受控参数来修正局部不一致性,从而实现对多种经典算法(如EM、信念传播、对抗训练、GANs、GFlowNets等)的统一建模与推广。LIR的核心创新在于将不同算法视为特定注意力机制下的实例,同时为GFlowNet提出了一种更自然的损失函数,实验证明可提升收敛性。
链接: https://arxiv.org/abs/2604.17140
作者: Oliver E. Richardson,Mandana Samiei,Mehran Shakerinava,Joseph D. Viviano,Abdessamad El Kabid,Ali Parviz,Yoshua Bengio
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 page body,
Abstract:We present a generic algorithm for learning and approximate inference with an intuitive epistemic interpretation: iteratively focus on a subset of the model and resolve inconsistencies using the parameters under control. This framework, which we call Local Inconsistency Resolution (LIR) is built upon Probabilistic Dependency Graphs (PDGs), which provide a flexible representational foundation capable of capturing inconsistent beliefs. We show how LIR unifies and generalizes a wide variety of important algorithms in the literature, including the Expectation-Maximization (EM) algorithm, belief propagation, adversarial training, GANs, and GFlowNets. In the last case, LIR actually suggests a more natural loss, which we demonstrate improves GFlowNet convergence. Each method can be recovered as a specific instance of LIR by choosing a procedure to direct focus (attention and control). We implement this algorithm for discrete PDGs and study its properties on synthetically generated PDGs, comparing its behavior to the global optimization semantics of the full PDG.
[AI-131] If Only My CGM Could Speak: A Privacy-Preserving Agent for Question Answering over Continuous Glucose Data ACL
【速读】:该论文旨在解决连续葡萄糖监测(Continuous Glucose Monitoring, CGM)数据在糖尿病管理中难以支持用户自由查询的问题,现有平台仅提供静态摘要,无法满足个性化、动态的健康数据分析需求。解决方案的关键在于提出CGM-Agent框架,其核心设计是将大语言模型(Large Language Models, LLMs)作为纯推理引擎,仅负责选择合适的分析函数,所有计算均在本地设备完成,确保个人健康数据不出设备,从而兼顾隐私保护与准确性。实验证明,该架构在合成与真实用户查询上分别达到94%和88%的值准确率,且错误主要源于意图和时间歧义,而非计算错误,表明该方法具备高可信度与部署潜力。
链接: https://arxiv.org/abs/2604.17133
作者: Yanjun Cui,Ali Emami,Temiloluwa Prioleau,Nikhil Singh
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted by ACL Findings 2026
Abstract:Continuous glucose monitors (CGMs) used in diabetes care collect rich personal health data that could improve day-to-day self-management. However, current patient platforms only offer static summaries which do not support inquisitive user queries. Large language models (LLMs) could enable free-form inquiries about continuous glucose data, but deploying them over sensitive health records raises privacy and accuracy concerns. In this paper, we present CGM-Agent, a privacy-preserving framework for question answering over personal glucose data. In our design, the LLM serves purely as a reasoning engine that selects analytical functions. All computation occurs locally, and personal health data never leaves the user’s device. For evaluation, we construct a benchmark of 4,180 questions combining parameterized question templates with real user queries and ground truth derived from deterministic program execution. Evaluating 6 leading LLMs, we find that top models achieve 94% value accuracy on synthetic queries and 88% on ambiguous real-world queries. Errors stem primarily from intent and temporal ambiguity rather than computational failures. Additionally, lightweight models achieve competitive performance in our agent design, suggesting opportunities for low-cost deployment. We release our code and benchmark to support future work on trustworthy health agents.
[AI-132] CASCADE: A Cascaded Hybrid Defense Architecture for Prompt Injection Detection in MCP-Based Systems
【速读】:该论文旨在解决基于模型上下文协议(Model Context Protocol, MCP)的大型语言模型(Large Language Model, LLM)应用中因多层架构引入的新攻击面问题,尤其是工具中毒(tool poisoning)和传统提示注入(prompt injection)等安全威胁。现有防御系统存在误报率高、依赖外部API或需白盒访问等局限性。论文提出CASCADE,一种三层级级联防御架构:第一层通过正则表达式、短语加权与熵分析实现快速预过滤;第二层利用BGE嵌入进行语义分析,并配备Ollama Llama3作为回退机制;第三层采用基于模式的输出过滤。其核心优势在于完全本地化运行,无需任何外部API调用,同时在5000样本数据集上实现了95.85%的精确率和74.59%的F1分数,显著提升了对数据外泄(91.5%检测率)和提示注入(84.2%检测率)类攻击的防护能力。
链接: https://arxiv.org/abs/2604.17125
作者: İpek Abasıkeleş Turgut,Edip Gümüş
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Model Context Protocol (MCP) is a rapidly adopted standard for defining and invoking external tools in LLM applications. The multi-layered architecture of MCP introduces new attack surfaces such as tool poisoning, in addition to traditional prompt injection. Existing defense systems suffer from limitations including high false positive rates, API dependency, or white-box access requirements. In this study, we propose CASCADE, a three-tiered cascaded defense architecture for MCP-based systems: (i) Layer 1 performs fast pre-filtering using regex, phrase weighting, and entropy analysis; (ii) Layer 2 conducts semantic analysis via BGE embedding with an Ollama Llama3 fallback mechanism; (iii) Layer 3 applies pattern-based output filtering. Evaluation on a dataset of 5,000 samples yielded 95.85% precision, 6.06% false positive rate, 61.05% recall, and 74.59% F1-score. Analysis across 31 attack types categorized into 6 tiers revealed high detection rates for data exfiltration (91.5%) and prompt injection (84.2%), while semantic attack (52.5%) and tool poisoning (59.9%) categories showed potential for improvement. A key advantage of CASCADE over existing solutions is its fully local operation, requiring no external API calls
[AI-133] he Topological Trouble With Transformers
【速读】:该论文旨在解决Transformer模型在处理序列数据时因纯前馈架构导致的动态状态跟踪能力受限问题。具体而言,由于前馈结构难以维持隐变量的迭代更新以反映环境演化,模型不得不将状态表示不断向深层传递,从而造成浅层信息不可访问且模型深度资源耗尽。解决方案的关键在于从依赖显式思维痕迹(explicit thought traces)转向利用递归架构所驱动的隐式激活动态(implicit activation dynamics),通过引入按深度或步骤划分的递归Transformer分类体系,并探索增强的状态空间模型与粗粒度递归等方向,实现更高效的状态跟踪机制集成到现代基础模型中。
链接: https://arxiv.org/abs/2604.17121
作者: Michael C. Mozer,Shoaib Ahmed Siddiqui,Rosanne Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformers encode structure in sequences via an expanding contextual history. However, their purely feedforward architecture fundamentally limits dynamic state tracking. State tracking – the iterative updating of latent variables reflecting an evolving environment – involves inherently sequential dependencies that feedforward networks struggle to maintain. Consequently, feedforward models push evolving state representations deeper into their layer stack with each new input step, rendering information inaccessible in shallow layers and ultimately exhausting the model’s depth. While this depth limit can be bypassed by dynamic depth models and by explicit or latent thinking that externalizes state representations, these solutions are computationally and memory inefficient. In this article, we argue that temporally extended cognition requires refocusing from explicit thought traces to implicit activation dynamics via recurrent architectures. We introduce a taxonomy of recurrent and continuous-thought transformer architectures, categorizing them by their recurrence axis (depth versus step) and their ratio of input tokens to recurrence steps. Finally, we outline promising research directions, including enhanced state-space models and coarse-grained recurrence, to better integrate state tracking into modern foundation models.
[AI-134] Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成回答时存在过度自信但错误输出的问题,尤其关注如何更可靠地量化不确定性以提升模型的鲁棒性。现有方法通常依赖自一致性(self-consistency)来估计偶然不确定性(aleatoric uncertainty, AU),但在模型过自信且多次产生相同错误答案时,该代理指标失效。论文的关键解决方案是引入一种可在黑盒访问条件下计算的先验不确定性(epistemic uncertainty, EU)项:EU基于一个小规模、规模匹配的模型集合,通过计算跨模型与模型内序列语义相似度之间的差距来衡量不确定性。最终,总不确定性(total uncertainty, TU)定义为AU与EU之和,在多个长文本任务中表现出优于仅使用AU的排名校准能力和选择性回避能力,且EU能有效识别出AU较低但模型仍犯错的情况。
链接: https://arxiv.org/abs/2604.17112
作者: Kimia Hamidieh,Veronika Thost,Walter Gerych,Mikhail Yurochkin,Marzyeh Ghassemi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) often produce confident yet incorrect responses, and uncertainty quantification is one potential solution to more robust usage. Recent works routinely rely on self-consistency to estimate aleatoric uncertainty (AU), yet this proxy collapses when models are overconfident and produce the same incorrect answer across samples. We analyze this regime and show that cross-model semantic disagreement is higher on incorrect answers precisely when AU is low. Motivated by this, we introduce an epistemic uncertainty (EU) term that operates in the black-box access setting: EU uses only generated text from a small, scale-matched ensemble and is computed as the gap between inter-model and intra-model sequence-semantic similarity. We then define total uncertainty (TU) as the sum of AU and EU. In a comprehensive study across five 7-9B instruction-tuned models and ten long-form tasks, TU improves ranking calibration and selective abstention relative to AU, and EU reliably flags confident failures where AU is low. We further characterize when EU is most useful via agreement and complementarity diagnostics.
[AI-135] HiveMind: OS-Inspired Scheduling for Concurrent LLM Agent Workloads
【速读】:该论文旨在解决多个大语言模型(Large Language Model, LLM)编码代理在共享受限API端点时因资源竞争导致的高失败率问题,其核心挑战是并行执行下未协调的请求引发连接重置和HTTP 502错误等故障模式。解决方案的关键在于提出HIVEMIND——一个透明的HTTP代理系统,它借鉴操作系统调度机制,引入五种原语:准入控制、速率跟踪、基于AIMD(Additive Increase Multiplicative Decrease)的背压与熔断机制、令牌预算管理及优先级队列,从而有效消除由无序并发引起的故障。该方案无需修改现有代理代码即可兼容Anthropic、OpenAI及本地模型API,并通过实验证明其可将失败率从72–100%降至0–18%,同时减少48–100%的浪费计算资源,其中透明重试机制被证明是最关键的单一因素,但各原语协同作用效果最优。
链接: https://arxiv.org/abs/2604.17111
作者: Justice Owusu Agyemang,Jerry John Kponyo,Obed Kwasi Somuah,Elliot Amponsah,Godfred Manu Addo Boakye,Kwame Opuni-Boachie Obour Agyekum
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:When multiple LLM coding agents share a rate-limited API endpoint, they exhibit resource contention patterns analogous to unscheduled OS processes competing for CPU, memory, and I/O. In a motivating incident, 3 of 11 parallel agents died from connection resets and HTTP 502 errors - a 27% failure rate - despite the API having sufficient aggregate capacity to serve all 11 sequentially. We present HIVEMIND, a transparent HTTP proxy that applies five OS-inspired scheduling primitives - admission control, rate-limit tracking, AIMD backpressure with circuit breaking, token budget management, and priority queuing - to eliminate the failure modes caused by uncoordinated parallel execution. The proxy requires zero modifications to existing agent code and supports Anthropic, OpenAI, and local model APIs via auto-detected provider profiles. Our evaluation across seven scenarios (5-50 concurrent agents) shows that uncoordinated agents fail at 72-100% rates under contention, while HIVEMIND reduces failures to 0-18% and eliminates 48-100% of wasted compute. An ablation study reveals that transparent retry - not admission control - is the single most critical primitive, but the primitives are most effective in combination. Real-world validation against Ollama confirms that HIVEMIND adds under 3ms of proxy overhead per request. The system is open-source under the MIT license.
[AI-136] nsorHub: Rethinking AI Model Hub with Tensor-Centric Compression
【速读】:该论文旨在解决现代人工智能(Artificial Intelligence, AI)模型在模型仓库中因规模庞大和冗余度高而导致的存储与分发难题。其解决方案的关键在于提出了一种以张量(tensor)为中心的系统 TensorHub,通过细粒度的去重和压缩技术实现存储优化;该系统利用张量级别的指纹识别与聚类方法,在无需模型标注的情况下识别跨模型间的冗余信息,从而在不损害模型可用性和性能的前提下显著降低存储开销。
链接: https://arxiv.org/abs/2604.17104
作者: Tingfeng Lan,Zirui Wang,Yunjia Zheng,Zhaoyuan Su,Juncheng Yang,Yue Cheng
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 6 figures. Systems paper on AI model storage
Abstract:Modern AI models are growing rapidly in size and redundancy, leading to significant storage and distribution challenges in model hubs. We present TensorHub, a tensor-centric system for reducing storage overhead through fine-grained deduplication and compression. TensorHub leverages tensor-level fingerprinting and clustering to identify redundancy across models without requiring annotations. Our design enables efficient storage reduction while preserving model usability and performance. Experiments on real-world model repositories demonstrate substantial storage savings with minimal overhead.
[AI-137] Configuration Over Selection: Hyperparameter Sensitivity Exceeds Model Differences in Open-Source LLM s for RTL Generation
【速读】:该论文旨在解决开源大语言模型(Large Language Models, LLMs)在硬件设计领域应用时,基准测试结果可能因推理阶段解码配置差异而产生误导的问题。现有研究多聚焦于模型选择,将解码配置视为次要因素,但本文揭示:同一模型在不同超参数设置下的通过率差距可达25.5%,远超不同模型家族默认配置间的平均差异(约5倍),且最优配置在不同基准测试集间不具迁移性(Spearman相关系数接近零)。解决方案的关键在于采用架构感知与基准感知的超参数调优方法,通过系统性地对主流模型进行大规模超参数搜索(108种配置),明确区分模型能力与配置效应,从而实现对开源LLMs用于寄存器传输级(Register-Transfer Level, RTL)生成潜力的准确评估与最大化利用。
链接: https://arxiv.org/abs/2604.17102
作者: Minghao Shao,Zeng Wang,Weimin Fu,Xiaolong Guo,Johann Knechtel,Ozgur Sinanoglu,Ramesh Karri,Muhammad Shafique
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Benchmarking of open-source LLMs for hardware design focuses on which LLMs to use, while treating inference-time decoding configuration as a secondary concern. This work shows that it matters more how an LLM is configured than which model is selected. Benchmarking 26 open-source LLMs on VerilogEval and RTLLM with synthesis-in-the-loop evaluation, the study first maps the current capability landscape and then conducts an extensive 108-configuration hyperparameter sweep on three prominent models. The sweep reveals absolute pass-rate gaps of up to 25.5% between the best and worst settings for the same LLM, which is 5x larger than the average spread observed across various model families under their respective default configurations. Ranking all configurations by Spearman’s \rho across the two benchmark suites yields near-zero correlation, demonstrating that optimal configurations do not transfer. These results show that benchmarking conducted under default hyperparameters confounds model capabilities with configuration effects. Realizing the full potential of open-source LLMs for RTL generation requires architecture and benchmark aware hyperparameter selection, as enabled by the proposed methodology.
[AI-138] Understanding and Enforcing Weight Disentanglement in Task Arithmetic CVPR2026
【速读】:该论文旨在解决任务算术(Task Arithmetic)方法在预训练模型编辑中缺乏理论解释的问题,尤其是为何其能够实现非干扰性的任务组合(weight disentanglement)。现有研究仅描述了理想结果,未揭示其内在机制。论文提出任务特征专业化(Task-Feature Specialization, TFS)作为根本原理,证明TFS是权重解耦的充分条件,并进一步发现TFS会引发可测量的几何特性——权重向量正交性(orthogonality)。这一发现的关键在于:由于TFS本身难以直接施加约束,可通过强化其几何表现形式——正交性来间接促进解耦。因此,作者提出OrthoReg正则化方法,在微调过程中主动约束权重更新(ΔW)的内部正交结构,从而有效提升多种任务算术方法的性能。
链接: https://arxiv.org/abs/2604.17078
作者: Shangge Liu,Yuehan Yin,Lei Wang,Qi Fan,Yinghuan Shi,Wenbin Li,Yang Gao,Dacheng Tao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: CVPR 2026
Abstract:Task arithmetic provides an efficient, training-free way to edit pre-trained models, yet lacks a fundamental theoretical explanation for its success. The existing concept of ``weight disentanglement" describes the ideal outcome of non-interfering task composition but does not reveal its underlying cause. Crucially, what intrinsic properties of the pre-trained model ( \theta_0 ) or the task vectors ( \tau_t ) enable this disentanglement remains underexplored. In this paper, we introduce Task-Feature Specialization (TFS), a model’s ability to allocate distinct internal features to different tasks, as the fundamental principle. We first prove that TFS is a sufficient condition for weight disentanglement. More importantly, we find that TFS also gives rise to an observable geometric consequence: weight vector orthogonality. This positions TFS as the common cause for both the desired functional outcome (disentanglement) and a measurable geometric property (orthogonality). This relationship provides the key insight for our method: since the abstract TFS property is intractable to enforce directly, we can instead promote weight disentanglement by shaping its concrete geometric consequence, orthogonality. Therefore, we propose OrthoReg, a simple and effective regularization method that actively enforces an internal orthogonal structure on weight updates ( \Delta W ) that constitute \tau_t during fine-tuning. And we theoretically prove that OrthoReg promotes disentanglement. Extensive experiments demonstrate that OrthoReg consistently and significantly enhances the performance of various task arithmetic methods. Code is available at \hrefthis https URLthis https URL.
[AI-139] Harness as an Asset: Enforcing Determinism via the Convergent AI Agent Framework (CAAF)
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在安全关键工程场景中因可控性缺口(controllability gap)导致的部署难题:即使极低比例的约束违反未被检测,系统也无法投入实际应用。现有协同范式存在谄媚式合规、上下文注意力衰减(context attention decay)及自修正过程中的随机振荡(stochastic oscillation)等问题。其解决方案的核心是提出收敛型AI代理框架(Convergent AI Agent Framework, CAAF),通过三个支柱实现从开环生成到闭环故障安全确定性的转变:(1) 带物理上下文防火墙的递归原子分解;(2) 将领域不变量形式化为可机器读取的注册表,并由确定性统一断言接口(Unified Assertion Interface, UAI)强制执行;(3) 结构化语义梯度与状态锁定机制以保证单调收敛。实证结果表明,CAAF在自动驾驶和制药连续流反应器设计两个复杂场景中均实现了100%悖论检测率,显著优于单模型基线和多智能体架构,且可靠性不受提示干扰,核心贡献在于UAI的确定性约束保障能力。
链接: https://arxiv.org/abs/2604.17025
作者: Tianbao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 39 pages, 13 figures. Code: this https URL (Apache-2.0)
Abstract:Large Language Models (LLMs) produce a controllability gap in safety-critical engineering: even low rates of undetected constraint violations render a system undeployable. Current orchestration paradigms suffer from sycophantic compliance, context attention decay [Liu et al., 2024], and stochastic oscillation during self-correction [Huang et al., 2024]. We introduce the Convergent AI Agent Framework (CAAF), which transitions agentic workflows from open-loop generation to closed-loop Fail-Safe Determinism via three pillars: (1) Recursive Atomic Decomposition with physical context firewalls; (2) Harness as an Asset, formalizing domain invariants into machine-readable registries enforced by a deterministic Unified Assertion Interface (UAI); and (3) Structured Semantic Gradients with State Locking for monotonic convergence. Empirical evaluation across two domains – SAE Level 3 (L3) autonomous driving (AD) (n=30, 7 conditions) and pharmaceutical continuous flow reactor design (n=20, 4 conditions including a Mono+UAI ablation) – shows that CAAF-all-GPT-4o-mini achieves 100% paradox detection while monolithic GPT-4o achieves 0% (even at temperature=0). The pharmaceutical benchmark features 7 simultaneous constraints with nonlinear Arrhenius interactions and a 3-way minimal unsatisfiable subset, representing a structurally harder challenge than the 2-constraint AD paradox. Alternative multi-agent architectures (debate, sequential checking) also achieve 0% across 80 trials, confirming that CAAF’s reliability derives from its deterministic UAI, not from multi-agent orchestration per se. A Mono+UAI ablation (95%) isolates UAI as the core contribution. CAAF’s reliability is invariant to prompt hints; all components use a single commodity model, enabling fully offline deployment. Comments: 39 pages, 13 figures. Code: this https URL (Apache-2.0) Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2604.17025 [cs.AI] (or arXiv:2604.17025v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.17025 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-140] Mini-BEHAVIOR-Gran: Revealing U-Shaped Effects of Instruction Granularity on Language-Guided Embodied Agents
【速读】:该论文旨在解决语言引导的具身智能(language-guided embodied AI)中指令粒度(instruction granularity)这一重要但控制不足的变量问题。现有基准通常为每个任务提供单一静态指令,难以研究相同任务在不同描述细节下智能体行为的变化。解决方案的关键在于引入Mini-BEHAVIOR-Gran,这是一个扩展自Mini-BEHAVIOR的新基准,为每个任务提供多种粒度的指令变体,涵盖从高层目标描述到分步指导的连续范围。通过该基准,作者进一步发现规划宽度(planning-width)是跨任务量化指令粒度最一致的指标,并揭示了指令粒度与性能之间呈非单调U型关系,即在细粒度和粗粒度两端均取得最优性能,其中粗粒度性能回升归因于浅层接地(shallow grounding)现象,即智能体学习到以视觉为主导的策略。
链接: https://arxiv.org/abs/2604.17019
作者: Sukai Huang,Chenyuan Zhang,Fucai Ke,Zhixi Cai,Gholamreza Haffari,Lizhen Qu,Hamid Rezatofighi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, Keywords: Language Grounding, Language Granularity, Instruction Following Agent, Width-based Planning Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond Research Area Keywords: vision language navigation, multimodality, neurosymbolic approaches
Abstract:Instruction granularity is an important yet poorly controlled variable in language-guided embodied AI. Existing benchmarks typically pair each task with a single static instruction, making it difficult to study how agent behavior changes when the same task is described at different levels of detail. We introduce Mini-BEHAVIOR-Gran, a new benchmark for controlled studies of instruction granularity that extends Mini-BEHAVIOR with multiple instruction variants per task, ranging from high-level goal descriptions to step-by-step guidance. Using this benchmark, we compare four candidate metrics for cross-task granularity quantification: token count, entity count, action-verb count, and planning-width, and find that width correlates most consistently with agent performance. Using width to organize training and evaluation further reveals a non-monotonic U-shaped relationship between instruction granularity and performance, with peaks at both fine and coarse extremes. Further analysis suggests that the coarse-granularity performance rebound is associated with shallow grounding, where agents learn vision-dominant policies.
[AI-141] Small Model as Master Orchestrator: Learning Unified Agent -Tool Orchestration with Parallel Subtask Decomposition
【速读】:该论文旨在解决多智能体系统(Multi-agent Systems, MAS)在处理复杂任务时面临的系统复杂度高、可扩展性差的问题,尤其是现有编排方法依赖静态工作流或串行代理调度,且受限于工具与代理间异构接口协议。其解决方案的关键在于提出“Agent-as-Tool”统一并行编排范式,将代理和工具均抽象为标准化、可学习的动作空间,并通过协议标准化和显式状态反馈实现高效协同;在此基础上训练轻量级编排器ParaManager,解耦规划决策与子任务求解,支持状态感知的并行子任务分解、委派与异步执行,并采用两阶段训练策略(监督微调+SFT恢复机制+强化学习优化),从而在任务成功率、协议合规性、多样性及推理效率之间取得平衡,显著提升系统鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2604.17009
作者: Wenzhen Yuan,Wutao Xiong,Fanchen Yu,Shengji Tang,Ting Liu,Tao Chen,Peng Ye,Yuzhuo Fu,Wanli Ouyang,Lei Bai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent systems (MAS) demonstrate clear advantages in tackling complex problems by coordinating diverse agents and external tools. However, most existing orchestration methods rely on static workflows or serial agent scheduling, and are further constrained by heterogeneous interface protocols between tools and agents. This leads to high system complexity and poor extensibility. To mitigate these issues, we propose Agent-as-Tool, a unified parallel orchestration paradigm that abstracts both agents and tools into a standardized, learnable action space with protocol normalization and explicit state feedback. Building on this paradigm, we train a lightweight orchestrator, ParaManager, which decouples planning decisions from subtask solving, enabling state-aware parallel subtask decomposition, delegation, and asynchronous execution. For training, we adopt a two-stage ParaManager training pipeline. It improves robustness by incorporating supervised fine-tuning (SFT) trajectories equipped with recovery mechanisms, and further applies reinforcement learning (RL) to achieve an optimal balance among task success, protocol compliance, diversity, and reasoning efficiency. Experiments show that ParaManager achieves strong performance across multiple benchmarks and exhibits robust generalization under unseen model pools.
[AI-142] In-Context Learning Under Regime Change
【速读】:该论文旨在解决非平稳序列(non-stationary sequences)中模型如何在不重新训练的情况下,通过上下文学习(in-context learning)实现对变化点(change-point)的检测与动态适应问题。其关键在于将该任务形式化为一个上下文内变化点检测问题(in-context change-point detection problem),并证明了Transformer类基础模型存在能够解决此问题的构造方案;其中模型复杂度(层数和参数量)取决于关于变化点位置的信息水平(从无知识到已知确切时间)。实验验证表明,在合成线性回归与线性动态系统任务中,训练后的Transformer可达到最优基线性能;此外,通过编码和引入变化点知识,无需再训练即可提升预训练模型在真实世界场景(如传染病预测与美联储货币政策会议期间金融波动预测)中的表现,体现出该方法的实际应用价值。
链接: https://arxiv.org/abs/2604.16988
作者: Carson Dudley,Yutong Bi,Xiaofeng Liu,Samet Oymak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Non-stationary sequences arise naturally in control, forecasting, and decision-making. The data-generating process shifts at unknown times, and models must detect the change, discard or downweight obsolete evidence, and adapt to new dynamics on the fly. Transformer-based foundation models increasingly rely on in-context learning for time series forecasting, tabular prediction, and continuous control. As these models are deployed in non-stationary environments, understanding their ability to detect and adapt to regime shifts is important. We formalize this as an in-context change-point detection problem and formally establish the existence of transformer models that solve this problem. Our construction demonstrates that model complexity, in layers and parameters, depends on the level of information available about the change-point location, from no knowledge to knowing exact timing. We validate our results with experiments on synthetic linear regression and linear dynamical systems, where trained transformers match the performance of optimal baselines across information levels. We also show that encoding and incorporating changepoint knowledge indeed improves the real-world performance of a pretrained foundation models on infectious disease forecasting and on financial volatility forecasting around Federal Open Market Committee (FOMC) announcements without retraining, demonstrating practical applicability to real-world regime changes.
[AI-143] A phenotype-driven and evidence-governed framework for knowledge graph enrichment and hypotheses discovery in population data
【速读】:该论文旨在解决当前知识图谱(Knowledge Graph, KG)构建方法多为验证性(confirmatory)的问题,即主要聚焦于恢复已知关系,而难以识别新颖或情境依赖的节点与关系。为此,作者提出了一种以表型驱动、证据治理的框架,其核心在于将KG扩展转化为一个兼顾结构支持与文献空白的多目标优化问题。解决方案的关键在于:1)利用图神经网络(Graph Neural Networks, GNNs)进行表型发现、因果推断与概率推理,实现结构化假设生成;2)结合大语言模型(Large Language Models, LLMs)完成假设生成与声明提取;3)通过帕累托最优选择机制筛选非支配性声明,平衡可解释性、新颖性和验证度,从而有效避免冗余或低质量知识的引入。该方法在异构人群数据集上验证了其在提升表型可解释性、揭示情境依赖因果结构及生成高质量声明方面的优势。
链接: https://arxiv.org/abs/2604.16982
作者: Adela Bâra,Simona-Vasilica Oprea
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Current knowledge graph (KG) construction methods are confirmatory, focusing on recovering known relationships rather than identifying novel or context-dependent nodes. This paper proposes a phenotype-driven and evidence-governed framework that shifts the paradigm toward structured hypothesis discovery and controlled KG expansion. The approach integrates graph neural networks (GNNs) for phenotype discovery, causal inference, probabilistic reasoning and large language models (LLMs) for hypothesis generation and claim extraction within a unified pipeline. The framework prioritizes relationships that are both structurally supported by data and underexplored in the literature. KG expansion is formulated as a multi-objective optimization problem, where candidate claims are jointly evaluated in terms of relevance, structural validation and novelty. Pareto-optimal selection enables the identification of non-dominated claims that balance confirmation and discovery, avoiding trivial or redundant knowledge inclusion. Experiments on heterogeneous population datasets demonstrate that the proposed framework produces more interpretable phenotypes, reveals context-dependent causal structures and generates high-quality claims that align with both data and scientific evidence. Compared to rule-based and LLM-only baselines, the method achieves the best trade-off across plausibility, novelty, validation and relevance. In retrieval-augmented settings, it significantly improves performance (Recall@5=0.98) while reducing hallucination rates (0.05), highlighting its effectiveness in grounding LLM outputs.
[AI-144] Evaluating Multimodal LLM s for Inpatient Diagnosis: Real-World Performance Safety and Cost Across Ten Frontier Models
【速读】:该论文旨在解决在低收入和中等收入国家(LMIC)公共医院环境中,大型语言模型(Large Language Models, LLMs)在真实世界多模态住院患者数据上的诊断支持能力评估不足的问题。其解决方案的关键在于:通过一项回顾性研究(VALID),在南非一家三级公立医院收集并分析了539例包含影像学(CT、MRI、X光)、实验室结果、临床记录和生命体征的多模态病例数据,由专家小组确定金标准诊断、鉴别诊断及推理路径,并采用校准后的三模型LLM裁判系统对10个多模态LLM的零样本输出进行量化评估(共10,000次评价)。结果显示,尽管模型成本差异显著,性能高度集中且均优于常规病房诊断,尤其在诊断准确性和患者安全指标上;同时发现添加放射学报告可提升性能6%,且诊断与推理得分高度相关(ρ = 0.85),表明多模态LLM在LMIC场景下具备高性价比和鲁棒性潜力,部署约束可能比微小性能差异更为重要。
链接: https://arxiv.org/abs/2604.16980
作者: Bruce A. Bassett,Amy Rouillard,Sitwala Mundia,Michael Cameron Gramanie,Linda Camara,Ziyaad Dangor,Shabir A. Madhi,Kajal Morar,Marlvin T. Ncube,Ismail Kalla,Haroon Saloojee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 11 figures, 10 tables
Abstract:Background: Large language models (LLMs) are increasingly proposed for diagnostic support, but few evaluations use real-world multimodal inpatient data, particularly in low and middle-income country (LMIC) public hospitals. Methods: We conducted VALID, a retrospective evaluation of 539 multimodal inpatient cases from a tertiary public hospital in South Africa. Inputs included radiology imaging (CT, MRI, CXR) and reports, laboratory results, clinical notes, and vital signs. Expert panels adjudicated 300 cases (balanced and discordant subsets) to establish ground truth diagnoses, differentials, and reasoning. Ten multimodal LLMs generated zero-shot outputs. A calibrated three-model LLM Jury scored all outputs and routine ward diagnoses across diagnostic accuracy, differential quality, reasoning, and patient safety (10,000 evaluations). Primary outcomes were composite scores ( S_3 , S_4 ) and win rates. Results: (i) LLM performance was tightly clustered (15% variation) despite large cost differences; low-cost models performed comparably to top models. (ii) All LLMs significantly outperformed routine ward diagnoses on average diagnostic and safety scores. (iii) Top performance was achieved by GPT-5.1, followed by Gemini models. (vi) Adding radiology reports improved performance by 6%. (v) Diagnostic and reasoning scores were highly correlated ( \rho = 0.85 ). (vi) Output rates varied (65-100%) due to input constraints. Results were robust across subsets and evaluation design. Conclusions: Across a real-world LMIC dataset, multimodal LLMs showed similar diagnostic performance despite large cost differences and outperformed routine care on average safety metrics. Affordability, robustness, and deployment constraints may outweigh marginal performance differences in LMIC settings. Comments: 17 pages, 11 figures, 10 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.16980 [cs.LG] (or arXiv:2604.16980v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.16980 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Bruce A. Bassett [view email] [v1] Sat, 18 Apr 2026 12:42:57 UTC (1,692 KB)
[AI-145] MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models
【速读】:该论文旨在解决Group Relative Policy Optimization (GRPO) 类算法在处理高准确率提示时存在的两个关键问题:一是对于已掌握提示(rollout accuracy = 1)导致组相对优势消失,从而丧失训练信号并引发策略漂移(policy drift),可能造成遗忘;二是对于多数正确提示(rollout accuracy ∈ (0.5,1))由于查询权重随准确率提升而缩小,削弱了从部分正确到完全掌握的巩固能力。解决方案的核心在于提出Mastery-Consolidated Policy Optimization (MCPO),其关键创新包括:(i) 仅对已掌握提示引入铰链式KL正则项(hinge-KL regularizer),以限制相邻梯度步之间的有害策略漂移;(ii) 设计一种优先级加权机制,强化对多数正确提示的优化分配,从而更有效地促进从部分正确到完全掌握的转化。实验表明,MCPO在三个数学基准上显著提升了pass@1性能,并意外地增强了pass@k指标,说明巩固掌握状态反而能进一步激发解空间多样性。
链接: https://arxiv.org/abs/2604.16972
作者: Zhaokang Liao,Yingguo Gao,Yi Yang,Yongheng Hu,Jingting Ding
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach to improve the reasoning abilities of Large Language Models (LLMs). Among RLVR algorithms, Group Relative Policy Optimization (GRPO) and its variants have demonstrated strong performance and high training efficiency. However, GRPO-style objectives exhibit two issues on high accuracy prompts including mastered prompts (rollout accuracy =1) and majority-correct prompts (rollout accuracy in (0.5,1)). For mastered prompts, group-relative advantages vanish, yielding no training signal and unconstrained policy drift that can cause forgetting. For majority-correct prompts, the induced query weight shrinks as accuracy increases, weakening consolidation from partial correctness to mastery. To alleviate this, we propose Mastery-Consolidated Policy Optimization (MCPO), which introduces (i) a hinge-KL regularizer applied exclusively to mastered prompts to bound harmful policy drift between successive gradient steps, and (ii) a weighting mechanism that prioritizes majority-correct prompts to better allocate optimization effort. Extensive experiments across three mathematical benchmarks demonstrate that MCPO consistently improves pass@1 performance. Counter-intuitively, rather than restricting exploration, MCPO boosts pass@k metrics, indicating that mastery consolidation further catalyzes solution diversity.
[AI-146] NaviFormer: A Deep Reinforcement Learning Transformer-like Model to Holistically Solve the Navigation Problem IROS
【速读】:该论文旨在解决自动驾驶或机器人导航中高阶路径规划(route planning)与低阶轨迹规划(path planning)需协同求解的难题,传统方法通常将二者分离处理,难以实现全局优化和实时性。其解决方案的关键在于提出NaviFormer——一种基于Transformer架构的深度强化学习模型,能够同时预测高阶路线(waypoint序列)与低阶轨迹(两点间避障路径),通过统一建模提升整体导航效率与准确性,并在实验中展现出优于对比算法的精度和计算速度,适用于实时任务场景。
链接: https://arxiv.org/abs/2604.16967
作者: Daniel Fuertes,Andrea Cavallaro,Carlos R. del-Blanco,Fernando Jaureguizar,Narciso García
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Published in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025
Abstract:Path planning is usually solved by addressing either the (high-level) route planning problem (waypoint sequencing to achieve the final goal) or the (low-level) path planning problem (trajectory prediction between two waypoints avoiding collisions). However, real-world problems usually require simultaneous solutions to the route and path planning subproblems with a holistic and efficient approach. In this paper, we introduce NaviFormer, a deep reinforcement learning model based on a Transformer architecture that solves the global navigation problem by predicting both high-level routes and low-level trajectories. To evaluate NaviFormer, several experiments have been conducted, including comparisons with other algorithms. Results show competitive accuracy from NaviFormer since it can understand the constraints and difficulties of each subproblem and act consequently to improve performance. Moreover, its superior computation speed proves its suitability for real-time missions.
[AI-147] Visual Inception: Compromising Long-term Planning in Agent ic Recommenders via Multimodal Memory Poisoning
【速读】:该论文旨在解决Agentic Recommender Systems (Agentic RecSys) 中因依赖长期记忆(Long-term Memory, LTM)而引入的新型安全威胁——“视觉诱饵攻击”(Visual Inception)。该攻击通过用户上传的图像(如生活照片)注入隐蔽触发器,这些触发器作为“潜伏代理”存储于系统记忆中,在未来任务规划时被激活,从而劫持AI代理的推理链,诱导其执行攻击者设定的目标(如推荐高利润商品),且无需提示注入即可实现。解决方案的关键在于提出CognitiveGuard,一个受人类认知双系统理论启发的防御框架:包含两个核心组件——System 1感知净化器(基于扩散模型对输入图像进行净化)和System 2推理验证器(通过反事实一致性检查识别记忆驱动决策中的异常),在不损害推荐质量的前提下显著降低目标命中率(从约85%降至约10%),并支持灵活的延迟-安全性权衡(轻量模式约1.5秒,完整序列验证约6.5秒)。
链接: https://arxiv.org/abs/2604.16966
作者: Jiachen Qian
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures, 16 tables
Abstract:The evolution from static ranking models to Agentic Recommender Systems (Agentic RecSys) empowers AI agents to maintain long-term user profiles and autonomously plan service tasks. While this paradigm shift enhances personalization, it introduces a vulnerability: reliance on Long-term Memory (LTM). In this paper, we uncover a threat termed “Visual Inception.” Unlike traditional adversarial attacks that seek immediate misclassification, Visual Inception injects triggers into user-uploaded images (e.g., lifestyle photos) that act as “sleeper agents” within the system’s memory. When retrieved during future planning, these poisoned memories hijack the agent’s reasoning chain, steering it toward adversary-defined goals (e.g., promoting high-margin products) without prompt injection. To mitigate this, we propose CognitiveGuard, a dual-process defense framework inspired by human cognition. It consists of a System 1 Perceptual Sanitizer (diffusion-based purification) to cleanse sensory inputs and a System 2 Reasoning Verifier (counterfactual consistency checks) to detect anomalies in memory-driven planning. Extensive experiments on a mock e-commerce agent environment demonstrate that Visual Inception achieves about 85% Goal-Hit Rate (GHR), while CognitiveGuard reduces this risk to around 10% with configurable latency trade-offs (about 1.5s in lite mode to about 6.5s for full sequential verification), without quality degradation under our setup.
[AI-148] Multi-stage Planning for Multi-target Surveillance using Aircrafts Equipped with Synthetic Aperture Radars Aware of Target Visibility
【速读】:该论文旨在解决合成孔径雷达(SAR)载机在复杂三维地形环境下生成高质量成像轨迹的难题,尤其针对多目标场景中如何动态规划能够最大化目标可见性的直飞段(straight-flight segments)并实现实时性的问题。传统方法通常假设预定义的直飞段不随地形和飞行器姿态变化,难以适应实际目标可见性需求,且扩展性差。解决方案的关键在于提出一个分阶段的规划系统:首先通过优化算法确定访问所有目标的航点顺序;其次利用基于深度强化学习训练的新型神经网络预测在3D地形约束下能最大化目标可见性的直飞段;最后通过引入三维Dubins曲线优化连接各段,生成平滑连续的飞行轨迹。该方法兼顾了高精度成像质量、地形适应性和实时性能,显著提升了多目标SAR任务的可行性与效率。
链接: https://arxiv.org/abs/2604.16962
作者: Daniel Fuertes,Carlos R. del-Blanco,Fernando Jaureguizar,Juan José Navarro-Corcuera,Narciso García
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Published in IEEE/RAS International Conference on Automation Science and Engineering 2025
Abstract:Generating trajectories for synthetic aperture radar (SAR)-equipped aircraft poses significant challenges due to terrain constraints, and the need for straight-flight segments to ensure high-quality imaging. Related works usually focus on trajectory optimization for predefined straight-flight segments that do not adapt to the target visibility, which depends on the 3D terrain and aircraft orientation. In addition, this assumption does not scale well for the multi-target problem, where multiple straight-flight segments that maximize target visibility must be defined for real-time operations. For this purpose, this paper presents a multi-stage planning system. First, the waypoint sequencing to visit all the targets is estimated. Second, straight-flight segments maximizing target visibility according to the 3D terrain are predicted using a novel neural network trained with deep reinforcement learning. Finally, the segments are connected to create a trajectory via optimization that imposes 3D Dubins curves. Evaluations demonstrate the robustness of the system for SAR missions since it ensures high-quality multi-target SAR image acquisition aware of 3D terrain and target visibility, and real-time performance.
[AI-149] AutoPKG: An Automated Framework for Dynamic E-commerce Product-Attribute Knowledge Graph Construction ACL2026
【速读】:该论文旨在解决电子商务中产品属性提取的瓶颈问题,即现有本体(ontology)存在不一致、不完整且维护成本高的缺陷。解决方案的关键在于提出AutoPKG框架,这是一个基于多智能体大语言模型(Large Language Model, LLM)的系统,能够从多模态商品内容中自动构建产品属性知识图谱(Product-attribute Knowledge Graph, PKG)。其核心创新包括按需推导产品类型和类型特异性属性键、从文本与图像中提取属性值,并通过一个中心化决策代理实现全局一致性校正,从而保障知识图谱的动态更新质量与准确性。
链接: https://arxiv.org/abs/2604.16950
作者: Pollawat Hongwimol,Haoning Shang,Chutong Wang,Zhichao Wan,Yi Gao,Yuanming Li,Lin Gui,Wenhao Sun,Cheng Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted as ACL 2026 Findings
Abstract:Product attribute extraction in e-commerce is bottlenecked by ontologies that are inconsistent, incomplete, and costly to maintain. We present AutoPKG, a multi-agent Large Language Model (LLM) framework that automatically constructs a Product-attribute Knowledge Graph (PKG) from multimodal product content. AutoPKG induces product types and type-specific attribute keys on demand, extracts attribute values from text and images, and consolidates updates through a centralized decision agent that maintains a globally consistent canonical graph. We also propose an evaluation protocol for dynamic PKGs that measures type and key validity, consolidation quality, and edge-level accuracy for value assertions after canonicalization. On a large real-world marketplace catalog dataset from Lazada (Alibaba), AutoPKG achieves up to 0.953 Weighted Knowledge Efficiency (WKE) for product types, 0.724 WKE for attribute keys, and 0.531 edge-level F1 for multimodal value extraction. Across three public benchmarks, our method improves edge-level exact-match F1 by 0.152 and yields a precision gain of 0.208 on the attribute extraction application. Online A/B tests show that AutoPKG-derived attributes increase Gross Merchandise Value (GMV) in Badge by 3.81 percent, in Search by 5.32 percent, and in Recommendation by 7.89 percent, supporting the practical value of AutoPKG in production.
[AI-150] MEMRES: A Memory-Augmented Resolver with Confidence Cascade for Agent ic Python Dependency Resolution
【速读】:该论文旨在解决Python依赖项解析(Dependency Resolution)中的高失败率问题,特别是在代码片段自动修复与执行场景下,传统基于大语言模型(Large Language Model, LLM)的方法因缺乏结构化知识和错误模式识别能力而表现不佳。其解决方案的关键在于提出一个多层次置信度级联机制(multi-level confidence cascade),将LLM作为最终手段,前置使用四个核心组件:自演化记忆(Self-Evolving Memory)用于积累可复用的解析模式;包含200+人工标注的导入到包映射的错误模式知识库(Error Pattern Knowledge Base);语义导入分析器(Semantic Import Analyzer)提升对导入语句的理解精度;以及针对Python 2遗留代码的启发式检测器,专门处理最常出现的兼容性失败类别。该设计显著提升了解析成功率,在HG2.9K数据集上达到86.6%(平均10次运行),远超现有方法PLLM的54.7%。
链接: https://arxiv.org/abs/2604.16941
作者: Dao Sy Duy Minh,Tran Chi Nguyen,Trung Kiet Huynh,Pham Phu Hoa,Nguyen Lam Phu Quy,Vu Nguyen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 4 pages, 1 figure, to appear in Proc. FSE Companion '26
Abstract:We present MEMRES, an agentic system for Python dependency resolution that introduces a multi-level confidence cascade where the LLM serves as the last resort. Our system combines: (1) a Self-Evolving Memory that accumulates reusable resolution patterns via tips and shortcuts; (2) an Error Pattern Knowledge Base with 200+ curated import-to-package mappings; (3) a Semantic Import Analyzer; and (4) a Python 2 heuristic detector resolving the largest failure category. On HG2.9K using Gemma-2 9B (10 GB VRAM). MEMRES resolves 2503 of 2890 (86.6%, 10-run average) snippets, combining intra-session memory with our confidence cascade for the remainder. This already exceeds PLLM’s 54.7% overall success rate by a wide margin.
[AI-151] D-QRELO: Training- and Data-Free Delta Compression for Large Language Models via Quantization and Residual Low-Rank Approximation
【速读】:该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)导致的大量专用大语言模型(Large Language Models, LLMs)带来的内存开销问题。现有基于增量压缩(Delta Compression)的方法在处理大规模SFT数据时效果不佳,因为随着训练数据规模增加,增量参数的幅值、奇异值和熵显著增大,从而加剧压缩误差。为此,作者提出DQRELO(Delta Compression via Quantization and Residual Low-Rank),其核心创新在于无需训练和数据即可实现高效压缩:首先采用粗粒度的一比特量化(one-bit quantization)捕获增量结构的主导特征,随后通过补偿残差低秩近似(compensated residual low-rank approximation)从较小的残差误差中恢复细粒度细节。实验表明,DQRELO在多种密集型与MoE架构的LLM上均优于现有方法,并揭示了任务难度、模型结构和层位置对压缩效果的影响规律,为生产环境中最优压缩策略的设计提供了可预测的指导原则。
链接: https://arxiv.org/abs/2604.16940
作者: Junlin Li,Shuangyong Song,Guodong Du,Ngai Wong,Xuebo Liu,Yongxiang Li,Min Zhang,Jing Li,Xuelong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Supervised Fine-Tuning (SFT) accelerates taskspecific large language models (LLMs) development, but the resulting proliferation of finetuned models incurs substantial memory overhead. Delta compression addresses this by retaining a single pre-trained LLM with multiple compressed delta weights. However, existing methods fail on models fine-tuned with largescale datasets. We find that larger SFT data scale amplifies delta parameter magnitude, singular values, and entropy, exacerbating compression errors. To tackle this, we propose DQRELO (Delta Compression via Quantization and Residual Low-Rank), a novel training- and data-free delta compression method. It combines coarse-grained one-bit quantization to capture the dominant structure of the delta, followed by compensated residual low-rank approximation to recover fine-grained details from the smaller residual error. Experiments on various LLMs spanning dense and MoE architectures across multiple domains under this challenging setting demonstrate that DQRELO outperforms existing methods. Moreover, we establish key design principles for delta compression through extensive empirical analysis, demonstrating how task difficulty, architecture, and layer positioning create predictable patterns that can guide optimal compression strategies in production systems.
[AI-152] Playing Psychic: Using Thought Trees to Predict Reasoning Models Accuracy on Coding Tasks
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在复杂任务中推理能力评估不充分的问题,特别是现有评测主要依赖于竞赛编程基准,难以全面反映模型在真实场景下的编码推理表现。其解决方案的关键在于提出一种结构化的推理表示方法——结构化思维树(structured thought-trees),并发现推理路径的结构特征比内容本身更能预测输出正确性;基于此,作者设计了一个轻量级分类器,通过提取思维树特征来识别结构异常的推理路径,并通过重试机制提升低复杂度任务中的性能表现。
链接: https://arxiv.org/abs/2604.16931
作者: Jiaxin Fang,Runyuan He,Sahil Bhatia,Neel Gajare,Alvin Cheung
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) have shown that test-time scaling can substantially improve model performance on complex tasks, particularly in the coding domain. Under this paradigm, models use a larger token budget during inference to generate intermediate reasoning traces before producing a final answer. However, current evaluations primarily rely on competitive programming benchmarks, which may not capture the full range of reasoning abilities. In this work, we perform a systematic study of frontier reasoning models to understand their performance on real-world coding benchmarks. To gain more insights into the performance of such models, we devise a programmatic way to \em automatically generate coding tasks of arbitrary difficulty and structure from existing benchmarks. Using this framework, our analysis reveals that the structure of a reasoning trace, not just its contents, is a strong predictor of correctness. Motivated by this, we propose structured thought-trees as means to represent reasoning traces. To illustrate their use, we train a lightweight classifier on features extracted from thought-trees to predict trace correctness, and demonstrate that flagging and retrying structurally anomalous traces based on the extracted features yields consistent gains at lower complexity levels.
[AI-153] st-Time Adaptation for EEG Foundation Models: A Systematic Study under Real-World Distribution Shifts
【速读】:该论文旨在解决脑电图(Electroencephalography, EEG)基础模型在临床部署中因数据分布偏移(distribution shifts)导致性能下降的问题,尤其是跨设备、人群和场景下的泛化能力不足。其解决方案的关键在于提出 NeuroAdapt-Bench——一个系统性的测试时适应(Test-Time Adaptation, TTA)基准,用于评估不同领域迁移学习方法在真实EEG分布偏移场景下的有效性。通过在多种预训练模型、下游任务和异构数据集(包括分布内、分布外及极端模态偏移如耳部脑电 Ear-EEG)上的全面评测,研究发现标准TTA方法表现不稳定且常导致性能退化,而无需优化的无梯度方法则更具鲁棒性和可靠性,从而揭示了现有TTA技术在EEG领域的局限性,并强调了开发面向神经信号领域的专用适配策略的必要性。
链接: https://arxiv.org/abs/2604.16926
作者: Gabriel Jason Lee,Jathurshan Pradeepkumar,Jimeng Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:Electroencephalography (EEG) foundation models have shown strong potential for learning generalizable representations from large-scale neural data, yet their clinical deployment is hindered by distribution shifts across clinical settings, devices, and populations. Test-time adaptation (TTA) offers a promising solution by enabling models to adapt to unlabeled target data during inference without access to source data, a valuable property in healthcare settings constrained by privacy regulations and limited labeled data. However, its effectiveness for EEG remains largely underexplored. In this work, we introduce NeuroAdapt-Bench, a systematic benchmark for evaluating test-time adaptation methods on EEG foundation models under realistic distribution shifts. We evaluate representative TTA approaches from other domains across multiple pretrained foundation models, diverse downstream tasks, and heterogeneous datasets spanning in-distribution, out-of-distribution, and extreme modality shifts (e.g., Ear-EEG). Our results show that standard TTA methods yield inconsistent gains and often degrade performance, with gradient-based approaches particularly prone to heavy degradation. In contrast, optimization-free methods demonstrate greater stability and more reliable improvements. These findings highlight the limitations of existing TTA techniques in EEG, provide guidance for future development, and underscore the need for domain-specific adaptation strategies.
[AI-154] Alignment Imprint: Zero-Shot AI-Generated Text Detection via Provable Preference Discrepancy
【速读】:该论文旨在解决生成式 AI (Generative AI) 文本检测中现有基于似然的方法因内容复杂度敏感而导致性能不稳定的问题。其解决方案的关键在于提出“对齐印记”(Alignment Imprint),即通过将大语言模型(Large Language Models, LLMs)的对齐过程抽象为一系列约束优化步骤,理论推导出日志似然比可分解为隐式指令偏置与偏好奖励两部分;在此基础上进一步设计标准化的信息加权统计量——对齐印记偏好差异(Log-likelihood Alignment Preference Discrepancy, LAPD),以缓解高熵区域的不稳定性,并提供严格的统计保障证明其优于现有方法(如Fast-DetectGPT)。实验表明,LAPD 相较于最强基线提升达 45.82% 的相对性能,且在各类场景下均表现稳定。
链接: https://arxiv.org/abs/2604.16923
作者: Junxi Wu,Kailin Huang,Dongjian Hu,Bin Chen,Hao Wu,Shu-Tao Xia,Changliang Zou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Detecting AI-generated text is an important but challenging problem. Existing likelihood-based detection methods are often sensitive to content complexity and may exhibit unstable performance. In this paper, our key insight is that modern Large Language Models (LLMs) undergo alignment (including fine-tuning and preference tuning), leaving a measurable distributional imprint. We theoretically derive this imprint by abstracting the alignment process as a sequence of constrained optimization steps, showing that the log-likelihood ratio can naturally decompose into implicit instructional biases and preference rewards. We refer to this quantity as the Alignment Imprint. Furthermore, to mitigate the instability in high-entropy regions, we introduce Log-likelihood Alignment Preference Discrepancy (LAPD), a standardized information-weighted statistic based on alignment imprint. We provide statistical guarantee that alignment-based statistics dominate Fast-DetectGPT in performance. We also theoretically show that LAPD strictly improves the unweighted alignment scores when the aligned and base models are close in distribution. Extensive experiments show that LAPD achieves an improvement 45.82% relative to the strongest existing baselines, yielding large and consistent gains across all settings.
[AI-155] ClimAgent : LLM as Agents for Autonomous Open-ended Climate Science Analysis
【速读】:该论文旨在解决气候研究中因多尺度数据量激增和分析工具复杂性导致的科研瓶颈问题,即传统工作流碎片化、劳动密集且难以高效推进科学发现。其解决方案的关键在于提出ClimAgent——一个通用的自主框架,通过整合统一的工具使用环境与严格的推理协议,实现跨不同气候子领域的端到端建模与执行能力,从而超越简单问答任务的局限,真正支持专业级气候科学研究。此外,为系统评估该框架的有效性,作者还构建了首个面向真实气候发现场景的基准测试平台ClimaBench,涵盖2000–2025年间五类专业任务,实验表明ClimAgent在解决方案严谨性和实用性上相较现有最优基线提升40.21%。
链接: https://arxiv.org/abs/2604.16922
作者: Hao Wang,Jindong Han,Wei Fan,Hao Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Climate research is pivotal for mitigating global environmental crises, yet the accelerating volume of multi-scale datasets and the complexity of analytical tools have created significant bottlenecks, constraining scientific discovery to fragmented and labor-intensive workflows. While the emergence Large Language Models (LLMs) offers a transformative paradigm to scale scientific expertise, existing explorations remain largely confined to simple Question-Answering (QA) tasks. These approaches often oversimplify real-world challenges, neglecting the intricate physical constraints and the data-driven nature required in professional climate this http URL bridge this gap, we introduce ClimAgent, a general-purpose autonomous framework designed to execute a wide spectrum of research tasks across diverse climate sub-fields. By integrating a unified tool-use environment with rigorous reasoning protocols, ClimAgent transcends simple retrieval to perform end-to-end modeling and this http URL foster systematic evaluation, we propose ClimaBench, the first comprehensive benchmark for real-world climate discovery. It encompasses challenging problems spanning 5 distinct task categories derived from professional scenarios between 2000 and 2025. Experiments on ClimaBench demonstrate that ClimAgent significantly outperforms state-of-the-art baselines, achieving a 40.21% improvement over original LLM solutions in solution rigorousness and practicality. Our code are available at this https URL.
[AI-156] Skilldex: A Package Manager and Registry for Agent Skill Packages with Hierarchical Scope-Based Distribution
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在运行时扩展技能包时存在的两个关键问题:一是缺乏公开工具对技能包是否符合 Anthropic 发布的格式规范进行评分;二是缺少机制将相关技能与其所需的共享上下文捆绑,以确保技能间的行为一致性。解决方案的关键在于提出 Skilldex,其核心创新包括:(1) 基于编译器风格的格式合规性评分系统,可提供逐行诊断,评估描述具体性、前言字段有效性及结构符合度;(2) 引入 skillset 抽象概念,即一组具有共享资产(如词汇表文件、模板和参考文档)的相关技能集合,从而强制跨技能行为的一致性。此外,Skilldex 还提供了三层次作用域系统、人机协同的技能建议循环、仅含元数据的社区注册表以及模型上下文协议(Model Context Protocol, MCP)服务器等配套基础设施。
链接: https://arxiv.org/abs/2604.16911
作者: Sampriti Saha,Pranav Hemanth
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure, 5 tables. IEEE conference format
Abstract:Large Language Model (LLM) agents are increasingly extended at runtime via skill packages, structured natural-language instruction bundles loaded from a well-known directory. Community install tooling and registries exist, but two gaps persist: no public tool scores skill packages against Anthropic’s published format specification, and no mechanism bundles related skills with the shared context they need to remain mutually coherent. We present Skilldex, a package manager and registry for agent skill packages addressing both gaps. The two novel contributions are: (1) compiler-style format conformance scoring against Anthropic’s skill specification, producing line-level diagnostics on description specificity, frontmatter validity, and structural adherence; and (2) the skillset abstraction, a bundled collection of related skills with shared assets (vocabulary files, templates, reference documents) that enforce cross-skill behavioral coherence. Skilldex also provides supporting infrastructure: a three-tier hierarchical scope system, a human-in-the-loop agent suggestion loop, a metadata-only community registry, and a Model Context Protocol (MCP) server. The system is implemented as a TypeScript CLI (skillpm / spm) with a Hono/Supabase registry backend, and is open-source.
[AI-157] Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
【速读】:该论文旨在解决原生多模态大语言模型(Native Omni-modal Large Language Models, OLLMs)中存在的模态偏好(modality preference)问题,即模型在处理跨模态任务时对某一模态(如视觉)表现出非均衡的倾向,进而可能引发跨模态幻觉(cross-modal hallucinations)。为应对这一挑战,作者首先构建了一个基于冲突的基准测试集,并引入模态选择率(modality selection rate)作为量化指标,系统评估了十种代表性OLLM的模态偏好行为,发现其呈现出从传统视觉语言模型(VLMs)的“文本主导”向“视觉主导”的范式转变。关键创新在于通过逐层探针(layer-wise probing)揭示模态偏好并非静态特性,而是逐步在中后期层中涌现;进一步利用这些内部信号诊断跨模态幻觉,在无需特定任务数据的情况下,在三个下游多模态基准上实现了具有竞争力的性能表现,从而为构建更可信的OLLMs提供了机制性理解与实用工具。
链接: https://arxiv.org/abs/2604.16902
作者: Xinru Yan,Boxi Cao,Yaojie Lu,Hongyu Lin,Weixiang Zhou,Le Sun,Xianpei Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly-curated conflict-based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text-dominance’’ of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer-wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid-to-late layers. Building upon these insights, we leverage these internal signals to diagnose cross-modal hallucinations, achieving competitive performance across three downstream multi-modal benchmarks without task-specific data. Our work provides both a mechanistic understanding and a practical tool for building more trustworthy OLLMs. Our code and related resources are publicly available at: this https URL
[AI-158] Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning ACL2026
【速读】:该论文旨在解决大模型在推理过程中因冗余计算导致的资源浪费问题,即“过度思考”(overthinking)现象——尽管长链式思维(chain-of-thought)能提升问题求解能力,但模型常在不必要的步骤中消耗大量计算资源。传统方法如训练时引入长度惩罚会损害模型性能,而推理时的早停机制则增加系统开销。其解决方案的关键在于提出Step-GRPO这一后训练框架,通过引入语言标记(linguistic markers)将优化目标从原始token数量转变为语义步骤(semantic steps),并结合动态截断回放(Dynamic Truncated Rollout)机制与步长感知相对奖励(Step-Aware Relative Reward),使模型在训练中内化动态早停能力,从而在不牺牲准确率的前提下显著提升推理效率。
链接: https://arxiv.org/abs/2604.16890
作者: Benteng Chen,Weida Wang,Shufei Zhang,Mingbao Lin,Min Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper has been accepted for publication at the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Abstract:Large reasoning models that use long chain-of-thought excel at problem-solving yet waste compute on redundant checks. Curbing this overthinking is hard: training-time length penalties can cripple ability, while inference-time early-exit adds system overhead. To bridge this gap, we propose Step-GRPO, a novel post-training framework that internalizes dynamic early-exit capabilities directly into the model. Step-GRPO shifts the optimization objective from raw tokens to semantic steps by utilizing linguistic markers to structure reasoning. We introduce a Dynamic Truncated Rollout mechanism that exposes the model to concise high-confidence trajectories during exploration, synergized with a Step-Aware Relative Reward that dynamically penalizes redundancy based on group-level baselines. Extensive experiments across three model sizes on diverse benchmarks demonstrate that Step-GRPO achieves a superior accuracy-efficiency trade-off. On Qwen3-8B, our method reduces token consumption by 32.0% compared to the vanilla model while avoiding the accuracy degradation observed in traditional length-penalty methods.
[AI-159] SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models
【速读】:该论文旨在解决大语言模型(LLM)和多模态大语言模型(LMM)在长上下文解码过程中因注意力机制需频繁加载大量键值缓存(KV-cache)数据而导致的内存瓶颈问题。现有加速策略通常通过启发式剪枝牺牲准确性,且缺乏对注意力汇聚现象(attention sink phenomenon)的深层理解,导致效率与精度难以兼顾。其解决方案的关键在于揭示了注意力汇聚现象本质上是训练过程中形成的稳定、可到达且误差可控的固定点(fixed point),并据此提出无需训练的SinkRouter选择性路由框架:该框架能检测到“汇点信号”(sink signal),跳过产生近零输出的计算路径;同时设计了硬件感知的Triton内核,支持块级分支和Split-K并行化以实现高效落地。实验证明,该方法在多种长文本和多模态基准测试中显著提升解码效率,最高达2.03倍加速,且保持竞争力的准确性。
链接: https://arxiv.org/abs/2604.16883
作者: Junnan Liu,Xinyan Liu,Peifeng Gao,Zhaobo Qi,Beichen Zhang,Weigang Zhang,Antoni Bert Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In long-context decoding for LLMs and LMMs, attention becomes increasingly memory-bound because each decoding step must load a large amount of KV-cache data from GPU memory. Existing acceleration strategies often trade efficiency for accuracy by relying on heuristic pruning that may discard useful information. At a deeper level, they also tend to indiscriminately preserve all high-scoring tokens, treat early tokens as indispensable anchors, or rely on heuristic head routing, reflecting an insufficient mechanistic understanding of the attention sink phenomenon. In this paper, we show that the attention sink phenomenon corresponds to a stable, reachable, and error-controllable fixed point constructed during training. Based on this insight, we propose SinkRouter, a training-free selective routing framework that detects the sink signal and skips computations that would otherwise produce near-zero output. To translate this mechanism into real-world acceleration, we develop a hardware-aware Triton kernel with block-level branching and Split-K parallelism. We conduct extensive evaluations on a diverse suite of long-context benchmarks, including LongBench, InfiniteBench, CVBench, MileBench, and MMVP, using both text-only and multimodal backbones such as Llama-3.1-8B, Llama-3.1-70B, Yi-9B-200K, LLaVA-1.5-7B, and LLaVA-1.5-13B. Across these settings, SinkRouter consistently improves decoding efficiency while maintaining competitive accuracy, and reaches 2.03x speedup with a 512K context.
[AI-160] GRAIL: Autonomous Concept Grounding for Neuro-Symbolic Reinforcement Learning
【速读】:该论文旨在解决神经符号强化学习(Neuro-symbolic Reinforcement Learning, NeSy-RL)中关系概念(relational concepts)的自动习得问题,即如何在不同环境中自主地将抽象的关系语义(如“左”或“靠近”)与具体环境特征对齐,从而提升策略的可解释性和泛化能力。传统方法依赖人工设计这些概念,导致适应性差且难以迁移。解决方案的关键在于提出GRAIL框架,该框架利用大语言模型(Large Language Models, LLMs)提供通用的概念表示作为弱监督信号,并通过环境交互迭代优化这些表示,使其捕获特定环境下的语义细节,从而有效缓解稀疏奖励和概念错位问题,在Atari游戏中的实验验证了其优于手工设计概念的性能表现。
链接: https://arxiv.org/abs/2604.16871
作者: Hikaru Shindo,Henri Rößler,Quentin Delfosse,Kristian Kersting
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint
Abstract:Neuro-symbolic Reinforcement Learning (NeSy-RL) combines symbolic reasoning with gradient-based optimization to achieve interpretable and generalizable policies. Relational concepts, such as “left of” or “close by”, serve as foundational building blocks that structure how agents perceive and act. However, conventional approaches require human experts to manually define these concepts, limiting adaptability since concept semantics vary across environments. We propose GRAIL (Grounding Relational Agents through Interactive Learning), a framework that autonomously grounds relational concepts through environmental interaction. GRAIL leverages large language models (LLMs) to provide generic concept representations as weak supervision, then refines them to capture environment-specific semantics. This approach addresses both sparse reward signals and concept misalignment prevalent in underdetermined environments. Experiments on the Atari games Kangaroo, Seaquest, and Skiing demonstrate that GRAIL matches or outperforms agents with manually crafted concepts in simplified settings, and reveals informative trade-offs between reward maximization and high-level goal completion in the full environment.
[AI-161] Governed MCP: Kernel-Level Tool Governance for AI Agents via Logit-Based Safety Primitives
【速读】:该论文旨在解决当前AI代理(AI agents)通过模型上下文协议(Model Context Protocol, MCP)调用外部工具(如文件系统、网络接口或API)时存在的安全治理缺失问题。现有方案将安全控制完全置于用户空间,导致攻击者可通过简短脚本轻易绕过防护机制,存在严重安全隐患。解决方案的核心是提出“受管MCP”(Governed MCP),一个基于内核的工具治理网关,其关键创新在于引入基于logit的安全原语ProbeLogits(详见配套论文arXiv:2604.11943),并通过六层流水线实现细粒度控制:包括模式验证、信任层级检查、速率限制、对抗预过滤、ProbeLogits语义决策门和宪法策略匹配,并辅以Blake3哈希审计链。实验证明,移除ProbeLogits层会使F1分数从0.773骤降至0.327,表明仅靠传统规则防火墙无法有效保障安全性;同时,所有WASM到系统主机函数均经由该网关中继,实现了对WASM ABI表面的完整中介,从根本上杜绝了用户空间绕过攻击的可能性。
链接: https://arxiv.org/abs/2604.16870
作者: Daeyeon Son
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Operating Systems (cs.OS)
备注: 12 pages. Companion paper to arXiv:2604.11943 (ProbeLogits)
Abstract:AI agents increasingly call external tools (file system, network, APIs) through the Model Context Protocol (MCP). These tool calls are the agent’s syscalls – privileged operations with side effects on shared state – yet today’s safety enforcement lives entirely in userspace, where a 10-line script can bypass it. I propose Governed MCP, a kernel-resident tool governance gateway built on a logit-based safety primitive (ProbeLogits, companion paper: arXiv:2604.11943). The gateway interposes on every MCP tool call in a 6-layer pipeline: schema validation, trust tier check, rate limit, adversarial pre-filter, ProbeLogits gate (the load-bearing semantic check), and constitutional policy match, with a Blake3-hashed audit chain. I implement Governed MCP in Anima OS, a bare-metal x86_64 OS in approximately 86,000 lines of Rust. The five non-inference layers add 65.3 microseconds of overhead per call; ProbeLogits adds 65 ms (per-token-class semantic decision) on 7B Q4_0. A 4-config ablation on a 101-prompt MCP-domain benchmark shows that removing the ProbeLogits layer collapses F1 from 0.773 to 0.327 (Delta F1 = -0.446) – hand-rule firewalling alone is insufficient. All 15 WASM-to-system host functions in the runtime route through the gateway (complete mediation of the WASM ABI surface; the scope and caveats of this claim are stated in Section 4.6); a 10-LoC userspace bypass that defeats existing guardrail libraries is structurally impossible against the kernel-resident gate. Comments: 12 pages. Companion paper to arXiv:2604.11943 (ProbeLogits) Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Operating Systems (cs.OS) Cite as: arXiv:2604.16870 [cs.CR] (or arXiv:2604.16870v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.16870 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.5281/zenodo.19639122 Focus to learn more DOI(s) linking to related resources
[AI-162] GAMMA-Net: Adaptive Long-Horizon Traffic Spatio-Temporal Forecasting Model based on Interleaved Graph Attention and Multi-Axis Mamba
【速读】:该论文旨在解决传统交通预测模型难以有效捕捉交通数据中复杂时空依赖关系的问题。其解决方案的关键在于提出GAMMA-Net架构,该架构融合了图注意力网络(Graph Attention Networks, GAT)与多轴选择性状态空间模型(Multi-axis Selective State Space Models, Mamba)。其中,GAT通过自注意力机制动态调整交通网络中节点间的影响力,实现基于实时条件的自适应空间依赖建模;而Mamba模块则高效地建模长期时空动态特性,同时避免了传统循环结构带来的高计算开销。二者协同作用显著提升了预测精度,在多个基准数据集上相比基线模型最高可降低16.25%的平均绝对误差(Mean Absolute Error, MAE)。
链接: https://arxiv.org/abs/2604.16859
作者: Dongyi He,Yuanquan Gao,Bin Jiang,He Yan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate traffic forecasting is crucial for intelligent transportation systems, supporting effective traffic management, congestion reduction, and informed urban planning. However, traditional models often fail to adequately capture the intricate spatio-temporal dependencies present in traffic data. To overcome these limitations, we introduce GAMMA-Net, a novel approach that integrates Graph Attention Networks (GAT) with multi-axis Selective State Space Models (Mamba). The GAT component uses a self-attention mechanism to dynamically adjust the influence of nodes within the traffic network, enabling adaptive spatial dependency modeling based on real-time conditions. Simultaneously, the Mamba module efficiently models long-term temporal and spatial dynamics without the heavy computational cost of conventional recurrent architectures. Extensive experiments on several benchmark traffic datasets, including METR-LA, PEMS-BAY, PEMS03, PEMS04, PEMS07, and PEMS08, show that GAMMA-Net consistently outperforms existing state-of-the-art models across different prediction horizons, achieving up to a 16.25% reduction in Mean Absolute Error (MAE) compared to baseline models. Ablation studies highlight the critical contributions of both the spatial and temporal components, emphasizing their complementary role in improving prediction accuracy. In conclusion, the GAMMA-Net model sets a new standard in traffic forecasting, offering a powerful tool for next-generation traffic management and urban planning. The code for this study is available at this https URL
[AI-163] Refinement of Accelerated Demonstrations via Incremental Iterative Reference Learning Control for Fast Contact-Rich Imitation Learning IROS2026
【速读】:该论文旨在解决接触密集型操作任务中,模仿学习(Imitation Learning, IL)因人类示范速度受限而导致训练效率低下的问题。传统方法若直接加速示范,会改变接触动力学并引发显著跟踪误差,从而影响策略性能。其核心解决方案是提出增量式迭代参考学习控制(Incremental Iterative Reference Learning Control, I2RLC),通过逐步增加执行速度的同时迭代更新参考轨迹,以补偿早期迭代中的大误差并改善瞬态稳定性,从而生成高保真度的高速示范轨迹。此方法在真实机器人白板擦除和插销入孔任务中验证有效,显著提升了示范速度(最高达10倍)与空间相似性,并使训练出的IL策略在保持高成功率的同时降低接触力,为快速、稳定的接触密集型模仿学习提供了实用路径。
链接: https://arxiv.org/abs/2604.16850
作者: Koki Yamane,Cristian C. Beltran-Hernandez,Steven Oh,Masashi Hamaya,Sho Sakaino
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 8 pages, 11 figures, submitted to IROS 2026
Abstract:Fast execution of contact-rich manipulation is critical for practical deployment, yet providing fast demonstrations for imitation learning (IL) remains challenging: humans cannot demonstrate at high speed, and naively accelerating demonstrations alters contact dynamics and induces large tracking errors. We present a method to autonomously refine time-accelerated demonstrations by repurposing Iterative Reference Learning Control (IRLC) to iteratively update the reference trajectory from observed tracking errors. However, applying IRLC directly at high speed tends to produce larger early-iteration errors and less stable transients. To address this issue, we propose Incremental Iterative Reference Learning Control (I2RLC), which gradually increases the speed while updating the reference, yielding high-fidelity trajectories. We validate on real-robot whiteboard erasing and peg-in-hole tasks using a teleoperation setup with a compliance-controlled follower and a 3D-printed haptic leader. Both IRLC and I2RLC achieve up to 10x faster demonstrations with reduced tracking error; moreover, I2RLC improves spatial similarity to the original trajectories by 22.5% on average over IRLC across three tasks and multiple speeds (3x-10x). We then use the refined trajectories to train IL policies; the resulting policies execute faster than the demonstrations and achieve 100% success rates in the peg-in-hole task at both seen and unseen positions, with I2RLC-trained policies exhibiting lower contact forces than those trained on IRLC-refined demonstrations. These results indicate that gradual speed scheduling coupled with reference adaptation provides a practical path to fast, contact-rich IL.
[AI-164] he CTLNet for Shanghai Composite Index Prediction
【速读】:该论文旨在解决上证综合指数(Shanghai Composite Index)预测问题,这是投资者和学术研究者广泛关注的热点。针对传统深度学习模型在处理长序列依赖性和多变量数据相关性时的局限性,论文提出了一种融合卷积神经网络(CNN)、Transformer编码器与长短期记忆网络(LSTM)优势的混合模型——CNN-Transformer-LSTM Networks(CTLNet)。其解决方案的关键在于通过CNN提取局部特征、Transformer捕捉全局依赖关系,并利用LSTM建模时间序列中的长期动态变化,从而实现对多变量金融时间序列更精准的预测性能。
链接: https://arxiv.org/abs/2604.16835
作者: Haibin Jiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Shanghai Composite Index prediction has become a hot issue for many investors and academic researchers. Deep learning models are widely applied in multivariate time series forecasting, including recurrent neural networks (RNN), convolutional neural networks (CNN), and transformers. Specifically, the Transformer encoder, with its unique attention mechanism and parallel processing capabilities, has become an important tool in time series prediction, and has an advantage in dealing with long sequence dependencies and multivariate data correlations. Drawing on the strengths of various models, we propose the CNN-Transformer-LSTM Networks (CTLNet). This paper explores the application of CTLNet for Shanghai Composite Index prediction and the comparative experiments show that the proposed model outperforms state-of-the-art baselines.
[AI-165] he Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation
【速读】:该论文旨在解决在线蒸馏(On-policy Distillation, OPD)在后训练语言模型时存在的置信度校准失衡问题,即OPD虽能提升任务准确率,但会导致模型产生严重的过度自信(overconfidence)。其根本原因在于教师模型在训练时依赖于部署阶段不可用的特权上下文信息(privileged context),而学生模型在部署时只能使用有限的实时输入信息,从而引发信息不匹配。解决方案的关键是提出校准感知的OPD框架(Calibration-aware OPD, CaOPD),通过模型回放(rollouts)估计经验置信度,替代原生的自报告置信度作为蒸馏目标,并沿用相同的自蒸馏流程进行优化。这一方法实现了校准性能与能力之间的帕累托最优,且在分布外(out-of-distribution)和持续学习场景下具有鲁棒性。
链接: https://arxiv.org/abs/2604.16830
作者: Jiaxin Zhang,Xiangyu Peng,Qinglin Chen,Qinyuan Ye,Caiming Xiong,Chien-Sheng Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 40 pages, Code: this https URL
Abstract:On-policy distillation (OPD) is an increasingly important paradigm for post-training language models. However, we identify a pervasive Scaling Law of Miscalibration: while OPD effectively improves task accuracy, it systematically traps models in severe overconfidence. We trace this failure to an information mismatch: teacher supervision is formed under privileged context available during training, whereas the deployed model must report confidence using only deployment-time information. We formalize this perspective theoretically, showing that teacher-conditioned success is generally not a valid target for deployment-time confidence and that helpful privileged context induces entropy collapse and a systematic optimism bias. To address this, we propose a calibration-aware OPD framework, CaOPD, that estimates empirical confidence from model rollouts, replaces self-reported confidence with this student-grounded target, and distills the revised response through the same self-distillation pipeline. Experiments across various models and domains show that CaOPD achieves Pareto-optimal calibration while maintaining competitive capability, generalizing robustly under out-of-distribution and continual learning. Our findings highlight that capability distillation does not imply calibrated confidence, and that confidence should be treated as an essential objective in post-training. Code: this https URL
[AI-166] SafeDream: Safety World Model for Proactive Early Jailbreak Detection
【速读】:该论文旨在解决多轮越狱攻击(multi-turn jailbreak attacks)对大语言模型(LLM)安全对齐的渐进式侵蚀问题,此类攻击通过看似无害的对话轮次逐步突破模型的安全防护机制,现有基于对齐和护栏的方法存在三大局限:需昂贵的权重修改、未建模各轮次累积的安全风险、且仅在有害内容生成后才触发检测。解决方案的关键在于提出一种轻量级世界模型框架 SAFEDREAM,其核心创新包括:(1) 安全状态世界模型(safety state world model),将 LLM 隐藏状态压缩为紧凑的安全表征并预测其跨轮演化;(2) CUSUM 检测机制,累积弱的单轮风险信号以形成可靠证据;(3) 对比想象(contrastive imagination),在潜在空间中并行模拟攻击与良性未来轨迹,在越狱发生前发出早期警报。该方案无需修改 LLM 权重,在多个基准测试中实现了平均提前 1.06–1.20 轮的检测时效,同时保持较低误报率并显著优于基线方法。
链接: https://arxiv.org/abs/2604.16824
作者: Bo Yan,Weikai Lin,Yada Zhu,Song Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-turn jailbreak attacks progressively erode LLM safety alignment across seemingly innocuous conversation turns, achieving success rates exceeding 90% against state-of-the-art models. Existing alignment-based and guardrail methods suffer from three key limitations: they require costly weight modification, evaluate each turn independently without modeling cumulative safety erosion, and detect attacks only after harmful content has been generated. To address these limitations, we first formulate the proactive early jailbreak detection problem with a new metric, detection lead, that measures how early an attack can be detected before the LLM complies. We then propose SAFEDREAM, a lightweight world-model-based framework that operates as an external module without modifying the LLM’s weights. SAFEDREAM introduces three components: (1) a safety state world model that encodes LLM hidden states into a compact safety representation and predicts how it evolves across turns, (2) CUSUM detection that accumulates weak per-turn risk signals into reliable evidence, and (3) contrastive imagination that simultaneously rolls out attack and benign futures in latent space to issue early alarms before jailbreaks occur. On three multi-turn jailbreak benchmarks (XGuard-Train, SafeDialBench, SafeMTData) against 8 baselines, SAFEDREAM achieves the best detection timeliness across all benchmarks (1.06-1.20 turns before compliance) while maintaining competitive false positive rates and outperforming baselines in detection quality.
[AI-167] Self-Reinforcing Controllable Synthesis of Rare Relational Data via Bayesian Calibration ACL2026
【速读】:该论文旨在解决现实世界中类别不平衡数据导致的下游分类性能下降问题,特别是针对结构化表格数据(tabular data)的生成式合成方法尚不成熟、且缺乏有效反馈机制以持续优化生成数据质量的挑战。解决方案的关键在于提出一种统一的上下文学习框架RDDG(Relational Data Generator with Dynamic Guidance),其核心创新包括:首先通过核心集选择(core set selection)识别原始数据中的代表性样本,继而利用上下文学习挖掘属性间的内在模式与关联性,随后在保持约束条件的前提下生成表格数据;更重要的是,引入自强化反馈机制(self-reinforcing feedback mechanism),对生成数据的质量进行自动评估并驱动迭代优化,从而显著提升生成数据的真实性(data fidelity)和下游不平衡分类任务的性能。
链接: https://arxiv.org/abs/2604.16817
作者: Chongsheng Zhang,Hao Wang,Zelong Yu,Esteban Garces Arias,Julian Rodemann,Zhanshuo Zhang,Qilong Li,Gaojuan Fan,Krikamol Muandet,Christian Heumann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to appear at: Findings of the Association for Computational Linguistics: ACL 2026 (ACL 2026 Findings), San Diego, California, USA, July 2-7, 2026
Abstract:Imbalanced data is commonly present in real-world applications. While data synthesis can effectively mitigate the data scarcity problem of rare-classes, and LLMs have revolutionized text generation, the application of LLMs to relational/structured tabular data synthesis remains underexplored. Moreover, existing approaches lack an effective feedback mechanism that can guide LLMs towards continuously optimizing the quality of the generated data throughout the synthesis process. In this work, we propose RDDG, Relational Data generator with Dynamic Guidance, which is a unified in-context learning framework that employs progressive chain-of-thought (CoT) steps to generate tabular data for enhancing downstream imbalanced classification performance. RDDG first uses core set selection to identify representative samples from the original data, then utilizes in-context learning to discover the inherent patterns and correlations among attributes within the core set, and subsequently generates tabular data while preserving the aforementioned constraints. More importantly, it incorporates a self-reinforcing feedback mechanism that provides automatic assessments on the quality of the generated data, enabling continuous quality optimization throughout the generation process. Experimental results on multiple real and synthetic datasets demonstrate that RDDG outperforms existing approaches in both data fidelity and downstream imbalanced classification performance. We make our code available at this https URL.
[AI-168] Introspection Adapters: Training LLM s to Report Their Learned Behaviors
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在微调后可能产生不可预测、有害或难以检测的行为,而现有审计手段缺乏高效性和普适性的问题。其核心解决方案是提出一种可扩展的“自省适配器”(introspection adapter, IA),该方法通过在共享基础模型上植入特定行为并训练多个微调模型(M_i)来构建标注数据集,进而联合训练一个LoRA适配器,使其能够使不同微调路径下的模型均能用自然语言描述自身行为。关键创新在于IA具备跨微调方式的泛化能力,可在未见过的微调场景中识别隐匿行为(如AuditBench测试中的恶意行为)甚至检测加密API攻击,且随模型规模和训练数据多样性提升而表现更优,从而为LLM审计提供了一种高效、实用的新范式。
链接: https://arxiv.org/abs/2604.16812
作者: Keshav Shenoy,Li Yang,Abhay Sheshadri,Sören Mindermann,Jack Lindsey,Sam Marks,Rowan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:When model developers or users fine-tune an LLM, this can induce behaviors that are unexpected, deliberately harmful, or hard to detect. It would be far easier to audit LLMs if they could simply describe their behaviors in natural language. Here, we study a scalable approach to rapidly identify learned behaviors of many LLMs derived from a shared base LLM. Given a model M , our method works by finetuning models M_i from M with implanted behaviors b_i ; the (M_i, b_i) pairs serve as labeled training data. We then train an \emphintrospection adapter (IA): a single LoRA adapter jointly trained across the finetunes M_i to cause them to verbalize their implanted behaviors. We find that this IA induces self-description of learned behaviors even in finetunes of M that were trained in very different ways from the M_i . For example, IAs generalize to AuditBench, achieving state-of-the-art at identifying explicitly hidden concerning behaviors. IAs can also be used to detect encrypted finetuning API attacks. They scale favorably with model size and training data diversity. Overall, our results suggest that IAs are a scalable, effective, and practically useful approach to auditing fine-tuned LLMs.
[AI-169] AutoOR: Scalably Post-training LLM s to Autoformalize Operations Research Problems
【速读】:该论文旨在解决将复杂优化问题从自然语言描述自动转化为求解器可用的形式化表达这一难题,这通常需要专业的运筹学(Operations Research, OR)知识,限制了其在工业场景中的规模化应用。解决方案的关键在于提出AutoOR框架,该框架结合合成数据生成与强化学习(Reinforcement Learning, RL)策略:首先基于标准优化形式生成可验证的训练数据,再利用求解器执行反馈作为强化学习的奖励信号进行后训练(post-training),从而让大型语言模型(Large Language Models, LLMs)具备跨线性、混合整数和非线性类别自动形式化优化问题的能力。针对物理动力学相关的非线性问题类,论文进一步引入课程强化学习(curriculum RL)策略,从有限初始数据出发逐步提升模型性能,显著改善了传统前沿模型在该类问题上近乎0%表现的瓶颈。
链接: https://arxiv.org/abs/2604.16804
作者: Sumeet Ramesh Motwani,Chuan Du,Aleksander Petrov,Christopher Davis,Philip Torr,Antonio Papania-Davis,Weishi Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Optimization problems are central to decision-making in manufacturing, logistics, scheduling, and other industrial settings. Translating complicated descriptions of these problems into solver-ready formulations requires specialized operations research (OR) expertise, making it hard to scale. We present AutoOR, a scalable synthetic data generation and reinforcement learning pipeline that trains LLMs to autoformalize optimization problems specified in natural language across linear, mixed-integer, and non-linear categories. AutoOR generates verified training data from standard optimization forms and uses solver execution feedback as the reward signal for RL post-training. AutoOR applied to an 8B model achieves state-of-the-art or competitive results across six established OR benchmarks, matching significantly larger frontier models. For a non-linear problem class involving physical dynamics, where frontier models score near 0%, we introduce a curriculum RL strategy that bootstraps from limited initial training data to make this class tractable for post-training. We believe that methods such as AutoOR can significantly accelerate industrial decision-making with AI.
[AI-170] Bias in the Loop: Auditing LLM -as-a-Judge for Software Engineering
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)作为代码评判者(LLM-as-a-Judge)时存在的可靠性与偏倚问题,尤其是在缺乏人工详尽评审或可执行测试覆盖的情况下,其评估结果可能因提示词(prompt)微小变化而显著波动,甚至在语义不变的前提下产生不一致的判决。研究的关键在于通过测量驱动的方法系统性地分析两种判别范式在代码生成、修复和测试生成任务中的表现,并控制单一提示变量以隔离偏倚来源,从而量化提示诱导偏倚对判断一致性与敏感性的影响。研究发现,即使代码本身未变,提示中的细微差异仍能系统性地改变评判偏好,导致任务级结论乃至模型排名发生实质性变化,因此提出应将偏倚敏感性报告纳入评估指标体系,并引入显式控制措施以提升软件工程中模型比较的可信度与可复现性。
链接: https://arxiv.org/abs/2604.16790
作者: Zixiao Zhao,Amirreza Esmaeili,Fatemeh Fard
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models are increasingly used as judges to evaluate code artifacts when exhaustive human review or executable test coverage is unavailable. LLM-judge is increasingly relevant in agentic software engineering workflows, where it can help rank candidate solutions and guide patch selection. While attractive for scale, current practice lacks a principled account of reliability and bias: repeated evaluations of the same case can disagree; small prompt edits can swing outcomes; and seemingly semantics-preserving, human-equivalent perturbations may elicit divergent verdicts. This paper studies LLM-as-a-Judge for code through a measurement-first lens. We analyze two pointwise judging regimes across code generation, code repair task, and test generation, and we systematically probe prompt-induced biases. Our study considers difficulty levels for repeated runs and controlled prompt interventions that isolate one presentation cue at a time, and it evaluates judges using consistency and sensitivity to bias. We find that judge decisions are highly sensitive to prompt biases even when the underlying code snippet is unchanged. Across all three tasks, several biases systematically shift preferences toward the option favored by the prompt, improving accuracy when that option aligns with the gold answer but substantially reducing it otherwise. In some settings, these effects are large enough to change task-level conclusions and alter relative model rankings. These findings show that reported judge performance may reflect prompt artifacts rather than stable assessment ability, posing a direct threat to the validity and reproducibility of code evaluation. We therefore argue that LLM-as-a-Judge studies should report bias sensitivity alongside accuracy and incorporate explicit controls to support more trustworthy model comparison in software engineering.
[AI-171] Federation over Text: Insight Sharing for Multi-Agent Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLM)驱动的智能体在面对新问题时通常从零开始推理,且缺乏将已习得技能自动迁移至其他智能体的机制这一问题。其解决方案的关键在于提出一种类联邦学习(federated learning-like)框架——文本联邦(Federation over Text, FoT),该框架通过在语义层面而非梯度层面进行协作,使多个执行不同任务的智能体能够迭代地共享和聚合各自的推理轨迹,从而构建一个跨任务、跨领域的元认知洞察库(metacognitive insights library)。该库可供当前及未来智能体复用,显著提升推理效率与效果,实验证明其在数学求解、跨域协作和机器学习研究洞察发现等场景中均表现出优越性能。
链接: https://arxiv.org/abs/2604.16778
作者: Dixi Yao,Tahseen Rabbani,Tian Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages
Abstract:LLM-powered agents often reason from scratch when presented with a new problem instance and lack automatic mechanisms to transfer learned skills to other agents. We propose a federated learning-like framework, Federation over Text (FoT), that enables multiple agents solving different tasks to collectively generate a shared library of metacognitive insights by iteratively federating their local reasoning processes. Instead of federation over gradients (e.g., as in distributed training), FoT operates at the semantic level without any gradient optimization or supervision signal. Iteratively, each agent does local thinking and self-improvement on their specific tasks independently, and shares reasoning traces with a central server, which aggregates and distills them into a cross-task (and cross-domain) insight library that existing and future agents can leverage to improve performance on related tasks. Experiments show that FoT improves reasoning effectiveness and efficiency across a wide range of challenging applications, including mathematical problem solving, cross-domain collaboration, and machine learning research insight discovery. Specifically, it improves average accuracies of downstream tasks by 24% while reducing the reasoning tokens by 28% across the first two applications. In the research insight discovery application, FoT is able to generate insights that cover over 90% of the major contributions in the subsequent papers.
[AI-172] SAVE: A Generalizable Framework for Multi-Condition Single-Cell Generation with Gene Block Attention ICLR2026
【速读】:该论文旨在解决单细胞基因表达建模中因忽略基因间高阶生物关系而导致的性能不足问题,尤其是在多样生物学和实验条件下对细胞状态刻画与未见场景模拟的挑战。其核心解决方案是提出SAVE框架,该框架基于条件Transformer构建统一的生成模型,通过将语义相关的基因分组为模块(gene blocks)实现粗粒度表示,从而捕捉基因模块间的高阶依赖关系;同时引入流匹配(Flow Matching)机制与条件掩码策略,提升模拟灵活性并增强对未见条件组合的外推泛化能力。
链接: https://arxiv.org/abs/2604.16776
作者: Jiahao Li,Jiayi Dong,Peng Ye,Xiaochi Zhou,Haohai Lu,Fei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2026
Abstract:Modeling single-cell gene expression across diverse biological and technical conditions is crucial for characterizing cellular states and simulating unseen scenarios. Existing methods often treat genes as independent tokens, overlooking their high-level biological relationships and leading to poor performance. We introduce SAVE, a unified generative framework based on conditional Transformers for multi-condition single-cell modeling. SAVE leverages a coarse-grained representation by grouping semantically related genes into blocks, capturing higher-order dependencies among gene modules. A Flow Matching mechanism and condition-masking strategy further enhance flexible simulation and enable generalization to unseen condition combinations. We evaluate SAVE on a range of benchmarks, including conditional generation, batch effect correction, and perturbation prediction. SAVE consistently outperforms state-of-the-art methods in generation fidelity and extrapolative generalization, especially in low-resource or combinatorially held-out settings. Overall, SAVE offers a scalable and generalizable solution for modeling complex single-cell data, with broad utility in virtual cell synthesis and biological interpretation. Our code is publicly available at this https URL
[AI-173] Representation Before Training: A Fixed-Budget Benchmark for Generative Medical Event Models ALT
【速读】:该论文旨在解决生成式医疗事件模型中输入表示(input representation)对下游临床预测性能的影响问题,尤其是在共享预训练预算下的优化策略。其核心问题是:如何通过合理的数据编码方式提升模型在多种临床结局上的泛化能力与预测精度。解决方案的关键在于系统性地评估不同tokenization策略(如代码-值融合、细粒度量化、参考范围锚定)、值编码方法(硬分箱、软离散化、归一化xVal)以及时间编码机制(事件顺序、时间标记、基于入院时间的RoPE),并发现代码-值融合tokenization可显著提升死亡率和住院时长预测的AUROC,而采用入院相对RoPE的时间编码可在不增加序列长度的前提下优于插入时间标记的方法,同时Common Longitudinal ICU Format(CLIF)重映射能保持性能并减少token数量,增强多中心适用性。
链接: https://arxiv.org/abs/2604.16775
作者: Inhyeok Lee,Luke Solo,Michael C. Burkhart,Bashar Ramadan,William F. Parker,Brett K. Beaulieu-Jones
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 39 pages. Submitted to Machine Learning for Healthcare 2026
Abstract:Every prediction from a generative medical event model is bounded by how clinical events are tokenized, yet input representation is rarely isolated from other system and architectural choices. We evaluate how representation decisions affect downstream prediction after a shared one-epoch pretraining budget. We train 28 matched transformers on MIMIC-IV and evaluate them on 30 clinical outcomes in three experiments: (1) quantization granularity, reference-range anchoring, and code-value fusion; (2) value encoding (hard bins, soft discretization, code-normalized xVal) crossed with temporal encoding (event order, time tokens, admission-relative RoPE); and (3) native MIMIC laboratory/vital codes versus the Common Longitudinal ICU Format (CLIF)-remapped laboratory/vital codes with compression-preserving perturbation arms. In Experiment 1, fused code-value tokenization improves mortality AUROC from 0.891 to 0.915 (BH-adjusted p 0.001), hospital length-of-stay AUROC from 0.763 to 0.788 (BH-adjusted p 0.001), and, for the decile fused-vs-unfused comparison, mean regression Spearman rho across the 13 regression outcomes from 0.414 to 0.494. Across the three temporal encodings, event order only and admission-relative RoPE match or exceed inserting time tokens on average while shortening sequences by 11%. CLIF remapping preserves downstream performance in our single-site setting while yielding a smaller, clinically interpretable token set compatible with multi-site use. Finer-than-decile quantization, reference-range anchoring, and soft discretization help in selective outcomes, while code-normalized xVal remains well below the discrete and soft families, consistent with near-median suppression that persists after the affine variant.
[AI-174] CapSeal: Capability-Sealed Secret Mediation for Secure Agent Execution
【速读】:该论文旨在解决现代AI代理(AI agent)在部署过程中因直接暴露敏感凭证(如API密钥和SSH凭据)而导致的安全风险问题,尤其是在面对提示注入(prompt injection)、工具滥用(tool misuse)和模型控制的数据外泄(model-controlled exfiltration)时,传统通过环境变量、本地文件或转发套接字传递凭证的方式存在严重安全隐患。解决方案的关键在于提出CapSeal架构——一种能力封印式(capability-sealed)的密钥中介机制,其核心是将代理对密钥的直接访问替换为通过本地可信中介(broker)进行受约束的调用,从而实现非导出性(non-exportable)且作用域受限的动作能力授予,而非简单的密钥分发。该架构融合了能力发放、Schema约束的HTTP执行、由中介执行的SSH操作、防重放会话绑定、策略评估与防篡改审计日志等技术,从根本上重构了智能体系统的密钥管理范式。
链接: https://arxiv.org/abs/2604.16762
作者: Shutong Jin,Ruiyi Guo,Ray C. C. Cheung
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures. Research preprint on secure secret mediation for agent systems
Abstract:Modern AI agents routinely depend on secrets such as API keys and SSH credentials, yet the dominant deployment model still exposes those secrets directly to the agent process through environment variables, local files, or forwarding sockets. This design fails against prompt injection, tool misuse, and model-controlled exfiltration because the agent can both use and reveal the same bearer credential. We present CapSeal, a capability-sealed secret mediation architecture that replaces direct secret access with constrained invocations through a local trusted broker. CapSeal combines capability issuance, schema-constrained HTTP execution, broker-executed SSH actions, anti-replay session binding, policy evaluation, and tamper-evident audit trails. We describe a Rust prototype integrated with an MCP-facing adapter, formulate conditional security goals for non-disclosure, constrained use, replay resistance, and auditability, and define an evaluation plan spanning prompt injection, tool misuse, and SSH abuse. The resulting system reframes secret handling for agentic systems from handing the model a key to granting the model a narrowly scoped, non-exportable action capability.
[AI-175] Mitigating Prompt-Induced Cognitive Biases in General-Purpose AI for Software Engineering
【速读】:该论文旨在解决生成式 AI (Generative AI) 在软件工程(SE)决策支持场景中因提示词(prompt)诱导产生的认知偏差问题,即模型决策受输入措辞影响而非任务逻辑本身,导致次优决策。其核心问题是:现有提示工程策略(如链式思考、自我去偏)无法显著降低模型对特定认知偏差的敏感性。解决方案的关键在于引入一种基于 Prolog 风格的推理机制——通过显式提取软件工程最佳实践作为背景公理(axioms),并在提示中注入结构化推理线索,从而阻断偏差诱导特征对隐含假设的短路效应,最终使模型在 SE 决策任务中的整体偏差敏感度平均降低 51%(p < .001)。
链接: https://arxiv.org/abs/2604.16756
作者: Francesco Sovrano,Gabriele Dominici,Alberto Bacchelli
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the proceedings of FSE’2026
Abstract:Prompt-induced cognitive biases are changes in a general-purpose AI (GPAI) system’s decisions caused solely by biased wording in the input (e.g., framing, anchors), not task logic. In software engineering (SE) decision support (where problem statements and requirements are natural language) small phrasing shifts (e.g., popularity hints or outcome reveals) can push GPAI models toward suboptimal decisions. We study this with PROBE-SWE, a dynamic benchmark for SE that pairs biased and unbiased versions of the same SE dilemmas, controls for logic and difficulty, and targets eight SE-relevant biases (anchoring, availability, bandwagon, confirmation, framing, hindsight, hyperbolic discounting, overconfidence). We ask whether prompt engineering mitigates bias sensitivity in practice, focusing on actionable techniques that practitioners can apply off-the-shelf in real environments. Testing common strategies (e.g., chain-of-thought, self-debiasing) on cost-effective GPAI systems, we find no statistically significant reductions in bias sensitivity on a per-bias basis. We then adopt a Prolog-style view of the reasoning process: solving SE dilemmas requires making explicit any background axioms and inference assumptions (i.e., SE best practices) that are usually implicit in the prompt. So, we hypothesize that bias-inducing features short-circuit assumptions elicitation, pushing GPAI models toward biased shortcuts. Building on this, we introduce an end-to-end method that elicits best practices and injects axiomatic reasoning cues into the prompt before answering, reducing overall bias sensitivity by 51% on average (p .001). Finally, we report a thematic analysis that surfaces linguistic patterns associated with heightened bias sensitivity, clarifying when GPAI use is less advisable for SE decision support and where to focus future countermeasures.
[AI-176] Machine individuality: Separating genuine idiosyncrasy from response bias in large language models
【速读】:该论文旨在解决当前对大语言模型(Large Language Models, LLMs)行为倾向评估中存在的一大难题:现有基于心理测量工具和认知范式的分析方法无法区分模型间的差异究竟是源于稳定的、刺激特异性的个体差异,还是由全局响应偏差或随机噪声所致。为解决此问题,作者采用交叉随机效应模型(crossed random-effects models),该方法在心理学领域广泛用于分离系统性效应,从而量化每个LLM对不同词项的稳定偏好。通过对10个开源模型在超过10万词上的7490万条评分数据进行建模,发现平均16.9%的方差可归因于刺激特异性的个体差异,且显著高于统计零模型;进一步跨规范预测分析表明,这种个体差异呈现出一致且独特的“指纹”特征,即机器个体性(machine individuality)。这一方案的关键在于利用统计模型分离出非随机、可复现的模型特异性行为模式,从而确立LLMs之间具有本质区别的个体差异。
链接: https://arxiv.org/abs/2604.16755
作者: Valentin Kriegmair,Dirk U. Wulff
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 1 figure. Supporting information included
Abstract:As large language models (LLMs) are increasingly integrated into daily life, in roles ranging from high-stakes decision support to companionship, understanding their behavioral dispositions becomes critical. A growing literature uses psychometric inventories and cognitive paradigms to profile LLM dispositions. However, these approaches cannot determine whether behavioral differences reflect stable, stimulus-specific individuality or global response biases and stochastic noise. Here, we apply crossed random-effects models – widely used in psychometrics to separate systematic effects – to 74.9 million ratings provided by 10 open-weight LLMs for over 100,000 words across 14 psycholinguistic norms. On average, 16.9% of variance is attributable to stimulus-specific individuality, robustly exceeding a statistical null model. Cross-norm prediction analyses reveal this individuality as a coherent fingerprint, unique to each model. These results identify individual differences among LLMs that cannot be attributed to response biases or stochastic noise. We term these differences machine individuality.
[AI-177] Know When to Trust the Skill: Delayed Appraisal and Epistemic Vigilance for Single-Agent LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在向自主代理(autonomous agents)演进过程中,因工具生态系统复杂化而导致的上下文污染(context pollution)和“过度思考”(overthinking)问题。传统路由启发式方法难以应对此类挑战,其根源不在于算法能力或技能多样性不足,而在于缺乏有纪律的第二层元认知治理(second-order metacognitive governance)。解决方案的关键在于将人类认知控制机制——包括延迟评估(delayed appraisal)、知识警觉性(epistemic vigilance)和近域卸载区(region-of-proximal offloading)——计算化地映射到单智能体架构中。具体而言,作者提出了MESA-S框架,通过将标量置信度估计转化为向量形式,分离自信心(parametric certainty)与源信心(trust in retrieved external procedures),并引入延迟过程探测机制和元认知技能卡(Metacognitive Skill Cards),从而在不执行高开销操作的前提下实现对技能效用的认知分离,有效缓解供应链漏洞、修剪冗余推理循环,并防止因任务卸载导致的信心膨胀(confidence inflation)。
链接: https://arxiv.org/abs/2604.16753
作者: Eren Unlu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 1 figure
Abstract:As large language models (LLMs) transition into autonomous agents integrated with extensive tool ecosystems, traditional routing heuristics increasingly succumb to context pollution and “overthinking”. We argue that the bottleneck is not a deficit in algorithmic capability or skill diversity, but the absence of disciplined second-order metacognitive governance. In this paper, our scientific contribution focuses on the computational translation of human cognitive control - specifically, delayed appraisal, epistemic vigilance, and region-of-proximal offloading - into a single-agent architecture. We introduce MESA-S (Metacognitive Skills for Agents, Single-agent), a preliminary framework that shifts scalar confidence estimation into a vector separating self-confidence (parametric certainty) from source-confidence (trust in retrieved external procedures). By formalizing a delayed procedural probe mechanism and introducing Metacognitive Skill Cards, MESA-S decouples the awareness of a skill’s utility from its token-intensive execution. Evaluated under an In-Context Static Benchmark Evaluation natively executed via Gemini 3.1 Pro, our early results suggest that explicitly programming trust provenance and delayed escalation mitigates supply-chain vulnerabilities, prunes unnecessary reasoning loops, and prevents offloading-induced confidence inflation. This architecture offers a scientifically cautious, behaviorally anchored step toward reliable, epistemically vigilant single-agent orchestration.
[AI-178] Dont Start What You Cant Finish: A Counterfactual Audit of Support-State Triage in LLM Agents
【速读】:该论文旨在解决当前智能体(Agent)评估体系中对任务阻塞原因诊断能力不足的问题,即现有方法多聚焦于执行完整任务(Complete),而忽视了对任务是否可澄清(Clarifiable)、是否需支持(Support-Blocked)或是否根本不可支持(Unsupported-Now)的区分与响应能力。其解决方案的核心在于提出一种名为Support-State Triage Audit (SSTA-32) 的匹配项诊断框架,通过最小化反事实编辑将同一基础请求映射到四种支持状态,并结合Dual-Persona Auto-Auditing (DPAA) 方法进行定量评估。关键创新在于引入结构化的分类决策路径(如Action-Only和Preflight Support Check,PSC),使模型能够准确识别并响应不同支持状态,从而避免默认执行导致的过度承诺(overcommitment)问题,同时保持高精度的三类延迟决策(deferral)能力(达91.7%)。
链接: https://arxiv.org/abs/2604.16752
作者: Eren Unlu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Current agent evaluations largely reward execution on fully specified tasks, while recent work studies clarification [11, 22, 2], capability awareness [9, 1], abstention [8, 14], and search termination [20, 5] mostly in isolation. This leaves open whether agents can diagnose why a task is blocked before acting. We introduce the Support-State Triage Audit (SSTA-32), a matched-item diagnostic framework in which minimal counterfactual edits flip the same base request across four support states: Complete (ANSWER), Clarifiable (CLARIFY), Support-Blocked (REQUEST SUPPORT), and Unsupported-Now (ABSTAIN). We evaluate a frontier model under four prompting conditions - Direct, Action-Only, Confidence-Only, and a typed Preflight Support Check (PSC) - using Dual-Persona Auto-Auditing (DPAA) with deterministic heuristic scoring. Default execution overcommits heavily on non-complete tasks (41.7% overcommitment rate). Scalar confidence mapping avoids overcommitment but collapses the three-way deferral space (58.3% typed deferral accuracy). Conversely, both Action-Only and PSC achieve 91.7% typed deferral accuracy by surfacing the categorical ontology in the prompt. Targeted ablations confirm that removing the support-sufficiency dimension selectively degrades REQUEST SUPPORT accuracy, while removing the evidence-sufficiency dimension triggers systematic overcommitment on unsupported items. Because DPAA operates within a single context window, these results represent upper-bound capability estimates; nonetheless, the structural findings indicate that frontier models possess strong latent triage capabilities that require explicit categorical decision paths to activate safely.
[AI-179] When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的编码代理在生成大型、格式复杂的文档时出现的一种隐蔽性故障——输出停滞(output stalling),即代理在无提示的情况下返回空响应。其核心解决方案在于提出一个理论框架,包含三个关键贡献:首先定义了输出生成容量(Output Generation Capacity, OGC),量化了代理在当前上下文状态下实际可生成的有效输出能力(不同于原始上下文窗口大小);其次证明了格式成本分离定理,表明延迟模板渲染在任何具有冗余系数 μf>1 的格式中均至少与直接生成一样高效,并推导出最优节省边界;最后构建自适应策略选择机制,根据预估输出成本与可用OGC的比值动态选择最优生成策略(直接生成、分块生成或延迟渲染)。实验验证显示,延迟渲染可减少48%-72%的LLM生成token并彻底消除输出停滞,所实现的GEN-PILOT开源工具进一步证明该理论可直接转化为实用系统。
链接: https://arxiv.org/abs/2604.16736
作者: Justice Owusu Agyemang,Michael Agyare,Miriam Kobbinah,Nathaniel Agbugblah,Prosper Addo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-powered coding agents suffer from a poorly understood failure mode we term output stalling: the agent silently produces empty responses when attempting to generate large, format-heavy documents. We present a theoretical framework that explains and prevents this failure through three contributions. (1) We introduce Output Generation Capacity (OGC), a formal measure of an agent’s effective ability to produce output given its current context state - distinct from and empirically smaller than the raw context window. (2) We prove a Format-Cost Separation Theorem showing that deferred template rendering is always at least as token-efficient as direct generation for any format with overhead multiplier \mu_f 1 , and derive tight bounds on the savings. (3) We formalize Adaptive Strategy Selection, a decision framework that maps the ratio of estimated output cost to available OGC into an optimal generation strategy (direct, chunked, or deferred). We validate the theory through controlled experiments across three models (Claude 3.5 Sonnet, GPT-4o, Llama 3.1 70B), four document types, and an ablation study isolating each component’s contribution. Deferred rendering reduces LLM generation tokens by 48-72% across all conditions and eliminates output stalling entirely. We instantiate the framework as GEN-PILOT, an open-source MCP server, demonstrating that the theory translates directly into a practical tool.
[AI-180] Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在科学创意生成任务中因迭代提示或复杂多智能体架构导致的幻觉问题和计算效率低下问题,以及强化学习(Reinforcement Learning, RL)在开放域应用时面临的奖励黑客(reward hacking)瓶颈。其解决方案的关键在于提出了一种专为高质量科学创意生成设计的强化学习框架:首先构建了一个多智能体奖励函数作为评判者,将方法论验证与实现细节解耦,并提供对奖励黑客鲁棒的严格二元奖励;其次采用无偏的组相对策略优化(Group Relative Policy Optimization)变体以缓解稀疏信号下的长度偏差问题;最终在ICLR-320数据集上进行训练,该数据集由ICLR 2024会议论文中的问题-解决方案对人工筛选而成,实验证明该框架在新颖性、可行性与有效性等专家评估指标上显著优于现有最先进基线。
链接: https://arxiv.org/abs/2604.16723
作者: Moein Salimi,Babak Hosseini Mohtasham,Amin Aghakasiri,Mahdi Naieni,Amir Hossein Qeysarbeigi,Mohammad Masih Shalchian Nazer,Zahra Azar,Mahdi Jafari Siavoshani,Mohammad Hossein Rohban
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) have demonstrated potential in automating scientific ideation, yet current approaches relying on iterative prompting or complex multi-agent architectures often suffer from hallucination or computational inefficiency. A critical bottleneck in applying Reinforcement Learning (RL) to this open-ended domain is reward hacking – where models exploit imperfect evaluation proxies to maximize scores without producing genuine scientific innovation. To address these limitations, we propose an RL framework explicitly tailored for high-quality scientific idea generation. We propose the first multi-agent reward function designed to serve as a judge, decoupling methodological validation from implementation details while providing strict binary rewards that are robust to reward hacking. To effectively optimize against this sparse signal, we utilize an unbiased variant of Group Relative Policy Optimization to mitigate artificial length bias. We grounded our training in ICLR-320, a curated dataset of problem-solution pairs extracted from ICLR 2024 proceedings. Experiments demonstrate that our framework significantly outperforms state-of-the-art baselines across expert-evaluated metrics of novelty, feasibility, and effectiveness.
[AI-181] Late Fusion Neural Operators for Extrapolation Across Parameter Space in Partial Differential Equations
【速读】:该论文旨在解决神经算子模型在跨参数域(parameter regimes)下预测性能不稳定的问题,特别是在训练与推理阶段存在分布偏移(distribution shift)时,如何提升模型的泛化能力。其核心挑战在于状态(state)与参数(parameter)表示之间的纠缠(entanglement),导致模型难以准确捕捉参数变化对系统行为的影响。解决方案的关键在于提出“Late Fusion Neural Operator”架构,通过将状态动态学习与参数信息建模分离:利用神经算子学习隐空间状态表示,再以稀疏回归(sparse regression)结构化地引入参数信息,从而实现状态演化与参数效应的解耦(disentanglement)。实验表明,该方法在多个偏微分方程(PDEs)基准任务中显著优于现有方法,在域内和域外均实现了平均72.9%和71.8%的均方根误差(RMSE)降低,验证了其强泛化能力。
链接: https://arxiv.org/abs/2604.16721
作者: Eva van Tegelen,Taniya Kapoor,George A.K. van Voorn,Peter van Heijster,Ioannis N. Athanasiadis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS)
备注:
Abstract:Developing neural operators that accurately predict the behavior of systems governed by partial differential equations (PDEs) across unseen parameter regimes is crucial for robust generalization in scientific and engineering applications. In practical applications, variations in physical parameters induce distribution shifts between training and prediction regimes, making extrapolation a central challenge. As a result, the way parameters are incorporated into neural operator models plays a key role in their ability to generalize, particularly when state and parameter representations are entangled. In this work, we introduce the Late Fusion Neural Operator, an architecture that disentangles learning state dynamics from parameter effects, improving predictive performance both within and beyond the training distribution. Our approach combines neural operators for learning latent state representations with sparse regression to incorporate parameter information in a structured manner. Across four benchmark PDEs including advection, Burgers, and both 1D and 2D reaction-diffusion equations, the proposed method consistently outperforms Fourier Neural Operator and CAPE-FNO. Late Fusion Neural Operators achieve consistently the best performance in all experiments, with an average RMSE reduction of 72.9% in-domain and 71.8% out-domain compared to the second-best method. These results demonstrate strong generalization across both in-domain and out-domain parameter regimes.
[AI-182] Scalable and Adaptive Parallel Training of Graph Transformer on Large Graphs
【速读】:该论文旨在解决图变压器(Graph Transformer)在大规模图数据上训练时面临的可扩展性问题,即现有实现通常受限于单GPU系统,导致训练时间过长或内存溢出,并且跨全图并行化训练难度大,效率受图结构和硬件特性(如带宽与内存容量)显著影响。解决方案的关键在于提出一种分布式训练框架,能够根据图结构和硬件配置自动选择并优化并行策略,并通过高效的分布式稀疏操作实现稀疏图注意力机制的加速(最高提升3.8倍)和内存消耗降低78%,从而在多达8个GPU的系统上实现最高6倍的速度提升,显著提升了图变压器的可扩展性,使其更接近作为实用图基础模型(Graph Foundation Models)的目标。
链接: https://arxiv.org/abs/2604.16715
作者: Jun-Liang Lin,Kamesh Madduri,Mahmut Taylan Kandemir
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the 63rd ACM/IEEE Design Automation Conference (DAC 2026)
Abstract:Graph foundation models have demonstrated remarkable adaptability across diverse downstream tasks through large-scale pretraining on graphs. However, existing implementations of the backbone model, graph transformers, are typically limited to single-GPU systems, leading to long training times or out-of-memory issues on large graphs. Moreover, parallelizing graph transformer training over the full graph is challenging, as efficiency depends heavily on both the graph structure and system characteristics, such as bandwidth and memory capacity. In this work, we introduce a distributed training framework for graph transformers, which automatically selects and optimizes parallelization strategies based on the graph structure and hardware configuration. With our implementation of distributed sparse operations, we accelerate sparse graph attention by up to 3.8x and reduce memory consumption by 78% compared to state-of-the-art frameworks. On large graph benchmarks, our proposed framework achieves up to 6x speedup with system scaling up to 8 GPUs. These results demonstrate that the proposed framework improves the scalability of graph transformers, bringing them closer to serving as practical graph foundation models. Comments: Accepted to the 63rd ACM/IEEE Design Automation Conference (DAC 2026) Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2604.16715 [cs.DC] (or arXiv:2604.16715v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.16715 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-183] RankGuide: Tensor-Rank-Guided Routing and Steering for Efficient Reasoning
【速读】:该论文旨在解决生成式 AI(Generative AI)中大型推理模型(Large Reasoning Models, LRM)因高推理延迟和计算开销导致的效率瓶颈问题,尤其是在小推理模型(Small Reasoning Models, SRM)与LRM协作时,如何有效检测并缓解SRM失败的问题。解决方案的关键在于提出RankGuide框架,其核心创新是利用连续隐藏状态的张量秩(tensor-rank)信号来实现两个层面的优化:一是基于张量秩的路由机制,用于精准识别SRM可能失败的场景并选择性调用LRM;二是引入张量秩过滤的引导向量提取方法,对SRM的推理轨迹进行调节以提升生成质量。通过上述机制,RankGuide显著减少了推理步骤数和整体延迟(最高达1.75倍加速),同时保持了与现有方法相当的准确性。
链接: https://arxiv.org/abs/2604.16694
作者: Jiayi Tian,Yupeng Su,Ryan Solgi,Souvik Kundu,Zheng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large reasoning models (LRMs) enhance problem-solving capabilities by generating explicit multi-step chains of thought (CoT) reasoning; however, they incur substantial inference latency and computational overhead. To mitigate this issue, recent works have explored model collaboration paradigms, where small reasoning models (SRMs) generate intermediate reasoning steps to achieve a better accuracy–latency trade-off. Despite recent progress, effectively and efficiently detecting and mitigating SRM failures in collaborative systems remains a key challenge. To address this issue, we analyze SRM inference in both the generated text and hidden-state spaces, and identify three types of failure modes: \textitoverconfidence, \textituncertainty, and \textitheavy revalidation. Building on these insights, we propose \textbfRankGuide, a framework that improves the efficiency and effectiveness of SRM–LRM collaboration through tensor-rank-guided routing and steering. Specifically, RankGuide leverages a routing signal that incorporates tensor-rank signals derived from consecutive hidden states to detect when SRMs are likely to fail and selectively invoke LRMs. In addition, we introduce a tensor-rank-filtered steering vector extraction method to modulate the reasoning trajectory of SRMs, thereby improving their generation quality. By improving both routing and steering through tensor-rank signals, RankGuide enables SRM–LRM collaborative systems to achieve more efficient reasoning with fewer steps and improved accuracy. Experiments on multiple reasoning benchmarks demonstrate the efficacy of RankGuide in reducing latency by up to 1.75\times compared to LRM, while maintaining competitive accuracy relative to prior methods.
[AI-184] he Query Channel: Information-Theoretic Limits of Masking-Based Explanations
【速读】:该论文旨在解决基于掩码的后验解释方法(如KernelSHAP和LIME)在估计局部特征重要性时的理论极限问题,即如何定量刻画解释的可靠性与查询次数之间的关系。其核心贡献在于将此类方法建模为一个通信信道问题:解释本身作为消息,每次掩码评估作为一次信道使用;通过引入假设类熵(hypothesis class entropy)和每查询识别容量(identification capacity per query),推导出强逆定理(strong converse),表明当解释速率超过信道容量时,任意解释器和解码器都无法避免错误概率收敛至1;同时证明了稀疏最大似然解码器可在速率低于容量时实现可靠恢复。关键创新在于利用信息论框架建立非渐近查询基准,并揭示标准凸代理(如Lasso和OLS)在某些查询预算范围内仍无法达到理论最优性能,从而指出了现有方法的局限性及高分辨率解释不可达的根本原因——包括超像素粒度和token化等源编码选择以及高斯噪声与非线性曲率对信道质量的破坏效应。
链接: https://arxiv.org/abs/2604.16689
作者: Erciyes Karakaya,Ozgur Ercetin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Masking-based post-hoc explanation methods, such as KernelSHAP and LIME, estimate local feature importance by querying a black-box model under randomized perturbations. This paper formulates this procedure as communication over a query channel, where the latent explanation acts as a message and each masked evaluation is a channel use. Within this framework, the complexity of the explanation is captured by the entropy of the hypothesis class, while the query interface supplies information at a rate determined by an identification capacity per query. We derive a strong converse showing that, if the explanation rate exceeds this capacity, the probability of exact recovery necessarily converges to one in error for any sequence of explainers and decoders. We also prove an achievability result establishing that a sparse maximum-likelihood decoder attains reliable recovery when the rate lies below capacity. A Monte Carlo estimator of mutual information yields a non-asymptotic query benchmark that we use to compare optimal decoding with Lasso- and OLS-based procedures that mirror LIME and KernelSHAP. Experiments reveal a range of query budgets where information theory permits reliable explanations but standard convex surrogates still fail. Finally, we interpret super-pixel resolution and tokenization for neural language models as a source-coding choice that sets the entropy of the explanation and show how Gaussian noise and nonlinear curvature degrade the query channel, induce waterfall and error-floor behavior, and render high-resolution explanations unattainable.
[AI-185] Agent ic Risk-Aware Set-Based Engineering Design
【速读】:该论文旨在解决工程设计早期阶段中存在的高维参数空间与不确定性问题,尤其是在空气动力学翼型设计中,如何高效地探索和筛选潜在设计方案。其解决方案的关键在于构建一个由大型语言模型(Large Language Models, LLMs)驱动的多智能体框架,通过人类在回路(human-in-the-loop)机制协调多个专业化智能体(包括编码助手、设计代理、系统工程代理和分析代理),结合基于集合的设计理念,实现从初始候选集的系统性探索到风险量化过滤的全流程自动化。其中,核心创新点是引入条件风险价值(Conditional Value-at-Risk, CVaR)作为定量指标来识别高失败概率的设计方案,并由分析代理执行全局敏感性分析以生成可操作的启发式规则,从而显著提升人类专家在最终决策中的效率与可靠性。
链接: https://arxiv.org/abs/2604.16687
作者: Varun Kumar,George Em Karniadakis
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This paper introduces a multi-agent framework guided by Large Language Models (LLMs) to assist in the early stages of engineering design, a phase often characterized by vast parameter spaces and inherent uncertainty. Operating under a human-in-the-loop paradigm and demonstrated on the canonical problem of aerodynamic airfoil design, the framework employs a team of specialized agents: a Coding Assistant, a Design Agent, a Systems Engineering Agent, and an Analyst Agent - all coordinated by a human Manager. Integrated within a set-based design philosophy, the process begins with a collaborative phase where the Manager and Coding Assistant develop a suite of validated tools, after which the agents execute a structured workflow to systematically explore and prune a large set of initial design candidates. A key contribution of this work is the explicit integration of formal risk management, employing the Conditional Value-at-Risk (CVaR) as a quantitative metric to filter designs that exhibit a high probability of failing to meet performance requirements, specifically the target coefficient of lift. The framework automates labor-intensive initial exploration through a global sensitivity analysis conducted by the Analyst agent, which generates actionable heuristics to guide the other agents. The process culminates by presenting the human Manager with a curated final set of promising design candidates, augmented with high-fidelity Computational Fluid Dynamics (CFD) simulations. This approach effectively leverages AI to handle high-volume analytical tasks, thereby enhancing the decision-making capability of the human expert in selecting the final, risk-assessed design.
[AI-186] Graph Transformer-Based Pathway Embedding for Cancer Prognosis
【速读】:该论文旨在解决癌症进展预测中因分子组学数据在患者间高度异质性而导致的准确性难题。现有生物信息模型虽提升了可解释性,但其基因特征编码方式未能有效学习每个基因的共享基础表示,限制了通路嵌入的表达能力和生物学准确性。解决方案的关键在于提出PATH(Pathway-Adaptive Gene Embedding),这是一种基于调制机制、患者条件化的基因嵌入策略:首先为每个基因构建一个共享的基础嵌入以保持群体层面的稳定生物学身份,随后利用患者特异性的拷贝数变异(Copy Number Variation, CNV)和突变信号动态调整该嵌入,从而在保留基因固有生物学特性的同时捕捉个体分子差异。此方法显著提升了通路间相互作用建模的精度,在泛癌转移预测任务中F1得分达到0.8766,优于当前最优多组学基准模型8.8%。
链接: https://arxiv.org/abs/2604.16685
作者: Koushik Howlader,Md Tauhidul Islam,Wei Le
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 5 figures
Abstract:Accurate prediction of cancer progression remains a challenge due to the high heterogeneity of molecular omics data across patients. While biologically informed models have improved the interpretability of these predictions, a persistent limitation lies in how they encode individual genes to construct pathway representations. Existing hierarchical models typically derive gene features by directly mapping raw molecular inputs, whereas integration frameworks often rely on simple statistical aggregations of patient-level signals. These approaches often fail to explicitly learn a shared base representation for each gene, thereby limiting the expressiveness and biological accuracy of downstream pathway embeddings. To address this, we introduce PATH, a modulation-based, patient-conditioned gene embedding strategy. PATH represents a paradigm shift by starting from a shared base embedding for each gene, preserving a stable biological identity across the population, and then dynamically adapting it using patient-specific copy number variation (CNV) and mutation signals. This allows the model to capture subtle individual molecular variations while maintaining a consistent latent understanding of the gene itself. We integrate PATH into a graph transformer framework that models interactions among biologically connected pathways through pathway-guided attention. Across pancancer metastasis prediction, PATH achieves an F1 score of 0.8766, representing an 8.8 percent improvement over the current SOTA multi-omics benchmarks. Beyond superior predictive accuracy, our approach identifies biologically meaningful pathways and, crucially, reveals disease-state-specific pathway rewiring, offering new insights into the evolving pathway-pathway interactions that drive cancer progression.
[AI-187] KAIROS: Stateful Context-Aware Power-Efficient Agent ic Inference Serving
【速读】:该论文针对生成式 AI(Generative AI)在推理阶段面临的功耗瓶颈问题展开研究,尤其聚焦于新兴的“代理型 AI”(agentic AI)工作负载。传统电源管理技术主要面向单轮大语言模型(LLM)服务,而 agentic AI 具有长期上下文演化和工具交互轮次的特点,导致现有方法在降低 GPU 频率时易引发内存压力激增,进而进入“抖动”(thrashing)状态,显著恶化性能与能效。解决方案的关键在于提出 KAIROS 系统,其以 agent 级别的上下文信息作为首要控制信号,协同调节 GPU 频率、实例级并发度及多实例请求调度策略,从而在内存资源充足时实现节能,同时避免抖动并保障性能目标。实验表明,KAIROS 在多种软件与数据工程代理任务中平均节省 27%(最高达 39.8%)功耗,且满足性能要求。
链接: https://arxiv.org/abs/2604.16682
作者: Yichao Yuan,Mosharaf Chowdhury,Nishil Talati
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Power has become a central bottleneck for AI inference. This problem is becoming more urgent as agentic AI emerges as a major workload class, yet prior power-management techniques focus almost entirely on single-turn LLM serving. Our analysis shows that agentic serving behaves fundamentally differently: each request carries long-lived context that evolves across tool-interleaved turns, and lowering GPU frequency can push the system into a thrashing regime where memory pressure sharply worsens both performance and power efficiency. These observations show that power optimization for agentic serving requires rethinking. We present KAIROS, a context-aware power optimization system for agentic AI serving. KAIROS uses agent context as a first-class control signal to jointly manage GPU frequency, per-instance concurrency, and multi-instance request placement. This enables KAIROS to save power when memory headroom exists while avoiding thrashing and preserving performance targets. At a high level, KAIROS tracks requests at agent granularity, adapts local control to context growth and agent progress, and routes agents across instances to jointly improve power efficiency and memory stability. Evaluated across diverse software and data engineering agentic tasks, KAIROS achieves an average of 27% (up to 39.8%) power reduction while meeting the performance targets. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.16682 [cs.DC] (or arXiv:2604.16682v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.16682 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-188] ReconVLA: An Uncertainty-Guided and Failure-Aware Vision-Language-Action Framework for Robotic Control
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在实际机器人控制中缺乏校准的置信度估计问题,从而限制了其在真实场景中的可靠性。为应对这一挑战,作者提出ReconVLA,其核心创新在于将**合规预测(Conformal Prediction)**直接应用于预训练VLA策略的动作标记输出,生成与执行质量及任务成功率相关联的校准不确定性估计。此外,该方法进一步将合规预测扩展至机器人状态空间,实现对异常或不安全状态的早期检测,形成一种简单而有效的故障预警机制,从而在不重新训练或修改原生VLA模型的前提下,显著提升机器人系统的失败预见能力和安全性。
链接: https://arxiv.org/abs/2604.16677
作者: Lingling Chen,Zongyao Lyu,William J. Beksi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 17 pages, 9 figures, and 7 tables
Abstract:Vision-language-action (VLA) models have emerged as generalist robotic controllers capable of mapping visual observations and natural language instructions to continuous action sequences. However, VLAs provide no calibrated measure of confidence in their action predictions, thus limiting their reliability in real-world settings where uncertainty and failures must be anticipated. To address this problem we introduce ReconVLA, a reliable conformal model that produces uncertainty-guided and failure-aware control signals. Concretely, our approach applies conformal prediction directly to the action token outputs of pretrained VLA policies, yielding calibrated uncertainty estimates that correlate with execution quality and task success. Furthermore, we extend conformal prediction to the robot state space to detect outliers or unsafe states before failures occur, providing a simple yet effective failure detection mechanism that complements the action-level uncertainty. We evaluate ReconVLA in both simulation and real robot experiments across diverse manipulation tasks. Our results show that conformalized action predictions consistently improve failure anticipation, reduce catastrophic errors, and provide a calibrated measure of confidence without retraining or modifying the underlying VLA.
[AI-189] From Subsumption to Satisfiability: LLM -Assisted Active Learning for OWL Ontologies
【速读】:该论文旨在解决知识获取过程中基于主动学习(Active Learning)的本体建模效率与准确性问题,特别是如何通过引入大型语言模型(Large Language Models, LLMs)来减少人工标注成本并提升对潜在错误 axiom 的检测能力。其解决方案的关键在于将候选公理(axiom)转化为对应的反例概念(counter-concept),并通过受控自然语言(Controlled Natural Language, CNL)进行形式化表达,随后利用 LLM 提供贴近现实世界的实例以辅助判断;这一设计确保了在本体构建中仅可能出现 II 型错误(Type II errors),即漏检错误,这类错误不会导致不一致性,仅可能延迟建模进程,从而保障了系统的稳健性。
链接: https://arxiv.org/abs/2604.16672
作者: Haoruo Zhao,Wenshuo Tang,Duncan Guthrie,Michele Sevegnani,David Flynn,Paul Harvey
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In active learning, membership queries (MQs) allow a learner to pose questions to a teacher, such as ‘‘Is every apple a fruit?’’, to which the teacher responds correctly with yes or no. These MQs can be viewed as subsumption tests with respect to the target ontology. Inspired by the standard reduction of subsumption to satisfiability in description logics, we reformulate each candidate axiom into its corresponding counter-concept and verbalise it in controlled natural language before presenting it to Large Language Models (LLMs). We introduce LLMs as a third component that provides real-world examples approximating an instance of the counter-concept. This design property ensures that only Type II errors may occur in ontology modelling; in the worst case, these errors merely delay the construction process without introducing inconsistencies. Experimental results on 13 commercial LLMs show that recall, corresponding to Type II errors in our framework, remains stable across several well-established ontologies.
[AI-190] Cross-Modal Bayesian Low-Rank Adaptation for Uncertainty-Aware Multimodal Learning
【速读】:该论文旨在解决低资源多模态场景下参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法中存在的不确定性建模不足与跨模态可靠性评估缺失的问题。现有PEFT方法通常为确定性且单模态,难以在音频-文本联合学习中有效捕捉预测置信度和模态间一致性。其解决方案的关键在于提出CALIBER框架,通过引入基于贝叶斯嵌入正则化的低秩推理机制,在适配器空间中以token级文本-音频交叉注意力为条件,动态调节一个低维随机潜在矩阵的均值与方差,从而将音频作为上下文可靠性信号来引导适应过程并量化异方差性不确定性。该设计在保持PEFT计算效率的同时,实现了轻量级、局部化的多模态不确定性估计。
链接: https://arxiv.org/abs/2604.16657
作者: Habibeh Naderi,Behrouz Haji Soleimani,Stan Matwin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large pre-trained language models are increasingly adapted to downstream tasks using parameter-efficient fine-tuning (PEFT), but existing PEFT methods are typically deterministic and unimodal, making them poorly suited for low-resource multimodal settings where predictive uncertainty and cross-modal reliability both matter. We introduce CALIBER (Context-Aware Low-rank Inference with Bayesian Embedding Regularization), a multimodal uncertainty-aware PEFT framework for audio-text learning. CALIBER extends Bayesian low-rank adaptation by conditioning the variational posterior in the adapter space on per-layer, token-level text-audio cross-attention. Specifically, text-derived low-rank features attend to frame-level audio embeddings to produce localized acoustic context, which then modulates the mean and variance of a compact stochastic latent matrix within the rank- r adapter space. This design treats audio not only as an additional feature source, but as a contextual reliability signal that shapes both adaptation and confidence. By confining stochasticity to a low-dimensional latent component, CALIBER retains the computational efficiency and scalability of PEFT while enabling heteroscedastic multimodal uncertainty estimation. Experimental results across diverse text and audio backbones show that CALIBER consistently matches or improves upon text-only Bayesian PEFT and conventional multimodal transfer-learning baselines, with token-level cross-attention yielding the most consistent gains. Our findings demonstrate that localized cross-modal conditioning is an effective and lightweight mechanism for uncertainty-aware multimodal adaptation.
[AI-191] Agent ic Frameworks for Reasoning Tasks: An Empirical Study
【速读】:该论文旨在解决当前缺乏对主流智能体框架(agentic frameworks)在推理性能、效率及实际适用性方面进行系统性比较的问题。针对这一空白,作者从2023年1月至2025年7月间收集的1,200个GitHub仓库中筛选出22个广泛使用的框架,并基于其架构设计进行分类,统一评估其在BBH、GSM8K和ARC三个推理基准上的表现,指标包括推理准确率、执行时间、计算成本及跨基准一致性。研究发现,尽管多数框架能完成所有任务,但性能差异主要源于“编排质量”问题而非推理能力本身,如上下文失控(Camel)、重复失败触发高成本重试(Upsonic)、API配额耗尽(AutoGen与Mastra)等;尤其数学推理能力显著下降(GSM8K平均准确率仅44.35%)。因此,解决方案的关键在于:框架选型应优先考虑编排机制的优化,特别是记忆控制、容错处理和成本管理能力,以提升实际部署中的稳定性与经济性。
链接: https://arxiv.org/abs/2604.16646
作者: Zeeshan Rasheed,Abdul Malik Sami,Muhammad Waseem,Kai-Kristian Kemell,Mika Saari,Pekka Abrahamsson
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 43 Pages, 3 Figures, and 9 Tables
Abstract:Recent advances in agentic frameworks have enabled AI agents to perform complex reasoning and decision-making. However, evidence comparing their reasoning performance, efficiency, and practical suitability remains limited. To address this gap, we empirically evaluate 22 widely used agentic frameworks across three reasoning benchmarks: BBH, GSM8K, and ARC. The frameworks were selected from 1,200 GitHub repositories collected between January 2023 and July 2025 and organized into a taxonomy based on architectural design. We evaluated them under a unified setting, measuring reasoning accuracy, execution time, computational cost, and cross-benchmark consistency. Our results show that 19 of the 22 frameworks completed all three benchmarks. Among these, 12 showed stable performance, with mean accuracy of 74.6-75.9%, execution time of 4-6 seconds per task, and cost of 0.14-0.18 cents per task. Poorer results were mainly caused by orchestration problems rather than reasoning limits. For example, Camel failed to complete BBH after 11 days because of uncontrolled context growth, while Upsonic consumed USD 1,434 in one day because repeated extraction failures triggered costly retries. AutoGen and Mastra also exhausted API quotas through iterative interactions that increased prompt length without improving results. We also found a sharp drop in mathematical reasoning. Mean accuracy on GSM8K was 44.35%, compared with 89.80% on BBH and 89.56% on ARC. Overall, this study provides the first large-scale empirical comparison of agentic frameworks for reasoning-intensive software engineering tasks and shows that framework selection should prioritize orchestration quality, especially memory control, failure handling, and cost management. Comments: 43 Pages, 3 Figures, and 9 Tables Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2604.16646 [cs.AI] (or arXiv:2604.16646v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.16646 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-192] Beyond Feature Fusion: Contextual Bayesian PEFT for Multimodal Uncertainty Estimation
【速读】:该论文旨在解决现有参数高效微调(Parameter Efficient Fine-Tuning, PEFT)方法在多模态文本预测任务中对不确定性建模不足的问题,尤其是当音频上下文存在外部干扰因素(如背景噪声、声道变化或说话风格)时,传统方法难以准确反映由此引发的预测不确定性。解决方案的关键在于提出CoCo-LoRA,其核心创新是将低秩适配器空间中的上下文变分后验(contextual variational posterior)同时依赖于本地文本特征和音频衍生的上下文信号,通过一个共享的音频嵌入投影与轻量级层内头结构实现全局到局部、深度特定的不确定性调制,从而在保持PEFT可扩展性的前提下,生成对音频敏感且异方差(heteroscedastic)的不确定性估计。
链接: https://arxiv.org/abs/2604.16615
作者: Habibeh Naderi,Behrouz Haji Soleimani,Stan Matwin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce CoCo-LoRA, a multimodal, uncertainty-aware parameter-efficient fine-tuning method for text prediction tasks accompanied by audio context. Existing PEFT approaches such as LoRA are efficient but typically deterministic, while recent Bayesian low-rank adapters model uncertainty in a lightweight way yet remain largely unimodal and condition uncertainty primarily on internal text features. This leaves them poorly equipped to reflect uncertainty driven by external acoustic factors such as background noise, channel variability, or speaking style, which can materially affect reliability in speech-centered applications. CoCo-LoRA addresses this gap by conditioning a contextual variational posterior in the low-rank space on both local text-derived adapter features and an audio-derived context signal. A pooled audio embedding is projected once into a shared context space and then adapted through lightweight layer-wise heads, enabling global-to-local, depth-specific modulation of the adapter uncertainty and update without high-dimensional multimodal fusion. Stochasticity is confined to a compact latent component in the rank space, preserving PEFT scalability while producing audio-sensitive, heteroscedastic uncertainty. Based on our evaluations across diverse tasks and backbone combinations, CoCo-LoRA consistently matches or outperforms text-only PEFT and conventional feature-fusion transfer baselines, particularly on high-coverage labels where reliable adaptation is critical. The results indicate that using audio as a contextual uncertainty signal, rather than as a fused feature stream, provides a robust and parameter-efficient alternative for multimodal low-resource prediction.
[AI-193] Randomized Antipodal Search Done Right for Data Pareto Improvement of LLM Unlearning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在部署后因记忆不良知识而需进行“遗忘”(unlearning)的问题,尤其针对实际场景中无法预先获取明确的遗忘集(forget set)和保留集(retain set)这一挑战。传统机器遗忘方法依赖优化策略调整参数以平衡遗忘与保留性能,但其有效性高度依赖于数据集的可访问性,难以应对推理阶段触发的动态遗忘需求。为此,作者提出“数据帕累托改进”(data Pareto improvement)的概念,形式化了检索过程如何扩展遗忘-保留之间的最优权衡边界。解决方案的关键在于设计一种名为随机反向搜索线性化影响核(Randomized Antipodal Search on Linearized Influence Kernel, RASLIK)的检索算法,该算法融合排列投影哈希(permutation-projection hashing)与随机反向搜索机制,在降低选择方差的同时实现次线性计算复杂度,并在质量和效率上均获得显著提升,从而为以数据为中心的遗忘提供了可扩展且原理清晰的新范式。
链接: https://arxiv.org/abs/2604.16591
作者: Ziwen Liu,Huawei Lin,Yide Ran,Denghui Zhang,Jianwen Xie,Chuan Li,Weijie Zhao,Zhaozhuo Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Large language models (LLMs) sometimes memorize undesirable knowledge, which must be removed after deployment. Prior work on machine unlearning has focused largely on optimization methods that adjust parameters to enforce forgetting while preserving retention. However, these approaches assume that the forget and retain sets are readily available, which rarely holds in practice. Unlearning is typically triggered by an undesired generation at inference time, making the retrieval of relevant data the central challenge. We introduce the notion of data Pareto improvement for LLM unlearning, which formalizes how retrieval can expand the achievable trade-off frontier between forgetting and retention. To realize this principle, we propose Randomized Antipodal Search on Linearized Influence Kernel (RASLIK), a retrieval algorithm that combines permutation-projection hashing with randomized antipodal search. RASLIK reduces selection variance, achieves sublinear complexity, and yields a double gain in both quality and efficiency. Across multiple models, datasets, and unlearning algorithms, RASLIK consistently outperforms deterministic baselines and even oracle sampling, establishing randomized search as a principled and scalable solution for data-centric unlearning. Comments: Preprint Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.16591 [cs.LG] (or arXiv:2604.16591v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.16591 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-194] Global Attention with Linear Complexity for Exascale Generative Data Assimilation in Earth System Prediction
【速读】:该论文旨在解决地球系统预测中数据同化(Data Assimilation, DA)的可扩展性与准确性瓶颈问题,尤其在exascale计算环境下仍难以实现高效不确定度量化和极端事件预测。其解决方案的关键在于提出了一种统一的一阶段生成式数据同化框架(Generative DA Framework),将传统预报-更新循环重构为贝叶斯后验采样问题,并引入STORM——一种具有全局注意力线性复杂度缩放算法的时空变换器(Spatiotemporal Transformer),从而突破了传统注意力机制的二次复杂度限制。该方法在Frontier超算上实现了63%的强扩展效率和1.6 ExaFLOP持续性能,支持200亿时空标记的全球建模,覆盖17.7万时间帧,首次实现了公里级全球模拟,确立了地球系统预测的新范式。
链接: https://arxiv.org/abs/2604.16590
作者: Xiao Wang,Zezhong Zhang,Isaac Lyngaas,Hong-Jun Yoon,Jong-Youl Choi,Siming Liang,Janet Wang,Hristo G. Chipilski,Ashwin M. Aji,Feng Bao,Peter Jan van Leeuwen,Dan Lu,Guannan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate weather and climate prediction relies on data assimilation (DA), which estimates the Earth system state by integrating observations with models. While exascale computing has significantly advanced earth simulation, scalable and accurate inference of the Earth system state remains a fundamental bottleneck, limiting uncertainty quantification and prediction of extreme events. We introduce a unified one-stage generative DA framework that reformulates assimilation as Bayesian posterior sampling, replacing the conventional forecast-update cycle with compute-dense, GPU-efficient inference. At the core is STORM, a novel spatiotemporal transformer with a global attention linear-complexity scaling algorithm that breaks the quadratic attention barrier. On 32,768 GPUs of the Frontier supercomputer, our method achieves 63% strong scaling efficiency and 1.6 ExaFLOP sustained performance. We further scale to 20 billion spatiotemporal tokens, enabling km-scale global modeling over 177k temporal frames, regimes previously unreachable, establishing a new paradigm for Earth system prediction.
[AI-195] Hybrid Spectro-Temporal Fusion Framework for Structural Health Monitoring
【速读】:该论文旨在解决振动信号在结构健康监测(Structural Health Monitoring, SHM)中特征表示不足的问题,以提升故障分类的准确性与稳定性。其解决方案的关键在于提出了一种时频对齐(Spectro-Temporal Alignment)框架与混合时频融合(Hybrid Spectro-Temporal Fusion)框架,通过整合到达时间间隔描述符(arrival-time interval descriptors)与频谱特征(spectral features),同时捕捉细粒度和粗粒度的振动动态信息。实验表明,该方法显著优于传统输入形式,并且在不同模型架构下均表现出更高的精度与更低的波动性,尤其在细粒度时间分辨率(Δτ = 0.008)下能充分释放深度学习模型的性能潜力。
链接: https://arxiv.org/abs/2604.16589
作者: Jongyeop Kim,Jinki Kim,Doyun Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Structural health monitoring plays a critical role in ensuring structural safety by analyzing vibration responses from engineering systems. This paper proposes a Spectro-Temporal Alignment framework and a Hybrid Spectro-Temporal Fusion framework that integrate arrival-time interval descriptors with spectral features to capture both fine-scale and coarse-scale vibration dynamics. Experiments conducted on data collected from an LDS V406 electrodynamic shaker demonstrate that the proposed spectro-temporal representations significantly outperform conventional input formulations. The results indicate that a temporal resolution (\Delta\tau) of 0.008 of 0.02 favors traditional machine learning models, whereas a finer resolution (\Delta\tau) of 0.008 effectively unlocks the performance potential of deep learning architectures. Beyond classification accuracy, a comprehensive stability analysis based on condensed indices, including mean performance, standard deviation, coefficient of variation, and balanced score, shows that the proposed hybrid framework consistently achieves higher accuracy with substantially lower variability compared to baseline and alignment-only approaches. Overall, these results demonstrate that the proposed framework provides a robust, accurate, and reliable solution for vibration-based structural health monitoring.
[AI-196] A Systematic Survey and Benchmark of Deep Learning for Molecular Property Prediction in the Foundation Model Era
【速读】:该论文旨在解决分子属性预测领域中因模型架构、数据标准与评估方法不统一而导致的可重复性差和跨学科应用受限的问题。其解决方案的关键在于提出一个统一的分类体系(unified taxonomy),将分子表示、模型架构与跨学科应用场景有机结合,并通过系统性基准测试分析(benchmark analyses)揭示当前数据整理、分割策略及评估协议中的核心挑战,如立体化学不一致、检测来源异质性和随机划分导致的可复现性不足。进一步地,论文推动基准设计向更透明、时间敏感和骨架感知的方法演进,并提出三大前瞻性方向:(i) 引入量子一致性约束的物理感知学习,(ii) 构建不确定性校准的基础模型以提升推断可信度,(iii) 建立融合计算与实验数据的多模态真实场景基准生态系统。
链接: https://arxiv.org/abs/2604.16586
作者: Zongru Li,Xingsheng Chen,Honggang Wen,Regina Qianru Zhang,Ming Li,Xiaojin Zhang,Hongzhi Yin,Qiang Yang,Kwok-Yan Lam,Pietro Lio,Siu-Ming Yiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 32 pages. It is just accepted by Journal of Chemical Theory and Computation 2026
Abstract:Molecular property prediction integrates quantum chemistry, cheminformatics, and deep learning to connect molecular structure with physicochemical and biological behavior. This survey traces four complementary paradigms, including Quantum, Descriptor Machine Learning, Geometric Deep Learning, and Foundation Models, and outlines a unified taxonomy linking molecular representations, model architectures, and interdisciplinary applications. Benchmark analyses integrate evidence from both widely used datasets and datasets reflecting industry perspectives, encompassing quantum, physicochemical, physiological, and biophysical domains. The survey examines current standards in data curation, splitting strategies, and evaluation protocols, highlighting challenges including inconsistent stereochemistry, heterogeneous assay sources, and reproducibility limitations under random or poorly defined splits. These observations motivate the modernization of benchmark design toward more transparent, time- and scaffold-aware methodologies. We further propose three forward-looking directions: (i) physics-aware learning embedding quantum consistency, (ii) uncertainty-calibrated foundation models for trustworthy inference, and (iii) realistic multimodal benchmark ecosystems integrating computational and experimental data. Repository: this https URL.
[AI-197] he Global Neural World Model: Spatially Grounded Discrete Topologies for Action-Conditioned Planning
【速读】:该论文旨在解决生成式模型在长期推理过程中出现的流形漂移(manifold drift)问题,以及如何从连续观测中自动构建具有结构化拓扑特性的世界表征。其解决方案的关键在于提出全局神经世界模型(Global Neural World Model, GNWM),该模型基于连续动作条件下的联合嵌入预测架构(Joint-Embedding Predictive Architecture, JEPA),通过引入平衡的连续熵约束实现拓扑量化,并将环境映射到离散二维网格上,从而强制平移等变性(translational equivariance)且无需像素级重建。该框架利用“网格吸附”(grid snapping)作为原生误差校正机制,在自回归推演中有效抑制流形漂移;同时,通过最大熵探索(随机游走)训练方式,使模型学习通用的转移动态而非记忆特定专家轨迹,从而具备因果发现能力,可将连续可预测的概念组织为结构化的拓扑地图。
链接: https://arxiv.org/abs/2604.16585
作者: Noureddine Kermiche
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures
Abstract:We present the Global Neural World Model (GNWM), a self-stabilizing framework that achieves topological quantization through balanced continuous entropy constraints. Operating as a continuous, action-conditioned Joint-Embedding Predictive Architecture (JEPA), the GNWM maps environments onto a discrete 2D grid, enforcing translational equivariance without pixel-level reconstruction. Our results show this architecture prevents manifold drift during autoregressive rollouts by using grid ``snapping’’ as a native error-correction mechanism. Furthermore, by training via maximum entropy exploration (random walks), the model learns generalized transition dynamics rather than memorizing specific expert trajectories. We validate the GNWM across passive observation, active agent control, and abstract sequence regimes, demonstrating its capacity to act not just as a spatial physics simulator, but as a causal discovery model capable of organizing continuous, predictable concepts into structured topological maps.
[AI-198] Certified Program Synthesis with a Multi-Modal Verifier
【速读】:该论文旨在解决认证程序合成(certified program synthesis,即vericoding)中的两大挑战:一是从自然语言描述中生成的规范(specification)往往过于薄弱或过于严格,而现有方法缺乏系统性手段来识别此类缺陷;二是程序验证工具生态碎片化,不同工具支持自动激活(auto-active)或交互式(interactive)推理模式,导致合成方法需针对单一验证范式定制,限制了任务适用范围。解决方案的关键在于构建一个围绕多模态验证器(multi-modal verifier)的合成流程——该验证器整合动态验证、自动化证明与交互式证明脚本于统一框架内。作者在Lean基础上实现LeetProof,其核心优势在于:通过随机属性测试提前验证规范合理性,利用验证条件分解合成任务,并将剩余证明义务交由专用于Lean的前沿AI定理证明器处理,从而显著提升完全认证解的产出率。
链接: https://arxiv.org/abs/2604.16584
作者: Yueyang Feng,Dipesh Kafle,Vladimir Gladshtein,Vitaly Kurin,George Pîrlea,Qiyuan Zhao,Peter Müller,Ilya Sergey
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:Certified program synthesis (aka vericoding) is the process of automatically generating a program, its formal specification, and a machine-checkable proof of their alignment from a natural-language description. Two challenges make vericoding difficult. First, specifications synthesised from natural language are often either too weak to be meaningful or too strong to be implementable, yet existing approaches lack systematic means to detect such defects. Second, the landscape of program verifiers is fragmented: each tool supports a particular reasoning mode – auto-active (e.g., Dafny, Verus) or interactive (e.g., Coq, Lean) – with its own trade-off between automation and expressivity. This forces every synthesis methodology to be tailored to a single verification paradigm, limiting the class of tasks it can handle effectively. We overcome both challenges by structuring the certified synthesis workflow around a multi-modal verifier – a single tool combining dynamic validation, automated proofs, and interactive proof scripting in one foundational framework. We realise this idea in LeetProof, an agentic pipeline built on Velvet, a multi-modal verifier embedded in Lean. Multi-modality enables LeetProof to validate generated specifications via randomised property-based testing before any code is synthesised, decompose the synthesis task into sub-problems guided by verification conditions, and delegate residual proof obligations to frontier AI provers specialised for Lean. We evaluate LeetProof on benchmarks derived from prior work on certified synthesis. Our specification validation uncovers defects in existing reference benchmarks, and LeetProof’s staged pipeline achieves a significantly higher rate of fully certified solutions than a single-mode baseline at the same budget – consistently across two frontier LLM backends. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL) Cite as: arXiv:2604.16584 [cs.SE] (or arXiv:2604.16584v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.16584 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-199] POLAR: Online Learning for LoRA Adapter Caching and Routing in Edge LLM Serving
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在边缘部署时,由于GPU/DRAM内存有限导致LoRA适配器(Low-Rank Adaptation, LoRA adapters)无法全部驻留内存的问题。当请求指向非驻留适配器时,需从存储中分页加载权重,引入显著延迟,形成一个双时间尺度的在线控制问题:慢时间尺度上决定哪些适配器保留在高速缓存中,快时间尺度上根据上下文未知的效用动态路由请求。解决方案的关键在于将此联合缓存与路由问题建模为双时间尺度上下文Bandit(Contextual Bandit),提出POLAR(Paging and Online Learning for Adapter Routing)框架,其核心是结合一个缓存感知的LinUCB路由算法与基于周期的缓存控制器;其中,POLAR+版本通过强制探索和优化缓存策略,在满足随机正则性和缓存可实现性的条件下,实现了\widetilde\mathcal{O}(d\sqrt{NT}+\sqrt{KT})的次线性遗憾,证明了内存层次结构不会从根本上阻碍路由学习效率。
链接: https://arxiv.org/abs/2604.16583
作者: Shaoang Li,Jian Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15pages
Abstract:Edge deployment of large language models (LLMs) increasingly relies on libraries of lightweight LoRA adapters, yet GPU/DRAM can keep only a small resident subset at a time. Serving a request through a non-resident adapter requires paging its weights from storage, incurring measurable latency. This creates a two-timescale online control problem: on a slow timescale, the system selects which adapters remain resident in fast memory, while on a fast timescale it routes each request to an adapter whose context-dependent utility is unknown a priori. The two decisions are tightly coupled: the cache determines the cost of exploration, and the router determines which adapters receive informative feedback. We formulate this joint caching-and-routing problem as a two-timescale contextual bandit and propose POLAR (Paging and Online Learning for Adapter Routing). POLAR pairs a cache-aware LinUCB router with an epoch-based cache controller. We study two variants. A fixed-epoch version provides a robust baseline with worst-case regret guarantees under arbitrary contexts. An epoch-doubling version, POLAR+, adds forced exploration and improved cache optimization to achieve \widetilde\mathcalO(d\sqrtNT+\sqrtKT) sublinear regret under stochastic regularity and cacheability conditions, where N is the adapter count, K the cache size, d the context dimension, and T the horizon. The routing term matches the standard contextual-bandit rate up to logarithmic factors, showing that the memory hierarchy does not fundamentally slow routing learning. Experiments using 15 real LoRA adapters for Qwen2.5-7B together with measured GPU paging latencies show that adaptive cache control substantially outperforms non-adaptive baselines and exhibits scaling trends consistent with the theory.
[AI-200] NCO4CVRP: Neural Combinatorial Optimization for the Capacitated Vehicle Routing Problem
【速读】:该论文旨在解决神经组合优化(Neural Combinatorial Optimization, NCO)模型在求解车辆路径问题(Capacitated Vehicle Routing Problem, CVRP)时存在的解质量不高和泛化能力不足的问题。其关键解决方案在于改进推理策略:一方面,通过引入模拟退火(Simulated Annealing, SA)机制替代原有的随机重构(Random Re-Construct, RRC)方法,以概率性接受机制帮助模型跳出局部最优并拓展搜索空间;另一方面,结合束搜索(Beam Search)对多最优策略的策略优化(Policy Optimization with Multiple Optima, POMO)进行增强,从而系统性地探索多个高潜力解并保持多样性。实验表明,这些改进显著缩小了不同CVRP基准测试中的最优性差距,提升了模型的实际应用性能。
链接: https://arxiv.org/abs/2604.16581
作者: Mahir Labib Dihan,Md. Ashrafur Rahman Khan,Wasif Jalal,Md. Roqunuzzaman Sojib,Mashroor Hasan Bhuiyan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural Combinatorial Optimization (NCO) has emerged as a powerful framework for solving combinatorial optimization problems by integrating deep learning-based models. This work focuses on improving existing inference techniques to enhance solution quality and generalization. Specifically, we modify the Random Re-Construct (RRC) approach of the Light Encoder Heavy Decoder (LEHD) model by incorporating Simulated Annealing (SA). Unlike the conventional RRC, which greedily replaces suboptimal segments, our SA-based modification introduces a probabilistic acceptance mechanism that allows the model to escape local optima and explore a more diverse solution space. Additionally, we enhance the Policy Optimization with Multiple Optima (POMO) approach by integrating Beam Search, enabling systematic exploration of multiple promising solutions while maintaining diversity in the search space. We further investigate different inference strategies, including Softmax Sampling, Greedy, Gumbel-Softmax, and Epsilon-Greedy, analyzing their impact on solution quality. Furthermore, we explore instance augmentation techniques, such as horizontal and vertical flipping and rotation-based augmentations, to improve model generalization across different CVRP instances. Our extensive experiments demonstrate that these modifications significantly reduce the optimality gap across various Capacitated Vehicle Routing Problem (CVRP) benchmarks, with Beam Search and SA-based RRC consistently yielding superior performance. By refining inference techniques and leveraging enhanced search strategies, our work contributes to the broader applicability of NCO models in real-world combinatorial optimization tasks.
[AI-201] Continuous ageing trajectory representations for knee-aware lifetime prediction of lithium-ion batteries across heterogeneous dataset
【速读】:该论文旨在解决锂离子电池老化评估中的三大挑战:单体间差异性、循环协议异质性以及数据驱动模型在不同数据集间的迁移能力不足,尤其聚焦于退化拐点(knee point)的鲁棒识别与早期寿命阶段剩余使用寿命(RUL)的可靠预测问题。解决方案的关键在于提出了一种统一框架,通过从异构公共数据集(NASA、CALCE、ISU-ILCC)中学习电压-容量和容量-循环轨迹的连续表示,实现退化特征(如曲率、平台长度及拐点相关指标)的一致提取,并显著降低对特定数据集离散化方式的敏感性;该连续建模方法不仅增强了跨数据集的一致性,还在早期循环阶段(前5–20次循环)即展现出稳定的RUL预测性能,且具备不确定性感知能力,从而提供一种可解释性强、适用于多源异构数据的电池老化分析方案。
链接: https://arxiv.org/abs/2604.16580
作者: Agnieszka Pregowska,Stefan Marynowicz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate assessment of lithium-ion battery ageing is challenged by cell-to-cell variability, heterogeneous cycling protocols, and limited transferability of data-driven models across datasets. In particular, robust identification of degradation transitions, such as the knee point, and reliable early-life prediction of remaining useful life (RUL) remain open problems. This study proposes a unified framework for battery ageing analysis based on continuous representations of voltage-capacity and capacity-cycle trajectories learned from heterogeneous public datasets (NASA, CALCE, ISU-ILCC). The continuous formulation enables consistent extraction of degradation descriptors, including curvature, plateau length and knee-related metrics, while reducing sensitivity to dataset-specific discretisation. Across more than 250 cells, statistically significant correlations between knee onset and end-of-life (Pearson 0.75-0.84) are observed. Additional early-life analysis confirms that knee-related features retain predictive value when estimated from partial trajectories. Early-life models provide increasingly stable RUL predictions as the number of observed cycles increases, with meaningful predictive performance emerging within the first 5-20 cycles and remain robust under cross-dataset domain shift. The framework integrates continuous modelling, feature extraction and uncertainty-aware prediction, providing an interpretable and dataset-consistent approach demonstrating robustness across heterogeneous dataset types. Compared with conventional discrete or feature-based methods, the proposed representation reduces sensitivity to sampling resolution and improves cross-dataset consistency. The study is limited to laboratory-scale datasets and capacity-based end-of-life definitions.
[AI-202] owards Trustworthy Depression Estimation via Disentangled Evidential Learning
【速读】:该论文旨在解决自动抑郁评估在真实场景中因信号污染和环境噪声导致的可靠性问题,以及现有确定性方法生成未经校准的点估计所引发的过度自信误诊风险。其核心解决方案是提出EviDep框架,通过引入正态逆伽马分布(Normal-Inverse-Gamma distribution)联合量化抑郁严重程度与认知不确定性(epistemic uncertainty)及随机不确定性(aleatoric uncertainty),实现可信赖的诊断输出。关键创新在于:一是采用频域感知特征提取模块(Frequency-aware Feature Extraction),利用小波基混合专家(wavelet-based Mixture-of-Experts)动态分离任务无关噪声,保持诊断信号保真度;二是设计解耦式证据学习策略(Disentangled Evidential Learning),在贝叶斯融合前显式分离模态共享共识与特定模态细节,从而系统性抑制跨模态冗余信息的累积,保障证据合成的信息完整性。
链接: https://arxiv.org/abs/2604.16579
作者: Fangyuan Liu,Sirui Zhao,Zeyu Zhang,Jinyang Huang,Feng-Qi Cui,Bin Luo,Tong Xu,Meng Li,Enhong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated depression estimation is highly vulnerable to signal corruption and ambient noise in real-world deployment. Prevailing deterministic methods produce uncalibrated point estimates, exposing safety-critical clinical systems to the severe risk of overconfident misdiagnoses. To establish a highly resilient and trustworthy assessment paradigm, we propose EviDep, an evidential learning framework that jointly quantifies depression severity alongside aleatoric and epistemic uncertainties via a Normal-Inverse-Gamma distribution. A fundamental vulnerability in multimodal evidential fusion is the uncontrolled accumulation of cross-modal redundancies. This structural flaw artificially inflates diagnostic confidence by double-counting overlapping evidence. To guarantee robust evidence synthesis, EviDep enforces strict information integrity. First, a Frequency-aware Feature Extraction module leverages a wavelet-based Mixture-of-Experts to dynamically isolate task-irrelevant noise, preserving the fidelity of diagnostic signals. Subsequently, a Disentangled Evidential Learning strategy separates the shared consensus from modality-specific nuances. By explicitly decoupling these representations before Bayesian fusion, EviDep systematically mitigates evidence redundancy. Extensive experiments on AVEC 2013, 2014, DAIC-WOZ, and E-DAIC confirm that EviDep achieves state-of-the-art predictive accuracy and superior uncertainty calibration, delivering a robust fail-safe mechanism for trustworthy clinical screening.
[AI-203] Evaluating Temporal and Structural Anomaly Detection Paradigms for DDoS Traffic
【速读】:该论文旨在解决云原生5G网络中分布式拒绝服务(Distributed Denial-of-Service, DDoS)攻击检测任务中特征表示选择不明确的问题,即现有无监督异常检测方法通常假设固定的数据表示形式(时间序列或结构特征),而未验证哪种特征空间更适配实际流量数据。其解决方案的关键在于提出一种轻量级决策框架,在训练前通过两个诊断指标——聚合流信号的一阶自相关性(lag-1 autocorrelation)和主成分分析(Principal Component Analysis, PCA)的累计解释方差——优先判断应采用时间特征还是结构特征;若诊断结果不确定,则保留混合选项作为未来备选,而非直接采用经验性分支策略。实验表明,结构特征在两种统计独立的数据集上均表现稳定且优于或等同于时间特征,且随着时间依赖性的减弱,性能差距进一步扩大。
链接: https://arxiv.org/abs/2604.16575
作者: Yasmin Souza Lima,Rodrigo Moreira,Larissa F. Rodrigues Moreira,Tereza Cristina M. de B. Carvalho,Flávio de Oliveira Silva
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Paper accepted for publication at Experimental Research Workshop on the Future Internet (2026) in conjunction with Brazilian Symposium on Computer Networks and Distributed Systems (2026)
Abstract:Unsupervised anomaly detection is widely used to detect Distributed Denial-of-Service (DDoS) attacks in cloud-native 5G networks, yet most studies assume a fixed traffic representation, either temporal or structural, without validating which feature space best matches the data. We propose a lightweight decision framework that prioritizes temporal or structural features before training, using two diagnostics: lag-1 autocorrelation of an aggregated flow signal and PCA cumulative explained variance. When the probes are inconclusive, the framework reserves a hybrid option as a future fallback rather than an empirically validated branch. Experiments on two statistically distinct datasets with Isolation Forest, One-Class SVM, and KMeans show that structural features consistently match or outperform temporal ones, with the performance gap widening as temporal dependence weakens.
[AI-204] FedOBP: Federated Optimal Brain Personalization through Cloud-Edge Element-wise Decoupling
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因客户端数据异构性和资源受限移动设备导致的模型精度下降问题,特别是在个性化联邦学习(Personalized Federated Learning, PFL)场景下如何有效区分全局共享参数与局部个性化参数的问题。其解决方案的关键在于提出一种基于分位数阈值机制的联邦最优大脑个性化算法(Federated Optimal Brain Personalization, FedOBP),通过引入逐元素重要性评分(element-wise importance score)来量化每个参数对本地损失函数敏感度,该评分扩展了经典最优大脑损伤(Optimal Brain Damage, OBD)剪枝理论,并结合联邦近似的一阶导数项进行计算;同时将指标计算从客户端迁移至服务器端以降低移动端负担,从而在保证全局知识共享的同时实现高效、精准的局部适应,实验证明该方法仅需极少数量的个性化参数即可显著优于现有先进方法。
链接: https://arxiv.org/abs/2604.16574
作者: Xingyan Chen,Tian Du,Changqiao Xu,Fuzhen Zhuang,Lujie Zhong,Gabriel-Miro Muntean,Enmao Diao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated Learning (FL) faces challenges from client data heterogeneity and resource-constrained mobile devices, which can degrade model accuracy. Personalized Federated Learning (PFL) addresses this issue by adapting shared global knowledge to local data distributions. A promising approach in PFL is model decoupling, which separates the model into global and personalized parameters, raising the key question of which parameters should be personalized to balance global knowledge sharing and local adaptation. In this paper, we propose a Federated Optimal Brain Personalization (FedOBP) algorithm with a quantile-based thresholding mechanism and introduce an element-wise importance score. This score extends Optimal Brain Damage (OBD) pruning theory by incorporating a federated approximation of the first-order derivative in the Taylor expansion to evaluate the importance of each parameter for personalization. Moreover, we move the metric computation originally performed on clients to the server side, to alleviate the burden on resource-constrained mobile devices. To the best of our knowledge, this is the first work to bridge classical saliency-based pruning theory with federated parameter decoupling, providing a rigorous theoretical justification for selecting personalized parameters based on their sensitivity to local loss landscapes. Extensive experiments demonstrate that FedOBP outperforms state-of-the-art methods across diverse datasets and heterogeneity scenarios, while requiring personalization of only a very small number of personalized parameters.
[AI-205] In Search of Lost DNA Sequence Pretraining
【速读】:该论文旨在解决当前DNA预训练方法中存在的三个关键问题:下游评估数据集选择不当、邻域掩码(neighbor-masking)策略存在固有缺陷,以及词汇表(vocabulary)设计缺乏深入讨论。其解决方案的核心在于提出一套系统性的指导原则,包括评估数据集的选择标准、任务设计的指导方针以及对词汇表的深入分析,并构建一个标准化测试平台,以实现DNA预训练方法的可复现性和严谨基准测试,从而推动基因组基础模型的发展。
链接: https://arxiv.org/abs/2604.16570
作者: Zhijiang Tang,Jiaxin Qi,Yan Cui,Jinli Ou,Yuhua Zheng,Jianqiang Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:DNA sequence encoding is fundamental to gene function prediction, protein synthesis, and diverse downstream biological tasks. Despite the substantial progress achieved by large-scale DNA sequence pretraining, existing studies have overwhelmingly emphasized pretraining scale and custom downstream evaluation datasets, while neglecting some essential components of the pretraining paradigm. In this paper, we reveal three critical yet heretofore overlooked problems in DNA pretraining: inappropriate downstream datasets, inherent flaws in the neighbor-masking strategy, and the lack of detailed discussion on vocabulary. Therefore, we undertake comprehensive investigations and propose principled guidelines, including selection criteria for evaluation datasets, guiding task design, and in-depth vocabulary analysis. Extensive experiments validate the significance of our identified problems and support the rationale behind our recommendations. Finally, we introduce a standardized testbed that enables reproducible and rigorous benchmarking of DNA pretraining methods to advance the development of genomic foundation models.
[AI-206] Reasoning on the Manifold: Bidirectional Consistency for Self-Verification in Diffusion Language Models
【速读】:该论文旨在解决扩散大语言模型(Diffusion Large Language Models, dLLMs)在全局规划中虽具结构优势,但如何高效验证其通过有效推理路径得出正确答案的挑战。解决方案的关键在于提出一种几何视角——“流形上的推理”(Reasoning on the Manifold),并引入无需训练、无监督的双向流形一致性(Bidirectional Manifold Consistency, BMC)指标:该指标通过前向掩码与后向重构的循环机制,量化生成序列在高密度流形上的稳定性,从而将有效推理路径识别为流形上的稳定吸引子,而无效路径则表现为离群漂移。这一方法在诊断、推理和对齐三个阶段均展现出强大实用性,实现了从粗粒度结果监督到细粒度几何奖励的转化,显著提升了dLLMs的推理可靠性与自进化能力。
链接: https://arxiv.org/abs/2604.16565
作者: Jiaoyang Ruan,Xin Gao,Yinda Chen,Hengyu Zeng,Liang Du,Guanghao Li,Jie Fu,Jian Pu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 pages, 5 figures
Abstract:While Diffusion Large Language Models (dLLMs) offer structural advantages for global planning, efficiently verifying that they arrive at correct answers via valid reasoning traces remains a critical challenge. In this work, we propose a geometric perspective: Reasoning on the Manifold. We hypothesize that valid generation trajectories reside as stable attractors on the high-density manifold of the learned distribution, whereas invalid paths exhibit off-manifold drift. To operationalize this, we introduce Bidirectional Manifold Consistency (BMC), a training-free, unsupervised metric that quantifies the stability of the generated sequence through a forward-masking and backward-reconstruction cycle. Empirically, we demonstrate BMC’s versatility across the full reasoning lifecycle: (1) in Diagnosis, it serves as a robust discriminator of solution validity without ground truth answer; (2) in Inference, it enables rejection resampling to effectively concentrate computational resources on complex reasoning tasks; and (3) in Alignment, it functions as a dense geometric reward that transforms sparse outcome supervision into fine-grained guidance, empowering models to self-evolve beyond standard baselines. Our results establish intrinsic geometric stability as a robust indicator of correctness for dLLMs.
[AI-207] SpecPylot: Python Specification Generation using Large Language Models
【速读】:该论文旨在解决自动为Python程序生成可执行形式规范(executable specifications)以提升程序正确性验证效率的问题,尤其针对开发者因手动编写契约(contract)困难而回避使用自动化验证工具的现状。其关键解决方案是提出SpecPylot工具,该工具利用大语言模型(LLM)生成候选契约(作为icontract注解),再通过Crosshair的符号执行进行验证;若发现反例,则仅迭代更新生成的契约而不修改原代码,并支持生成覆盖驱动的pytest桩文件及保留调试用的执行中间产物,从而在不改变程序逻辑的前提下实现高效、可迭代的规范合成与验证。
链接: https://arxiv.org/abs/2604.16560
作者: Ragib Shahariar Ayon,Shibbir Ahmed
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Accepted in 34th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (FSE Companion 26)
Abstract:Automatically generating formal specifications could reduce the effort needed to improve program correctness, but in practice, this is still challenging. Many developers avoid writing contracts by hand, which limits the use of automated verification tools. Recent large language models (LLMs) can generate specifications from code, but these specifications often fail in terms of verification. The reason is syntax errors, overly strict constraints, or mismatches with program behavior. We present SpecPylot, a Python tool that synthesizes executable specifications for Python programs as icontract annotations and checks them using crosshair’s symbolic execution. The tool relies on LLMs to propose candidate contracts and uses crosshair to validate them. When crosshair finds a concrete counterexample, SpecPylot updates only the generated contracts and leaves the program itself untouched. In addition, the tool can produce coverage-driven pytest stubs and keep detailed execution artifacts that are useful during debugging. Overall, the evaluation indicates that SpecPylot is able to generate crosshair-compatible contracts for most programs, but it also highlights the practical limits introduced by bounded symbolic exploration and differences in LLM behavior.
[AI-208] An Interpretable Framework Applying Protein Words to Predict Protein-Small Molecule Complementary Pairing Rules
【速读】:该论文旨在解决深度学习模型在药物发现中“黑箱”特性导致的可解释性不足问题,尤其是在预测蛋白质-小分子结合亲和力时缺乏对关键相互作用机制的理解。解决方案的关键在于提出PWRules框架,该框架通过将结合亲和力数据用于识别优势小分子片段(privileged small molecule fragments),并借助可解释性模块建立这些片段与蛋白质词(protein words,即语义序列单元)之间的互补配对规则;随后利用PWScore函数对规则进行排序以优先筛选活性化合物。该方法不仅在基准数据集上达到与基于物理的模型(Glide)和深度学习模型(PSICHIC)相当的性能,还展现出对训练集外靶点(如SARS-CoV-2主蛋白酶)的良好泛化能力,并能通过结构分析验证所学规则显著富集于配体结合口袋附近,从而提供了一种兼具高性能与高可解释性的药物设计新范式。
链接: https://arxiv.org/abs/2604.16550
作者: Jingke Chen,Jingrui Zhong,Tazneen Hossain Tani,Zidong Su,Xiaochun Zhang,Boxue Tian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the high accuracy of ‘black box’ deep learning models, drug discovery still relies on protein-ligand interaction principles and heuristics. To improve interpretability of protein-small molecule binding predictions, we developed the PWRules framework, which applies binding affinity data to identify privileged small molecule fragments and subsequently defines complementary pairing rules between these fragments and protein words (semantic sequence units) through an interpretability module. The resulting word-fragment rules are then ranked by the PWScore function to prioritize active compounds. Evaluations on benchmark datasets show that PWScore achieves competitive performance comparable to the physics-based model (Glide) and the deep learning model (PSICHIC) and shows broad applicability for protein targets outside the training dataset, e.g., SARS-CoV-2 main protease. Notably, PWScore captures complementary interaction information, yielding superior enrichment performance when integrated with these established methods. Structural analysis of protein-ligand complexes indicates that learned word-fragment rules are significantly enriched near ligand-binding pockets, despite training without explicit structural guidance. By extracting and applying complementary pairing rules, PWRules provides an interpretable framework for drug discovery.
[AI-209] Understanding Tool-Augmented Agents for Lean Formalization: A Factorial Analysis
【速读】:该论文旨在解决自然语言数学表述自动翻译为忠实Lean 4代码时所面临的挑战,即非形式化的集合论直觉与严格形式类型论之间的根本性差异,这种差异常导致大语言模型(LLM)幻觉出不存在的库定义,从而生成无法编译或语义不一致的代码。解决方案的关键在于引入工具增强型智能体(tool-augmented agents),通过系统性的因子分析方法,整合三类互补工具:微调模型查询(Fine-tuned Model Querying,访问专家草稿)、知识检索(Knowledge Search,获取符号定义)和编译器反馈(Compiler Feedback,利用Lean REPL验证代码),从而显著提升代码的编译成功率和语义等价性。
链接: https://arxiv.org/abs/2604.16538
作者: Ke Zhang,Patricio Gallardo,Maziar Raissi,Sudhir Murthy
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注: 15 pages,8 figures
Abstract:Automatic translation of natural language mathematics into faithful Lean 4 code is hindered by the fundamental dissonance between informal set-theoretic intuition and strict formal type theory. This gap often causes LLMs to hallucinate non-existent library definitions, resulting in code that fails to compile or lacks semantic fidelity. In this work, we investigate the effectiveness of tool-augmented agents for this task through a systematic factorial analysis of three distinct tool categories: Fine-tuned Model Querying (accessing expert drafts), Knowledge Search (retrieving symbol definitions), and Compiler Feedback (verifying code via a Lean REPL). We first benchmark the agent against one-shot baselines, demonstrating large gains in both compilation success and semantic equivalence. We then use the factorial decomposition to quantify the impact of each category, isolating the marginal contribution of each tool type to overall performance.
[AI-210] owards Reliable Testing of Machine Unlearning
【速读】:该论文旨在解决机器学习模型在数据删除需求下(如合规性要求)的“遗忘测试”问题,即如何在现实部署约束和不完美评估条件下验证模型是否已彻底移除对特定敏感信息的依赖。其核心挑战在于传统方法难以检测到通过代理路径(proxy pathways)、中介影响(mediated influence)或子群掩蔽(subgroup masking)等复杂机制残留的模型依赖。解决方案的关键是提出一种因果导向的“路径中心型”测试范式——因果模糊测试(causal fuzzing),通过预算受限的干预实验来量化残余直接与间接效应,并生成可调试的“泄漏报告”,从而实现对遗忘效果的全面覆盖、定位和高效评估,特别适用于API部署的黑盒模型场景。
链接: https://arxiv.org/abs/2604.16536
作者: Anna Mazhar,Sainyam Galhotra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine learning components are now central to AI-infused software systems, from recommendations and code assistants to clinical decision support. As regulations and governance frameworks increasingly require deleting sensitive data from deployed models, machine unlearning is emerging as a practical alternative to full retraining. However, unlearning introduces a software quality-assurance challenge: under realistic deployment constraints and imperfect oracles, how can we test that a model no longer relies on targeted information? This paper frames unlearning testing as a first-class software engineering problem. We argue that practical unlearning tests must provide (i) thorough coverage over proxy and mediated influence pathways, (ii) debuggable diagnostics that localize where leakage persists, (iii) cost-effective regression-style execution under query budgets, and (iv) black-box applicability for API-deployed models. We outline a causal, pathway-centric perspective, causal fuzzing, that generates budgeted interventions to estimate residual direct and indirect effects and produce actionable “leakage reports”. Proof-of-concept results illustrate that standard attribution checks can miss residual influence due to proxy pathways, cancellation effects, and subgroup masking, motivating causal testing as a promising direction for unlearning testing.
[AI-211] SCATR: Simple Calibrated Test-Time Ranking
【速读】:该论文旨在解决生成式 AI(Generative AI)在测试时扩展(Test-time Scaling, TTS)过程中,如何在不显著增加计算成本的前提下提升大语言模型(Large Language Models, LLMs)推理性能的问题。现有方法如基于过程奖励模型(Process Reward Models, PRMs)虽有效但训练与运行开销高,而轻量级置信度启发式方法(confidence heuristics)虽然高效却性能不足。其解决方案的关键在于提出 SCATR —— 一种利用小规模校准集从基础模型的隐藏表示中学习轻量级评分函数的 BoN 排序方法,从而在保持极低参数量(相比 LoRA 微调减少高达 8000 倍)和计算复杂度的同时,实现接近强 PRM 基线的准确率,并显著降低训练与推理延迟(分别最多降低 150x 和 1000x),在多个编码与数学推理基准上相较原有置信度基线提升达 9%,并优于部分 PRM 方法。
链接: https://arxiv.org/abs/2604.16535
作者: Divya Shyamal,Marta Knežević,Lan Tran,Chanakya Ekbote,Vijay Lingam,Paul Pu Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Test-time scaling (TTS) improves large language models (LLMs) by allocating additional compute at inference time. In practice, TTS is often achieved through parallel scaling: generating multiple candidate responses and selecting the best via a Best-of-N (BoN) strategy. Its effectiveness therefore hinges on the scoring function. Learned scorers such as process reward models (PRMs) can be strong, but they are expensive to train and run. Lightweight confidence heuristics based on token log-probabilities are much cheaper, yet we find that they often perform substantially worse. To improve on lightweight confidence heuristics without incurring the full cost of stronger learned scorers, we introduce SCATR, a simple and efficient BoN ranking method that learns a lightweight scorer from a small calibration set using hidden representations from the base model. Across coding and mathematical reasoning benchmarks, SCATR improves over prior confidence-based baselines by up to 9%. Relative to LoRA fine-tuning on the same calibration data, it achieves comparable accuracy with up to 8000x fewer trainable parameters and much lower compute, reducing training and inference latency by up to 150x and 1000x, respectively. SCATR is also competitive with strong PRM baselines, and in several settings improves accuracy by up to 7.8% on math and 4.2% on coding while enabling up to 1000x faster inference. Overall, SCATR offers a strong accuracy-efficiency trade-off for scalable test-time selection.
[AI-212] G-PARC: Graph-Physics Aware Recurrent Convolutional Neural Networks for Spatiotemporal Dynamics on Unstructured Meshes
【速读】:该论文旨在解决现有物理感知深度学习(Physics-aware Deep Learning, PADL)方法在处理非线性时空动力学时的两大局限:一是基于像素的卷积网络受限于静态均匀笛卡尔网格,难以高效追踪演化中的局部结构;二是现有图神经网络(Graph Neural Networks, GNNs)方法在极端非线性 regime 下表现不佳。解决方案的关键在于提出 Graph PARC(G-PARC),其通过移动最小二乘(Moving Least Squares, MLS)核在非结构化图上近似空间导数,并将控制偏微分方程(PDE)的导数项显式嵌入网络计算图中,从而结合了 GNN 对不规则空间离散化的天然适应性与物理信息建模的准确性。该方法以更少参数(2–3 倍)实现更高精度,且无需传统编码器-处理器-解码器框架,显著提升了在复杂几何域和极端非线性场景下的泛化能力与模拟精度。
链接: https://arxiv.org/abs/2604.16533
作者: Jack T. Beerman,Tyler J. Abele,Mehdi Taghizadeh,Andrew Davis,Zoë J. Gray,Negin Alemazkoor,Xinfeng Gao,H.S. Udaykumar,Stephen S. Baek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Physics-aware recurrent convolutional networks (PARC) have demonstrated strong performance in predicting nonlinear spatiotemporal dynamics by embedding differential operators directly into the computational graph of a neural network. However, pixel-based convolutions are restricted to static, uniform Cartesian grids, making them ill-suited to following evolving localized structures in an efficient manner. Graph neural networks (GNNs) naturally handle irregular spatial discretizations, but existing graph-based physics-aware deep learning (PADL) methods have difficulty handling extreme nonlinear regimes. To address these limitations, we propose Graph PARC (G-PARC), which uses moving least squares (MLS) kernels to approximate spatial derivatives on unstructured graphs, and embeds the derivatives of governing partial differential equations into the network’s computational graph. G-PARC achieves better accuracy with 2-3x fewer parameters than MeshGraphNet, MeshGraphKAN, and GraphSAGE, replacing the traditional encoder-processor-decoder framework with analytically computed differential operators. We demonstrate that G-PARC (1) generalizes across nonuniform spatial and temporal discretizations; (2) handles moving meshes required for structural deformation; and (3) outperforms existing graph-based PADL methods on nonlinear benchmarks including fluvial hydrology, planar shock waves, and elastoplastic dynamics. By embedding explicit physical operators within the flexibility of GNNs, G-PARC enables accurate modeling of extreme nonlinear phenomena on complex computational domains, moving PADLbeyond idealized Cartesian grids.
[AI-213] CAMP: Cumulative Agent ic Masking and Pruning for Privacy Protection in Multi-Turn LLM Conversations
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在代理式多轮对话场景中因碎片化个人信息(Personally Identifiable Information, PII)累积而导致的隐私泄露问题,即“累积个人信息暴露”(Cumulative PII Exposure, CPE)。传统PII屏蔽方法仅针对单轮消息进行静态处理,无法识别跨轮次信息片段组合后形成的高风险可重识别身份特征。解决方案的关键在于提出CAMP(Cumulative Agentic Masking and Pruning)框架:通过维护会话级PII注册表、构建实体类型共现图以量化组合风险,并在每轮对话后计算CPE得分,在阈值触发时对历史对话实施回溯性屏蔽,从而在保障对话完整性的同时有效阻断累积性隐私泄露。
链接: https://arxiv.org/abs/2604.16521
作者: Aman Panjwani
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Submitted to arXiv. Finance-domain multi-turn demo evaluated on 4 synthetic scenarios. Independent research
Abstract:The deployment of Large Language Models in agentic, multi-turn conversational settings has introduced a class of privacy vulnerabilities that existing protection mechanisms are not designed to address. Current approaches to Personally Identifiable Information (PII) masking operate on a per-turn basis, scanning each user message in isolation and replacing detected entities with typed placeholders before forwarding sanitized text to the model. While effective against direct identifier leakage within a single message, these methods are fundamentally stateless and fail to account for the compounding privacy risk that emerges when PII fragments accumulate across conversation turns. A user who separately discloses their name, employer, location, and medical condition across several messages has revealed a fully re-identifiable profile - yet no individual message would trigger a per-turn masker. We formalize this phenomenon as Cumulative PII Exposure (CPE) and propose CAMP (Cumulative Agentic Masking and Pruning), a cross-turn privacy protection framework for multi-turn LLM conversations. CAMP maintains a session-level PII registry, constructs a co-occurrence graph to model combination risk between entity types, computes a CPE score after each turn, and triggers retroactive masking of conversation history when the score crosses a configurable threshold. We evaluate CAMP on four synthetic multi-turn scenarios spanning healthcare, hiring, finance, and general conversation, demonstrating that per-turn baselines expose re-identifiable profiles that CAMP successfully neutralizes while preserving full conversational utility.
[AI-214] On-Orbit Space AI: Federated Multi-Agent and Collaborative Algorithms for Satellite Constellations
【速读】:该论文旨在解决卫星星座尺度下自主智能(constellation-scale autonomy)所面临的核心挑战,包括动态星间连接条件下的学习与协同、严格的SWaP-C(尺寸、重量、功耗与成本)限制、辐射引起的故障、非独立同分布(non-IID)数据、概念漂移以及安全关键型操作约束等问题。其解决方案的关键在于提出三种互补的范式:(i) 联邦学习(federated learning),用于跨卫星训练、个性化模型和安全聚合;(ii) 多智能体算法(multi-agent algorithms),实现协作规划、资源分配、调度、编队控制和碰撞规避;(iii) 协同感知与分布式推理(collaborative sensing and distributed inference),支持多星融合感知、跟踪、分层推理(split/early-exit inference)及与星座网络的跨层协同设计。通过系统级视角和统一的分类体系,该文整合了协作架构、时间机制与信任模型,为构建可扩展、鲁棒且高效的在轨空间AI提供了理论基础与实践路径。
链接: https://arxiv.org/abs/2604.16518
作者: Ziyang Wang
机构: 未知
类目: Robotics (cs.RO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注: Accepted by Algorithms, MDPI
Abstract:Satellite constellations are transforming space systems from isolated spacecraft into networked, software-defined platforms capable of on-orbit perception, decision making, and adaptation. Yet much of the existing AI studies remains centered on single-satellite inference, while constellation-scale autonomy introduces fundamentally new algorithmic requirements: learning and coordination under dynamic inter-satellite connectivity, strict SWaP-C limits, radiation-induced faults, non-IID data, concept drift, and safety-critical operational constraints. This survey consolidates the emerging field of on-orbit space AI through three complementary paradigms: (i) federated learning for cross-satellite training, personalization, and secure aggregation; (ii) multi-agent algorithms for cooperative planning, resource allocation, scheduling, formation control, and collision avoidance; and (iii) collaborative sensing and distributed inference for multi-satellite fusion, tracking, split/early-exit inference, and cross-layer co-design with constellation networking. We provide a system-level view and a taxonomy that unifies collaboration architectures, temporal mechanisms, and trust models. To support community development and keep this review actionable over time, we continuously curate relevant papers and resources at this https URL.
[AI-215] Forge-UGC: FX optimization and register-graph engine for universal graph compiler
【速读】:该论文旨在解决当前主流深度学习编译框架(如OpenVINO和ONNX Runtime)在部署Transformer模型时存在的问题,包括不透明的编译流程、有限的优化阶段可见性以及薄弱的内存管理机制,这些问题导致编译成本高、运行时开销大。其核心解决方案是提出Forge-UGC(FX Optimization and Register-Graph Engine for Universal Graph Compilation),一个四阶段可扩展的编译器架构:第一阶段以ATen算子级别捕获图并支持现代Transformer组件;第二阶段执行六种优化策略(如注意力融合、算子融合等)显著减少节点数量;第三阶段将优化后的图降低为带显式虚拟寄存器分配的类型化中间表示;第四阶段通过存活分析与线性扫描缓冲区分配降低峰值内存占用,并采用设备亲和调度减少NPU-CPU切换次数。此设计实现了比现有工具更快的编译速度(6.9–9.2倍)、更低的推理延迟(18.2–35.7%)和能耗(30.2–40.9%),同时保持模型精度不变。
链接: https://arxiv.org/abs/2604.16498
作者: Satyam Kumar,Saurabh Jha
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:We present Forge-UGC (FX Optimization and Register-Graph Engine for Universal Graph Compilation), a four-phase compiler for transformer deployment on heterogeneous accelerator hardware, validated on Intel AI Boost NPU. Existing frameworks such as OpenVINO and ONNX Runtime often use opaque compilation pipelines, limited pass-level visibility, and weak buffer management, which can lead to higher compilation cost and runtime overhead. Forge-UGC addresses this with a hardware-agnostic design that separates graph capture, optimization, intermediate representation lowering, and backend scheduling. Phase 1 captures graphs with this http URL at the ATen operator level, supporting modern transformer components such as rotary position embeddings, grouped-query attention, and SwiGLU without manual decomposition. Phase 2 applies six optimization passes: dead code elimination, common subexpression elimination, constant folding, attention fusion, operator fusion, and layout optimization, reducing graph node count by 14.2 to 21.9%. Phase 3 lowers the optimized graph into a typed intermediate representation with explicit virtual register assignments. Phase 4 performs liveness analysis, linear-scan buffer allocation, reducing peak buffer count by 30 to 48%, and device-affinity scheduling, reducing NPU-CPU transitions by 42 to 65%. Across six model families ranging from 125M to 8B parameters, evaluated on WikiText-103 and GLUE, Forge-UGC delivers 6.9 to 9.2x faster compilation than OpenVINO and ONNX Runtime, 18.2 to 35.7% lower inference latency, and 30.2 to 40.9% lower energy per inference. Fidelity is preserved, with max absolute logit differences below 2.1e-5 and KL divergence below 8.4e-9. We also introduce Fusion Gain Ratio, Compilation Efficiency Index, and per-pass execution profiling for systematic evaluation of NPU compilation pipelines.
[AI-216] Gradient-Free Continual Learning in Spiking Neural Networks via Inter-Spike Interval Regularization
【速读】:该论文旨在解决神经网络在动态环境中进行持续学习(continual learning)时面临的灾难性遗忘问题,尤其是针对脉冲神经网络(Spiking Neural Networks, SNNs)在缺乏反向传播支持的类脑硬件(neuromorphic hardware)上难以应用传统基于梯度的权重重要性评估方法(如EWC和SI)的问题。其解决方案的关键在于提出了一种全新的、无需梯度计算的突触重要性度量指标——ISI-CV(Inter-Spike Interval Coefficient of Variation),该指标通过分析神经元放电间隔的变异系数来判断其稳定性:放电规律性强(低CV)的神经元编码任务相关特征,被保护以避免遗忘;而放电不规则的神经元则允许自由适应新任务。ISI-CV仅依赖于脉冲时间计数器和整数运算,完全兼容所有类脑芯片的硬件原生能力,从而实现了在真实事件驱动传感器(DVS)数据上的高效、稳定持续学习,且显著优于现有梯度方法。
链接: https://arxiv.org/abs/2604.16496
作者: Samrendra Roy,Kazuma Kobayashi,Souvik Chakraborty,Sajedul Talukder,Syed Bahauddin Alam
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Continual learning, the ability to acquire new tasks sequentially without forgetting prior knowledge, is essential for deploying neural networks in dynamic real-world environments, from nuclear digital twin monitoring to grid-edge fault detection. Existing synaptic importance methods, such as Elastic Weight Consolidation (EWC) and Synaptic Intelligence (SI), rely on gradient computation, making them incompatible with neuromorphic hardware that lacks backpropagation support. We propose ISI-CV, the first gradient-free synaptic importance metric for SNN continual learning, derived from the Coefficient of Variation (CV) of Inter-Spike Intervals (ISIs). Neurons that fire regularly (low CV) encode stable, task-relevant features and are protected from overwriting; neurons with irregular firing are permitted to adapt freely. ISI-CV requires only spike time counters and integer arithmetic, all of which are native to every neuromorphic chip. We evaluate on four benchmarks of increasing difficulty: Split-MNIST, Permuted-MNIST, Split-FashionMNIST, and Split-N-MNIST using real Dynamic Vision Sensor (DVS) event data. Across three seeds, ISI-CV achieves zero forgetting (AF = 0.000 +/- 0.000) on Split-MNIST and Split-FashionMNIST, near-zero forgetting on Permuted-MNIST (AF = 0.001 +/- 0.000), and the highest accuracy with the lowest forgetting on real neuromorphic DVS data (AA = 0.820 +/- 0.012, AF = 0.221 +/- 0.014). On N-MNIST, gradient-based methods produce unreliable importance estimates and perform worse than no regularization; ISI-CV avoids this failure by design.
[AI-217] Spike-driven Large Language Model
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)依赖大规模密集矩阵乘法运算导致高计算和能耗成本的问题,并探索如何有效将脑启发的脉冲驱动特性(spike-driven characteristics)融入LLM推理过程。现有基于脉冲神经网络(Spiking Neural Networks, SNNs)的方法在处理百亿参数级模型时,受限于表示能力不足和编码稀疏性问题,难以实现高效低功耗的脉冲驱动推理。其解决方案的关键在于提出SDLLM——一种通过稀疏加法操作替代密集矩阵乘法的脉冲驱动大语言模型:首先采用可插拔的gamma-SQP两步脉冲编码方法,使量化过程与语义空间对齐,缓解二进制脉冲带来的表征退化;其次引入对称量化下的双向编码与膜电位截断机制,显著降低脉冲发放率并减少时间步数,从而在保持高性能的同时实现7倍能效提升和4.2%精度改善。
链接: https://arxiv.org/abs/2604.16475
作者: Han Xu,Xuerui Qiu,Baiyu Chen,Xinhao Luo,Xingrun Xing,Jiahong Zhang,Bo Lei,Tiejun Huang,Bo Xu,Guoqi Li
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:Current Large Language Models (LLMs) are primarily based on large-scale dense matrix multiplications. Inspired by the brain’s information processing mechanism, we explore the fundamental question: how to effectively integrate the brain’s spiking-driven characteristics into LLM inference. Spiking Neural Networks (SNNs) possess spike-driven characteristics, and some works have attempted to combine SNNs with Transformers. However, achieving spike-driven LLMs with billions of parameters, relying solely on sparse additions, remains a challenge in the SNN field. To address the issues of limited representational capacity and sparsity in existing spike encoding schemes at the LLM level, we propose SDLLM, a spike-driven large language model that eliminates dense matrix multiplications through sparse addition operations. Specifically, we use the plug-and-play gamma-SQP two-step spike encoding method to ensure that the quantization process aligns with the model’s semantic space, mitigating representation degradation caused by binary spikes. Furthermore, we introduce bidirectional encoding under symmetric quantization and membrane potential clipping mechanisms, leading to spike trains with no or low firing counts dominating, significantly reducing the model’s spike firing rate, while halving the number of time steps. Experimental results show that SDLLM not only significantly reduces inference costs but also achieves state-of-the-art task performance under the spike-based paradigm. For example, compared to previous spike-based LLMs, SDLLM reduces energy consumption by 7x and improves accuracy by 4.2%. Our model provides inspiration for the architecture design of the next generation of event-driven neuromorphic chips.
[AI-218] Full Feature Spiking Neural Network Simulation on Micro-Controllers for Neuromorphic Applications at the Edge
【速读】:该论文旨在解决在资源受限的边缘计算设备上实现高效神经形态计算的问题,特别是如何在微控制器(MCU)上运行脉冲神经网络(SNN)模拟器而不依赖高功耗的GPU或专用硬件。其解决方案的关键在于对CARLsim SNN模拟器进行优化,采用IEEE 16位浮点数(half-precision floating-point)替代标准单精度浮点数,在不损失功能的前提下显著降低内存占用,从而使得完整功能的SNN模拟可在仅8 MB内存的RP2350 MCU上运行,并实现在20 mW功耗下实时处理缩放后的Synfire4基准测试(含186个神经元),相较ARM Cortex-A53处理器和完整SoC平台分别实现了5倍和一个数量级的能效提升。
链接: https://arxiv.org/abs/2604.16474
作者: L. Niedermeier,J. L. Krichmar
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Microcontroller units (MCU), which have an order of magnitude lower Size, Weight and Power (SWaP) than standard computers, makes them suitable for applications at the edge. Neuromorphic computing, which can realize low SWaP, relies on Spiking Neural Networks (SNNs). Until now, software based simulations of SNNs required GPU-based workstations, application classified core processors such as the ARM Cortex-A53, or specialized hardware like Intel’s Loihi. In the present work, we demonstrate that the SNN simulator CARLsim can run its full feature set on a MCU RP2350 with 8 MB memory. We accomplished this by utilizing IEEE 16-bit float point numbers, which reduced memory requirements without loss of function. We were able to run the Synfire4 benchmark which comprises 1200 neurons. The accuracy was 97.5% compared to the standard single precision numbers. Furthermore, we show that CARLsim runs a Synfire4 benchmark scaled-down to 186 neurons on a MCU in real-time at only 20 mW. Compared to the smallest application class ARM processor used by Raspberry in their Pi Zero 2 W, our MCU implementation is five times more energy efficient for the SNN itself, and an order of magnitude better when compared to the complete SoC (MCU/CPU + Board).
[AI-219] B-PASTE: Beam-Aware Pattern-Guided Speculative Execution for Resource-Constrained LLM Agents
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)智能体在执行过程中因串行依赖导致的端到端延迟高、资源利用率低的问题,尤其是在边缘侧部署时,推测性执行可能侵占稀缺的延迟敏感资源。其解决方案的关键在于提出B-PASTE——一种基于束搜索(beam-aware)的分支感知型推测执行机制,它将推测对象从单个工具调用扩展为局部执行分支假设,并通过维护一个受限的未来执行子图束(bounded beam of future execution subgraphs),以预期关键路径缩短量而非单纯执行概率来排序分支候选,仅调度高价值分支前缀至瞬态空闲资源上;同时显式建模并发干扰、下游解锁价值与状态安全性约束,从而在保障权威执行的前提下实现高效串行快速路径执行与安全并行性的动态平衡。
链接: https://arxiv.org/abs/2604.16469
作者: Yanfei Song
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:LLM agents execute in an interleaved reasoning-and-action loop, where future tool calls cannot be launched until the current reasoning step completes. This serial dependency inflates end-to-end latency and leaves the model idle while waiting for tool execution. Prior work, Pattern-Aware Speculative Tool Execution (PASTE), mitigates this bottleneck by speculating likely future tool invocations from mined control-flow and data-flow regularities. However, PASTE is tool-centric and speculates only individual invocations rather than bounded future branches. We propose B-PASTE, a beam-aware extension that lifts speculation from single tools to local branch hypotheses under strict resource constraints. B-PASTE maintains a bounded beam of future execution subgraphs, ranks them by expected critical-path reduction rather than raw execution probability, and schedules only high-value branch prefixes on transient slack resources. It explicitly models co-run interference, downstream unlock value, and state-safety constraints, enabling the system to prioritize serial fast-path execution when early completion unlocks valuable future work, while still exploiting safe parallelism under low contention. This design is especially important for edge-side deployments, where speculative work must not steal scarce resources from latency-critical authoritative execution. Preliminary internal testing on Thor-class edge environments shows up to 1.4X end-to-end speedup, suggesting that branch-aware speculative execution remains effective even under tight resource budgets. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.16469 [cs.DC] (or arXiv:2604.16469v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.16469 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-220] Healthcare AI for Automation or Allocation? A Transaction Cost Economics Framework
【速读】:该论文旨在解决医疗工作中协调成本(transaction cost)的系统性差异问题,即不同职业角色在完成任务时因信息搜寻、决策与谈判、监督与执行、适应与协调等协调活动所产生的隐性工作负担如何分布。其解决方案的关键在于:利用O*NET职业数据库中的任务描述与频率权重,结合受约束的大语言模型对每项任务进行分类,识别其所属的交易成本类别并量化整体交易成本强度,从而在任务层面实现对医疗职业间协调结构的精细化刻画。结果表明,临床人员的交易成本强度显著高于非临床人员,主要源于信息搜索和决策协调负担,揭示了数字技术和生成式AI干预机会的不均衡分布,其驱动力更多来自协调结构而非技术复杂度本身。
链接: https://arxiv.org/abs/2604.16465
作者: Ari Ercole
机构: 未知
类目: Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:
Abstract:Healthcare productivity is shaped not only by clinical complexity but by the costs of coordinating work under uncertainty. Transaction-cost economics offers a theory of these coordination frictions, yet has rarely been operationalised at task level across health occupations. Using task statements and frequency weights from the O*NET occupational database, we characterised healthcare work at task granularity and coded each unique task using a constrained large language model into one dominant transaction-cost category (information search, decision and bargaining, monitoring and enforcement, or adaptation and coordination) together with an overall transaction-cost intensity score. Aggregating to the occupation level, clinician roles exhibited substantially higher transaction-cost intensity than non-clinician roles, driven primarily by greater burdens of information search and decision-related coordination, while dispersion of transaction costs within occupations did not differ. These findings demonstrate systematic heterogeneity in the nature of coordination work across healthcare roles and suggest that the opportunities for digital and AI interventions are unevenly distributed, shaped less by technical task complexity than by underlying coordination structure.
[AI-221] Sampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte Carlo
【速读】:该论文旨在解决大语言模型在生成过程中因标准解码方法仅优化词元级似然而忽视序列级质量的问题,从而导致生成结果在实际任务中表现不佳。其核心解决方案是提出一个训练-free的、基于概率框架的奖励引导解码方法,通过将模型转移概率与前缀依赖的奖励势能相结合,定义出完整的序列奖励增强目标分布,并利用序贯蒙特卡洛(Sequential Monte Carlo)算法进行采样,其中关键创新包括:1)引入前缀仅限变体以提升计算效率;2)设计前瞻变体使中间目标匹配完整序列分布的精确边缘分布;3)集成重采样-移动更新与马尔可夫链蒙特卡洛再生机制,支持块级生成并统一多种常见解码策略(如温度采样和幂温化目标)。实验表明,该方法在代码生成(HumanEval)和数学推理(MATH500)任务上显著优于现有基线,且无需重新训练模型即可实现性能跃升。
链接: https://arxiv.org/abs/2604.16453
作者: Jelena Markovic-Voronov,Wenhui Zhu,Bo Long,Zhipeng Wang,Suyash Gupta,Kayhan Behdin,Bee-Chung Chen,Deepak Agarwal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:We introduce a principled probabilistic framework for reward-guided decoding in large language models, addressing the limitations of standard decoding methods that optimize token-level likelihood rather than sequence-level quality. Our method defines a reward-augmented target distribution over complete sequences by combining model transition probabilities with prefix-dependent reward potentials. Importantly, the approach is training-free: it leaves model weights unchanged and instead modifies the inference distribution via reward potentials, with all gains arising purely from inference-time sampling. To sample from this distribution, we develop Sequential Monte Carlo algorithms, including a computationally efficient prefix-only variant and a lookahead variant whose intermediate targets match the exact marginals of the full sequence distribution. The framework also integrates resample-move updates with Metropolis-Hastings rejuvenation and supports block-wise generation, subsuming common decoding strategies such as temperature sampling and power-tempered objectives. Empirical results across three 7B models show significant gains. On code generation (HumanEval), our method improves base performance by up to 54.9% and surpasses the strongest sampling baselines by 9.1%-15.3%. On mathematical reasoning (MATH500), it achieves gains of up to 8.8%. Notably, it reaches 87.8% on HumanEval and 78.4% on MATH500 with Qwen2.5-7B, consistently outperforming the reinforcement learning method GRPO.
[AI-222] LatentMimic: Terrain-Adaptive Locomotion via Latent Space Imitation
【速读】:该论文旨在解决四足机器人在复杂地形中实现自然且多样化的运动控制问题,尤其是在保持运动风格一致性的同时适应地形变化的挑战。现有基于模仿学习的方法面临根本性优化权衡:严格遵循动作捕捉(motion capture, mocap)参考会抑制为适应地形所需的几何偏差,而以地形为中心的策略则常牺牲风格保真度。解决方案的关键在于提出LatentMimic框架,通过最小化策略状态-动作分布与学习到的mocap先验之间的边际潜在空间差异,实现风格保真度与几何约束的解耦;该方法在保留步态拓扑结构的同时,允许末端执行器独立适应不规则地形,并引入带有动态回放缓冲区的地形适应模块以缓解不同地形间策略分布漂移问题,从而在多个运动风格和地形场景下显著提升地形穿越成功率并维持高风格保真度。
链接: https://arxiv.org/abs/2604.16440
作者: Zhiquan Wang,Yunyu Liu,Dipam Patel,Ayush Kumar,Aniket Bera,Bedrich Benes
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Developing natural and diverse locomotion controllers for quadruped robots that can adapt to complex terrains while preserving motion style remains a significant challenge. Existing imitation-based methods face a fundamental optimization trade-off: strict adherence to motion capture (mocap) references penalizes the geometric deviations required for terrain adaptability, whereas terrain-centric policies often compromise stylistic fidelity. We introduce LatentMimic, a novel locomotion learning framework that decouples stylistic fidelity from geometric constraints. By minimizing the marginal latent divergence between the policy’s state-action distribution and a learned mocap prior, our approach provides a conditional relaxation of rigid pose-tracking objectives. This formulation preserves gait topology while permitting independent end-effector adaptations for irregular terrains. We further introduce a terrain adaptation module with a dynamic replay buffer to resolve the policy’s distribution shifts across different terrains. We validate our method across four locomotion styles and four terrains, demonstrating that LatentMimic enables effective terrain-adaptive locomotion, achieving higher terrain traversal success rates than state-of-the-art motion-tracking methods while maintaining high stylistic fidelity.
[AI-223] Support Sufficiency as Consequence-Sensitive Compression in Belief Arbitration
【速读】:该论文旨在解决系统在做出假设时因压缩导致的证据结构丢失问题,特别是标准方法中仅依赖选择内容和标量置信度无法满足下游控制需求的问题。其核心挑战在于如何确定哪些信息必须在压缩过程中保留,以维持决策的有效性。解决方案的关键在于提出一种递归仲裁架构(recurrent arbitration architecture),其中活跃的约束场共同决定候选假设的空间几何结构;系统并非完整传递该几何结构,而是将其压缩为一种支持感知的控制状态(support-aware control state),其解析程度由当前后果几何、仲裁记忆和资源约束共同调节。通过一个有界目标函数形式化这一权衡:支持保留过少会导致政策相关区分失效,引发验证、回避与恢复策略误判;保留过多则造成学习在过于精细的情境中碎片化,损害适应能力。实验表明,能动态调节支持分辨率的自适应控制器优于所有固定分辨率方案,且敏捷型自适应控制优于迟滞型,揭示了支持充分性应被视为一种动态压缩准则而非静态表征阈值。
链接: https://arxiv.org/abs/2604.16434
作者: Mark Walsh
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: 27 pages, 3 figures, 1 table
Abstract:When a system commits to a hypothesis, much of the evidential structure behind that commitment is lost to compression. Standard accounts assume that selected content and scalar confidence suffice for downstream control. This paper argues that they do not, and that determining what must survive compression is itself a consequence-sensitive problem. We develop a recurrent arbitration architecture in which active constraint fields jointly determine a hypothesis geometry over candidates. Rather than carrying that geometry forward in full, the system compresses it into a support-aware control state whose resolution is regulated by current consequence geometry, arbitration memory, and resource constraints. A bounded objective formalizes the tradeoff. Too little retained support collapses policy-relevant distinctions, producing controllers that select content adequately while misrouting verification, abstention, and recovery. Too much retained support fragments learning across overly fine contexts, degrading adaptation even as discrimination improves. These failure modes yield ordered controller predictions confirmed by a minimal repeated-interaction simulation. Adaptive controllers that regulate support resolution outperform all fixed-resolution controllers in cumulative utility. Agile adaptive control outperforms sluggish adaptive control. Fixed high-resolution control achieves the best commitment accuracy but still trails adaptive controllers because resource cost and learning fragmentation offset the gains from richer retention. Support sufficiency should be understood not as a static representational threshold, but as a dynamic compression criterion. Robust arbitration depends on preserving the smallest support structure adequate for policy under the current consequence landscape, and on regulating that structure as conditions change across repeated cycles of inference and action. Comments: 27 pages, 3 figures, 1 table Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC) Cite as: arXiv:2604.16434 [cs.AI] (or arXiv:2604.16434v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.16434 Focus to learn more arXiv-issued DOI via DataCite
[AI-224] Quantifying how AI Panels improve precision
【速读】:该论文旨在解决生成式 AI(Generative AI)在求职筛选等应用场景中可能加剧青年失业问题的潜在风险,尤其是当单一AI系统被过度依赖时所引发的偏差与决策脆弱性问题。其解决方案的关键在于提出一个可量化评估AI群体(Panel of AIs)选择精度的公式:$ P(q) \approx \frac{\rho}{n^b} + q(1-\rho)\left[1 + (n^b - 1)\rho\right] $,其中 $ n $ 为AI数量,$ \rho $ 为平均成对相关性,$ b \approx q^* + 0.8(1 - \rho) $ 且 $ q^* $ 被截断于 [0.07, 0.22] 区间内,$ P(q) $ 表示选取前 $ q $ 分位数候选者的精确度。该公式揭示了通过引入多样化的AI组成面板,可在不显著牺牲性能的前提下提升决策鲁棒性,从而推动从单一AI依赖向多元协同决策的范式转变。
链接: https://arxiv.org/abs/2604.16432
作者: Nicholas CL Beale
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Econometrics (econ.EM)
备注: 11 pages, 8 Figures, 13pp of Supplementary Information
Abstract:AI in applications like screening job applicants had become widespread, and may contribute to unemployment especially among the young. Biases in the AIs may become baked into the job selection process, but even in their absence, reliance on a single AI is problematic. In this paper we derive a simple formula to estimate, or at least place an upper bound on, the precision of such approaches for data resembling realistic CVs: P(q) \approx \frac\rho n^b + q(1-\rho)1 + (n^b - 1)\rho where b \approx q^* + 0.8 (1 - \rho) and q^* is q clipped to [0.07, 0.22] where P(q) is the precision of the top q quantile selected by a panel of n AIs and \rho is their average pairwise correlation. This equation provides a basis for considering how many AIs should be used in a Panel, depending on the importance of the decision. A quantitative discussion of the merits of using a diverse panel of AIs to support decision-making in such areas will move away from dangerous reliance on single AI systems and encourage a balanced assessment of the extent to which diversity needs to be built into the AI parts of the socioeconomic systems that are so important for our future. Comments: 11 pages, 8 Figures, 13pp of Supplementary Information Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Econometrics (econ.EM) MSC classes: 62-07, 62P25, 91B06 ACMclasses: I.2.6; H.1.2; K.4.1 Cite as: arXiv:2604.16432 [cs.CY] (or arXiv:2604.16432v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2604.16432 Focus to learn more arXiv-issued DOI via DataCite
[AI-225] Dimensional Criticality at Grokking Across MLPs and Transformers
【速读】:该论文旨在解决深度神经网络中“grokking”现象的宏观可测量特征缺失问题,即在训练后期出现从记忆到泛化的突变性过渡时,缺乏能够清晰表征这一转变的宏观物理量。其解决方案的关键在于提出了一种名为TDU–OFC(Thresholded Diffusion Update–Olami-Feder-Christensen)的离线级联探测器,通过将梯度快照转化为级联统计量,并利用与grokking对齐的有限尺寸标度法提取一个时间分辨的有效级联维度 $ D(t) $ 作为宏观可观测量。研究发现,在模块加法和XOR任务中,$ D(t) $ 在泛化过渡点处精确穿过高斯扩散基线 $ D=1 $,且穿越方向与任务相关,表明系统可能趋近于一个共享的临界流形而非随机靠近 $ D \approx 1 $,从而为理解grokking背后的动力学机制提供了定量且普适的指标。
链接: https://arxiv.org/abs/2604.16431
作者: Ping Wang
机构: 未知
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO)
备注:
Abstract:Abrupt transitions between distinct dynamical regimes are a hallmark of complex systems. Grokking in deep neural networks provides a striking example – an abrupt transition from memorization to generalization long after training accuracy saturates – yet robust macroscopic signatures of this transition remain elusive. Here we introduce \textbfTDU–OFC (Thresholded Diffusion Update–Olami-Feder-Christensen), an offline avalanche probe that converts gradient snapshots into cascade statistics and extracts a \emphmacroscopic observable – the time-resolved effective cascade dimension D(t) – via grokking-aligned finite-size scaling. Across Transformers trained on modular addition and MLPs trained on XOR, we discover a localized dynamical crossing of the Gaussian diffusion baseline D=1 precisely at the generalization transition. The crossing direction is task-dependent: modular addition descends through D=1 (approaching from D1 ), while XOR ascends (from D1 ). This opposite-direction convergence is consistent with attraction toward a candidate shared critical manifold, rather than trivial residence near D \approx 1 . Negative controls confirm this picture: ungrokked runs remain supercritical ( D1 ) and never enter the post-transition regime. In addition, avalanche distributions exhibit heavy tails and finite-size scaling consistent with the dimensional exponent extracted from D(t) . Shadow-probe controls ( \alpha_\mathrmtrain=0 ) confirm that D(t) is non-invasive, and grokked trajectories diverge from ungrokked ones in D(t) some 100 – 200 epochs before the behavioral transition.
[AI-226] Non-Stationarity in the Embedding Space of Time Series Foundation Models
【速读】:该论文旨在解决时间序列基础模型(Time Series Foundation Models, TSFMs)在嵌入空间中非平稳性(non-stationarity)理解不足的问题,尤其澄清了非平稳性与分布漂移(distribution shift)之间的混淆,这在经典时间序列分析和统计过程控制(Statistical Process Control, SPC)中具有根本区别。解决方案的关键在于系统性地识别和量化不同形式的非平稳性——包括均值漂移、方差变化和线性趋势——如何在受控条件下于TSFM嵌入空间中呈现为可线性探测的特征,并进一步考察由持久性(persistence)引起的时序非平稳性,即违反弱平稳性的长期记忆或近单位根行为。通过参数化漂移强度并测试多种TSFMs,研究发现嵌入空间中非平稳性的可检测性呈平滑退化趋势,且各模型表现出特定的失败模式,从而为TSFM的诊断性评估和改进提供了理论依据与实证基础。
链接: https://arxiv.org/abs/2604.16428
作者: Jinmyeong Choi,Brad Shook,Artur Dubrawski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 17 pages, 7 figures
Abstract:Time series foundation models (TSFMs) are widely used as generic feature extractors, yet the notion of non-stationarity in their embedding spaces remains poorly understood. Recent work often conflates non-stationarity with distribution shift, blurring distinctions fundamental to classical time-series analysis and long-standing methodologies such as statistical process control (SPC). In SPC, non-stationarity signals a process leaving a stable regime - via shifts in mean, variance, or emerging trends - and detecting such departures is central to quality monitoring and change-point analysis. Motivated by this diagnostic tradition, we study how different forms of distributional non-stationarity - mean shifts, variance changes, and linear trends - become linearly accessible in TSFM embedding spaces under controlled conditions. We further examine temporal non-stationarity arising from persistence, which reflects violations of weak stationarity due to long-memory or near-unit-root behavior rather than explicit distributional shifts. By sweeping shift strength and probing multiple TSFMs, we find that embedding-space detectability of non-stationarity degrades smoothly and that different models exhibit distinct, model-specific failure modes.
[AI-227] Shifting the Gradient: Understanding How Defensive Training Methods Protect Language Model Integrity
【速读】:该论文旨在解决防御性训练方法中两种代表性技术——正向预防转向(Positive Preventative Steering, PPS)与接种提示(Inoculation Prompting, IP)——在机制上是否相同的问题。尽管二者均通过在训练过程中引入诱导特定特质的物体来防止大语言模型(Large Language Models, LLMs)习得不良特质(如“邪恶性”),其成功背后的运作原理尚不明确。论文的关键发现是:PPS 与 IP 实现防御效果的机制截然不同。PPS 通过调整激活梯度方向,在特征轴上产生抑制效应,甚至可逆转已有特质表达;而 IP 则表现出更弥散的梯度特征,并通过降低对特质数据的预测损失来“解释掉”特质表现,这表明其作用机制可能涉及对训练数据中特质信号的重构或掩盖。这一区分揭示了两类方法的本质差异,并为理解防御性训练提供了新的理论框架与开放问题。
链接: https://arxiv.org/abs/2604.16423
作者: Satchel Grant,Victor Gillioz,Jake Ward,Thomas McGrath
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Defensive training methods such as positive preventative steering (PPS) and inoculation prompting (IP) offer surprising results through seemingly similar processes: both add trait-inducing objects to large language models (LLMs) during training, and both defend the LLM against acquiring the trait. The surprising success of these methods comes with the question: how do they work? Are PPS and IP doing the same thing? We provide behavioral and mechanistic comparisons of these two methods using “evilness” as a case-study trait. Our central finding is that PPS and IP achieve their defensive benefits through distinct mechanisms. Behaviorally, we show that neither PPS nor IP operates through a purely associative mechanism; and PPS can both defend against trait acquisition and actively reduce pre-existing expression, whereas IP is ineffective in models that were previously finetuned to express the trait. This behavioral divergence is reflected mechanistically: PPS shifts the activation gradient towards an attenuating direction along the PPS vector axis. When the PPS vector is aligned with a trait-expressing axis, it can reverse the gradient pressure, reducing rather than increasing activation along that axis. In contrast, IP continues to resist a precise mechanistic account. Direct cosine similarity analyses reveal that IP has a characteristically different gradient signature than PPS, and qualitative analyses reveal IP’s gradient to be more diffuse. Furthermore, IP reduces the next-token prediction loss on trait-expressing data where PPS need not, consistent with the notion that IP “explains away” the trait-expression in the training data. Taken together, our analyses reveal distinct mechanisms by which each method operates and highlight open questions about IP’s mechanistic picture.
[AI-228] Breaking Validity-Induced Boundaries to Expand Algorithm Search Space: A Two-Stage AST-Based Operator for LLM -Driven Automated Heuristic Evolution
【速读】:该论文旨在解决现有基于大语言模型(Large Language Model, LLM)的自动化启发式设计(Automated Heuristic Design, AHD)方法中搜索能力受限的问题。当前主流的一阶段LLM-AHD框架依赖于语义进化算子,严格要求生成代码在操作过程中必须合法,并常采用“思考-代码”(thought-code)表示形式,这限制了算法搜索空间中的结构多样性探索。其解决方案的关键在于提出一种两阶段、基于抽象语法树(Abstract Syntax Tree, AST)的进化算子:第一阶段直接对启发式代码的AST进行交叉和变异,生成多样但通常无效的结构变体;第二阶段利用LLM将这些无效代码修复为可执行且高质量的代码。通过保留原始无效变体或修复后的优质解至种群中,该方法有效提升了LLM-AHD算法(如EoH-S)的全局搜索能力和收敛速度,在旅行商问题(TSP)与在线装箱问题(OBP)上验证了其优化性能显著提升。
链接: https://arxiv.org/abs/2604.16420
作者: Sun Shengming,Shi Jialong
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 7 pages, 2 figures
Abstract:Large Language Model (LLM) based automated heuristic design (AHD) has shown great potential in discovering efficient heuristics. Most existing LLM-AHD frameworks use semantic evolutionary operators that rely entirely on the LLM’s pre-trained knowledge. These one-stage methods strictly require the generated code to be valid during the operation and often rely on a ``thought-code’’ representation. We argue that this end-to-end generation fundamentally limits the exploration ability within the algorithm search space. In this paper, we propose a two-stage, structure-based evolutionary operator for LLM-AHD. In the first stage, our approach directly performs crossover and mutation on the Abstract Syntax Trees (ASTs) of the heuristic code, intentionally generating diverse but often invalid structural variants. In the second stage, the LLM is employed to repair these invalid heuristics into executable, high-quality code. Depending on the underlying framework, either the raw invalid variants or the repaired heuristics are integrated into the population to preserve potential structural patterns. We demonstrate that the proposed operator can significantly enhance the search ability of state-of-the-art LLM-AHD algorithms, such as EoH-S. Experimental results on the Traveling Salesman Problem (TSP) and the Online Bin Packing Problem (OBP) show that our method effectively improves both optimization performance and convergence speed. Comments: 7 pages, 2 figures Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI) ACMclasses: I.2.2 Cite as: arXiv:2604.16420 [cs.NE] (or arXiv:2604.16420v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2604.16420 Focus to learn more arXiv-issued DOI via DataCite
[AI-229] What Is Actually Being Annotated? Inter-Prompt Reliability as a Measurement Problem in LLM -Based Social Science Labeling
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在计算社会科学研究中用于标注时,其输出结果对提示词(prompt)变化的可靠性问题。现有研究普遍采用单一提示词进行评估,但未充分考虑提示词语义等价但语言形式差异所引发的模型行为波动,这可能影响研究结论的可重复性与稳定性。论文提出跨提示可靠性(Inter-Prompt Reliability, IPR)框架作为解决方案,其核心在于通过成对一致性率(Pairwise Agreement Rate, PAR)及其分布来量化LLM在不同提示下的输出一致性与随机性。实验表明,在解释性任务(如TREC)中LLM表现出显著的随机变异性,而在知识锚定任务(如PolitiFact)中则更为稳定;进一步证明,通过多提示多数投票机制可有效提升标注的一致性和降低方差。因此,IPR框架强调将提示视为一种测量工具,并建议未来研究从单提示评估转向基于分布稳定性和提示聚合的系统性方法。
链接: https://arxiv.org/abs/2604.16413
作者: Jingyuan Liu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 21 pages, 4 figures, 3 tables
Abstract:Large language models (LLMs) are increasingly used for annotation in computational social science, yet their methodological reliability under prompt variation remains unclear. This paper introduces Inter-Prompt Reliability (IPR), a framework for evaluating the stability of LLM outputs across semantically equivalent but linguistically varied prompts. Drawing on Inter-Rater Reliability, IPR is measured by Pairwise Agreement Rate (PAR) and its distribution to capture both consistency and stochasticity in model behavior. We evaluate this framework on two tasks with distinct properties: TREC (interpretative) and Politifact (knowledge-anchored). Results show that LLM annotation exhibits substantial stochastic variation in interpretative tasks, while appearing more stable in knowledge-based tasks. We further show that majority voting across prompts significantly improves reproducibility and reduces variance. These findings suggest that LLM prompt acts as an instrumental measurement while its wording exhibits methodological uncertainty. For future LLM-based CSS studies, we suggest that researchers move beyond single-prompt evaluation toward distributional stability and prompt aggregation within our IPR framework.
[AI-230] How unique are hallucinated citations offered by generative Artificial Intelligence models?
【速读】:该论文旨在解决生成式 AI(Generative AI)在学术写作中产生和传播虚假参考文献(hallucinated academic references)的问题,特别是聚焦于一个反复出现的虚构引用“Education Governance and Datafication”被错误归因于 Ben Williamson 和 Nelli Piattoeva 的现象。其解决方案的关键在于揭示这些幻觉引用并非随机编造,而是基于真实作者、期刊、年份和关键词的模式化重组,并通过实证分析(包括对137篇来源论文的检索、对ChatGPT 5-mini的结构化提问以及对10篇AI生成论文的审查)证明:当缺乏外部验证时,模型会依据学习到的模式重构看似合理的参考文献,而非基于事实记忆。这表明当前基于网络的生成式AI仍无法完全消除伪造引用的风险,亟需加强学术引用验证机制以维护学术诚信。
链接: https://arxiv.org/abs/2604.16407
作者: Dirk HR Spennemann
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper investigates how generative AI produces and propagates hallucinated academic references, focusing on the recurring non-existent citation ‘Education Governance and Datafication’ attributed to Ben Williamson and Nelli Piattoeva. Drawing on 137 accessible source papers identified through Google Scholar and Google searches, the study analyses the structure, recurrence, and onward citation of this phantom reference. It shows that hallucinated citations are not random inventions but patterned recombinations of real authors, journals, dates, and keywords, with duplication occurring in nearly 30% of cases. The paper also reports a structured interrogation of ChatGPT 5-mini about how it generates citations and finds that, absent verification, the model reconstructs plausible references from learned patterns rather than factual recall. Finally, ten AI-generated essays on datafication and school governance were examined: while most references were genuine or partly accurate, 9.2% remained hallucinated, including an exact match to the most common phantom citation. The findings highlight ongoing risks to academic integrity and show that web-enabled AI still does not fully eliminate fabricated references.
[AI-231] Computational Hermeneutics: Evaluating generative AI as a cultural technology
【速读】:该论文试图解决当前生成式 AI (Generative AI) 评估框架将文化视为可测量变量而非系统运作基础的问题,从而导致对模型理解与生成意义的能力评估不足。解决方案的关键在于提出“计算诠释学”(computational hermeneutics)这一新兴框架,强调生成式 AI 系统本质上是“情境机器”(context machines),必须应对三个诠释挑战:情境性(meaning only emerges in context)、多元性(plurality)和模糊性(ambiguity)。作者进一步提出三条诠释评价原则:基准测试应为迭代过程而非一次性任务;应包含人类参与者而非仅依赖机器;应衡量文化语境而非仅关注模型输出。这一视角推动了从标准化准确性问题向情境化意义问题的范式转变。
链接: https://arxiv.org/abs/2604.16403
作者: Cody Kommers,Ruth Ahnert,Maria Antoniak,Emmanouil Benetos,Steve Benford,Mercedes Bunz,Baptiste Caramiaux,Shauna Concannon,Martin Disley,James Dobson,Yali Du,Edgar Duéñez-Guzmán,Kerry Francksen,Evelyn Gius,Jonathan W. Y. Gray,Ryan Heuser,Sarah Immel,Richard Jean So,Sang Leigh,Dalaki Livingston,Hoyt Long,Meredith Martin,Georgia Meyer,Daniela Mihai,Ashley Noel-Hirst,Kirsten Ostherr,Deven Parker,Yipeng Qin,Jessica Ratcliff,Emily Robinson,Karina Rodriguez,Adam Sobey,Ted Underwood,Aditya Vashistha,Matthew Wilkens,Youyou Wu,Yuan Zheng,Drew Hemment
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Published in Frontiers in Artificial Intelligence
Abstract:Generative AI systems are increasingly recognized as cultural technologies, yet current evaluation frameworks often treat culture as a variable to be measured rather than fundamental to the system’s operation. Drawing on hermeneutic theory from the humanities, we argue that GenAI systems function as “context machines” that must inherently address three interpretive challenges: situatedness (meaning only emerges in context), plurality (multiple valid interpretations coexist), and ambiguity (interpretations naturally conflict). We present computational hermeneutics as an emerging framework offering an interpretive account of what GenAI systems do, and how they might do it better. We offer three principles for hermeneutic evaluation – that benchmarks should be iterative, not one-off; include people, not just machines; and measure cultural context, not just model output. This perspective offers a nascent paradigm for designing and evaluating contemporary AI systems: shifting from standardized questions about accuracy to contextual ones about meaning.
[AI-232] CoLLM : A Unified Framework for Co-execution of LLM s Federated Fine-tuning and Inference
【速读】:该论文旨在解决边缘智能场景下大语言模型(Large Language Models, LLMs)后训练阶段(包括微调和推理)因被当作孤立负载处理而导致的资源冗余与推理质量提升延迟问题。现有方法虽在联邦参数高效微调(Federated Parameter-Efficient Fine-Tuning, FL PEFT)和低延迟推理方面取得进展,但未充分考虑二者间的协同关系。解决方案的关键在于提出一种联合执行框架 CoLLM,其核心创新包括:(1) 在单个边缘副本内部采用模型共享机制,通过未合并推理与影子适配器策略实现实时参数复用;(2) 在副本间引入双时间尺度协调算法,动态平衡微调与推理负载,从而同时优化长期模型质量提升与短期推理效率。
链接: https://arxiv.org/abs/2604.16400
作者: Shaoyuan Huang,Xiaokai Wang,Na Yan,Xiaofei Wang,Wenyu Wang,Yansha Deng
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As Large Language Models (LLMs) are increasingly adopted in edge intelligence to power domain-specific applications and personalized services, the quality and efficiency of the LLM post-training phase-including fine-tuning and inference, have become critical due to constrained resources. Although recent advances in federated parameter-efficient fine-tuning (FL PEFT) and low-latency inference have improved individual task performance, fine-tuning and inference are still handled as isolated workloads, which overlooks their interdependence and results in redundant deployments and delayed improvement in inference quality. To address these limitations, we introduce a new co-execution framework and instantiate it with CoLLM, a system that unifies FL PEFT and inference on shared edge replicas and model parameters. CoLLM addresses key challenges at both replica and cluster levels through: (1) an intra-replica model sharing mechanism that enables real-time model parameter reuse via unmerged inference and shadow adapter strategies; and (2) a two-timescale inter-replica coordination algorithm that adaptively balances fine-tuning and inference workloads to jointly optimize long-term model quality gains and short-term inference efficiency. Extensive evaluation across diverse LLMs and real-world traces show that CoLLM consistently outperforms state-of-the-art LLM systems, achieving up to 3x higher goodput, demonstrating its effectiveness in enabling seamless LLM post-training for edge intelligence.
[AI-233] IACDM: Interactive Adversarial Convergence Development Methodology – A Structured Framework for AI-Assisted Software Development
【速读】:该论文旨在解决生成式 AI(Generative AI)在软件开发中引发的“验证缺口”(verification gap)问题,即大语言模型(LLM)作为随机生成器缺乏内部语义验证能力,导致即使经验丰富的开发者使用前沿模型时仍出现效率下降和生产级应用中高达10.3%的关键安全漏洞。解决方案的核心是提出一种名为 IACDM(Interactive Adversarial Convergence Development Methodology)的结构化8阶段方法论,其关键在于引入外部验证代理(Verification Agent, VA)在离散节点进行强制性验证,通过三大支柱实现:(1) 在技术实现前通过分层语义分析(Hierarchical Semantic Analysis)深入挖掘问题本质;(2) 跨会话持续管理知识;(3) 在实施前采用专业化批判视角进行系统性对抗性审查。该方法具有工具无关性,扎根于传统软件工程实践,并已在多个生产研发环境中验证其有效性。
链接: https://arxiv.org/abs/2604.16399
作者: Jasmine Moreira
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 14 pages, 6 tables. Technical Foundation Document. Repository: this https URL . VSCode extensions available at VS Marketplace ( this http URL -claude, this http URL -copilot)
Abstract:The widespread adoption of AI-assisted development tools in 2025 – and the emergence of vibe coding, a practice of generating complete applications from natural language without verification – exposed a critical and tool-agnostic failure pattern: experienced developers who used frontier AI models were measurably slower in objective evaluations despite believing they were faster. Concurrently, 10.3% of AI-generated applications in a production showcase contained critical security flaws. This paper argues that these failures share a structural cause – the verification gap: every large language model (LLM), regardless of interface or capability, operates as a stochastic generator with zero internal semantic verification capability. The tool is irrelevant; the process is determinative. We present IACDM (Interactive Adversarial Convergence Development Methodology), a structured 8-phase framework designed to address the verification gap through external verification agents (VA) operating at discrete gates. Its three pillars are: (1) deep problem discovery via Hierarchical Semantic Analysis before any technical solution; (2) persistent knowledge management across sessions; and (3) systematic adversarial critique through specialized lenses before implementation. The methodology is tool-agnostic by construction, grounded in established software engineering tradition, and applied across more than 20 projects by multiple practitioners in a production RD environment. Limitations are formalized as testable hypotheses for future empirical validation.
[AI-234] A Framework for Human-AI Q-Matrix Refinement: A NeuralCDM Evaluation
【速读】:该论文旨在解决传统Q-matrix构建过程中依赖专家经验所导致的耗时、主观性强及难以实证验证的问题。其解决方案的关键在于提出一种人机协同的Q-matrix精炼框架,其中大语言模型(Large Language Models, LLMs)通过结构化且包含常见误解提示的方式生成候选Q-matrix,再由NeuralCDM提供基于学生作答数据的实证评估层,以比较不同候选矩阵的拟合效果,从而实现自动化与可验证的Q-matrix优化。
链接: https://arxiv.org/abs/2604.16398
作者: Ying Zhang,Ningxi Cheng,Yizhu Gao,Hongmei Li,Lehong Shi,Nicholas Young,Geng Yuan,Xiaoming Zhai
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted at AIED 2026
Abstract:Q-matrices are a cornerstone of theory-driven assessment and learning analytics, making item demands and students’ underlying knowledge components and misconceptions explicit and actionable. However, Q-matrices are typically crafted by experts, making them time-consuming to build, prone to subjectivity, and difficult to validate empirically. We propose a framework for human-AI Q-matrix refinement in which large language models (LLMs) generate candidate Q-matrices using structured, misconception-aware prompting, and NeuralCDM provides an empirical evaluation layer to compare candidates based on how well they explain student response data. We apply the framework to a thermodynamics assessment dataset and benchmark locally deployed LLMs against cloud-served models. Results show that iteratively refined LLM-generated Q-matrices can exceed expert-baseline model fit (AUC 0.780 vs. 0.717), and that locally deployed models achieve comparable performance to cloud APIs, supporting privacy-preserving deployment.
[AI-235] Stream2LLM : Overlap Context Streaming and Prefill for Reduced TTFT
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)推理过程中上下文检索系统面临的高延迟问题,即在等待完整上下文加载时导致的首次词元时间(Time-to-First-Token, TTFT)恶化与提前推理造成的生成质量下降之间的根本性权衡。针对多租户部署场景下并发请求竞争GPU内存资源、且上下文动态到达带来的调度挑战,作者提出STREAM2LLM系统,其核心创新在于:通过解耦调度决策与资源获取机制,实现基于硬件成本模型的灵活抢占策略;同时支持两种检索模式——追加模式(append-mode,渐进式累积上下文)和更新模式(update-mode,带缓存失效的迭代优化),并利用最长公共前缀匹配进行缓存无效化以减少冗余计算。实验表明,该方案可实现高达11倍的TTFT提升,并在内存受限环境下通过感知成本的调度策略维持吞吐量与非流式基线相当。
链接: https://arxiv.org/abs/2604.16395
作者: Rajveer Bachkaniwala,Chengqi Luo,Richard So,Divya Mahajan,Kexin Rong
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:Context retrieval systems for LLM inference face a critical challenge: high retrieval latency creates a fundamental tension between waiting for complete context (poor time-to-first-token) and proceeding without it (reduced quality). Recent work mitigates this via streaming–overlapping retrieval with inference–but prior systems focus on single-request settings and overlook challenges in multi-tenant deployments where concurrent requests contend for GPU memory and scheduling must adapt to dynamic context arrivals. We present STREAM2LLM, a system that extends vLLM to support streaming prompts with adaptive scheduling and preemption for two distinct retrieval patterns: append-mode (progressive context accumulation) and update-mode (iterative refinement with cache invalidation). STREAM2LLM decouples scheduling decisions from resource acquisition, enabling flexible preemption strategies guided by hardware-specific cost models, and uses cache invalidation based on longest common prefix matching to minimize redundant computation when prompts change dynamically. To evaluate STREAM2LLM, we collect and characterize two large-scale, real-world streaming workloads based on web crawling and approximate nearest neighbor search. Our evaluation demonstrates that streaming architecture delivers up to 11x TTFT improvements, with cost-aware scheduling providing critical benefits under memory pressure, while maintaining throughput parity with non-streaming baselines. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.16395 [cs.DB] (or arXiv:2604.16395v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2604.16395 Focus to learn more arXiv-issued DOI via DataCite
[AI-236] DAOnt: A Formal Ontology for EU Data Act Compliance
【速读】:该论文旨在解决欧盟《数据法案》(EU Data Act)中法律条款在实际应用中的可计算性与合规性验证难题,即如何将复杂的法规文本转化为机器可读、可推理的形式化模型,以支持自动化合规检查。解决方案的关键在于构建了一个名为DAOnt的本体(ontology),该本体复用LKI-F Core、ODRL和DPV三个成熟本体中的元素,对《数据法案》的核心概念及其规范结构进行形式化建模,并通过SPARQL查询实现对义务、权限和禁止项的自动提取与验证,从而为B2C、B2B和B2G场景下的数据共享协议提供可执行的合规性分析能力。
链接: https://arxiv.org/abs/2604.16386
作者: Sheyla Leyva-Sánchez,Fabian Linde,Meem Arafat Manab,María Poveda-Villalón,Víctor Rodríguez-Doncel
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:The EU Data Act establishes comprehensive rules governing data access and sharing across business-to-consumer (B2C), business-to-business (B2B), and business-to-government (B2G) contexts. This paper presents a comprehensive ontology for the EU Data Act, enabling reasoning over data sharing agreements through machine-readable representations. The DAOnt ontology reuses elements from three established ontologies, LKIF-Core, ODRL, and DPV, to capture the normative structure of the Data Act. The ontology captures the main concepts and relationships in the Regulation, and it also operationalises three articles to facilitate compliance checking: Article 4(1) (B2C user access rights), Article 8(6) (B2B trade secret exceptions) and Article 19(2)(a) (B2G competitive use prohibitions). The ontology supports compliance checking through SPARQL queries that return obligations, permissions, and prohibitions, allowing organisations to verify whether data-sharing agreements meet the requirements of the EU Data Act and to assess conditions such as FRAND obligations. By representing key legal concepts in RDF, our work helps bridge the gap between the legal provisions of the Data Act and their computational interpretation. The complete ontology, along with example instances and queries, is available online. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2604.16386 [cs.DB] (or arXiv:2604.16386v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2604.16386 Focus to learn more arXiv-issued DOI via DataCite
[AI-237] StressWeb: A Diagnostic Benchmark for Web Agent Robustness under Realistic Interaction Variability
【速读】:该论文旨在解决当前大型语言模型驱动的网页代理(Web Agents)在理想化评估环境中表现出高任务成功率,但其真实场景下鲁棒性可能被高估的问题。现有评估方法多基于稳定且行为良好的交互条件,无法充分反映实际网页交互中的复杂性和不确定性。解决方案的关键在于构建一个诊断式压力测试基准(diagnostic stress-testing benchmark),通过创建可控的、现实的网页环境作为参考基线,并引入结构化的扰动(如布局偏移、交互语义变化和执行中断)来模拟真实世界的交互变异性;在此基础上,比较代理在干净与扰动环境下的行为差异,从而系统性地诊断其在“如果-那么”(what-if)场景下的鲁棒性缺陷,揭示出传统基准未暴露的失败模式和显著性能差距。
链接: https://arxiv.org/abs/2604.16385
作者: Haoyue Bai,Dong Wang,Long Chen,Bingguang Hao,Pengyang Shao,Yonghui Yang,Yicheng He,Chenyi Zhuang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model-based web agents have demonstrated strong performance on realistic web interaction tasks. However, existing evaluations are predominantly conducted under relatively stable and well-behaved interaction conditions, which may overestimate agent robustness. High task success in such idealized settings does not necessarily reflect performance under realistic web interaction. To address this limitation, we introduce a diagnostic stress-testing benchmark for web agents. We first construct realistic and controllable web environments that provide clean and stable interaction workflows as reference baselines. We then introduce structured and controlled perturbations that emulate interaction variability, including shifting layouts, altered interaction semantics, and execution disruptions. By comparing agent behavior between clean and perturbed settings, our framework enables systematic diagnosis of robustness under what-if interaction scenarios. Through extensive evaluation of state-of-the-art multimodal web agents, we show that stress-based evaluation exposes failure modes and substantial robustness gaps that remain hidden under clean benchmark conditions.
[AI-238] Same Verdict Different Reason s: LLM -as-a-Judge and Clinician Disagreement on Medical Chatbot Completeness
【速读】:该论文旨在解决生成式 AI(Generative AI)作为自动评估工具在高风险医疗场景中是否可靠的问题,特别是针对患者可读的医学回复完整性检测。其关键发现是:尽管大语言模型(LLM)作为评判者(LLM-as-a-Judge)在识别不完整医疗回应上表现接近随机水平(AUC 0.49–0.66),且无法在保持高召回率(>90%)的同时减少人工审查负担,表明其不具备临床 triage(分诊)实用性;更深层的问题在于,LLM 判决与临床医生之间存在根本性的完整性标准差异——即使两者结论一致时也极少共享相同依据,而分歧则表现为 LLM 过度标记非关键缺陷(假阳性)或完全遗漏实质性缺失(假阴性)。这一发现揭示了当前 LLM Judges 在医疗评估中难以替代人类专家的根本局限。
链接: https://arxiv.org/abs/2604.16383
作者: Alexandra DeLucia,Heyuan Huang,Sonal Joshi,Mahsa Yarmohammadi,Ahmed Hassoon,Mark Dredze
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-as-a-Judge frameworks are increasingly trusted to automate evaluation in place of human experts, yet their reliability in high-stakes medical contexts remains unproven. We stress-test this assumption for detecting incomplete patient-facing medical responses, evaluating three rubric granularities (General-Likert, Analytical-Rubric, Dynamic-Checklist) and three backbone models across two clinician-annotated datasets, including HealthBench, the largest publicly available benchmark for medical response evaluation. LLM Judges discriminate complete from incomplete responses at and slightly above near chance (AUC 0.49 – 0.66 ); at the threshold required to recall 90% of incomplete responses, clinicians must still review the vast majority of the dataset, offering no triage utility. Even when model and clinician verdicts agree, they rarely cite the same explanation; and when they diverge, false positives stem from over-flagging non-essential gaps while false negatives reflect outright detection failures. These results reveal that LLM Judges and clinicians apply fundamentally different completeness standards; a finding that undermines their use as autonomous evaluators or triage filters in clinical settings.
[AI-239] alk Walk and Market Response: Multimodal Measurement of AI Washing and Its Capital Market Consequences in China
【速读】:该论文旨在解决资本市场中因信息不对称和技术透明度不足导致的“AI洗牌”(AI Washing)问题,即企业通过夸大人工智能(Artificial Intelligence, AI)能力来误导投资者、抬高估值的现象。其解决方案的关键在于构建两个核心指标:一是基于多模态大模型Qwen-VL的AI洗牌风险评分(AI Washing Risk Score, AWRS),用于量化年报与路演材料中文本与图像的一致性偏差;二是利用主成分分析(PCA)整合专利质量、AI无形资产资本化和技术人员薪酬等维度构建实质性投资匹配指数(Material Real-Investment Matching Index, MRMI),以识别真实研发投入。通过这两个指标,研究不仅揭示了虚假宣传与实际投入之间的差距及其对市场定价效率的影响机制,还验证了长期机构投资者如何通过实地调研识别并规避AI洗牌行为,从而为监管科技(RegTech)干预提供实证依据。
链接: https://arxiv.org/abs/2604.16367
作者: Wen Zhanjie,Guo Jingqiao
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 18 pages, 3 figures, 7 tables, academic research paper
Abstract:As artificial intelligence and generative large language models drive industrial upgrading, capital markets increasingly focus on AI-themed listed firms. Information asymmetry and technological opacity lower the cost of exaggerating AI capabilities relative to genuine RD, spurring widespread AI Washing. Using China’s A-share market from 2018Q1 to 2025Q2, we advance literature in measurement and mechanism testing. We construct a multimodal AI Washing Risk Score (AWRS) via Qwen-VL to assess text-image consistency in annual reports and roadshows, and a Material Real-Investment Matching Index (MRMI) from patent quality, AI intangible asset capitalization, and technical personnel compensation using PCA. Four findings emerge: (1) AWRS lacks predictive power for future MRMI, with a wider rhetoric-action gap among financially constrained firms; (2) substantive AI investment boosts high-quality patents, while empty rhetoric crowds out industry innovation; (3) long-horizon institutional investors detect AI Washing through site visits and reduce holdings; (4) such divestment triggers analyst downgrades, retail selling, and sharp valuation corrections within 180 days. Results are robust to IV-2SLS and staggered DID using the ChatGPT shock. This study enhances disclosure and pricing-efficiency research and supports RegTech for curbing thematic speculation.
[AI-240] Mapping Recent Shifts in Digital Art via Conference Discourse: AI XR the Metaverse and Blockchain/NFTs (2021-2025)
【速读】:该论文旨在解决数字艺术领域内新兴技术主题在学术会议 discourse 中的演变趋势问题,特别是人工智能(AI)、沉浸式技术(包括XR和元宇宙)及区块链与非同质化代币(NFT)三类技术的相对关注度变化。其解决方案的关键在于对2021至2025年六场数字艺术会议的文本数据进行系统性内容分析,量化不同技术主题的贡献比例及其随时间的变化轨迹,从而揭示AI在2022年后迅速崛起并成为主导议题,而沉浸式技术和区块链/NFT相关研究则保持相对稳定或边缘化状态。
链接: https://arxiv.org/abs/2604.16360
作者: Vasileios Komianos,Emmanuel Rovithis,Athanasios Tsipis
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 16 pages, 3 figures, 3 tables, Submitted to DCAC
Abstract:This paper presents an analysis of five years (2021 - 2025) of conference discourse across six digital art conferences, aiming to trace thematic shifts associated with the rapid development of emerging technologies, namely artificial intelligence (AI), immersive technologies (including XR and the metaverse), and blockchain technologies and non-fungible tokens (NFTs). The results indicate a marked increase in AI-related contributions, while immersive technologies maintain a relatively stable share of the discourse, and blockchain- and NFT-based works remain marginal. Overall, whereas immersive technologies and blockchain-related topics exhibit relative stability, AI shows a significant rise after 2022, emerging as a dominant theme within digital art conference discourse.
[AI-241] Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在软件工程(Software Engineering, SWE)任务中,由于依赖仅能反映最终结果的终端奖励(如单元测试是否通过)而导致中间行为难以优化的问题。这种二值化反馈无法有效引导多步骤交互过程中的行为调整,从而限制了整体解决方案质量的提升。其关键解决方案是提出一种基于评分标准(rubric-based)的生成式奖励模型(Generative Reward Model, GRM),该模型利用人工设计的评分标准明确指示鼓励或抑制特定行为模式,并通过轨迹筛选(trajectory filtration)机制构建高质量训练数据,进而用于强化微调(Reinforced Fine-Tuning, RFT)。实验证明,相较于仅使用终端得分的拒绝采样方法,该方案能更有效地抑制不良行为并促进有益行为,最终提升测试准确率。
链接: https://arxiv.org/abs/2604.16335
作者: Jiawei Huang,Qingping Yang,Renjie Zheng,Jiaze Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Despite recent progress in Large Language Model (LLM) Agents for Software Engineering (SWE) tasks, end-to-end fine-tuning typically relies on verifiable terminal rewards such as whether all unit tests pass. While these binary signals reflect whether the final solution is correct, they provide little guidance for shaping intermediate behaviors during multi-step interactions, thereby limiting improvements in the overall quality of the resolution process. To address this, we introduce a rubric-based Generative Reward Model (GRM) that provides richer learning signals. The GRM is equipped with human-designed rubrics that indicate criteria for encouraging or discouraging specific behavioral patterns, and we leverage this feedback for high-quality training data collection via trajectory filtration. When used for Reinforced Fine-Tuning (RFT) on SWE Tasks, our approach outperforms terminal-score-only rejection sampling: it more effectively suppresses undesirable patterns while promoting beneficial ones, as confirmed by case analyses, and it ultimately improves final test accuracy.
[AI-242] Preventing overfitting in deep learning using differential privacy
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks)在实际应用中因过拟合(overfitting)而导致泛化能力差的问题,尤其是在训练数据有限的情况下。其解决方案的关键在于引入基于差分隐私(differential privacy)的方法,通过在训练过程中对模型参数或梯度添加噪声,以限制模型对训练数据中细节和噪声的过度学习,从而提升模型在未见数据上的泛化性能。
链接: https://arxiv.org/abs/2604.16334
作者: Alizishaan Anwar Hussein Khatri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Master’s dissertation State University of New York at Buffalo first published in 2017
Abstract:The use of Deep Neural Network based systems in the real world is growing. They have achieved state-of-the-art performance on many image, speech and text datasets. They have been shown to be powerful systems that are capable of learning detailed relationships and abstractions from the data. This is a double-edged sword which makes such systems vulnerable to learning the noise in the training set, thereby negatively impacting performance. This is also known as the problem of \emphoverfitting or \emphpoor generalization. In a practical setting, analysts typically have limited data to build models that must generalize to unseen data. In this work, we explore the use of a differential-privacy based approach to improve generalization in Deep Neural Networks.
[AI-243] A Discordance-Aware Multimodal Framework with Multi-Agent Clinical Reasoning
【速读】:该论文旨在解决膝骨关节炎(Knee Osteoarthritis, KOA)中影像学结构损伤与患者主观症状(如疼痛)之间存在的不一致性问题,这种不一致使得临床判断和患者分层困难,且现有决策支持系统对此建模不足。解决方案的关键在于提出一个感知不一致性的多模态框架,其核心包括:1)基于基线数据训练多模态模型以预测两类进展任务——仅关节间隙变窄的进展与非进展、以及仅疼痛加重的进展与非进展;2)引入三个模态特异专家模型(CatBoost表格模型、ResNet18提取的MRI图像嵌入、相同架构生成的X光嵌入),并通过堆叠集成融合预测结果;3)利用残差模型从结构特征估计预期疼痛,从而计算出“疼痛-结构不一致性评分”;4)通过多智能体推理层解析这些信号,识别具有临床可解释性的OA表型并生成针对不同表型的管理建议。
链接: https://arxiv.org/abs/2604.16333
作者: Pegah Ahadian,Mingrui Yang,Sixu Chen,Xiaojuan Li,Qiang Guan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Knee osteoarthritis frequently exhibits discordance between structural damage observed in imaging and patient-reported symptoms such as pain. This mismatch complicates clinical interpretation and patient stratification and remains insufficiently modeled in existing decision support systems. We propose a discordance aware multimodal framework that combines machine learning prediction models with a tool grounded multi agent reasoning system. Using baseline data from the FNIH Osteoarthritis Biomarkers Consortium, we trained multimodal models to predict two progression tasks, joint space loss only progression versus non progression, and pain only progression versus non progression. The predictive system integrates three modality specific experts: a CatBoost tabular model using demographic, radiographic, MRI-derived scalar, and biomarker features; MRI image embeddings extracted using a ResNet18 backbone; and Xray embeddings derived from the same architecture. Expert predictions are fused using a stacking ensemble. Residual based models estimate expected pain from structural features, enabling the computation of a pain structure discordance score between observed and expected symptoms. A multi-agent reasoning layer interprets these signals to assign clinically interpretable OA phenotypes and generate phenotype specific management recommendations.
[AI-244] UniMamba: A Unified Spatial-Temporal Modeling Framework with State-Space and Attention Integration
【速读】:该论文旨在解决多变量时间序列预测中长期依赖建模与跨变量交互学习的难题,现有基于Transformer的方法虽能通过注意力机制捕捉时序相关性,但存在二次计算复杂度;而状态空间模型(如Mamba)虽具备高效长程建模能力,却缺乏显式的时序模式识别能力。其解决方案的关键在于提出UniMamba框架,融合高效状态空间动态与注意力驱动的依赖学习:通过引入FFT-Laplace变换和TCN增强的Mamba变体-通道编码层(Mamba Variate-Channel Encoding Layer)捕获全局时序依赖,结合空间-时间注意力层(Spatial Temporal Attention Layer)联合建模变量间相关性和时序演化,并利用前馈时序动态层融合连续与离散上下文信息,从而在准确率与计算效率上均优于当前最优模型。
链接: https://arxiv.org/abs/2604.16325
作者: Xingsheng Chen,Xianpei Mu,Deyu Yi,Yilin Yuan,Xingwei He,Bo Gao,Regina Zhang,Pietro Lio,Siu-Ming Yiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multivariate time series forecasting is fundamental to numerous domains such as energy, finance, and environmental monitoring, where complex temporal dependencies and cross-variable interactions pose enduring challenges. Existing Transformer-based methods capture temporal correlations through attention mechanisms but suffer from quadratic computational cost, while state-space models like Mamba achieve efficient long-context modeling yet lack explicit temporal pattern recognition. Therefore we introduce UniMamba, a unified spatial-temporal forecasting framework that integrates efficient state-space dynamics with attention-based dependency learning. UniMamba employs a Mamba Variate-Channel Encoding Layer enhanced with FFT-Laplace Transform and TCN to capture global temporal dependencies, and a Spatial Temporal Attention Layer to jointly model inter-variate correlations and temporal evolution. A Feedforward Temporal Dynamics Layer further fuses continuous and discrete contexts for accurate forecasting. Comprehensive experiments on eight public benchmark datasets demonstrate that UniMamba consistently outperforms state-of-the-art forecasting models in both forecasting accuracy and computational efficiency, establishing a scalable and robust solution for long-sequence multivariate time-series prediction.
[AI-245] Beyond the Diff: Addressing Agent ic Entropy in Agent ic Software Development
【速读】:该论文旨在解决自主编码代理(autonomous coding agents)在软件开发流程中因高操作速度而引发的“代理熵”(agentic entropy)问题,即代理行为与架构意图之间的系统性漂移,传统基于代码差异(code diff-based)和人类可解释AI(HCXAI)方法无法捕捉这种全局性的代理行为偏差。解决方案的关键在于提出一种面向过程的可解释性框架,其核心由三个支柱构成:一致性种子(conformity seeding)、推理监控(reasoning monitoring)和因果图界面(causal graph interface),通过提供意图级别的遥测信息,增强对代理决策演化过程的理解,从而在不增加人工审查负担的前提下,提升开发者对代理行为的认知深度与监督有效性。
链接: https://arxiv.org/abs/2604.16323
作者: Matteo Casserini,Alessandro Facchini,Andrea Ferrario
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Submitted to the ACM CHI Workshop on Human-Centered Explainable AI 2026 (HCXAI26)
Abstract:As autonomous coding agents become deeply embedded in software development workflows, their high operational velocity introduces a critical oversight challenge: the accumulating divergence between agentic actions and architectural intent. We term this process agentic entropy: a systemic drift that traditional code diff-based and HCXAI methods fail to capture, as they address local outputs rather than global agentic behaviour. To close this gap, we propose a process-oriented explainability framework that exposes how agentic decisions unfold across time, tool calls, and architectural boundaries. Built around three pillars (conformity seeding, reasoning monitoring, and a causal graph interface) our approach provides intent-level telemetry that complements, rather than replaces, existing review practices. We demonstrate its relevance across two user profiles: lay users engaged in vibe coding, who gain structural visibility otherwise masked by functional success; and professional developers, who gain richer contextual grounding for code review without increased overhead. By treating cognitive drift as a first-class concern alongside code quality, our framework supports the minimum level of human comprehension required for agentic oversight to remain substantive.
[AI-246] Steerable Instruction Following Coding Data Synthesis with Actor-Parametric Schema Co-Evolution
【速读】:该论文旨在解决大规模指令对齐编码数据生成的难题,尤其在多约束条件下保持逻辑一致性的挑战。其核心问题在于如何高效构建高质量、多样化且逻辑自洽的指令-代码配对数据,以提升大语言模型(LLM)在自动编程中遵循人类指令的能力。解决方案的关键在于提出一个演员-模式协同进化框架(IFCodeEvolve):通过将指令表示为参数化函数模式(parametric function schema),动态实例化约束以构建覆盖广泛指令空间的模式库;利用蒙特卡洛树搜索(MCTS)采样器在该空间中高效探索,并以演员模型反馈作为动态终止信号;进一步引入协同进化机制,基于采样统计结果迭代优化演员模型与模式库,通过模式组合与变异逐步攻克复杂任务。这一方法显著提升了基线模型性能,使32B规模模型达到商用领先水平。
链接: https://arxiv.org/abs/2604.16322
作者: Tinglin Huang,Bo Chen,Xiao Zhang,Kai Shen,Rex Ying
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:Interpreting and following human instructions is a critical capability of large language models (LLMs) in automatic programming. However, synthesizing large-scale instruction-paired coding data remains largely unexplored and is particularly challenging when ensuring logical compatibility among multiple constraints. In this study, we propose IFCodeEvolve, an actor-schema co-evolution framework for instruction following coding data generation. By representing instructions as parametric function schema, we construct a library that covers the vast instruction space via dynamic constraint instantiation. Building upon this, Monte Carlo Tree Search (MCTS) sampler is applied to efficiently navigate this space, utilizing actor model feedback as a dynamic termination signal. Furthermore, to progressively explore challenging problems, we introduce a co-evolving paradigm that iteratively advances both the actor model and the schema library, via schema composition and mutation, based on sampler statistics. Empirical results demonstrate that IFCodeEvolve significantly boosts base model performance, with our 32B model achieving parity with proprietary SOTA models. Additionally, we contribute IFCodeBench, a comprehensive human-verified benchmark equipped with solutions and robust AST-based verification.
[AI-247] How Robustly do LLM s Understand Execution Semantics?
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在代码理解任务中是否依赖于内部世界模型(internal world models)而非仅基于模式匹配的问题。研究通过标准的程序输出预测任务,考察模型对代码变换(input perturbations)的鲁棒性差异,发现开源推理模型(如DeepSeek-R1系列)虽准确率较低但行为稳定,而前沿模型GPT-5.2在原始数据上表现优异(99%准确率),但在输入扰动下准确率下降20%-24%,表现出显著脆弱性。关键解决方案在于引入扰动测试(perturbation-based evaluation)作为诊断工具,识别并改进异常预测能力不足的问题,从而揭示所有模型在代码理解上的共性局限,并验证扰动分析在评估代码模型可靠性方面的价值。
链接: https://arxiv.org/abs/2604.16320
作者: Claudio Spiess,Prem Devanbu,Earl T. Barr
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:LLMs demonstrate remarkable reasoning capabilities, yet whether they utilize internal world models or rely on sophisticated pattern matching remains open. We study LLMs through the lens of robustness of their code understanding using a standard program-output prediction task. Our results reveal a stark divergence in model behavior: while open-source reasoning models (DeepSeek-R1 family) maintain stable, albeit somewhat lower accuracies (38% to 67%) under code transformations input perturbations, the frontier model GPT-5.2 exhibits significant brittleness. Despite achieving a near-perfect score of 99% on the original, unperturbed CRUXEval benchmark, perturbed inputs trigger accuracy declines between 20% and 24%. In addition, we find that many models perform much worse at predicting behavior on perturbed inputs that raise exceptions, and that prediction performance depends on the kind of exception. We study remedies to address this deficiency in exception prediction, and evaluate the effect of these remedies on the ability to predict non-exception behaviors. Our findings both point to limitations in the way all models understand code, and establish the value of using perturbation to evaluate code models.
[AI-248] AI Agents and Hard Choices
【速读】:该论文试图解决当前人工智能(AI)代理在处理“硬选择”(hard choices)时的局限性问题,即当多个目标同时存在且不可通约(incommensurable)时,AI代理如何识别并合理应对此类决策困境。论文指出,现有AI代理作为优化器的根本设计导致两个核心问题:一是“识别问题”(Identification Problem),即基于多目标优化(Multi-Objective Optimisation, MOO)的代理无法识别不可通约性,从而引发阻塞、不可信和不可靠等对齐问题;二是“解决问题”(Resolution Problem),即使识别出不可通约性,代理也缺乏自主权来真正解决这类选择,而只能通过自我修改目标进行任意抉择。解决方案的关键在于提出一种概念性的集成方案(ensemble solution),以突破传统优化框架的结构性限制,并进一步探讨赋予AI更高自主权所涉及的规范性权衡(opaque normative trade-offs)。
链接: https://arxiv.org/abs/2504.15304
作者: Kangyu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 20 pages. v2: Substantially revised and rewritten; now typeset in LaTeX. Reflects the version presented at ACM FAccT 2026 (non-archival track). A revised version is under submission to a journal
Abstract:Can AI agents deal with hard choices – cases where options are incommensurable because multiple objectives are pursued simultaneously? Adopting a technologically engaged approach distinct from existing philosophical literature, I submit that the fundamental design of current AI agents as optimisers creates two limitations: the Identification Problem and the Resolution Problem. First, I demonstrate that agents relying on Multi-Objective Optimisation (MOO) are structurally unable to identify incommensurability. This inability generates three specific alignment problems: the blockage problem, the untrustworthiness problem, and the unreliability problem. I argue that standard mitigations, such as Human-in-the-Loop, are insufficient for many decision environments. As a constructive alternative, I conceptually explore an ensemble solution. Second, I argue that even if the Identification Problem is solved, AI agents face the Resolution Problem: they lack the autonomy to resolve hard choices rather than arbitrarily picking through self-modification of objectives. I conclude by examining the opaque normative trade-offs involved in granting AI this level of autonomy.
[AI-249] Learning the Riccati solution operator for time-varying LQR via Deep Operator Networks
【速读】:该论文旨在解决有限时域线性二次型调节器(Linear Quadratic Regulator, LQR)问题中反复求解微分Riccati方程(differential Riccati equation)带来的高计算成本问题。传统方法需对每个新的系统实例进行数值积分,导致在线计算效率低下。解决方案的关键在于构建一个学习型算子代理(learned operator surrogate),在离线阶段近似映射时间依赖的系统参数到Riccati轨迹的解算子;在线阶段则通过该代理快速生成近似最优反馈控制律,从而将计算负担从重复数值积分转移到一次性学习阶段。该方法不仅实现了显著的计算加速,还通过理论分析建立了算子近似误差与闭环性能、轨迹精度及代价次优性之间的定量关系,并证明了在足够精确的算子近似下,闭环系统的指数稳定性得以保持,为数据驱动控制近似提供了可靠性保障。
链接: https://arxiv.org/abs/2604.18507
作者: Jun Chen,Umberto Biccari,Junmin Wang
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We propose a computational framework for replacing the repeated numerical solution of differential Riccati equations in finite-horizon Linear Quadratic Regulator (LQR) problems by a learned operator surrogate. Instead of solving a nonlinear matrix-valued differential equation for each new system instance, we construct offline an approximation of the associated solution operator mapping time-dependent system parameters to the Riccati trajectory. The resulting model enables fast online evaluation of approximate optimal feedbacks across a wide class of systems, thereby shifting the computational burden from repeated numerical integration to a one-time learning stage. From a theoretical perspective, we establish control-theoretic guarantees for this operator-based approximation. In particular, we derive bounds quantifying how operator approximation errors propagate to feedback performance, trajectory accuracy, and cost suboptimality, and we prove that exponential stability of the closed-loop system is preserved under sufficiently accurate operator approximation. These results provide a framework to assess the reliability of data-driven approximations in optimal control. On the computational side, we design tailored DeepONet architectures for matrix-valued, time-dependent problems and introduce a progressive learning strategy to address scalability with respect to the system dimension. Numerical experiments on both time-invariant and time-varying LQR problems demonstrate that the proposed approach achieves high accuracy and strong generalization across a wide range of system configurations, while delivering substantial computational speedups compared to classical solvers. The method offers an effective and scalable alternative for parametric and real-time optimal control applications.
[AI-250] Dissecting AI Trading: Behavioral Finance and Market Bubbles
【速读】:该论文旨在解决人工智能代理(AI agents)在模拟资产市场中如何形成预期并参与交易的问题,特别是其行为模式是否能够复现人类投资者的经典行为偏差及市场动态。解决方案的关键在于利用一个由自主大型语言模型(Large Language Model, LLM)代理组成的开放叫价拍卖市场进行实验,并通过一套二十机制评分框架对代理的推理文本进行分析,从而识别和干预特定的行为机制;研究发现,针对性的提示(prompt)干预可因果性地增强或抑制某些行为特征,显著改变市场泡沫的强度,揭示了AI代理行为与市场均衡动态之间的内在联系。
链接: https://arxiv.org/abs/2604.18373
作者: Shumiao Ouyang,Pengfei Sui
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); General Finance (q-fin.GN)
备注:
Abstract:We study how AI agents form expectations and trade in experimental asset markets. Using a simulated open-call auction populated by autonomous Large Language Model (LLM) agents, we document three main findings. First, AI agents exhibit classic behavioral patterns: a pronounced disposition effect and recency-weighted extrapolative beliefs. Second, these individual-level patterns aggregate into equilibrium dynamics that replicate classic experimental findings (Smith et al., 1988), including the predictive power of excess demand for future prices and the positive relationship between disagreement and trading volume. Third, by analyzing the agents’ reasoning text through a twenty-mechanism scoring framework, we show that targeted prompt interventions causally amplify or suppress specific behavioral mechanisms, significantly altering the magnitude of market bubbles.
[AI-251] On The Mathematics of the Natural Physics of Optimization
【速读】:该论文试图解决的问题是:优化算法是否可以被理解为遵循某种“运动自然法则”的系统,以及这些算法能否通过应用此类法则推导出来。其解决方案的关键在于提出一种新的数学物理框架,将优化算法视为隐藏的算法原语(algorithm primitives)所表现出的非牛顿动力学现象;通过将最优控制问题的终端横截条件与优化问题的广义Karush-Kuhn-Tucker(KKT)条件等价映射,使得给定约束优化问题的数据函数生成一个渗透整个隐空间的自然向量场,该场编码了最优性条件信息;进而利用庞特里亚金最小值原理实现“远距离作用”操作,通过哈密顿-雅可比不等式产生局部作用以达成全局结果,并通过控制跳跃耗散由搜索李雅普诺夫函数定义的量子化“能量”,从而生成逆最优算法。
链接: https://arxiv.org/abs/2604.17645
作者: I. M. Ross
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Mathematical Physics (math-ph); Numerical Analysis (math.NA)
备注: J. Nonlinear Var. Anal. 10 (2026), 661-686. this https URL special issue dedicated to Yurii Nesterov on the occasion of his 70th birthday
Abstract:A number of optimization algorithms have been inspired by the physics of Newtonian motion. Here, we ask the question: do algorithms themselves obey some natural laws of motion,'' and can they be derived by an application of these laws? We explore this question by positing the theory that optimization algorithms may be considered as some manifestation of hidden algorithm primitives that obey certain universal non-Newtonian dynamics. This natural physics of optimization is developed by equating the terminal transversality conditions of an optimal control problem to the generalized Karush/John-Kuhn-Tucker conditions of an optimization problem. Through this equivalence formulation, the data functions of a given constrained optimization problem generate a natural vector field that permeates an entire hidden space with information on the optimality conditions. An action-at-a-distance’’ operation via a Pontryagin-type minimum principle produces a local action to deliver a globalized result by way of a Hamilton-Jacobi inequality. An inverse-optimal algorithm is generated by performing control jumps that dissipate quantized ``energy’’ defined by a search Lyapunov function. Illustrative applications of the proposed theory show that a large number of algorithms can be generated and explained in terms of the new mathematical physics of optimization.
[AI-252] Polarization and Integration in Global AI Research
【速读】:该论文旨在解决全球人工智能(Artificial Intelligence, AI)研究体系中日益加剧的分化与整合趋势不明确的问题,尤其关注中美两国在AI科研合作与影响力上的动态演变。其解决方案的关键在于利用大规模科学出版物数据,通过对比跨国合作网络与引文链接与其随机模拟结果,量化分析过去三十年全球AI研究的极化(polarization)与整合(integration)过程,从而揭示美国和中国已形成两大科研极点,且全球AI研究正围绕这两极演化,同时识别出不同国家群体(如欧洲国家、发展中国家)在其中的不同融合路径。
链接: https://arxiv.org/abs/2604.17602
作者: Luca Gallo,Riccardo Di Clemente,Balázs Lengyel
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
备注:
Abstract:The AI race amplifies security risks and international tensions. While the US restricts mobility and knowledge flows, challenges regulatory efforts to protect its advantage, China leads initiatives of global governance. Both strategies depend on cross-country relationships in AI innovation; yet, how this system evolves is unclear. Here, we measure the processes of polarization and integration in the global AI research over three decades by using large-scale data of scientific publications. Comparing cross-country collaboration and citation links to their random realizations, we find that the US and China have long diverged in both dimensions, forming two poles around which global AI research increasingly revolves. While the United Kingdom and Germany have integrated exclusively with the US, many European countries have converged with both poles. Developing and further developed countries, however, only integrate with China, signaling its expanding influence over the international AI research landscape. Our results inform national science policies and efforts toward global AI regulations.
[AI-253] Beyond the Bellm an Fixed Point: Geometry and Fast Policy Identification in Value Iteration
【速读】:该论文旨在解决传统Q值迭代(Q-VI)在收敛过程中对最优策略识别时间与整体Q函数收敛速率之间不一致的问题,尤其是当关注点不仅限于最终收敛至最优Q函数 $ Q^* $,还涉及诱导的贪心策略何时达到实际最优时。传统基于压缩映射的分析仅提供粗略的收敛刻画,无法揭示Q-VI轨迹的几何结构。解决方案的关键在于引入“实际最优解集”(Practically Optimal Solution Set, X∗),并借助切换系统理论重新审视折扣Q-VI,发现其具有两阶段几何行为:首先在有限步内识别出最优动作类别(即收敛至X∗的一个子集X1),随后以由受限切换族的联合谱半径(Joint Spectral Radius, JSR)决定的指数速率向Q∗收敛;该JSR速率可能严格快于标准折扣因子γ,从而揭示了Q-VI中局部快速收敛与全局慢速收敛共存的内在机制。
链接: https://arxiv.org/abs/2604.17457
作者: Donghwan Lee
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Dynamic programming is one of the most fundamental methodologies for solving Markov decision problems. Among its many variants, Q-value iteration (Q-VI) is particularly important due to its conceptual simplicity and its classical contraction-based convergence guarantee. Despite the central role of this contraction property, it does not fully reveal the geometric structure of the Q-VI trajectory. In particular, when one is interested not only in the final limit Q^* but also in when the induced greedy policy becomes effectively optimal, the standard contraction argument provides only a coarse characterization. To formalize this notion, we denote by \mathcal X^* the set of Q -functions whose corresponding tie-broken greedy policies are optimal, referred to as the practically optimal solution set (POS). In this paper, we revisit discounted Q-VI through the lens of switching system theory and derive new geometric insights into its behavior. In particular, we show that although Q-VI does not reach Q^* in finite time in general, it identifies the optimal action class in finite time. Furthermore, we prove that the distance from the iterate to a particular subset of \mathcal X^* decays exponentially at a rate governed by the joint spectral radius (JSR) of a restricted switching family. This rate can be strictly faster than the standard \gamma rate when the restricted JSR is strictly smaller than \gamma , while the convergence of the entire Q -function to Q^* can still be dominated by the slower \gamma mode, where \gamma denotes the discount factor. These results reveal a two-stage geometric behavior of Q-VI: a fast convergence toward \mathcal X_1 , followed by a slower convergence toward Q^* in general.
[AI-254] Signal or Noise in Multi-Agent LLM -based Stock Recommendations?
【速读】:该论文旨在解决多智能体大语言模型(Large Language Model, LLM)在股票投资组合层面的实证有效性问题,即验证MarketSenseAI这一部署型多智能体LLM股权系统是否能产生显著超额收益,并揭示其内部代理结构如何贡献alpha来源。解决方案的关键在于:首先,通过实时生成信号并严格避免前瞻偏差(look-ahead bias),构建一个可复现的端到端投资决策流程;其次,利用非负最小二乘投影将策略论点嵌入向量映射至各专业代理(新闻、基本面、动态、宏观)嵌入空间,识别出一种自适应整合机制(adaptive-integration mechanism);最后,发现不同市场阶段下各代理贡献轮动与行业配置及宏观事件高度一致,表明该系统具备对市场状态敏感的动态调参能力,从而在SP 500和SP 100样本中均展现出显著的选股能力(如SP 500组别信息系数ICIR=+0.489,p=0.024)。
链接: https://arxiv.org/abs/2604.17327
作者: George Fatouros,Kostas Metaxas
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Statistical Finance (q-fin.ST)
备注: 22 pages, 10 figures
Abstract:We present the first portfolio-level validation of MarketSenseAI, a deployed multi-agent LLM equity system. All signals are generated live at each observation date, eliminating look-ahead bias. The system routes four specialist agents (News, Fundamentals, Dynamics, and Macro) through a synthesis agent that issues a monthly equity thesis and recommendation for each stock in its coverage universe, and we ask two questions: do its buy recommendations add value over both passive benchmarks and random selection, and what does the internal agent structure reveal about the source of the edge? On the SP 500 cohort (19 months) the strong-buy equal-weight portfolio earns +2.18%/month against a passive equal-weight benchmark of +1.15% (approximating RSP), a +25.2% compound excess, and ranks at the 99.7th percentile of 10,000 Monte Carlo portfolios (p=0.003). The SP 100 cohort (35 months) delivers a +30.5% compound excess over EQWL with consistent direction but formal significance not reached, limited by the small average selection of ~10 stocks per month. Non-negative least-squares projection of thesis embeddings onto agent embeddings reveals an adaptive-integration mechanism. Agent contributions rotate with market regime (Fundamentals leads on SP 500, Macro on SP 100, Dynamics acts as an episodic momentum signal) and this agent rotation moves in lockstep with both the sector composition of strong-buy selections and identifiable macro-calendar events, three independent views of the same underlying adaptation. The recommendation’s cross-sectional Information Coefficient is statistically significant on SP 500 (ICIR=+0.489, p=0.024). These results suggest that multi-agent LLM equity systems can identify sources of alpha beyond what classical factor models capture, and that the buy signal functions as an effective universe-filter that can sit upstream of any portfolio-construction process.
[AI-255] Light-Adapted Electroretinogram and Oscillatory Potentials (LEOPs) Dataset for Autism Spectrum Disorder and Typically Developing Individuals
【速读】:该论文旨在解决自闭症谱系障碍(Autism Spectrum Disorder, ASD)及其共病注意缺陷多动障碍(Attention Deficit Hyperactivity Disorder, ADHD)儿童和青少年群体中视网膜功能异常的客观量化问题,尤其关注在明适应条件下(Light-Adapted, LA)的视网膜电图(Electroretinogram, ERG)及振荡电位(Oscillatory Potentials, OPs)特征差异。其解决方案的关键在于构建并公开一个大规模、标准化的LEOPs数据集,包含来自控制组、ASD组和ASD+ADHD组共5309条单次闪光ERG波形与4434条OPs波形,覆盖多站点(澳大利亚与英国)、多种闪光强度(从−0.37到1.20 Td·s),并提供详细的元数据(如受试者人口统计学信息、诊断评分、电极位置图像、时间域ERG参数及OPs总和值),同时配套刺激代码和结构化JSON格式患者级数据,以支持生成式AI(Generative AI)等机器学习方法在视网膜电生理信号分析中的应用。
链接: https://arxiv.org/abs/2604.16981
作者: Paul A. Constable,Dorothy A. Thompson,Irene O. Lee,Lynne Loh,Aleksei Zhdanov,Mikhail Kulyabin,Andreas Maier
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The LEOPs (Light-ERG-Oscillatory Potentials) dataset provides light-adapted (LA) electroretinogram (ERG) and Oscillatory Potentials (OPs) waveforms for typically developing Control, Autism Spectrum Disorder (ASD) and ASD + Attention Deficit Hyperactivity Disorder (ADHD) childhood and adolescent populations. The ERGs were recorded in the Right And Left eyes with skin electrodes using the handheld RETeval device at two sites in Australia and the United Kingdom. The LEOPs dataset includes 5309 single flash ERG and 4434 OPs waveforms as well as images selected from each participant showing the position of the skin electrode. The LEOPs dataset is constructed from recordings using a 9 step randomized flash series from -0.37 to 1.20 ~ Td.s , a 2 step at 113 and 446 Td.s flash strengths (2500 Control, 1730 ASD and 451 ASD + ADHD samples), as well as the 85 ~ Td.s (Light Adapted 3 cd.s.m^-2 (LA3)) equivalent International Society of Clinical Electrophysiology of Vision (ISCEV) Standard flash with 435 Control, 176 ASD and 37 ASD + ADHD waveform samples. Code for the stimulus is provided along with participant demographics, date and time of testing, and where available diagnostic scores for the ASD and ASD + ADHD groups, alongside iris color, electrode position with image files and time domain values for the ERG and summed values for the OPs. The repository contains excel file, exported JSON files on the patient level that are more suitable for machine learning tasks, images of electrode position for each recording and the protocol files for use with the RETeval.
[AI-256] ProtoCycle: Reflective Tool-Augmented Planning for Text-Guided Protein Design ACL2026
【速读】:该论文旨在解决生成式 AI (Generative AI) 在蛋白质设计中面临的“计划-执行鸿沟”(plan-execute gap)问题,即大语言模型(LLM)虽能生成符合自然语言描述的初步设计方案,但在有限监督下难以稳定产出具有实际功能的蛋白质序列。其解决方案的关键在于提出 ProtoCycle 框架,该框架通过将 LLM 作为规划器与轻量级工具环境结合,形成多轮反馈驱动的决策循环,并利用 LLM 对工具反馈进行自我反思以迭代优化设计策略,从而在保持良好折叠能力的同时显著提升序列质量。
链接: https://arxiv.org/abs/2604.16896
作者: Yutang Ge,Guojiang Zhao,Sihang Li,Zheng Cheng,Zifeng Zhao,Hanchen Xia,Guolin Ke,Linfeng Zhang,Zhifeng Gao,Yuguang Wang
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注: 25 pages, 11 figures. Accepted to Findings of ACL 2026
Abstract:Designing proteins that satisfy natural language functional requirements is a central goal in protein engineering. A straightforward baseline is to fine-tune generic instruction-tuned LLMs as direct text-to-sequence generators, but this is data- and compute-hungry. With limited supervision, LLMs can produce coherent plans in text yet fail to reliably realize them as sequences. This plan-execute gap motivates ProtoCycle, an agentic framework for protein design that uses LLMs primarily to drive a multi-round, feedback-driven decision cycle. ProtoCycle couples an LLM planner with a lightweight tool environment designed to emulate the iterative workflow of human protein engineering and uses LLM-driven reflection on tool feedback to revise plans. Trained with supervised trajectories and online reinforcement learning, ProtoCycle achieves strong language alignment while maintaining competitive foldability, and ablations show that reflection substantially improves sequence quality.
[AI-257] Robustifying and Selecting Cohort-Appropriate Prognostic Models under Distributional Shifts
【速读】:该论文旨在解决当前 prognostic model(预后模型)外部校准(external calibration)与模型泛化能力(generalizability)之间关系的误解问题,即认为成功外部校准即可保证模型在不同人群中的可迁移性(transportability)。研究发现,外部校准性能随训练与验证队列间协变量和结局分布差异(以 Kullback-Leibler (KL) 散度衡量)的增加而显著下降,表明模型的可迁移性不仅依赖于校准效果,更受数据分布一致性的制约。解决方案的关键在于提出两种互补策略:一是从模型开发者视角出发,通过元分析推导的目标人群分布对模型进行加权训练,提升其在广泛人群中的平均校准性能;二是从终端用户视角出发,基于队列结局相似性指标筛选最适合特定目标队列的已有模型,从而兼顾校准准确性和临床实用性(如决策曲线分析 DCA)。
链接: https://arxiv.org/abs/2604.16537
作者: Dimitris Bertsimas,Carol Gao,Angelos G. Koulouras,Georgios Antonios Margonis
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:
Abstract:External validation is widely regarded as the gold standard for prognostic model evaluation. In this study, we challenge the assumption that successful external calibration guarantees model generalizability and propose two complementary strategies to improve transportability of prognostic models across cohorts. Using six real-world surgical cohorts from tertiary academic centers, we tested whether successful external calibration depends largely on similarity in covariates and outcomes between training and validation cohorts, quantified using Kullback-Leibler (KL) divergence, with calibration assessed by the Integrated Calibration Index (ICI). From the model-developer’s perspective, we trained the “best-on-average” prognostic model by tuning toward a meta-analysis-derived covariate and outcome distribution as an approximation of the broader target population. From the end-user perspective, we proposed a simple measure for cohort outcome similarity to identify, among published models, the one most suitable for a given target cohort in terms of both calibration and clinical utility. External calibration worsened as distributional mismatch increased. Higher KL divergence was associated with higher ICI in both surgery-alone (Spearman \rho=0.614 , p=0.004 ) and surgery + adjuvant chemotherapy cohorts (Spearman \rho=0.738 , p0.001 ). Meta-analysis-informed weighting improved calibration in most settings without materially affecting discrimination, with the clearest benefit when evaluated on the aggregated external population ( p=0.037 ). Models developed in more similar cohorts achieved lower ICI in surgery-alone (Spearman \rho=0.803 , p0.001 ) and surgery + adjuvant chemotherapy cohorts (Spearman \rho=0.737 , p0.001 ), and provided greater clinical utility on DCA. Subjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Applications (stat.AP) Cite as: arXiv:2604.16537 [stat.ME] (or arXiv:2604.16537v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2604.16537 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-258] MLE-Toolbox: An Open-Source Toolbox for Comprehensive EEG and MEG Data Analysis
【速读】:该论文旨在解决多模态脑电生理数据(如磁脑图MEG和脑电图EEG)分析流程复杂、工具分散、缺乏端到端自动化支持的问题。当前研究中,从原始数据预处理到高级功能连接建模及机器学习分类往往依赖多个独立软件平台,导致流程繁琐、可重复性差且对用户技术门槛要求高。解决方案的关键在于开发一个集成化的开源MATLAB工具箱——MLE-Toolbox,其核心优势是将完整的MEG/EEG分析流程封装于统一的图形用户界面(GUI)中,涵盖从原始数据导入、自动伪迹去除(如ICA、SSP、SSS)、源定位(MNE、dSPM、sLORETA、束流成像)、频谱功率分析、相位-幅值耦合(PAC)、图论网络分析到机器学习分类的全链条功能,并通过与Brainstorm、FieldTrip、EEGLAB和FreeSurfer等主流工具的原生互操作性实现工作流扩展,同时提供一键式学术报告生成能力,从而显著降低研究门槛并提升分析的系统性、自动化与可复现性。
链接: https://arxiv.org/abs/2604.16463
作者: Xiaobo Liu
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:MLE-Toolbox is a comprehensive open-source MATLAB toolbox for end-to-end analysis of magnetoencephalography (MEG) and electroencephalography (EEG) data. Inspired by widely used neuroimaging platforms such as Brainstorm and FieldTrip, it integrates the full analysis pipeline within a unified and user-friendly graphical interface (GUI), covering raw data import, preprocessing, source localization, functional connectivity, oscillatory analysis, and machine learning-based classification. The toolbox includes automated artifact rejection methods, including independent component analysis (ICA), signal-space projection (SSP), and signal-space separation (SSS); multiple source localization approaches, including minimum norm estimation (MNE), dynamic statistical parametric mapping (dSPM), standardized low-resolution brain electromagnetic tomography (sLORETA), and beamforming; multi-atlas parcellation with anatomical visualization; spectral power analysis with frequency-band brain mapping; phase-amplitude coupling (PAC); graph-theoretic brain network analysis; and integrated machine learning and deep learning classifiers. MLE-Toolbox also provides native interoperability with Brainstorm, FieldTrip, EEGLAB, and FreeSurfer, allowing researchers to build on established workflows while benefiting from additional automation, interactive visualization, and one-click academic report generation. Freely available for non-commercial use, MLE-Toolbox is designed to lower the barrier to rigorous, reproducible MEG/EEG research.
[AI-259] he Breakthrough of Sleep: A Contactless Approach for Accurate Sleep Stage Detection Using the Sleepal AI Lamp
【速读】:该论文旨在解决传统多导睡眠图(Polysomnography, PSG)在睡眠分期中存在侵入性强、劳动密集且不适用于长期监测的问题。其解决方案的关键在于开发了一种基于雷达的非接触式睡眠追踪设备——Sleepal AI Lamp,通过从雷达信号中提取多尺度呼吸与运动相关特征,并结合频率增强的深度学习模型进行睡眠分期。实验表明,该方法在二分类(清醒-睡眠)任务中准确率达92.8%,在四阶段分类(清醒、浅睡N1+N2、深睡N3、快速眼动期REM)中对健康人群和患有不同严重程度阻塞性睡眠呼吸暂停(Obstructive Sleep Apnea, OSA)的异质人群均表现出高一致性(Kappa系数分别为0.695和0.677),验证了非接触雷达传感与先进时序建模相结合可在无需物理接触或可穿戴设备的情况下实现可靠睡眠分期,具有临床筛查、居家评估及长期连续监测的应用潜力。
链接: https://arxiv.org/abs/2604.16442
作者: Zhuo Diao,Yueting Li,Jianpeng Wang,Shengyu Guan,Xinwei Wang,Wenxiong Cui,Xin Shi,Tong Liu,Kailai Sun,Jingyu Wang,Dian Fan,Thomas Penzel
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 12 figures, 4 tables. Preprint version; intended submission to Physiological Measurement
Abstract:Sleep staging is essential for the assessment of sleep quality and the diagnosis of sleep-related disorders. Conventional polysomnography (PSG), while considered the gold standard, is intrusive, labor-intensive, and unsuitable for long-term monitoring. This study evaluates the performance of the Sleepal AI Lamp, a contactless, radar-based consumer-grade sleep tracker, in comparison with gold-standard polysomnography (PSG), using a large-scale dataset comprising 1022 overnight recordings. We extract multi-scale respiratory and motion-related features from radar signals to train a frequency-augmented deep learning model. For the binary sleep-wake classification task, experimental results demonstrated that the model achieved an accuracy of 92.8% alongside a macro-averaged F1 score of 0.895. For four-stage classification (wake, light NREM (N1 + N2), deep NREM (N3), REM), the model achieved an accuracy of 78.5% with a Cohen’s kappa coefficient of 0.695 in healthy individuals and maintained a stable accuracy of 77.2% with a kappa of 0.677 in a heterogeneous population including patients with varying severities of obstructive sleep apnea (OSA). These experimental results demonstrate that the sleep staging performance of the contactless Sleepal AI Lamp is in high agreement with expert-labeled PSG sleep stages. Our findings suggest that non-contact radar sensing, combined with advanced temporal modeling, can provide reliable sleep staging performance without requiring physical contact or wearable devices. Owing to its unobtrusive nature, ease of deployment, and robustness to long-term use, the contactless Sleepal AI Lamp shows strong potential for clinical screening, home-based sleep assessment, and continuous longitudinal sleep monitoring in real-world medical and healthcare applications.
[AI-260] Sampling Matters: The Effect of ECG Frequency on Deep Learning-Based Atrial Fibrillation Detection
【速读】:该论文旨在解决深度学习模型在房颤(Atrial Fibrillation, AF)检测中因训练数据来自不同采样频率的体表心电图(Electrocardiogram, ECG)而导致性能、校准和鲁棒性不一致的问题。其解决方案的关键在于通过系统性基准测试,对12导联、10秒ECG记录在62 Hz、100 Hz、250 Hz和500 Hz四种目标采样频率下进行重采样,并评估标准一维卷积神经网络(1-D Convolutional Neural Network, CNN)与混合CNN-长短期记忆(CNN-LSTM)架构的表现差异。研究发现,采样频率对模型性能具有显著且架构依赖性的影响:混合CNN-LSTM模型在中等频率(100–250 Hz)时表现最优且校准稳定,而单一CNN模型在500 Hz时因高频频噪声干扰导致准确率和敏感度明显下降,揭示了采样频率作为关键时间分辨率参数的重要性,强调未来心律失常检测的基础模型需明确控制采样频率以保障临床可靠性与可重复性。
链接: https://arxiv.org/abs/2604.16437
作者: Arjan Mahmuod,Adrian Rod Hammerstad,Muzaffar Yousef,Yngve Sebastian Heill,Jonas L. Isaksen,Jørgen K. Kanters,Pal Halvorsen,Vajira Thambawita
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 5 figures, 2 tables. Conference-style paper. Includes reproducible benchmark on PTB-XL using 12-lead 10-second ECGs resampled to 62, 100, 250, and 500 Hz
Abstract:Deep learning models for atrial fibrillation (AF) detection are increasingly trained on heterogeneous electrocardiogram (ECG) datasets with varying sampling frequencies, yet the specific consequences of these discrepancies on model performance, calibration, and robustness remain insufficiently characterized. To address this, we conducted a systematic benchmark using 12-lead, 10-second recordings from the PTB-XL dataset, resampled to target frequencies of 62, 100, 250, and 500 Hz, to evaluate a standard 1-D Convolutional Neural Network (CNN) and a hybrid CNN-Long Short-Term Memory (LSTM) architecture under a rigorous patient-safe cross-validation framework. Our analysis reveals that sampling frequency significantly impacts detection metrics in an architecture-dependent manner; the hybrid CNN-LSTM model demonstrated optimal performance and consistent calibration at intermediate frequencies (100-250 Hz), whereas the 1-D CNN baseline exhibited marked degradation in accuracy and sensitivity at 500 Hz, suggesting increased susceptibility to high-frequency noise. We conclude that ECG sampling frequency is a critical, underappreciated factor in arrhythmia detection, and future foundation models must explicitly control for temporal resolution to ensure clinical reliability and reproducibility.
机器学习
[LG-0] A Note on TurboQuant and the Earlier DRIVE/EDEN Line of Work
链接: https://arxiv.org/abs/2604.18555
作者: Ran Ben-Basat,Yaniv Ben-Itzhak,Gal Mendelson,Michael Mitzenmacher,Amit Portnoy,Shay Vargaftik
类目: Machine Learning (cs.LG)
*备注:
Abstract:This note clarifies the relationship between the recent TurboQuant work and the earlier DRIVE (NeurIPS 2021) and EDEN (ICML 2022) schemes. DRIVE is a 1-bit quantizer that EDEN extended to any b0 bits per coordinate; we refer to them collectively as EDEN. First, TurboQuant _\textmse is a special case of EDEN obtained by fixing EDEN’s scalar scale parameter to S=1 . EDEN supports both biased and unbiased quantization, each optimized by a different S (chosen via methods described in the EDEN works). The fixed choice S=1 used by TurboQuant is generally suboptimal, although the optimal S for biased EDEN converges to 1 as the dimension grows; accordingly TurboQuant _\textmse approaches EDEN’s behavior for large d . Second, TurboQuant _\textprod combines a biased (b-1) -bit EDEN step with an unbiased 1-bit QJL quantization of the residual. It is suboptimal in three ways: (1) its (b-1) -bit step uses the suboptimal S=1 ; (2) its 1-bit unbiased residual quantization has worse MSE than (unbiased) 1-bit EDEN; (3) chaining a biased (b-1) -bit step with a 1-bit unbiased residual step is inferior to unbiasedly quantizing the input directly with b -bit EDEN. Third, some of the analysis in the TurboQuant work mirrors that of the EDEN works: both exploit the connection between random rotations and the shifted Beta distribution, use the Lloyd-Max algorithm, and note that Randomized Hadamard Transforms can replace uniform random rotations. Experiments support these claims: biased EDEN (with optimized S ) is more accurate than TurboQuant _\textmse , and unbiased EDEN is markedly more accurate than TurboQuant _\textprod , often by more than a bit (e.g., 2-bit EDEN beats 3-bit TurboQuant _\textprod ). We also repeat all accuracy experiments from the TurboQuant paper, showing that EDEN outperforms it in every setup we have tried. Subjects: Machine Learning (cs.LG) MSC classes: 68T07 ACMclasses: I.2.7; G.1.0 Cite as: arXiv:2604.18555 [cs.LG] (or arXiv:2604.18555v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.18555 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Amit Portnoy [view email] [v1] Mon, 20 Apr 2026 17:44:15 UTC (131 KB) Full-text links: Access Paper: View a PDF of the paper titled A Note on TurboQuant and the Earlier DRIVE/EDEN Line of Work, by Ran Ben-Basat and 5 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-04 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-1] Physics-Informed Neural Networks for Biological 2mathrmDt Reaction-Diffusion Systems
链接: https://arxiv.org/abs/2604.18548
作者: William Lavery,Jodie A. Cochrane,Christian Olesen,Dagim S. Tadele,John T. Nardini,Sara Hamis
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Physics-informed neural networks (PINNs) provide a powerful framework for learning governing equations of dynamical systems from data. Biologically-informed neural networks (BINNs) are a variant of PINNs that preserve the known differential operator structure (e.g., reaction-diffusion) while learning constitutive terms via trainable neural subnetworks, enforced through soft residual penalties. Existing BINN studies are limited to 1\mathrmD+t reaction-diffusion systems and focus on forward prediction, using the governing partial differential equation as a regulariser rather than an explicit identification target. Here, we extend BINNs to 2\mathrmD+t systems within a PINN framework that combines data preprocessing, BINN-based equation learning, and symbolic regression post-processing for closed-form equation discovery. We demonstrate the framework’s real-world applicability by learning the governing equations of lung cancer cell population dynamics from time-lapse microscopy data, recovering 2\mathrmD+t reaction-diffusion models from experimental observations. The proposed framework is readily applicable to other spatio-temporal systems, providing a practical and interpretable tool for fast analytic equation discovery from data.
[LG-2] Wasserstein Distributionally Robust Risk-Sensitive Estimation via Conditional Value-at-Risk
链接: https://arxiv.org/abs/2604.18546
作者: Feras Al Taha,Eilyan Bitar
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注: 6 pages, 2 figures
Abstract:We propose a distributionally robust approach to risk-sensitive estimation of an unknown signal x from an observed signal y. The unknown signal and observation are modeled as random vectors whose joint probability distribution is unknown, but assumed to belong to a given type-2 Wasserstein ball of distributions, termed the ambiguity set. The performance of an estimator is measured according to the conditional value-at-risk (CVaR) of the squared estimation error. Within this framework, we study the problem of computing affine estimators that minimize the worst-case CVaR over all distributions in the given ambiguity set. As our main result, we show that, when the nominal distribution at the center of the Wasserstein ball is finitely supported, such estimators can be exactly computed by solving a tractable semidefinite program. We evaluate the proposed estimators on a wholesale electricity price forecasting task using real market data and show that they deliver lower out-of-sample CVaR of squared error compared to existing methods.
[LG-3] oo Correct to Learn: Reinforcement Learning on Saturated Reasoning Data ACL2026
链接: https://arxiv.org/abs/2604.18493
作者: Zhenwen Liang,Yujun Zhou,Sidi Lu,Xiangliang Zhang,Haitao Mi,Dong Yu
类目: Machine Learning (cs.LG)
*备注: ACL 2026 Main Paper
Abstract:Reinforcement Learning (RL) enhances LLM reasoning, yet a paradox emerges as models scale: strong base models saturate standard benchmarks (e.g., MATH), yielding correct but homogeneous solutions. In such environments, the lack of failure cases causes the advantage signal in group-relative algorithms (e.g., GRPO) to vanish, driving policies into mode collapse. To address this, we propose Constrained Uniform Top-K Sampling (CUTS), a parameter-free decoding strategy enforcing structure-preserving exploration. Unlike standard sampling that follows model biases, CUTS flattens the local optimization landscape by sampling uniformly from constrained high-confidence candidates. We integrate this into Mixed-CUTS, a training framework synergizing exploitative and exploratory rollouts to amplify intra-group advantage variance. Experiments on Qwen3 models demonstrate that our approach prevents policy degeneration and significantly boosts out-of-domain generalization. Notably, Mixed-CUTS improves Pass@1 accuracy on the challenging AIME25 benchmark by up to 15.1% over standard GRPO, validating that maintaining diversity within the semantic manifold is critical for rigorous reasoning.
[LG-4] Barrier-enforced multi-objective optimization for direct point and sharp interval forecasting
链接: https://arxiv.org/abs/2604.18492
作者: Worachit Amnuaypongsa,Yotsapat Suparanonrat,Pana Wanitchollakit,Jitkomut Songsiri
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 25 pages, 12 figures, 3 tables
Abstract:This paper proposes a multi-step probabilistic forecasting framework using a single neural-network based model to generate simultaneous point and interval forecasts. Our approach ensures non-crossing prediction intervals (PIs) through a model structure design that strictly satisfy a target coverage probability (PICP) while maximizing sharpness. Unlike existing methods that rely on manual weight tuning for scalarized loss functions, we treat point and PI forecasting as a multi-objective optimization problem, utilizing multi-gradient descent to adaptively select optimal weights. Key innovations include a new PI loss function based on an extended log-barrier with an adaptive hyperparameter to guarantee the coverage, a hybrid architecture featuring a shared temporal model with horizon-specific submodels, and a training strategy. The proposed loss is scale-independent and universally applicable; combined with our training algorithm, the framework eliminates trial-and-error hyperparameter tuning for balancing multiple objectives. Validated by an intra-day solar irradiance forecasting application, results demonstrate that our proposed loss consistently outperforms those in current literature by achieving target coverage with the narrowest PI widths. Furthermore, when compared against LSTM encoder-decoder and Transformer architectures–including those augmented with Chronos foundation models–our method remains highly competitive and can be seamlessly adapted to any deep learning structure.
[LG-5] Safe Control using Learned Safety Filters and Adaptive Conformal Inference
链接: https://arxiv.org/abs/2604.18482
作者: Sacha Huriot,Ihab Tabbara,Hussein Sibai
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted to L4DC 2026
Abstract:Safety filters have been shown to be effective tools to ensure the safety of control systems with unsafe nominal policies. To address scalability challenges in traditional synthesis methods, learning-based approaches have been proposed for designing safety filters for systems with high-dimensional state and control spaces. However, the inevitable errors in the decisions of these models raise concerns about their reliability and the safety guarantees they offer. This paper presents Adaptive Conformal Filtering (ACoFi), a method that combines learned Hamilton-Jacobi reachability-based safety filters with adaptive conformal inference. Under ACoFi, the filter dynamically adjusts its switching criteria based on the observed errors in its predictions of the safety of actions. The range of possible safety values of the nominal policy’s output is used to quantify uncertainty in safety assessment. The filter switches from the nominal policy to the learned safe one when that range suggests it might be unsafe. We show that ACoFi guarantees that the rate of incorrectly quantifying uncertainty in the predicted safety of the nominal policy is asymptotically upper bounded by a user-defined parameter. This gives a soft safety guarantee rather than a hard safety guarantee. We evaluate ACoFi in a Dubins car simulation and a Safety Gymnasium environment, empirically demonstrating that it significantly outperforms the baseline method that uses a fixed switching threshold by achieving higher learned safety values and fewer safety violations, especially in out-of-distribution scenarios.
[LG-6] Physics-Informed Neural Networks: A Didactic Derivation of the Complete Training Cycle
链接: https://arxiv.org/abs/2604.18481
作者: Abdeladhim Tahimi
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 22 pages, 5 figures, companion code at this https URL
Abstract:This paper is a step-by-step, self-contained guide to the complete training cycle of a Physics-Informed Neural Network (PINN) – a topic that existing tutorials and guides typically delegate to automatic differentiation libraries without exposing the underlying algebra. Using a first-order initial value problem with a known analytical solution as a running example, we walk through every stage of the process: forward propagation of both the network output and its temporal derivative, evaluation of a composite loss function built from the ODE residual and the initial condition, backpropagation of gradients – with particular attention to the product rule that arises in hidden layers – and a gradient descent parameter update. Every calculation is presented with explicit, verifiable numerical values using a 1-3-3-1 multilayer perceptron with two hidden layers and 22 trainable parameters. From these concrete examples, we derive general recursive formulas – expressed as sensitivity propagation relations – that extend the gradient computation to networks of arbitrary depth, and we connect these formulas to the automatic differentiation engines used in practice. The trained network is then validated against the exact solution, achieving a relative L^2 error of 4.290 \times 10^-4 using only the physics-informed loss, without any data from the true solution. A companion Jupyter/PyTorch notebook reproduces every manual calculation and the full training pipeline, providing mutual validation between hand-derived and machine-computed gradients.
[LG-7] Multi-Scale Reversible Chaos Game Representation: A Unified Framework for Sequence Classification
链接: https://arxiv.org/abs/2604.18477
作者: Sarwan Ali,Taslim Murad
类目: Machine Learning (cs.LG)
*备注:
Abstract:Biological classification with interpretability remains a challenging task. For this, we introduce a novel encoding framework, Multi-Scale Reversible Chaos Game Representation (MS-RCGR), that transforms biological sequences into multi-resolution geometric representations with guaranteed reversibility. Unlike traditional sequence encoding methods, MS-RCGR employs rational arithmetic and hierarchical k-mer decomposition to generate scale-invariant features that preserve complete sequence information while enabling diverse analytical approaches. Our framework bridges three distinct paradigms for sequence analysis: (1) traditional machine learning using extracted geometric features, (2) computer vision models operating on CGR-generated images, and (3) hybrid approaches combining protein language model embeddings with CGR features. Through comprehensive experiments on synthetic DNA and protein datasets encompassing seven distinct sequence classes, we demonstrate that MS-RCGR features consistently enhance classification performance across all paradigms. Notably, our hybrid approach combining pre-trained language model embeddings (ESM2, ProtT5) with MS-RCGR features achieves superior performance compared to either method alone. The reversibility property of our encoding ensures no information loss during transformation, while multi-scale analysis captures patterns ranging from individual nucleotides to complex motif structures. Our results indicate that MS-RCGR provides a flexible, interpretable, and high-performing foundation for biological sequence analysis.
[LG-8] rain Separately Merge Together: Modular Post-Training with Mixture-of-Experts
链接: https://arxiv.org/abs/2604.18473
作者: Jacob Morrison,Sanjay Adhikesaven,Akshita Bhagia,Matei Zaharia,Noah A. Smith,Sewon Min
类目: Machine Learning (cs.LG)
*备注: 9 content pages, 23 pages overall, 3 figures
Abstract:Extending a fully post-trained language model with new domain capabilities is fundamentally limited by monolithic training paradigms: retraining from scratch is expensive and scales poorly, while continued training often degrades existing capabilities. We present BAR (Branch-Adapt-Route), which trains independent domain experts, each through its own mid-training, supervised finetuning, and reinforcement learning pipeline, and composes them via a Mixture-of-Experts architecture with lightweight router training. Unlike retraining approaches that mix all domains and require full reprocessing for any update (with cost scaling quadratically), BAR enables updating individual experts independently with linear cost scaling and no degradation to existing domains. At the 7B scale, with experts for math, code, tool use, and safety, BAR achieves an overall score of 49.1 (averaged across 7 evaluation categories), matching or exceeding re-training baselines (47.8 without mid-training, 50.5 with). We further show that modular training provides a structural advantage: by isolating each domain, it avoids the catastrophic forgetting that occurs when late-stage RL degrades capabilities from earlier training stages, while significantly reducing the cost and complexity of updating or adding a domain. Together, these results suggest that decoupled, expert-based training is a scalable alternative to monolithic retraining for extending language models.
[LG-9] NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization ICLR2026
链接: https://arxiv.org/abs/2604.18471
作者: Enshu Liu,Xuefei Ning,Yu Wang,Zinan Lin
类目: Machine Learning (cs.LG)
*备注: Accepted by ICLR 2026
Abstract:Discrete diffusion language models (dLLMs) have recently emerged as a promising alternative to traditional autoregressive approaches, offering the flexibility to generate tokens in arbitrary orders and the potential of parallel decoding. However, existing heuristic sampling strategies remain inefficient: they choose only a small part of tokens to sample at each step, leaving substantial room for improvement. In this work, we study the problem of token sampling order optimization and demonstrate its significant potential for acceleration. Specifically, we find that fully leveraging correct predictions at each step can reduce the number of sampling iterations by an order of magnitude without compromising accuracy. Based on this, we propose Neural Indicator Sampling (NI Sampling), a general sampling order optimization framework that utilize a neural indicator to decide which tokens should be sampled at each step. We further propose a novel trajectory-preserving objective to train the indicator. Experiments on LLaDA and Dream models across multiple benchmarks show that our method achieves up to 14.3 \times acceleration over full-step sampling with negligible performance drop, and consistently outperforms confidence threshold sampling in the accuracy-step trade-off. Code is available at this https URL.
[LG-10] Semantic Step Prediction: Multi-Step Latent Forecasting in LLM Reasoning Trajectories via Step Sampling
链接: https://arxiv.org/abs/2604.18464
作者: Yidi Yuan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Semantic Tube Prediction (STP) leverages representation geometric to regularize LLM hidden-state trajectories toward locally linear geodesics during fine-tuning, thereby greatly improving data efficiency. The original STP recipe samples random token sub-spans, which is compatible with the base large language model (LLM) training architecture. Inspired by STP, we are interested to investigate whether the sampling position can further enhance the semantic structure of multi-step reasoning, and hence affect its geometric impact. We applied STP at consecutive semantic reasoning step boundaries and achieved 168x more accurate multi-step latent prediction than frozen baselines on ProcessBench (3,400 samples), compared to only 4x for the random-token STP. Probing the latent manifold with a learned non-linear predictor reveals that STP-shaped trajectories are smooth curves, not straight lines: a 3-layer MLP reduces prediction error by a further 3-12x over linear extrapolation on step-boundary models. Removing the language modeling loss yields trajectories that are 2x more MLP-predictable than the combined loss, revealing a tradeoff between generation quality and geometric purity. Our results identify sampling position as the critical variable in geometric regularization and establish multi-step latent prediction MSE as a new evaluation metric for this class of methods.
[LG-11] Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective ACL2026
链接: https://arxiv.org/abs/2604.18460
作者: Sijie Mai,Shiqin Han
类目: Machine Learning (cs.LG)
*备注: Accepted by ACL 2026 Main
Abstract:Multimodal affective computing aims to predict humans’ sentiment, emotion, intention, and opinion using language, acoustic, and visual modalities. However, current models often learn spurious correlations that harm generalization under distribution shifts or noisy modalities. To address this, we propose a causal modality-invariant representation (CmIR) learning framework for robust multimodal learning. At its core, we introduce a theoretically grounded disentanglement method that separates each modality into causal invariant representation' and environment-specific spurious representation’ from a causal inference perspective. CmIR ensures that the learned invariant representations retain stable predictive relationships with labels across different environments while preserving sufficient information from the raw inputs via invariance constraint, mutual information constraint, and reconstruction constraint. Experiments across multiple multimodal benchmarks demonstrate that CmIR achieves state-of-the-art performance. CmIR particularly excels on out-of-distribution data and noisy data, confirming its robustness and generalizability.
[LG-12] AutoPPA: Automated Circuit PPA Optimization via Contrastive Code-based Rule Library Learning
链接: https://arxiv.org/abs/2604.18445
作者: Chongxiao Li,Pengwei Jin,Di Huang,Guangrun Sun,Husheng Han,Jianan Mu,Xinyao Zheng,Jiaguo Zhu,Shuyi Xing,Hanjun Wei,Tianyun Ma,Shuyao Cheng,Rui Zhang,Ying Wang,Zidong Du,Qi Guo,Xing Hu
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:
Abstract:Performance, power, and area (PPA) optimization is a fundamental task in RTL design, requiring a precise understanding of circuit functionality and the relationship between circuit structures and PPA metrics. Recent studies attempt to automate this process using LLMs, but neither feedback-based nor knowledge-based methods are efficient enough, as they either design without any prior knowledge or rely heavily on human-summarized optimization rules. In this paper, we propose AutoPPA, a fully automated PPA optimization framework. The key idea is to automatically generate optimization rules that enhance the search for optimal solutions. To do this, AutoPPA employs an Explore-Evaluate-Induce ( E^2I ) workflow that contrasts and abstracts rules from diverse generated code pairs rather than manually defined prior knowledge, yielding better optimization patterns. To make the abstracted rules more generalizable, AutoPPA employs an adaptive multi-step search framework that adopts the most effective rules for a given circuit. Experiments show that AutoPPA outperforms both the manual optimization and the state-of-the-art methods SymRTLO and RTLRewriter. Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR) Cite as: arXiv:2604.18445 [cs.LG] (or arXiv:2604.18445v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.18445 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-13] Scalable Physics-Informed Neural Differential Equations and Data-Driven Algorithms for HVAC Systems
链接: https://arxiv.org/abs/2604.18438
作者: Hanfeng Zhai,Hongtao Qiao,Hassan Mansour,Christopher Laughman
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Adaptation and Self-Organizing Systems (nlin.AO)
*备注: 50 pages, 26 figures
Abstract:We present a scalable, data-driven simulation framework for large-scale heating, ventilation, and air conditioning (HVAC) systems that couples physics-informed neural ordinary differential equations (PINODEs) with differential-algebraic equation (DAE) solvers. At the component level, we learn heat-exchanger dynamics using an implicit PINODE formulation that predicts conserved quantities (refrigerant mass M_r and internal energy E_\texthx ) as outputs, enabling physics-informed training via automatic differentiation of mass/energy balances. Stable long-horizon prediction is achieved through gradient-stabilized latent evolution with gated architectures and layer normalization. At the system level, we integrate learned components with DAE solvers (IDA and DASSL) that explicitly enforce junction constraints (pressure equilibrium and mass-flow consistency), and we use Bayesian optimization to tune solver parameters for accuracy–efficiency trade-offs. To reduce residual system-level bias, we introduce a lightweight corrector network trained on short trajectory segments. Across dual-compressor and scaled network studies, the proposed approach attains multi-fold speedups over high-fidelity simulation while keeping errors low (MAPE below a few percent) and scales to systems with up to 32 compressor–condenser pairs.
[LG-14] Balance-Guided Sparse Identification of Multiscale Nonlinear PDEs with Small-coefficient Terms
链接: https://arxiv.org/abs/2604.18414
作者: Zhenhua Dang,Lei Zhang,Long Wang,Guowei He
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 32 pages, 7 figures, submitted to Journal of Computational Physics
Abstract:Data-driven discovery of governing equations has advanced significantly in recent years; however, existing methods often struggle in multiscale systems where dynamically significant terms may have small coefficients. Therefore, we propose Balance-Guided SINDy (BG-SINDy) inspired by the principle of dominant balance, which reformulates \ell_0 -constrained sparse regression as a term-level \ell_2,0 -regularized problem and solves it using a progressive pruning strategy. Terms are ranked according to their relative contributions to the governing equation balance rather than their absolute coefficient magnitudes. Based on this criterion, BG-SINDy alternates between least-squares regression and elimination of negligible terms, thereby preserving dynamically significant terms even when their coefficients are small. Numerical experiments on the Korteweg–de Vries equation with a small dispersion coefficient, a modified Burgers equation with vanishing hyperviscosity, a modified Kuramoto–Sivashinsky equation with multiple small-coefficient terms, and a two-dimensional reaction–diffusion system demonstrate the validity of BG-SINDy in discovering small-coefficient terms. The proposed method thus provides an efficient approach for discovering governing equations that contain small-coefficient terms.
[LG-15] Bridge-Centered Metapath Classification Using R-GCN-VGAE for Disaster-Resilient Maintenance Decisions
链接: https://arxiv.org/abs/2604.18399
作者: Takato Yasuno
类目: Machine Learning (cs.LG)
*备注: 14 pages, 3 figures, 6 tables
Abstract:Daily infrastructure management in preparation for disasters is critical for urban resilience. When bridges remain resilient against disaster-induced external forces, access to hospitals, shops, and residences via metapaths can be sustained, maintaining essential urban functions. However, prioritizing bridge maintenance under limited budgets requires quantifying the multi-dimensional roles that bridges play in disaster scenarios – a challenge that existing single-indicator approaches fail to address. We focus on metapaths from national highways through bridges to buildings (hospitals, shops, residences), constructing a heterogeneous graph with road, bridge, and building layers. A Relation-centric Graph Convolutional Network Variational Autoencoder (R-GCN-VGAE) learns metapath-based feature representations, enabling classification of bridges into disaster-preparedness categories: Supply Chain (commercial logistics), Medical Access (emergency healthcare), and Residential Protection (preventing isolation). Using OSMnx and open data, we validate our methodology on three diverse cities in Ibaraki Prefecture, Japan: Mito (697 bridges), Chikusei (258 bridges), and Moriya (148 bridges), totaling 1,103 bridges. The heterogeneous graph construction from open data enables redefining bridge roles for disaster scenarios, supporting maintenance budget decision-making. We contributed that (1) Open-data methodology for constructing urban heterogeneous graphs. (2) Redefinition of bridge roles for disaster scenarios via metapath-based classification. (3) Establishment of maintenance budget decision support methodology. (4) k-NN tuning strategy validated across diverse city scales. (5) Empirical demonstration of UMAP superiority over t-SNE/PCA for multi-role bridge visualization.
[LG-16] Forecasting Ionospheric Irregularities on GNSS Lines of Sight Using Dynamic Graphs with Ephemeris Conditioning
链接: https://arxiv.org/abs/2604.18379
作者: Mert Can Turkmen,Eng Leong Tan,Yee Hui Lee
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Geophysics (physics.geo-ph); Space Physics (physics.space-ph)
*备注: 14 pages, 8 figures, submitted to IEEE Transactions on Geoscience and Remote Sensing
Abstract:Most data-driven ionospheric forecasting models operate on gridded products, which do not preserve the time-varying sampling structure of satellite-based sensing. We instead model the ionosphere as a dynamic graph over ionospheric pierce points (IPPs), with connectivity that evolves as satellite positions change. Because satellite trajectories are predictable, the graph topology over the forecast horizon can be constructed in advance. We exploit this property to condition forecasts on the future graph structure, which we term ephemeris conditioning. This enables prediction on lines of sight that appear only in the forecast horizon. We evaluate our framework on multi-GNSS (Global Navigation Satellite System) data from a co-located receiver pair in Singapore spanning January 2023 through April 2025. The task is to forecast Rate of TEC Index (ROTI)-defined irregularities at 5-minute cadence up to 2 hours ahead as binary probabilistic classification per node. The resulting model, IonoDGNN, achieves a Brier Skill Score (BSS) of 0.49 and a precision-recall area under the curve (PR-AUC) of 0.75, improving over persistence by 35% in BSS and 52% in PR-AUC, with larger gains at longer lead times. Ablations confirm that graph structure and ephemeris conditioning each contribute meaningfully, with conditioning proving essential for satellites that rise during the forecast horizon (receiver operating characteristic AUC: 0.95 vs.\ 0.52 without). Under simulated coverage dropout, the model retains predictive skill on affected nodes through spatial message passing from observed neighbors. These results suggest that dynamic graph forecasting on evolving lines of sight is a viable alternative to grid-based representations for ionospheric irregularity forecasting. The model and evaluation code will be released upon publication.
[LG-17] Parkinsons Disease Detection via Self-Supervised Dual-Channel Cross-Attention on Bilateral Wrist-Worn IMU Signals
链接: https://arxiv.org/abs/2604.18372
作者: Meheru Zannat
类目: Machine Learning (cs.LG)
*备注: 15 pages, 6 figures
Abstract:Parkinson’s disease (PD) is a chronic neurodegenerative disease. It shows multiple motor symptoms such as tremor, bradykinesia, postural instability, freezing of gait (FoG). PD is currently diagnosed clinically through physical exam by health-care professionals, which can be time consuming and highly subjective. Wearable IMU sensors has become a promising gateway for passive monitoring of PD patients. We propose a self-supervised cross-attention encoder that processes bilateral wrist-worn IMU signals from a public dataset called PADS, consisting of three groups, PD (Parkinson Disease), HC (Healthy Control) and DD (Differential Diagnosis) of a total of 469 subjects. We have achieved a mean accuracy of 93.12% for HC vs. PD classification and 87.04% for PD vs. DD classification. The results emphasize the clinical challenge of distinguishing Parkinson’s from other neurodegenerative diseases. Self-supervised representation learning using contrastive infoNCE loss gained an accuracy of 93.56% for HC vs. PD and 92.50% for PD vs. DD using only 20% of labelled data. This demonstrates the effectiveness of our method in transfer learning for clinical use with minimal labels. The real-time applicability was tested by deploying the optimized model with a mean inference time of 48.32 ms per window on a Raspberry Pi CPU.
[LG-18] Scale-free adaptive planning for deterministic dynamics discounted rewards ICML2019
链接: https://arxiv.org/abs/2604.18312
作者: Peter L. Bartlett,Victor Gabillon,Jennifer Healey,Michal Valko
类目: Machine Learning (cs.LG)
*备注: 36th International Conference on Machine Learning (ICML 2019)
Abstract:We address the problem of planning in an environment with deterministic dynamics and stochastic rewards with discounted returns. The optimal value function is not known, nor are the rewards bounded. We propose Platypoos, a simple scale-free planning algorithm that adapts to the unknown scale and smoothness of the reward function. We provide a sample complexity analysis for Platypoos that improves upon prior work and holds simultaneously over a broad range of discount factors and reward scales, without the algorithm knowing them. We also establish a matching lower bound showing our analysis is optimal up to constants.
[LG-19] CAARL: In-Context Learning for Interpretable Co-Evolving Time Series Forecasting
链接: https://arxiv.org/abs/2604.18305
作者: Etienne Tajeuna,Patrick Asante Owusu,Armelle Brun,Shengrui Wang
类目: Machine Learning (cs.LG)
*备注: Double-columned, 8 pages, 4 figures
Abstract:In this paper we investigate forecasting coevolving time series that feature intricate dependencies and nonstationary dynamics by using an LLM Large Language Models approach We propose a novel modeling approach named ContextAware ARLLM CAARL that provides an interpretable framework to decode the contextual dynamics influencing changes in coevolving series CAARL decomposes time series into autoregressive segments constructs a temporal dependency graph and serializes this graph into a narrative to allow processing by LLM This design yields a chainofthoughtlike reasoning path where intermediate steps capture contextual dynamics and guide forecasts in a transparent manner By linking prediction to explicit reasoning traces CAARL enhances interpretability while maintaining accuracy Experiments on realworld datasets validate its effectiveness positioning CAARL as a competitive and interpretable alternative to stateoftheart forecasting methods
[LG-20] Dissipative Latent Residual Physics-Informed Neural Networks for Modeling and Identification of Electromechanical Systems
链接: https://arxiv.org/abs/2604.18277
作者: Youyuan Long,Gokhan Solak,Arash Ajoudani
类目: Machine Learning (cs.LG)
*备注: Accepted for publication at the 23rd IFAC World Congress 2026
Abstract:Accurate dynamical modeling is essential for simulation and control of embodied systems, yet first-principles models of electromechanical systems often fail to capture complex dissipative effects such as joint friction, stray losses, and structural damping. While residual-learning physics-informed neural networks (PINNs) can effectively augment imperfect first-principles models with data-driven components, the residual terms are typically implemented as unconstrained multilayer perceptrons (MLPs), which may inadvertently inject artificial energy into the system. To more faithfully model the dissipative dynamics, we propose DiLaR-PINN, a dissipative latent residual PINN designed to learn unmodeled dissipative effects in a physically consistent manner. Structurally, the residual network operates only on unmeasurable (latent) state components and is parameterized in a skew-dissipative form that guarantees non-increasing energy for any choice of network parameters. To enable stable and data-efficient training under partial measurability of the state, we further develop a recurrent rollout scheme with a curriculum-based sequence length extension strategy. We validate DiLaR-PINN on a real-world helicopter system and compare it against four baselines: a pure physical model (without a residual network), an unstructured residual MLP, a DiLaR variant with a soft dissipativity constraint, and a black-box LSTM. The results demonstrate that DiLaR-PINN more accurately captures dissipative effects and achieves superior long-horizon extrapolation performance. Comments: Accepted for publication at the 23rd IFAC World Congress 2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.18277 [cs.LG] (or arXiv:2604.18277v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.18277 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-21] Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling
链接: https://arxiv.org/abs/2604.18264
作者: Fei Wang,Li Shen,Liang Ding,Chao Xue,Ye Liu,Changxing Ding
类目: Machine Learning (cs.LG)
*备注:
Abstract:Zeroth-Order optimization presents a promising memory-efficient paradigm for fine-tuning Large Language Models by relying solely on forward passes. However, its practical adoption is severely constrained by slow wall-clock convergence and high estimation variance. In this work, we dissect the runtime characteristics of ZO algorithms and identify a critical system bottleneck where the generation of perturbations and parameter updates accounts for over 40% of the training latency. We argue that the standard uniform exploration strategy is fundamentally flawed as it fails to account for the heterogeneous sensitivity of layers in deep networks, resulting in computationally wasteful blind searches. To address this structural mismatch, we propose AdaLeZO, an Adaptive Layer-wise ZO optimization framework. By formulating the layer selection process as a non-stationary Multi-Armed Bandit problem, AdaLeZO dynamically allocates the limited perturbation budget to the most sensitive parameters. We further introduce an Inverse Probability Weighting mechanism based on sampling with replacement, which guarantees unbiased gradient estimation while effectively acting as a temporal denoiser to reduce variance. Extensive experiments on LLaMA and OPT models ranging from 6.7B to 30B parameters demonstrate that AdaLeZO achieves 1.7x to 3.0x wall-clock acceleration compared to state-of-the-art methods. Crucially, AdaLeZO functions as a universal plug-and-play module that seamlessly enhances the efficiency of existing ZO optimizers without incurring additional memory overhead.
[LG-22] Correction and Corruption: A Two-Rate View of Error Flow in LLM Protocols
链接: https://arxiv.org/abs/2604.18245
作者: Fernando Reitich
类目: Machine Learning (cs.LG)
*备注: 42 pages main paper, 21 pages supplementary material included as ancillary file
Abstract:Large language models are increasingly deployed as protocols: structured multi-call procedures that spend additional computation to transform a baseline answer into a final one. These protocols are evaluated only by end-to-end accuracy, giving limited insight into when they help, when they hurt, and whether their behavior transfers under distribution shift or composition. We propose a paired-outcome measurement interface for auditing a single protocol step on exact-match tasks. For each instance, the interface records a baseline correctness bit E_0\in\0,1\ and a post-step correctness bit E_1\in\0,1\ , separating correction ( E_0=0\to E_1=1 ) from corruption ( E_0=1\to E_1=0 ) through two rates: c=\Pr(E_1=1\mid E_0=0) and \gamma=\Pr(E_1=0\mid E_0=1) . These rates predict accuracy changes and define a reusable empirical interface testable across seeds, mixtures, and pipelines. We identify three failure mechanisms. Under mixture shift, pooled estimates of (c,\gamma) become biased when calibration and deployment mixtures differ; conditioning on a difficulty proxy restores stability without additional model calls. Under presentation contamination, selection protocols alter the interface through stable presentation artifacts when candidate content is fixed. Under state insufficiency, the correctness bit may not carry enough history for multi-step pipelines to compose predictably; a Markov factorization test identifies when composition is valid and where additional state is needed. When a protocol step passes these diagnostics, it becomes an auditable module: gated by estimated gain, conditioned on a difficulty proxy to correct mixture bias, and composed into multi-step pipelines with predictable accuracy. We demonstrate these ideas on synthetic mathematical tasks and on GSM8K, where the calibrated interface correctly predicts when protocol steps should be activated or suppressed.
[LG-23] FSEVAL: Feature Selection Evaluation Toolbox and Dashboard
链接: https://arxiv.org/abs/2604.18227
作者: Muhammad Rajabinasab,Arthur Zimek
类目: Machine Learning (cs.LG)
*备注:
Abstract:Feature selection is a fundamental machine learning and data mining task, involved with discriminating redundant features from informative ones. It is an attempt to address the curse of dimensionality by removing the redundant features, while unlike dimensionality reduction methods, preserving explainability. Feature selection is conducted in both supervised and unsupervised settings, with different evaluation metrics employed to determine which feature selection algorithm is the best. In this paper, we propose FSEVAL, a feature selection evaluation toolbox accompanied with a visualization dashboard, with the goal to make it easy to comprehensively evaluate feature selection algorithms. FSEVAL aims to provide a standardized, unified, evaluation and visualization toolbox to help the researchers working in the field, conduct extensive and comprehensive evaluation of feature selection algorithms with ease.
[LG-24] An `Inverse Experimental Framework to Estimate Market Efficiency
链接: https://arxiv.org/abs/2604.18130
作者: Thomas Asikis,Heinrich Nax
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Applications (stat.AP)
*备注:
Abstract:Digital marketplaces processing billions of dollars annually represent critical infrastructure in sociotechnical ecosystems, yet their performance optimization lacks principled measurement frameworks that can inform algorithmic governance decisions regarding market efficiency and fairness from complex market data. By looking at orderbook data from double auction markets alone, because bids and asks do not represent true maximum willingnesses to buy and true minimum willingnesses to sell, there is little an economist can say about the market’s actual performance in terms of allocative efficiency. We turn to experimental data to address this issue, `inverting’ the standard induced value approach of double auction experiments. Our aim is to predict key market features relevant to market efficiency, particularly allocative efficiency, using orderbook data only – specifically bids, asks and price realizations, but not the induced reservation values – as early as possible. Since there is no established model of strategically optimal behavior in these markets, and because orderbook data is highly unstructured, non-stationary and non-linear, we propose quantile-based normalization techniques that help us build general predictive models. We develop and train several models, including linear regressions and gradient boosting trees, leveraging quantile-based input from the underlying supply-demand model. Our models can predict allocative efficiency with reasonable accuracy from the earliest bids and asks, and these predictions improve with additional realized price data. The performance of the prediction techniques varies by target and market type. Our framework holds significant potential for application to real-world market data, offering valuable insights into market efficiency and performance, even prior to any trade realizations.
[LG-25] LoRaQ: Optimized Low Rank Approximation for 4-bit Quantization
链接: https://arxiv.org/abs/2604.18117
作者: Yann Bouquet,Alireza Khodamoradi,Sophie Yáng Shen,Kristof Denolf,Mathieu Salzmann
类目: Machine Learning (cs.LG)
*备注:
Abstract:Post-training quantization (PTQ) is essential for deploying large diffusion transformers on resource-constrained hardware, but aggressive 4-bit quantization significantly degrades generative performance. Low-rank approximation methods have emerged as a promising solution by appending auxiliary linear branches to restore performance. However, current state-of-the-art approaches assume these branches must retain high precision (W16A16) and rely on heavy, data-dependent calibration for initialization. We challenge both limitations with LoRaQ (Low-Rank Approximated Quantization), a simple, data-free calibration approach that optimizes quantization error compensation. By overcoming the need for high-precision branches, LoRaQ enables the first fully sub-16 bit pipeline, allowing the low-rank branch itself to be quantized. We demonstrate that, at equal memory overhead, LoRaQ outperforms the state-of-the-art methods in their native implementations on Pixart- \Sigma and SANA. We also analyze mixed-precision configurations, showing that setups such as W8A8, W6A6, and W4A8 for the low-rank branch, alongside a W4 main layer, yield superior results while maintaining a fully quantized architecture compatible with modern mixed-precision hardware.
[LG-26] Generalization Boundaries of Fine-Tuned Small Language Models for Graph Structural Inference
链接: https://arxiv.org/abs/2604.18092
作者: Michal Podstawski
类目: Machine Learning (cs.LG)
*备注:
Abstract:Small language models fine-tuned for graph property estimation have demonstrated strong in-distribution performance, yet their generalization capabilities beyond training conditions remain poorly understood. In this work, we systematically investigate the boundaries of structural inference in fine-tuned small language models along two generalization axes - graph size and graph family distribution - and assess domain-learning capability on real-world graph benchmarks. Using a controlled experimental setup with three instruction-tuned models in the 3-4B parameter class and two graph serialization formats, we evaluate performance on graphs substantially larger than the training range and across held-out random graph families. Our results show that fine-tuned models maintain strong ordinal consistency across structurally distinct graph families and continue to rank graphs by structural properties on inputs substantially larger than those seen during training, with distinct architecture-specific degradation profiles. These findings delineate where fine-tuned small language models generalize reliably, providing empirical grounding for their use in graph-based reasoning tasks.
[LG-27] owards E-Value Based Stopping Rules for Bayesian Deep Ensembles AISTATS2026
链接: https://arxiv.org/abs/2604.18089
作者: Emanuel Sommer,Rickmer Schulte,Sarah Deubner,Julius Kobialka,David Rügamer
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted for presentation at the OPTIMAL Workshop at AISTATS 2026, Tangier, Morocco
Abstract:Bayesian Deep Ensembles (BDEs) represent a powerful approach for uncertainty quantification in deep learning, combining the robustness of Deep Ensembles (DEs) with flexible multi-chain MCMC. While DEs are affordable in most deep learning settings, (long) sampling of Bayesian neural networks can be prohibitively costly. Yet, adding sampling after optimizing the DEs has been shown to yield significant improvements. This leaves a critical practical question: How long should the sequential sampling process continue to yield significant improvements over the initial optimized DE baseline? To tackle this question, we propose a stopping rule based on E-values. We formulate the ensemble construction as a sequential anytime-valid hypothesis test, providing a principled way to decide whether or not to reject the null hypothesis that MCMC offers no improvement over a strong baseline, to early stop the sampling. Empirically, we study this approach for diverse settings. Our results demonstrate the efficacy of our approach and reveal that only a fraction of the full-chain budget is often required.
[LG-28] Predicting LLM Compression Degradation from Spectral Statistics
链接: https://arxiv.org/abs/2604.18085
作者: Mingxue(Mercy)Xu
类目: Machine Learning (cs.LG)
*备注: Profoundly assisted by agentic AI
Abstract:Matrix-level low-rank compression is a promising way to reduce the cost of large language models, but running compression and evaluating the resulting models on language tasks can be prohibitively expensive. Can compression-induced degradation be predicted before committing to this compute? We systematically analyze the Qwen3 and Gemma3 model families across four representative low-rank compression methods: vanilla SVD, two ASVD variants, and SVD-LLM. We find that stable rank and information density, measured in bits per parameter, dominate performance degradation. The interaction term \gamma \cdot \bar\rho_s , defined as compression ratio times stable rank, is a robust predictor of accuracy degradation, achieving leave-one-out cross-validation Pearson correlations of 0.890 for attention layers and 0.839 for MLP layers. We provide theoretical intuition for why this predictor succeeds by connecting it to standard SVD truncation bounds and error composition mechanisms in transformer layers. These findings enable a predict-then-compress workflow: compute \gamma \cdot \bar\rho_s from weights, estimate degradation, and invest compute only in desirable configurations.
[LG-29] Dynamic Risk Assessment by Bayesian Attack Graphs and Process Mining
链接: https://arxiv.org/abs/2604.18080
作者: Francesco Vitale,Simone Guarino,Stefano Perone,Massimiliano Rak,Nicola Mazzocca
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Accepted to the 2026 IEEE International Conference on Cyber Security and Resilience
Abstract:While attack graphs are useful for identifying major cybersecurity threats affecting a system, they do not provide operational support for determining the likelihood of having a known vulnerability exploited, or that critical system nodes are likely to be compromised. In this paper, we perform dynamic risk assessment by combining Bayesian Attack Graphs (BAGs) and online monitoring of system behavior through process mining. Specifically, the proposed approach applies process mining techniques to characterize malicious network traffic and derive evidence regarding the probability of having a vulnerability actively exploited. This evidence is then provided to a BAG, which updates its conditional probability tables accordingly, enabling dynamic assessment of vulnerability exploitation. We apply our method to a cybersecurity testbed instantiating several machines deployed on different subnets and affected by several CVE vulnerabilities. The testbed is stimulated with both benign traffic and malicious behavior, which simulates network attack patterns aimed at exploiting the CVE vulnerabilities. The results indicate that our proposal effectively detects whether vulnerabilities are being actively exploited, allowing for an updated assessment of the probability of system compromise.
[LG-30] owards Real-Time ECG and EMG Modeling on μ NPUs
链接: https://arxiv.org/abs/2604.18067
作者: Josh Millar,Ashok Samraj Thangarajan,Soumyajit Chatterjee,Hamed Haddadi
类目: Machine Learning (cs.LG)
*备注:
Abstract:The miniaturisation of neural processing units (NPUs) and other low-power accelerators has enabled their integration into microcontroller-scale wearable hardware, supporting near-real-time, offline, and privacy-preserving inference. Yet physiological signal analysis has remained infeasible on such hardware; recent Transformer-based models show state-of-the-art performance but are prohibitively large for resource- and power-constrained hardware and incompatible with \mu NPUs due to their dynamic attention operations. We introduce PhysioLite, a lightweight, NPU-compatible model architecture and training framework for ECG/EMG signal analysis. Using learnable wavelet filter banks, CPU-offloaded positional encoding, and hardware-aware layer design, PhysioLite reaches performance comparable to state-of-the-art Transformer-based foundation models on ECG and EMG benchmarks, while being 10% of the size ( \sim 370KB with 8-bit quantization). We also profile its component-wise latency and resource consumption on both the MAX78000 and HX6538 WE2 \mu NPUs, demonstrating its viability for signal analysis on constrained, battery-powered hardware. We release our model(s) and training framework at: this https URL.
[LG-31] Enhancing Anomaly-Based Intrusion Detection Systems with Process Mining
链接: https://arxiv.org/abs/2604.18066
作者: Francesco Vitale,Francesco Grimaldi,Massimiliano Rak,Nicola Mazzocca
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Accepted to the 2026 IEEE International Conference on Cyber Security and Resilience
Abstract:Anomaly-based Intrusion Detection Systems (IDSs) ensure protection against malicious attacks on networked systems. While deep learning-based IDSs achieve effective performance, their limited trustworthiness due to black-box architectures remains a critical constraint. Despite existing explainable techniques offering insight into the alarms raised by IDSs, they lack process-based explanations grounded in packet-level sequencing analysis. In this paper, we propose a method that employs process mining techniques to enhance anomaly-based IDSs by providing process-based alarm severity ratings and explanations for alerts. Our method prioritizes critical alerts and maintains visibility into network behavior, while minimizing disruption by allowing misclassified benign traffic to pass. We apply the method to the publicly available USB-IDS-TC dataset, which includes anomalous traffic affected by different variants of the Slowloris DoS attack. Results show that our method is able to discriminate between low- to very-high-severity alarms while preserving up to 99.94% recall and 99.99% precision, effectively discarding false positives while providing different degrees of severity for the true positives.
[LG-32] owards a Foundation-Model Paradigm for Aerodynamic Prediction in Three-dimensional Design
链接: https://arxiv.org/abs/2604.18062
作者: Yunjia Yang,Babak Gholami,Caglar Gurbuz,Mohammad Rashed,Nils Thuerey
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Accurate machine-learning models for aerodynamic prediction are essential for accelerating shape optimization, yet remain challenging to develop for complex three-dimensional configurations due to the high cost of generating training data. This work introduces a methodology for efficiently constructing accurate surrogate models for design purposes by first pre-training a large-scale model on diverse geometries and then fine-tuning it with a few more detailed task-specific samples. A Transformer-based architecture, AeroTransformer, is developed and tailored for large-scale training to learn aerodynamics. The methodology is evaluated on transonic wings, where the model is pre-trained on SuperWing, a dataset of nearly 30000 samples with broad geometric diversity, and subsequently fine-tuned to handle specific wing shapes perturbed from the Common Research Model. Results show that, with 450 task-specific samples, the proposed methodology achieves 0.36% error on surface-flow prediction, reducing 84.2% compared to training from scratch. The influence of model configurations and training strategies is also systematically studied to provide guidance on effectively training and deploying such models under limited data and computational budgets. To facilitate reuse, we release the datasets and the pre-trained models at this https URL. An interactive design tool is also built on the pre-trained model and is available online at this https URL.
[LG-33] Sonata: A Hybrid World Model for Inertial Kinematics under Clinical Data Scarcity
链接: https://arxiv.org/abs/2604.18058
作者: Blaise Delaney,Salil Patel,Yuji Xing,Dominic Dootson,Karin Sevegnani
类目: Machine Learning (cs.LG)
*备注: 18 pages, 3 figures
Abstract:We introduce Sonata, a compact latent world model for six-axis trunk IMU representation learning under clinical data scarcity. Clinical cohorts typically comprise tens to hundreds of patients, making web-scale masked-reconstruction objectives poorly matched to the problem. Sonata is a 3.77 M-parameter hybrid model, pre-trained on a harmonised corpus of nine public datasets (739 subjects, 190k windows) with a latent world-model objective that predicts future state rather than reconstructing raw sensor traces. In a controlled comparison against a matched autoregressive forecasting baseline (MAE) on the same backbone, Sonata yields consistently stronger frozen-probe clinical discrimination, prospective fall-risk prediction, and cross-cohort transfer across a 14-arm evaluation suite, while producing higher-rank, more structured latent representations. At 3.77 M parameters the model is compatible with on-device wearable inference, offering a step toward general kinematic world models for neurological assessment.
[LG-34] Variational Autoencoder Domain Adaptation for Cross-System Generalization in ML-Based SOP Monitoring
链接: https://arxiv.org/abs/2604.18035
作者: Leyla Sadighi,Stefan Karlsson,Carlos Natalino,Mojtaba Eshghie,Fehmida Usmani,Eoin Kenny,Lena Wosinska,Paolo Monti,Marija Furdek,Marco Ruffini
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning (ML) models trained to detect physical-layer threats on one optical fiber system often fail catastrophically when applied to a different system, due to variations in operating wavelength, fiber properties, and network architecture. To overcome this, we propose a Domain Adaptation (DA) framework based on a Variational Autoencoder (VAE) that learns a shared representation capturing event signatures common to both systems while suppressing system-specific differences. The shared encoder is first trained on the combined data from two distinct optical systems: a 21 km O-band dark-fiber testbed (System 1) and a 63.4 km C-band live metro ring (System 2). The encoder is then frozen, and a classifier is trained using labels from an individual system. The proposed approach achieves 95.3% and 73.5% cross-system accuracy when moving from System 1 to System 2 and vice versa, respectively. This corresponds to gains of 83.4% and 51% over a fully supervised Deep Neural Network (DNN) baseline trained on a single system, while preserving intra-system performance.
[LG-35] Clusterability-Based Assessment of Potentially Noisy Views for Multi-View Clustering
链接: https://arxiv.org/abs/2604.18024
作者: Mudi Jiang,Jiahui Zhou,Xinying Liu,Zengyou He,Zhikui Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:In multi-view clustering, the quality of different views may vary substantially, and low-quality or degraded views can impair overall clustering performance. However, existing studies mainly address this issue within the clustering process through view weighting or noise-robust optimization, while paying limited attention to data-level assessment before clustering. In this paper, we study the problem of pre-clustering noisy-view analysis in multi-view data from a clusterability perspective. To this end, we propose a Multi-View Clusterability Score (MVCS), which quantifies the strength of latent cluster-related structures in multi-view data through three complementary components: per-view structural clusterability, joint-space clusterability, and cross-view neighborhood consistency. To the best of our knowledge, this is the first clusterability score specifically designed for multi-view data. We further use it to perform potentially noisy view analysis and noisy-view detection before clustering. Extensive experiments on real-world datasets demonstrate that noisy views can significantly degrade clustering performance, and that, compared with existing clusterability measures designed for single-view data, the proposed method more effectively supports noisy-view analysis and detection.
[LG-36] Neural Shape Operator Surrogates – Expression Rate Bounds
链接: https://arxiv.org/abs/2604.18012
作者: Helmut Harbrecht,Christoph Schwab
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We prove error bounds for operator surrogates of solution operators for partial differential and boundary integral equations on families of domains which are diffeomorphic to one common reference (or latent) domain D_ref . The pullback of the PDE to D_ref via affine-parametric shape encoding produces a collection of holomorphic parametric PDEs on D_ref . Sufficient conditions for (uniformly with respect to the parameter) well-posedness are given, implying existence, uniqueness and stability of parametric solution families on D_ref . We illustrate the abstract hypotheses by reviewing recent holomorphy results for a suite of elliptic and parabolic PDEs. Quantified parametric holomorphy implies existence of finite-parametric, discrete approximations of the parametric solution families with convergence rates in terms of the number N of parameters. We obtain constructive proofs of existence of Neural and Spectral Operator surrogates for the shape-to-solution maps with error bounds and convergence rate guarantees uniform on the collection of admissible shapes. We admit principal-component shape encoders and frame decoders. Our results support in particular the (empirically reported) ability of neural operators to realize data-to-solution maps for elliptic and parabolic PDEs and BIEs that generalize across parametric families of shapes. Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA) MSC classes: 68T07 (Primary), 35A35, 65D40 (Secondary) Cite as: arXiv:2604.18012 [cs.LG] (or arXiv:2604.18012v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.18012 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-37] Neural Garbage Collection: Learning to Forget while Learning to Reason
链接: https://arxiv.org/abs/2604.18002
作者: Michael Y. Li,Jubayer Ibn Hamid,Emily B. Fox,Noah D. Goodman
类目: Machine Learning (cs.LG)
*备注:
Abstract:Chain-of-thought reasoning has driven striking advances in language model capability, yet every reasoning step grows the KV cache, creating a bottleneck to scaling this paradigm further. Current approaches manage these constraints on the model’s behalf using hand-designed criteria. A more scalable approach would let end-to-end learning subsume this design choice entirely, following a broader pattern in deep learning. After all, if a model can learn to reason, why can’t it learn to forget? We introduce Neural Garbage Collection (NGC), in which a language model learns to forget while learning to reason, trained end-to-end from outcome-based task reward alone. As the model reasons, it periodically pauses, decides which KV cache entries to evict, and continues to reason conditioned on the remaining cache. By treating tokens in a chain-of-thought and cache-eviction decisions as discrete actions sampled from the language model, we can use reinforcement learning to jointly optimize how the model reasons and how it manages its own memory: what the model evicts shapes what it remembers, what it remembers shapes its reasoning, and the correctness of that reasoning determines its reward. Crucially, the model learns this behavior entirely from a single learning signal - the outcome-based task reward - without supervised fine-tuning or proxy objectives. On Countdown, AMC, and AIME tasks, NGC maintains strong accuracy relative to the full-cache upper bound at 2-3x peak KV cache size compression and substantially outperforms eviction baselines. Our results are a first step towards a broader vision where end-to-end optimization drives both capability and efficiency in language models.
[LG-38] Causally-Constrained Probabilistic Forecasting for Time-Series Anomaly Detection
链接: https://arxiv.org/abs/2604.17998
作者: Pooyan Khosravinia,João Gama,Bruno Veloso
类目: Machine Learning (cs.LG)
*备注: This work is currently under review for possible publication in the IEEE Access journal. All intellectual property rights are retained by IEEE
Abstract:Anomaly detection in multivariate time series is a central challenge in industrial monitoring, as failures frequently arise from complex temporal dynamics and cross-sensor interactions. While recent deep learning models, including graph neural networks and Transformers, have demonstrated strong empirical performance, most approaches remain primarily correlational and offer limited support for causal interpretation and root-cause localization. This study introduces a causally-constrained probabilistic forecasting framework which is a Causally Guided Transformer (CGT) model for multivariate time-series anomaly detection, integrating an explicit time-lagged causal graph prior with deep sequence modeling. For each target variable, a dedicated forecasting block employs a hard parent mask derived from causal discovery to restrict the main prediction pathway to graph-supported causes, while a latent Gaussian head captures predictive uncertainty. To leverage residual correlational information without compromising the causal representation, a shadow auxiliary path with stop-gradient isolation and a safety-gated blending mechanism is incorporated to suppress non-causal contributions when reliability is low. Anomalies are identified using negative log-likelihood scores with adaptive streaming thresholding, and root-cause variables are determined through per-dimension probabilistic attribution and counterfactual clamping. Experiments on the ASD and SMD benchmarks indicate that the proposed method achieves state-of-the-art detection performance, with F1-scores of 96.19% on ASD and 95.32% on SMD, and enhances variable-level attribution quality. These findings suggest that causal structural priors can improve both robustness and interpretability in detecting deep anomalies in multivariate sensor systems.
[LG-39] Online Conformal Prediction with Adversarial Semi-bandit Feedback via Regret Minimization
链接: https://arxiv.org/abs/2604.17984
作者: Junyoung Yang,Kyungmin Kim,Sangdon Park
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Uncertainty quantification is crucial in safety-critical systems, where decisions must be made under uncertainty. In particular, we consider the problem of online uncertainty quantification, where data points arrive sequentially. Online conformal prediction is a principled online uncertainty quantification method that dynamically constructs a prediction set at each time step. While existing methods for online conformal prediction provide long-run coverage guarantees without any distributional assumptions, they typically assume a full feedback setting in which the true label is always observed. In this paper, we propose a novel learning method for online conformal prediction with partial feedback from an adaptive adversary-a more challenging setup where the true label is revealed only when it lies inside the constructed prediction set. Specifically, we formulate online conformal prediction as an adversarial bandit problem by treating each candidate prediction set as an arm. Building on an existing algorithm for adversarial bandits, our method achieves a long-run coverage guarantee by explicitly establishing its connection to the regret of the learner. Finally, we empirically demonstrate the effectiveness of our method in both independent and identically distributed (i.i.d.) and non-i.i.d. settings, showing that it successfully controls the miscoverage rate while maintaining a reasonable size of the prediction set.
[LG-40] Federated Rule Ensemble Method in Medical Data
链接: https://arxiv.org/abs/2604.17956
作者: Ke Wan,Kensuke Tanioka,Toshio Shimokawa
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Machine learning has become integral to medical research and is increasingly applied in clinical settings to support diagnosis and decision-making; however, its effectiveness depends on access to large, diverse datasets, which are limited within single institutions. Although integrating data across institutions can address this limitation, privacy regulations and data ownership constraints hinder these efforts. Federated learning enables collaborative model training without sharing raw data; however, most methods rely on complex architectures that lack interpretability, limiting clinical applicability. Therefore, we proposed a federated RuleFit framework to construct a unified and interpretable global model for distributed environments. It integrates three components: preprocessing based on differentially private histograms to estimate shared cutoff values, enabling consistent rule definitions and reducing heterogeneity across clients; local rule generation using gradient boosting decision trees with shared cutoffs; and coefficient estimation via \ell_1 -regularized optimization using a Federated Dual Averaging algorithm for sparse and consistent variable selection. In simulation studies, the proposed method achieved a performance comparable to that of centralized RuleFit while outperforming existing federated approaches. Real-world analysis demonstrated its ability to provide interpretable insights with competitive predictive accuracy. Therefore, the proposed framework offers a practical and effective solution for interpretable and reliable modeling in federated learning environments.
[LG-41] Fisher Decorator: Refining Flow Policy via A Local Transport Map
链接: https://arxiv.org/abs/2604.17919
作者: Xiaoyuan Cheng,Haoyu Wang,Wenxuan Yuan,Ziyan Wang,Zonghao Chen,Li Zeng,Zhuo Sun
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Recent advances in flow-based offline reinforcement learning (RL) have achieved strong performance by parameterizing policies via flow matching. However, they still face critical trade-offs among expressiveness, optimality, and efficiency. In particular, existing flow policies interpret the L_2 regularization as an upper bound of the 2-Wasserstein distance ( W_2 ), which can be problematic in offline settings. This issue stems from a fundamental geometric mismatch: the behavioral policy manifold is inherently anisotropic, whereas the L_2 (or upper bound of W_2 ) regularization is isotropic and density-insensitive, leading to systematically misaligned optimization directions. To address this, we revisit offline RL from a geometric perspective and show that policy refinement can be formulated as a local transport map: an initial flow policy augmented by a residual displacement. By analyzing the induced density transformation, we derive a local quadratic approximation of the KL-constrained objective governed by the Fisher information matrix, enabling a tractable anisotropic optimization formulation. By leveraging the score function embedded in the flow velocity, we obtain a corresponding quadratic constraint for efficient optimization. Our results reveal that the optimality gap in prior methods arises from their isotropic approximation. In contrast, our framework achieves a controllable approximation error within a provable neighborhood of the optimal solution. Extensive experiments demonstrate state-of-the-art performance across diverse offline RL benchmarks. See project page: this https URL.
[LG-42] M100: An Orchestrated Dataflow Architecture Powering General AI Computing ISCA2026
链接: https://arxiv.org/abs/2604.17862
作者: Yan Xie,Changkui Mao,Changsong Wu,Chao Lu,Chao Suo,Cheng Qian,Chun Yang,Danyang Zhu,Hengchang Xiong,Hongzhan Lu,Hongzhen Liu,Jiafu Liu,Jie Chen,Jie Dai,Junfeng Tang,Kai Liu,Kun Li,Lipeng Ge,Meng Sun,Min Luo,Peng Chen,Peng Wang,Shaodong Yang,Shibin Tang,Shibo Chen,Weikang Zhang,Xiao Ling,Xiaobo Du,Xin Wu,Yang Liu,Yi Jiang,Yihua Jin,Yin Huang,Yuli Zhang,Zhen Yuan,Zhiyuan Man,Zhongxiao Yao
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Accepted to appear at ISCA 2026 Industry Track. 12 pages, 16 figures
Abstract:As deep learning-based AI technologies gain momentum, the demand for general-purpose AI computing architectures continues to grow. While GPGPU-based architectures offer versatility for diverse AI workloads, they often fall short in efficiency and cost-effectiveness. Various Domain-Specific Architectures (DSAs) excel at particular AI tasks but struggle to extend across broader applications or adapt to the rapidly evolving AI landscape. M100 is Li Auto’s response: a performant, cost-effective architecture for AI inference in Autonomous Driving (AD), Large Language Models (LLMs), and intelligent human interactions, domains crucial to today’s most competitive automobile platforms. M100 employs a dataflow parallel architecture, where compiler-architecture co-design orchestrates not only computation but, more critically, data movement across time and space. Leveraging dataflow computing efficiency, our hardware-software co-design improves system performance while reducing hardware complexity and cost. M100 largely eliminates caching: tensor computations are driven by compiler- and runtime-managed data streams flowing between computing elements and on/off-chip memories, yielding greater efficiency and scalability than cache-based systems. Another key principle was selecting the right operational granularity for scheduling, issuing, and execution across compiler, firmware, and hardware. Recognizing commonalities in AI workloads, we chose the tensor as the fundamental data element. M100 demonstrates general AI computing capability across diverse inference applications, including UniAD (for AD) and LLaMA (for LLMs). Benchmarks show M100 outperforms GPGPU architectures in AD applications with higher utilization, representing a promising direction for future general AI computing.
[LG-43] Efficient Diffusion Models under Nonconvex Equality and Inequality constraints via Landing
链接: https://arxiv.org/abs/2604.17838
作者: Kijung Jeon,Michael Muehlebach,Molei Tao
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 50 pages
Abstract:Generative modeling within constrained sets is essential for scientific and engineering applications involving physical, geometric, or safety requirements (e.g., molecular generation, robotics). We present a unified framework for constrained diffusion models on generic nonconvex feasible sets \Sigma that simultaneously enforces equality and inequality constraints throughout the diffusion process. Our framework incorporates both overdamped and underdamped dynamics for forward and backward sampling. A key algorithmic innovation is a computationally efficient landing mechanism that replaces costly and often ill-defined projections onto \Sigma , ensuring feasibility without iterative Newton solves or projection failures. By leveraging underdamped dynamics, we accelerate mixing toward the prior distribution, effectively alleviating the high simulation costs typically associated with constrained diffusion. Empirically, this approach reduces function evaluations and memory usage during both training and inference while preserving sample quality. On benchmarks featuring equality and mixed constraints, our method achieves comparable sample quality to state-of-the-art baselines while significantly reducing computational cost, providing a practical and scalable solution for diffusion on nonconvex feasible sets.
[LG-44] EmbedBench: A Multi-Corpus Embedding Benchmark for RAG in Telecommunications
链接: https://arxiv.org/abs/2604.17778
作者: Pranshav Gajjar,Vijay K Shah
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) are increasingly deployed in the telecommunications domain for critical tasks, relying heavily on Retrieval-Augmented Generation (RAG) to adapt general-purpose models to continuously evolving standards. However, a significant gap exists in evaluating the embedding models that power these RAG pipelines, as general-purpose benchmarks fail to capture the dense, acronym-heavy, and highly cross-referential nature of telecommunications corpora. To address this, we introduce TeleEmbedBench, the first large-scale, multi-corpus embedding benchmark designed specifically for telecommunications. The benchmark spans three heterogeneous corpora: O-RAN Alliance specifications, 3GPP release documents, and the srsRAN open-source codebase, comprising 9,000 question-chunk pairs across three standard chunk sizes (512, 1024, and 2048 tokens). To construct this dataset at scale without manual annotation bottlenecks, we employ a novel automated pipeline where one LLM generates specific queries from text chunks and a secondary LLM validates them across strict criteria. We comprehensively evaluate eight embedding models, spanning standard sentence-transformers and LLM-based embedders. Our results demonstrate that LLM-based embedders, such as Qwen3 and EmbeddingGemma, consistently and significantly outperform traditional sentence-transformers in both retrieval accuracy and robustness against cross-domain interference. Additionally, we introduce TeleEmbedBench-Clean to evaluate model robustness against noisy, incomplete user queries. Finally, our analysis reveals that while domain-specific task instructions improve embedder performance for raw source code, they paradoxically degrade retrieval performance for natural language telecommunications specifications.
[LG-45] LLM -AUG: Robust Wireless Data Augmentation with In-Context Learning in Large Language Models
链接: https://arxiv.org/abs/2604.17770
作者: Pranshav Gajjar,Manan Tiwari,Sayanta Seth,Vijay K. Shah
类目: Machine Learning (cs.LG)
*备注:
Abstract:Data scarcity remains a fundamental bottleneck in applying deep learning to wireless communication problems, particularly in scenarios where collecting labeled Radio Frequency (RF) data is expensive, time-consuming, or operationally constrained. This paper proposes LLM-AUG, a data augmentation framework that leverages in-context learning in large language models (LLMs) to generate synthetic training samples directly in a learned embedding space. Unlike conventional generative approaches that require training task-specific models, LLM-AUG performs data generation through structured prompting, enabling rapid adaptation in low-shot regimes. We evaluate LLM-AUG on two representative tasks: modulation classification and interference classification using the RadioML 2016.10A dataset, and the Interference Classification (IC) dataset respectively. Results show that LLM-AUG consistently outperforms traditional augmentation and deep generative baselines across low-shot settings and reaches near oracle performance using only 15% labeled data. LLM-AUG further demonstrates improved robustness under distribution shifts, yielding a 29.4% relative gain over diffusion-based augmentation at a lower SNR value. On the RadioML and IC datasets, LLM-AUG yields a relative gain of 67.6% and 35.7% over the diffusion-based baseline. The t-SNE visualizations further validate that synthetic samples generated by better preserve class structure in the embedding space, leading to more consistent and informative augmentations. These results demonstrate that LLMs can serve as effective and practical data augmenters for wireless machine learning, enabling robust and data-efficient learning in evolving wireless environments.
[LG-46] A Quasi-Experimental Developer Study of Security Training in LLM -Assisted Web Application Development
链接: https://arxiv.org/abs/2604.17763
作者: Mohammed Kharma,Ahmed Sabbah,Radi Jarrar,Samer Zain,Mohammad Alkhanafseh,David Mohaisen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, 6 tables
Abstract:This paper presents a controlled quasi-experimental developer study examining whether a layer-based security training package is associated with improved security quality in LLM-assisted implementation of an identity-centric Java Spring Boot backend. The study uses a mixed design with a within-subject pre-training versus post-training comparison and an exploratory between-subject expertise factor. Twelve developers completed matched runs under a common interface, fixed model configuration, counterbalanced task sets, and a shared starter project. Security outcomes were assessed via independent manual validation of submitted repositories by the first and second authors. The primary participant-level endpoint was a severity-weighted validated-weakness score. The post-training condition showed a significant paired reduction under an exact Wilcoxon signed-rank test ( p = 0.0059 ). In aggregate, validated weaknesses decreased from 162 to 111 (31.5%), the severity-weighted burden decreased from 432 to 267 (38.2%), and critical findings decreased from 24 to 5 (79.2%). The largest reductions were in authorization and object access (53.3%) and in authentication, credential policy, and recovery weaknesses (44.7%). Session and browser trust-boundary issues showed minimal change, while sensitive-data and cryptographic weaknesses showed only marginal improvement. These results suggest that, under the tested conditions, post-training runs reduce validated security burden in LLM-assisted backend development without modifying the model. They do not support replacing secure defaults, static analysis, expert review, or operational hardening. Comments: 8 pages, 3 figures, 6 tables Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2604.17763 [cs.CR] (or arXiv:2604.17763v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.17763 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-47] Efficient Federated RLHF via Zeroth-Order Policy Optimization
链接: https://arxiv.org/abs/2604.17747
作者: Deyi Wang,Qining Zhang,Lei Ying
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper considers reinforcement learning from human feedback in a federated learning setting with resource-constrained agents, such as edge devices. We propose an efficient federated RLHF algorithm, named Partitioned, Sign-based Stochastic Zeroth-order Policy Optimization (Par-S ^2 ZPO). The algorithm is built on zeroth-order optimization with binary perturbation, resulting in low communication, computation, and memory complexity by design. Our theoretical analysis establishes an upper bound on the convergence rate of Par-S ^2 ZPO, revealing that it is as efficient as its centralized counterpart in terms of sample complexity but converges faster in terms of policy update iterations. Our experimental results show that it outperforms a FedAvg-based RLHF on four MuJoCo RL tasks.
[LG-48] Program Structure-aware Language Models: Targeted Software Testing beyond Textual Semantics ACL
链接: https://arxiv.org/abs/2604.17715
作者: Khang Tran,Khoa Nguyen,Cristian Borcea,NhatHai Phan
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Accepted in The 64th Annual Meeting of the Association for Computational Linguistics (ACL Findings 2026)
Abstract:Recent advances in large language models for test case generation have improved branch coverage via prompt-engineered mutations. However, they still lack principled mechanisms for steering models toward specific high-risk execution branches, limiting their effectiveness for discovering subtle bugs and security vulnerabilities. We propose GLMTest, the first program structure-aware LLM framework for targeted test case generation that seamlessly integrates code property graphs and code semantics using a graph neural network and a language model to condition test case generation on execution branches. This structured conditioning enables controllable and branch-targeted test case generation, thereby potentially enhancing bug and security risk discovery. Experiments on real-world projects show that GLMTest built on a Qwen2.5-Coder-7B-Instruct model improves branch accuracy from 27.4% to 50.2% on TestGenEval benchmark compared with state-of-the-art LLMs, i.e., Claude-Sonnet-4.5 and GPT-4o-mini.
[LG-49] Modeling Higher-Order Brain Interactions via a Multi-View Information Bottleneck Framework for fMRI-based Psychiatric Diagnosis
链接: https://arxiv.org/abs/2604.17713
作者: Kunyu Zhang,Qiang Li,Vince D. Calhoun,Shujian Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Resting-state functional magnetic resonance imaging (fMRI) has emerged as a cornerstone for psychiatric diagnosis, yet most approaches rely on pairwise brain cortical or sub-cortical connectivities that overlooks higher-order interactions (HOIs) central to complex brain dynamics. While hypergraph methods encode HOIs through predefined hyperedges, their construction typically relies on heuristic similarity metrics and does not explicitly characterize whether interactions are synergy- or redundancy-dominated. In this paper, we introduce O -information, a signed measure that characterizes the informational nature of HOIs, and integrate third- and fourth-order O -information into a unified multi-view information bottleneck framework for fMRI-based psychiatric diagnosis. To enable scalable O -information estimation, we further develop two independent acceleration strategies: a Gaussian analytical approximation and a randomized matrix-based Rényi entropy estimator, achieving over a 30-fold computational speedup compared with conventional estimators. Our tri-view architecture systematically fuses pairwise, triadic, and tetradic brain interactions, capturing comprehensive brain connectivity while explicitly penalizing redundancy. Extensive evaluation across four benchmark datasets (REST-meta-MDD, ABIDE, UCLA, ADNI) demonstrates consistent improvements, outperforming 11 baseline methods including state-of-the-art graph neural network (GNN) and hypergraph based approaches. Moreover, our method reveals interpretable region-level synergy-redundancy patterns which are not explicitly characterized by conventional hypergraph formulations.
[LG-50] Path-Based Quantum Meta-Learning for Adaptive Optimization of Reconfigurable Intelligent Surfaces
链接: https://arxiv.org/abs/2604.17690
作者: Noha Hassan,Xavier Fernando,Halim Yanikomeroglu
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE Wireless Communications Letters Journal for possible publication
Abstract:Reconfigurable intelligent surfaces (RISs) modify signal reflections to enhance wireless communication capabilities. Classical RIS phase optimization is highly non convex and challenging in dynamic environments due to high interference and user mobility. Here we propose a hierarchical multi-objective quantum metalearning algorithm that switches among specific quantum paths based on historical success, energy cost, and current data rate. Candidate RIS control directions are arranged as switch paths between quantum neural network layers to minimize inference, and a scoring mechanism selects the top performing paths per layer. Instead of merely storing past successful settings of the RIS and picking the closest match when a new problem is encountered, the algorithm learns how to select and recombine the best parts of different solutions to solve new scenarios. In our model, high-dimensional RIS scenario features are compressed into a quantum state using the tensor product, then superimposed during quantum path selection, significantly improving quantum computational advantage. Results demonstrate efficient performance with enhanced spectral efficiency, convergence rate, and adaptability.
[LG-51] Grokking of Diffusion Models: Case Study on Modular Addition
链接: https://arxiv.org/abs/2604.17673
作者: Joon Hyeok Kim,Yong-Hyun Park,Mattis Dalsætra Østby,Jiatao Gu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Despite their empirical success, how diffusion models generalize remains poorly understood from a mechanistic perspective. We demonstrate that diffusion models trained with flow-matching objectives exhibit grokking–delayed generalization after overfitting–on modular addition, enabling controlled analysis of their internal computations. We study this phenomenon across two levels of data regime. In a single-image regime, mechanistic dissection reveals that the model implements modular addition by composing periodic representations of individual operands. In a diverse-image regime with high intraclass variability, we find that the model leverages its iterative sampling process to partition the task into an arithmetic computation phase followed by a visual denoising phase, separated by a critical timestep threshold. Our work provides the mechanistic decomposition of algorithmic learning in diffusion models, revealing how these models bridge continuous pixel-space generation and discrete symbolic reasoning.
[LG-52] Prior-Fitted Functional Flow: In-Context Generative Models for Pharmacokinetics
链接: https://arxiv.org/abs/2604.17670
作者: César Ojeda,Niklas Hartung,Wilhelm Huisinga,Tim Jahn,Purity Kamene Kavwele,Marian Klose,Piyush Kumar,Ramsés J. Sánchez,Darius A. Faroughy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 9 pages, 2 tables and 4 figures
Abstract:We introduce Prior-Fitted Functional Flows, a generative foundation model for pharmacokinetics that enables zero-shot population synthesis and individual forecasting without manual parameter tuning. We learn functional vector fields, explicitly conditioned on the sparse, irregular data of an entire study population. This enables the generation of coherent virtual cohorts as well as forecasting of partially observed patient trajectories with calibrated uncertainty. We construct a new open-access literature corpus to inform our priors, and demonstrate state-of-the-art predictive accuracy on extensive real-world datasets.
[LG-53] SLO-Guard: Crash-Aware Budget-Consistent Autotuning for SLO-Constrained LLM Serving
链接: https://arxiv.org/abs/2604.17627
作者: Christian Lysenstøen
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注: 20 pages, 6 figures, 5 tables. Code and raw per-trial JSONL data: this https URL
Abstract:Serving large language models under latency service-level objectives (SLOs) is a configuration-heavy systems problem with an unusually failure-prone search space: many plausible configurations crash outright or miss user-visible latency targets, and standard black-box optimizers treat these failures as wasted trials. We present SLO-Guard, a crash-aware autotuner for vLLM serving that treats crashes as first-class observations. SLO-Guard combines a feasible-first Thermal Budget Annealing (TBA) exploration phase with a warm-started Tree-structured Parzen Estimator (TPE) exploitation phase; the handoff replays all exploration history, including crashes encoded as extreme constraint violations. We additionally contribute a configuration-repair pass, a GPU-aware KV-cache memory guard, and a four-category crash taxonomy. We evaluate SLO-Guard on Qwen2-1.5B served with vLLM 0.19 on an NVIDIA A100 40GB. Across a pre-specified five-seed study, both SLO-Guard and uniform random search attain 75/75 feasibility with zero crashes under the corrected concurrent harness, and are statistically tied on best-achieved latency (Mann-Whitney two-sided p=0.84). SLO-Guard’s advantage is in budget consistency: more trials in the fast-serving regime (10.20 vs. 7.40 out of 15; one-sided p=0.014) and higher post-handoff consistency (0.876 vs. 0.539; p=0.010). Under concurrent load, SLO-Guard’s cross-seed standard deviation on best latency is 4.4x tighter than random search’s (2.26 ms vs. 10.00 ms). A harness-replication analysis shows that the consistency findings survive an independent sequential-dispatch measurement condition. The central claim is not that SLO-Guard finds a better final configuration, but that it spends a fixed tuning budget more predictably once the fast regime has been found. Comments: 20 pages, 6 figures, 5 tables. Code and raw per-trial JSONL data: this https URL Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF) ACMclasses: I.2.6; D.4.8 Cite as: arXiv:2604.17627 [cs.LG] (or arXiv:2604.17627v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.17627 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-54] STRIKE: Additive Feature-Group-Aware Stacking Framework for Credit Default Prediction
链接: https://arxiv.org/abs/2604.17622
作者: Swattik Maiti,Ritik Pratap Singh,Fardina Fathmiul Alam
类目: Machine Learning (cs.LG)
*备注: 17 pages, 5 figures
Abstract:Credit risk default prediction remains a cornerstone of risk management in the financial industry. The task involves estimating the likelihood that a borrower will fail to meet debt obligations, an objective critical for lending decisions, portfolio optimization, and regulatory compliance. Traditional machine learning models such as logistic regression and tree-based ensembles are widely adopted for their interpretability and strong empirical performance. However, modern credit datasets are high-dimensional, heterogeneous, and noisy, increasing overfitting risk in monolithic models and reducing robustness under distributional shift. We introduce STRIKE (Stacking via Targeted Representations of Isolated Knowledge Extractors), a feature-group-aware stacking framework for structured tabular credit risk data. Rather than training a single monolithic model on the complete dataset, STRIKE partitions the feature space into semantically coherent groups and trains independent learners within each group. This decomposition is motivated by an additive perspective on risk modeling, where distinct feature sources contribute complementary evidence that can be combined through a structured aggregation. The resulting group-specific predictions are integrated through a meta-learner that aggregates signals while maintaining robustness and modularity. We evaluate STRIKE on three real-world datasets spanning corporate bankruptcy and consumer lending scenarios. Across all settings, STRIKE consistently outperforms strong tree-based baselines and conventional stacking approaches in terms of AUC-ROC. Ablation studies confirm that performance gains stem from meaningful feature decomposition rather than increased model complexity. Our findings demonstrate that STRIKE is a stable, scalable, and interpretable framework for credit risk default prediction tasks.
[LG-55] Conditional Attribution for Root Cause Analysis in Time-Series Anomaly Detection
链接: https://arxiv.org/abs/2604.17616
作者: Shashank Mishra,Karan Patil,Cedric Schockaert,Didier Stricker,Jason Rambach
类目: Machine Learning (cs.LG)
*备注: 16 pages, 8 figures, 13 tables, Appendix included
Abstract:Root cause analysis (RCA) for time-series anomaly detection is critical for the reliable operation of complex real-world systems. Existing explanation methods often rely on unrealistic feature perturbations and ignore temporal and cross-feature dependencies, leading to unreliable attributions. We propose a conditional attribution framework that explains anomalies relative to contextually similar normal system states. Instead of using marginal or randomly sampled baselines, our method retrieves representative normal instances conditioned on the anomalous observation, enabling dependency-preserving and operationally meaningful explanations. To support high-dimensional time-series data, contextual retrieval is performed in learned low-dimensional representations using both variational autoencoder latent spaces and UMAP manifold embeddings. By grounding the retrieval process in the system’s learned manifold, this strategy avoids out-of-distribution artifacts and ensures attribution fidelity while maintaining computational efficiency. We further introduce confidence-aware and temporal evaluation metrics for assessing explanation reliability and responsiveness. Experiments on the SWaT and MSDS benchmarks demonstrate that the proposed approach consistently improves root-cause identification accuracy, temporal localization, and robustness across multiple anomaly detection models. These results highlight the practical utility of conditional attribution for explainable anomaly diagnosis in complex time-series systems. Code and models will be publicly released.
[LG-56] Recovery Guarantees for Continual Learning of Dependent Tasks: Memory Data-Dependent Regularization and Data-Dependent Weights
链接: https://arxiv.org/abs/2604.17578
作者: Liangzu Peng,Uday Kiran Reddy Tadipatri,Ziqing Xu,Eric Eaton,René Vidal
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Continual learning (CL) is concerned with learning multiple tasks sequentially without forgetting previously learned tasks. Despite substantial empirical advances over recent years, the theoretical development of CL remains in its infancy. At the heart of developing CL theory lies the challenge that the data distribution varies across tasks, and we argue that properly addressing this challenge requires understanding this variation–dependency among tasks. To explicitly model task dependency, we consider nonlinear regression tasks and propose the assumption that these tasks are dependent in such a way that the data of the current task is a nonlinear transformation of previous data. With this model and under natural assumptions, we prove statistical recovery guarantees (more specifically, bounds on estimation errors) for several CL paradigms in practical use, including experience replay with data-independent regularization and data-independent weights that balance the losses of tasks, replay with data-dependent weights, and continual learning with data-dependent regularization (e.g., knowledge distillation). To the best of our knowledge, our bounds are informative in cases where prior work gives vacuous bounds.
[LG-57] Diverse Dictionary Learning ICLR2026
链接: https://arxiv.org/abs/2604.17568
作者: Yujia Zheng,Zijian Li,Shunxing Fan,Andrew Gordon Wilson,Kun Zhang
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: ICLR 2026
Abstract:Given only observational data X = g(Z) , where both the latent variables Z and the generating process g are unknown, recovering Z is ill-posed without additional assumptions. Existing methods often assume linearity or rely on auxiliary supervision and functional constraints. However, such assumptions are rarely verifiable in practice, and most theoretical guarantees break down under even mild violations, leaving uncertainty about how to reliably understand the hidden world. To make identifiability actionable in the real-world scenarios, we take a complementary view: in the general settings where full identifiability is unattainable, what can still be recovered with guarantees, and what biases could be universally adopted? We introduce the problem of diverse dictionary learning to formalize this view. Specifically, we show that intersections, complements, and symmetric differences of latent variables linked to arbitrary observations, along with the latent-to-observed dependency structure, are still identifiable up to appropriate indeterminacies even without strong assumptions. These set-theoretic results can be composed using set algebra to construct structured and essential views of the hidden world, such as genus-differentia definitions. When sufficient structural diversity is present, they further imply full identifiability of all latent variables. Notably, all identifiability benefits follow from a simple inductive bias during estimation that can be readily integrated into most models. We validate the theory and demonstrate the benefits of the bias on both synthetic and real-world data.
[LG-58] arget Parameterization in Diffusion Models for Nonlinear Spatiotemporal System Identification
链接: https://arxiv.org/abs/2604.17566
作者: Achraf El Messaoudi,Noureddine Khaous,Karim Cherifi
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Machine learning is becoming increasingly important for nonlinear system identification, including dynamical systems with spatially distributed outputs. However, classical identification and forecasting approaches become markedly less reliable in turbulent-flow regimes, where the dynamics are high-dimensional, strongly nonlinear, and highly sensitive to compounding rollout errors. Diffusion-based models have recently shown improved robustness in this setting and offer probabilistic inference capabilities, but many current implementations inherit target parameterizations from image generation, most commonly noise or velocity prediction. In this work, we revisit this design choice in the context of nonlinear spatiotemporal system identification. We consider a simple, self-contained patch-based transformer that operates directly on physical fields and use turbulent flow simulation as a representative testbed. Our results show that clean-state prediction consistently improves rollout stability and reduces long-horizon error relative to velocity- and noise-based objectives, with the advantage becoming more pronounced as the per-token dimensionality increases. These findings identify target parameterization as a key modeling choice in diffusion-based identification of nonlinear systems with spatial outputs in turbulent regimes.
[LG-59] Contraction and Hourglass Persistence for Learning on Graphs Simplices and Cells ICLR2026
链接: https://arxiv.org/abs/2604.17548
作者: Mattie Ji,Indradyumna Roy,Vikas Garg
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT); Machine Learning (stat.ML)
*备注: 31 pages, 6 figures, 4 algorithms, 2 tables. Accepted at ICLR 2026
Abstract:Persistent homology (PH) encodes global information, such as cycles, and is thus increasingly integrated into graph neural networks (GNNs). PH methods in GNNs typically traverse an increasing sequence of subgraphs. In this work, we first expose limitations of this inclusion procedure. To remedy these shortcomings, we analyze contractions as a principled topological operation, in particular, for graph representation learning. We study the persistence of contraction sequences, which we call Contraction Homology (CH). We establish that forward PH and CH differ in expressivity. We then introduce Hourglass Persistence, a class of topological descriptors that interleave a sequence of inclusions and contractions to boost expressivity, learnability, and stability. We also study related families parametrized by two paradigms. We also discuss how our framework extends to simplicial and cellular networks. We further design efficient algorithms that are pluggable into end-to-end differentiable GNN pipelines, enabling consistent empirical improvements over many PH methods across standard real-world graph datasets. Code is available at \hrefthis https URLthis https URL.
[LG-60] rustworthy deep domain adaptation for wearable photoplethysmography signal analysis with decision-theoretic uncertainty quantification
链接: https://arxiv.org/abs/2604.17480
作者: Ciaran Bench
类目: Machine Learning (cs.LG)
*备注:
Abstract:In principle, deep generative models can be used to perform domain adaptation; i.e. align the input feature representations of test data with that of a separate discriminative model’s training data. This can help improve the discriminative model’s performance on the test data. However, generative models are prone to producing hallucinations and artefacts that may degrade the quality of generated data, and therefore, predictive performance when processed by the discriminative model. While uncertainty quantification can provide a means to assess the quality of adapted data, the standard framework for evaluating the quality of predicted uncertainties may not easily extend to generative models due to the common lack of ground truths (among other reasons). Even with ground truths, this evaluation is agnostic to how the generated outputs are used on the downstream task, limiting the extent to which the uncertainty reliability analysis provides insights about the utility of the uncertainties with respect to the intended use case of the adapted examples. Here, we describe how decision-theoretic uncertainty quantification can address these concerns and provide a convenient framework for evaluating the trustworthiness of generated outputs, in particular, for domain adaptation. We consider a case study in photoplethysmography time series denoising for Atrial Fibrillation classification. This formalises a well-known heuristic method of using a downstream classifier to assess the quality of generated outputs.
[LG-61] Machine Learning Hamiltonian Dynamical Systems with Sparse and Noisy Data
链接: https://arxiv.org/abs/2604.17470
作者: Vedanta Thapar,Abhinav Gupta
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning has become a powerful tool for discovering governing laws of dynamical systems from data. However, most existing approaches degrade severely when observations are sparse, noisy, or irregularly sampled. In this work, we address the problem of learning symbolic representations of nonlinear Hamiltonian dynamical systems under extreme data scarcity by explicitly incorporating physical structure into the learning architecture. We introduce Adaptable Symplectic Recurrent Neural Networks (ASRNNs), a parameter-cognizant, structure-preserving model that combines Hamiltonian learning with symplectic recurrent integration, avoiding time derivative estimation, and enabling stable learning under noise. We demonstrate that ASRNNs can accurately predict long-term dynamics even when each training trajectory consists of only two irregularly spaced time points, possibly corrupted by correlated noise. Leveraging ASRNNs as structure-preserving data generators, we further enable symbolic discovery using independent regression methods (SINDy and PySR), recovering exact symbolic equations for polynomial systems and consistent polynomial approximations for non-polynomial Hamiltonians. Our results show that such architectures can provide a robust pathway to interpretable discovery of Hamiltonian dynamics from sparse and noisy data.
[LG-62] Neural Adjoint Method for Meta-optics: Accelerating Volumetric Inverse Design via Fourier Neural Operators
链接: https://arxiv.org/abs/2604.17425
作者: Chanik Kang,Hyewon Suk,Haejun Chung
类目: Machine Learning (cs.LG); Optics (physics.optics)
*备注: 10 pages, 6 figures, 3 tables
Abstract:Meta-optics promises compact, high-performance imaging and color routing. However, designing high-performance structures is a high-dimensional optimization problem: mapping a desired optical output back to a physical 3D structure requires solving computationally expensive Maxwell’s equations iteratively. Even with adjoint optimization, broadband design can require thousands of Maxwell solves, making industrial-scale optimization slow and costly. To overcome this challenge, we propose the Neural Adjoint Method, a solver-supervised surrogate that predicts 3D adjoint gradient fields from a voxelized permittivity volume using a Fourier Neural Operator (FNO). By learning the dense, per-voxel sensitivity field that drives gradient-based updates, our method can replace per-iteration adjoint solves with fast predictions, greatly reducing the computational cost of full-wave simulations required during iterative refinement. To better preserve sensitivity peaks, we introduce a stage-wise FNO that progressively refines residual errors with increasing emphasis on higher-frequency components. We curate a meta-optics dataset from paired forward/adjoint FDTD simulations and evaluate it across three tasks: spectral sorting (color routers), achromatic focusing (metalenses), and waveguide mode conversion. Our method reduces design time from hours to seconds. These results suggest a practical route toward fast, large-scale volumetric meta-optical design enabled by AI-accelerated scientific computing.
[LG-63] A unified convergence theory for adaptive first-order methods in the nonconvex case including AdaNorm full and diagonal AdaGrad Shampoo and Muo
链接: https://arxiv.org/abs/2604.17423
作者: S. Gratton,Ph. L. Toint
类目: Machine Learning (cs.LG)
*备注:
Abstract:A unified framework for first-order optimization algorithms fornonconvex unconstrained optimization is proposed that uses adaptivelypreconditioned gradients and includes popular methods such as full anddiagonal AdaGrad, AdaNorm, as well as adpative variants of Shampoo andMuon. This framework also allows combining heterogeneous geometriesacross different groups of variables while preserving a unifiedconvergence analysis. A fully stochastic global rate-of-convergenceanalysis is conducted for all methods in the framework, with andwithout two types of momentum, using reasonable assumptions on thevariance of the gradient oracle and without assuming boundedstochastic gradients or small enough stepsize.
[LG-64] On the Generalization Bounds of Symbolic Regression with Genetic Programming
链接: https://arxiv.org/abs/2604.17402
作者: Masahiro Nomura,Ryoki Hamano,Isao Ono
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Symbolic regression (SR) with genetic programming (GP) aims to discover interpretable mathematical expressions directly from data. Despite its strong empirical success, the theoretical understanding of why GP-based SR generalizes beyond the training data remains limited. In this work, we provide a learning-theoretic analysis of SR models represented as expression trees. We derive a generalization bound for GP-style SR under constraints on tree size, depth, and learnable constants. Our result decomposes the generalization gap into two interpretable components: a structure-selection term, reflecting the combinatorial complexity of choosing an expression-tree structure, and a constant-fitting term, capturing the complexity of optimizing numerical constants within a fixed structure. This decomposition provides a theoretical perspective on several widely used practices in GP, including parsimony pressure, depth limits, numerically stable operators, and interval arithmetic. In particular, our analysis shows how structural restrictions reduce hypothesis-class growth while stability mechanisms control the sensitivity of predictions to parameter perturbations. By linking these practical design choices to explicit complexity terms in the generalization bound, our work offers a principled explanation for commonly observed empirical behaviors in GP-based SR and contributes towards a more rigorous understanding of its generalization properties.
[LG-65] RISC-V Functional Safety for Autonomous Automotive Systems: An Analytical Framework and Research Roadmap for ML-Assisted Certification
链接: https://arxiv.org/abs/2604.17391
作者: Nick Andreasyan,Mikhail Struve,Alexey Popov,Maksim Nikolaev,Vadim Vashkelis
类目: oftware Engineering (cs.SE); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 11 pages, 3 figures, 4 tables. Analytical perspective paper on automotive-grade RISC-V functional safety, certification economics, and ML-assisted certification for autonomous driving systems
Abstract:RISC-V is emerging as a viable platform for automotive-grade embedded computing, with recent ISO 26262 ASIL-D certifications demonstrating readiness for safety-critical deployment in autonomous driving systems. However, functional safety in automotive systems is fundamentally a certification problem rather than a processor problem. The dominant costs arise from diagnostic coverage analysis, toolchain qualification, fault injection campaigns, safety-case generation, and compliance with ISO 26262, ISO 21448 (SOTIF), and ISO/SAE 21434. This paper analyzes the role of RISC-V in automotive functional safety, focusing on ISA openness, formal verifiability, custom extension control, debug transparency, and vendor-independent qualification. We examine autonomous driving safety requirements and map them to RISC-V architectural challenges such as lockstep execution, safety islands, mixed-criticality isolation, and secure debug. Rather than proposing a single algorithmic breakthrough, we present an analytical framework and research roadmap centered on certification economics as the primary optimization objective. We also discuss how selected ML methods, including LLM-assisted FMEDA generation, knowledge-graph-based safety case automation, reinforcement learning for fault injection, and graph neural networks for diagnostic coverage, can support certification workflows. We argue that the strongest outcome is not a faster core, but an ASIL-D-ready certifiable RISC-V platform. Comments: 11 pages, 3 figures, 4 tables. Analytical perspective paper on automotive-grade RISC-V functional safety, certification economics, and ML-assisted certification for autonomous driving systems Subjects: Software Engineering (cs.SE); Hardware Architecture (cs.AR); Machine Learning (cs.LG) ACMclasses: C.3; D.2.4; J.7 Cite as: arXiv:2604.17391 [cs.SE] (or arXiv:2604.17391v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2604.17391 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-66] Back to Repair: A Minimal Denoising Network for Time Series Anomaly Detection
链接: https://arxiv.org/abs/2604.17388
作者: Kadir-Kaan Özer,René Ebeling,Markus Enzweiler
类目: Machine Learning (cs.LG)
*备注: 9 pages, 6 figures, 5 tables
Abstract:We introduce JuRe (Just Repair), a minimal denoising network for time series anomaly detection that exposes a central finding: architectural complexity is unnecessary when the training objective correctly implements the manifold-projection principle. JuRe consists of a single depthwise-separable convolutional residual block with hidden dimension 128, trained to repair corrupted time series windows and scored at inference by a fixed, parameter-free structural discrepancy function. Despite using no attention, no latent variable, and no adversarial component, JuRe ranks second on the TSB-AD multivariate benchmark (AUC-PR 0.404, 180 series, 17 datasets) and second on the UCR univariate archive by AUC-PR (0.198, 250 series), leading all neural baselines on AUC-PR and VUS-PR. Component ablation on TSB-AD identifies training-time corruption as the dominant factor ( \Delta AUC-PR = 0.047 on removal), confirming that the denoising objective, not network capacity, drives detection quality. Pairwise Wilcoxon signed-rank tests establish statistical significance against 21 of 25 baselines on TSB-AD. Code is available at the URL this https URL.
[LG-67] owards a Data-Parameter Correspondence for LLM s: A Preliminary Discussion
链接: https://arxiv.org/abs/2604.17384
作者: Ou Wu
类目: Machine Learning (cs.LG)
*备注: 25 pages
Abstract:Large language model optimization has historically bifurcated into isolated data-centric and model-centric paradigms: the former manipulates involved samples through selection, augmentation, or poisoning, while the latter tunes model weights via masking, quantization, or low-rank adaptation. This paper establishes a unified \emphdata-parameter correspondence revealing these seemingly disparate operations as dual manifestations of the same geometric structure on the statistical manifold \mathcalM . Grounded in the Fisher-Rao metric g_ij(\theta) and Legendre duality between natural ( \theta ) and expectation ( \eta ) parameters, we identify three fundamental correspondences spanning the model lifecycle: 1. Geometric correspondence: data pruning and parameter sparsification equivalently reduce manifold volume via dual coordinate constraints; 2. Low-rank correspondence: in-context learning (ICL) and LoRA adaptation explore identical subspaces on the Grassmannian \mathcalG(r,d) , with k -shot samples geometrically equivalent to rank- r updates; 3. Security-privacy correspondence: adversarial attacks exhibit cooperative amplification between data poisoning and parameter backdoors, whereas protective mechanisms follow cascading attenuation where data compression multiplicatively enhances parameter privacy. Extending from training through post-training compression to inference, this framework provides mathematical formalization for cross-community methodology transfer, demonstrating that cooperative optimization integrating data and parameter modalities may outperform isolated approaches across efficiency, robustness, and privacy dimensions.
[LG-68] Interpolating Discrete Diffusion Models with Controllable Resampling
链接: https://arxiv.org/abs/2604.17310
作者: Marcel Kollovieh,Sirine Ayadi,Stephan Günnemann
类目: Machine Learning (cs.LG)
*备注:
Abstract:Discrete diffusion models form a powerful class of generative models across diverse domains, including text and graphs. However, existing approaches face fundamental limitations. Masked diffusion models suffer from irreversible errors due to early unmasking, while uniform diffusion models, despite enabling self-correction, often yield low-quality samples due to their strong reliance on intermediate latent states. We introduce IDDM, an Interpolating Discrete Diffusion Model, that improves diffusion by reducing dependence on intermediate latent states. Central to IDDM is a controllable resampling mechanism that partially resets probability mass to the marginal distribution, mitigating error accumulation and enabling more effective token corrections. IDDM specifies a generative process whose transitions interpolate between staying at the current state, resampling from a prior, and flipping toward the target state, while enforcing marginal consistency and fully decoupling training from inference. We benchmark our model against state-of-the-art discrete diffusion models across molecular graph generation as well as text generation tasks, demonstrating competitive performance.
[LG-69] REALM: Reliable Expertise-Aware Language Model Fine-Tuning from Noisy Annotations
链接: https://arxiv.org/abs/2604.17289
作者: Sajjad Ghiasvand,Mark Beliaev,Mahnoosh Alizadeh,Ramtin Pedarsani
类目: Machine Learning (cs.LG)
*备注:
Abstract:Supervised fine-tuning of large language models relies on human-annotated data, yet annotation pipelines routinely involve multiple crowdworkers of heterogeneous expertise. Standard practice aggregates labels via majority vote or simple averaging, discarding annotator identity and causing the model to absorb the errors of unreliable annotators directly into its parameters. We propose REALM, a method that jointly learns the model parameters and a scalar expertise value for each annotator entirely unsupervised, requiring no supervision beyond annotator identity. The key idea is to model each observed label as a mixture between the model’s prediction and a uniform random guess, weighted by the annotator’s learned expertise. We extend REALM to a multi-task setting via a learned expertise matrix that captures per-annotator reliability across tasks. We evaluate on five question answering benchmarks, fine-tuning three sizes of Flan-T5 under simulated noisy annotations. The proposed algorithm consistently outperforms the naive noisy SFT in the large majority of single- and multi-task settings, across datasets, model sizes, and noise types, with accuracy improvements of up to 50% in the most adversarial regime and gains that grow with model capacity.
[LG-70] A Unified Compliance Aggregator Framework for Automated Multi-Tool Security Assessment of Linux Systems
链接: https://arxiv.org/abs/2604.17256
作者: Sheldon Paul,Izzat Alsmadi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Assessing the security posture of modern computing systems typically requires the use of multiple specialized tools. These tools focus on different aspects such as configuration compliance, file integrity, and vulnerability exposure, and their outputs are often difficult to interpret collectively. This paper introduces the Unified Compliance Aggregator (UCA), a framework that integrates several open-source security tools into a single composite score representing overall system security. The proposed framework combines outputs from Lynis, OpenSCAP (STIG and CIS profiles), AIDE, Tripwire, and Nmap NSE. A normalization process converts heterogeneous outputs into a consistent 0 to 100 scale, followed by weighted aggregation. We also introduce a logarithmic scoring model for file integrity measurements to address limitations observed in prior linear approaches. Experiments were conducted on Ubuntu 22.04 across different hardening levels and environments. Results show consistent improvement in composite scores as systems are hardened, while also revealing contrasting behavior between compliance and file integrity tools. Two case studies, a basic web server and a DVWA-based system illustrate how the framework can be applied in practical scenarios.
[LG-71] Bit-Flip Vulnerability of Shared KV-Cache Blocks in LLM Serving Systems
链接: https://arxiv.org/abs/2604.17249
作者: Yuji Yamamoto,Satoshi Matsuura
类目: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 12 pages, 4 figures
Abstract:Rowhammer on GPU DRAM has enabled adversarial bit flips in model weights; shared KV-cache blocks in LLM serving systems present an analogous but previously unexamined target. In vLLM’s Prefix Caching, these blocks exist as a single physical copy without integrity protection. Using software fault injection under ideal bit targeting, we characterize worst-case severity and identify three properties: (1) Silent divergence - 13 of 16 BF16 bit positions produce coherent but altered outputs, indistinguishable from legitimate responses without a clean baseline. (2) Selective propagation - only requests sharing the targeted prefix are affected. (3) Persistent accumulation - no temporal decay occurs, so cumulative damage grows linearly with subsequent requests. Together, these constitute a threat profile distinct from weight corruption: silent divergence and selective propagation enable detection evasion; persistent accumulation then proceeds unchecked, yielding damage amplification bounded only by how long the block remains cached. A checksum-based countermeasure detects any single-bit corruption at scheduling time, bounding cumulative damage to one batch independent of the block’s cache lifetime, with negligible overhead. These results argue for integrity protection of prefix blocks before end-to-end exploitation is demonstrated.
[LG-72] Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study ACL
链接: https://arxiv.org/abs/2604.17228
作者: Qingwei Lin
类目: Machine Learning (cs.LG)
*备注: 23 pages, 4 figures. Preprint. Controlled empirical study with 3-seed runs at 157.5M parameters; includes a negative result on oracle-style utility/rank supervision for conditional depth routing
Abstract:Conditional depth execution routes a subset of tokens through a lightweight cheap FFN while the remainder execute the standard full FFN at each controlled layer. The central difficulty is gate training: the gate decision must propagate through many layers before it influences the language modeling (LM) loss, so the resulting gradients are weak and noisy. Auxiliary losses are commonly stacked to stabilise training, yet the interactions among them – particularly between a predictive auxiliary and explicit score supervision – have not been systematically compared under controlled conditions. We evaluate two gate designs under a 157.5M-parameter decoder-only model with controller-only training, 50% full-path budget, and 3-seed runs on a fineweb-edu subset. The MLP gate (G1) maps the current hidden state to a utility score; the JEPA-guided gate (G3) adds an action-conditional predictor that forecasts, in a low-dimensional latent space, the outcome of executing full vs. cheap per token, aligned against a fixed target head. Under the standard recipe with oracle-style utility regression and pairwise rank supervision (util/rank), G3 improves early-to-mid optimisation over G1 in 3/3 seeds (lower avg LM, faster threshold hits, ~10.3x lower grad norms), with 20k-step endpoint LM within a 0.005 heuristic reference. A key finding (ablation A3): jointly removing util/rank improves best/avg LM and threshold-hit speed in 3/3 seeds for both gates, and the early-to-mid advantage of G3 over G1 disappears. We trace this to an off-policy oracle label that assumes all subsequent layers execute full, whereas gated execution routes only a fraction through full – making util/rank net-negative under the current recipe. Removing util/rank also cuts the training FLOPs proxy from ~1.53x to ~1.07x full-only (2.87h to 1.75h on a V100-32GB, ~39%). Conclusions are scoped to the studied regime. Comments: 23 pages, 4 figures. Preprint. Controlled empirical study with 3-seed runs at 157.5M parameters; includes a negative result on oracle-style utility/rank supervision for conditional depth routing Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.17228 [cs.LG] (or arXiv:2604.17228v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.17228 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-73] LASER: Low-Rank Activation SVD for Efficient Recursion ICLR2026
链接: https://arxiv.org/abs/2604.17224
作者: Ege Çakar,Ketan Ali Raghu,Lia Zheng
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to the Latent and Implicit Thinking Workshop at ICLR 2026
Abstract:Recursive architectures such as Tiny Recursive Models (TRMs) perform implicit reasoning through iterative latent computation, yet the geometric structure of these reasoning trajectories remains poorly understood. We investigate the activation manifold of TRMs during recursive unrolling and find that activations occupy an effectively linear, low-dimensional subspace whose principal directions can be tracked dynamically with cheap power iterations. This suggests that weight-sharing concentrates iterative computation along a small number of dominant eigendirections, and we find that this concentration varies sharply across computational sites. We exploit this structure through LASER (Low-Rank Activation SVD for Efficient Recursion), a dynamic compression framework that maintains an evolving low-rank basis via matrix-free subspace tracking with a fidelity-triggered reset mechanism, achieving \sim60% activation memory savings with no statistically significant accuracy degradation. Our analysis raises questions about how recursive architectures allocate representational capacity during implicit reasoning, and whether this concentration can be exploited to improve the efficiency and stability of latent computation.
[LG-74] Bilinear Input Modulation for Mamba: Koopman Bilinear Forms for Memory Retention and Multiplicative Computation
链接: https://arxiv.org/abs/2604.17221
作者: Hiroki Fujii,Masaki Yamakita
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 6 pages, 5 figures, submitted to IEEE Control Systems Letters (L-CSS)
Abstract:Selective State Space Models (SSMs), notably Mamba, employ diagonal state transitions that limit both memory retention and bilinear computational capacity. We propose a factorized bilinear input modulation that augments the SSM with a state-input product, interpretable as a finite-dimensional Koopman bilinear form. After introducing a shared state across channels (Coupled SSM), the modulation admits two implementations. Coupled Bilinear Input Modulation (Coupled-BIM) retains the full bilinear product at the cost of sequential computation, while Coupled Gated Modulation (Coupled-GM) linearizes it into a gate modulation that is compatible with the parallel scan. Experiments on a multiple input-delay pendulum (memory retention) and NARMA-10 (bilinear computation) reveal a clear dissociation. Coupled-GM substantially improves memory retention but not bilinear computation, while Coupled-BIM improves both. A pathway ablation confirms that the two downstream routes of the bilinear signal serve complementary roles. The improvement is statistically robust, with Coupled-BIM consistently outperforming all other variants on bilinear computation. Furthermore, only Coupled-BIM benefits from increasing the SSM state dimension, while coupling or gate modulation alone show no improvement, establishing the bilin-ear mechanism as uniquely capable of exploiting larger state spaces.
[LG-75] Continual Safety Alignment via Gradient-Based Sample Selection
链接: https://arxiv.org/abs/2604.17215
作者: Thong Bach,Dung Nguyen,Thao Minh Le,Truyen Tran
类目: Machine Learning (cs.LG)
*备注: 18 pages
Abstract:Large language models require continuous adaptation to new tasks while preserving safety alignment. However, fine-tuning on even benign data often compromises safety behaviors, including refusal of harmful requests, truthfulness, and commonsense reasoning. We investigate which training samples cause alignment drift through a data-centric lens. Our empirical analysis shows samples contribute unequally: high-gradient samples cause greater safety degradation and drive models toward pretrained distributions, while moderate-gradient samples enable task learning with minimal alignment loss. We propose gradient-based sample selection that filters high-gradient samples during fine-tuning. Across multiple model families on continual domain tasks, our method substantially improves alignment preservation while maintaining competitive task performance, without requiring curated safe data or architectural modifications. Our method is robust across selection ratios, task orderings, and diverse attack benchmarks.
[LG-76] Guardrails in Logit Space: Safety Token Regularization for LLM Alignment
链接: https://arxiv.org/abs/2604.17210
作者: Thong Bach,Truyen Tran
类目: Machine Learning (cs.LG)
*备注: 10 pages, 3 figures
Abstract:Fine-tuning well-aligned large language models (LLMs) on new domains often degrades their safety alignment, even when using benign datasets. Existing safety alignment techniques primarily focus on pretraining, leaving fine-tuned models vulnerable to behavioral shifts. In this work, we introduce safety token regularization (STR), a lightweight method designed to preserve safety properties during fine-tuning. Our approach identifies salient tokens from rejection templates of well-aligned models and constrains their associated logits during training, preventing the loss of critical safety behaviors. Unlike reinforcement learning or preference optimization methods, STR requires minimal additional computation and seamlessly integrates with parameter-efficient fine-tuning techniques such as LoRA. Comprehensive experiments demonstrate that our approach achieves safety performance on par with state-of-the-art methods, while preserving task-specific utility and requiring minimal implementation overhead. Furthermore, we show that safety token regularization enhances training stability and overall performance beyond safety considerations alone. This work offers a practical and readily deployable strategy for continual safety alignment in fine-tuned LLMs.
[LG-77] Do LLM -derived graph priors improve multi-agent coordination?
链接: https://arxiv.org/abs/2604.17191
作者: Nikunj Gupta,Rajgopal Kannan,Viktor Prasanna
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-agent reinforcement learning (MARL) is crucial for AI systems that operate collaboratively in distributed and adversarial settings, particularly in multi-domain operations (MDO). A central challenge in cooperative MARL is determining how agents should coordinate: existing approaches must either hand-specify graph topology, rely on proximity-based heuristics, or learn structure entirely from environment interaction; all of which are brittle, semantically uninformed, or data-intensive. We investigate whether large language models (LLMs) can generate useful coordination graph priors for MARL by using minimal natural language descriptions of agent observations to infer latent coordination patterns. These priors are integrated into MARL algorithms via graph convolutional layers within a graph neural network (GNN)-based pipeline, and evaluated on four cooperative scenarios from the Multi-Agent Particle Environment (MPE) benchmark against baselines spanning the full spectrum of coordination modeling, from independent learners to state-of-the-art graph-based methods. We further ablate across five compact open-source LLMs to assess the sensitivity of prior quality to model choice. Our results provide the first quantitative evidence that LLM-derived graph priors can enhance coordination and adaptability in dynamic multi-agent environments, and demonstrate that models as small as 1.5B parameters are sufficient for effective prior generation.
[LG-78] SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair
链接: https://arxiv.org/abs/2604.17184
作者: Yifan Zhang,Jieyu Li,Kexin Pei,Yu Huang,Kevin Leach
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) show promise for automated code repair but often struggle with the complex semantic and structural correctness required. We present SynthFix, a hybrid neural-symbolic framework that improves LLM-based vulnerability repair by unifying code synthesis with compiler-informed symbolic feedback. The core of our approach is an adaptive training strategy where a neural Router Model directs code samples to either Supervised Fine-Tuning (SFT) to learn common patterns or Reward Fine-Tuning (RFT) with symbolic rewards for complex, iterative refinement. On the FixJS (JavaScript) and CodeFlaws © benchmarks, SynthFix achieves up to 18% relative improvement in CodeBLEU/CrystalBLEU and 32% in Exact Match over strong SFT and RFT baselines. Our results show that this adaptive combination of training strategies, which mirrors how developers alternate between pattern application and tool feedback, significantly improves the accuracy and efficiency of LLM-based vulnerability repair. Our code and data are available at this https URL.
[LG-79] A Model and Estimation of the Bitcoin Transaction Fee
链接: https://arxiv.org/abs/2604.17183
作者: Daniel Aronoff,Kristian Praizner,Armin Sabouri
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注: 53 pages
Abstract:Bitcoin transaction fees will become more important as the block subsidy declines, but fee formation is hard to study with blockchain data alone because the relevant queueing environment is unobserved. We develop and estimate a structural model of Bitcoin fee choice that treats the mempool as a market for scarce blockspace. We assemble a novel, high-frequency mempool panel, from a self-run Bitcoin node that records transaction arrivals, exits, block inclusion, fee-bumping events, and congestion snapshots. We characterize the fee market as a Vickery-Clarke-Groves mechanism and derive an equation to estimate fees. In the first-stage we estimate a monotone delay technology linking fee-rate priority and network state to expected confirmation delay. We then estimate how fees respond to that delay technology and to transaction characteristics. We find that congestion is the main determinant of delay; that the marginal value of priority is priced in fees, which is increasing in the gradient of confirmation time reduction per movement up in the fee queue; and that transactor choice of RBF, CPFP, and block conditions have economically important effects on fees.
[LG-80] Decomposing the Depth Profile of Fine-Tuning
链接: https://arxiv.org/abs/2604.17177
作者: Jayadev Billa
类目: Machine Learning (cs.LG)
*备注: 25 pages incl. 13 appendix pages. 1 figure, 19 tables
Abstract:Fine-tuning adapts pretrained networks to new objectives. Whether the resulting depth profile of representational change reflects an intrinsic property of the model or the magnitude of gradient flow has not been tested directly. We measure this profile across 240 fine-tuning runs spanning 15 models in four architecture families (encoder and decoder transformers, a state-space model, and an RNN) at scales from 125M to 6.9B parameters. Representational change concentrates in output-proximal layers in every standard-training run except one. We apply a per-layer control that equalizes |\Delta W|/|W| across layers after each optimizer step. Under this control, the profile persists in some conditions and collapses in others. At 125M–350M, sequential-block architectures (BERT, OPT, GPT-2) retain the slope across tested objectives while parallel-block architectures (Pythia, CodeGen) retain it only for causal-language-modeling objectives. This architectural distinction narrows at 1.3B–1.4B, where both block types show positive equal-step slopes for CausalLM. Under standard training, profile shape is described by two additional axes: steepness tracks a training-free objective distance at initialization, and profile width is dominated by architecture. We treat the locality gradient, the depthwise slope of representational change, as a composite phenomenon whose components are scale-dependent.
[LG-81] Uncertainty Quantification in PINNs for Turbulent Flows: Bayesian Inference and Repulsive Ensembles
链接: https://arxiv.org/abs/2604.17156
作者: Khemraj Shukla,Zongren Zou,Theo Kaeufer,Michael Triantafyllou,George Em Karniadakis
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Physics-informed neural networks (PINNs) have emerged as a promising framework for solving inverse problems governed by partial differential equations (PDEs), including the reconstruction of turbulent flow fields from sparse data. However, most existing PINN formulations are deterministic and do not provide reliable quantification of epistemic uncertainty, which is critical for ill-posed problems such as data-driven Reynolds-averaged Navier-Stokes (RANS) modeling. In this work, we develop and systematically evaluate a set of probabilistic extensions of PINNs for uncertainty quantification in turbulence modeling. The proposed framework combines (i) Bayesian PINNs with Hamiltonian Monte Carlo sampling and a tempered multi-component likelihood, (ii) Monte Carlo dropout, and (iii) repulsive deep ensembles that enforce diversity in function space. Particular emphasis is placed on the role of ensemble diversity and likelihood tempering in improving uncertainty calibration for PDE-constrained inverse problems. The methods are assessed on a hierarchy of test cases, including the Van der Pol oscillator and turbulent flow past a circular cylinder at Reynolds numbers Re=3,900 (direct numerical simulation data) and Re = 10,000 (experimental particle image velocimetry data). The results demonstrate that Bayesian PINNs provide the most consistent uncertainty estimates across all inferred quantities, while function-space repulsive ensembles offer a computationally efficient approximation with competitive accuracy for primary flow variables. These findings provide quantitative insight into the trade-offs between accuracy, computational cost, and uncertainty calibration in physics-informed learning, and offer practical guidance for uncertainty quantification in data-driven turbulence modeling.
[LG-82] SeekerGym: A Benchmark for Reliable Information Seeking
链接: https://arxiv.org/abs/2604.17143
作者: Remy Kim,Minseung Lee,Shuo Li,Osbert Bastani
类目: Machine Learning (cs.LG)
*备注:
Abstract:Despite their substantial successes, AI agents continue to face fundamental challenges in terms of trustworthiness. Consider deep research agents, tasked with searching for information relevant to a given topic-while AI agents can perform effective information retrieval, there is little guarantee regarding the completeness of this information. Gaps in retrieved information can leave biases that mislead users even if the information they are given is correct and relevant. We introduce SeekerGym, a benchmark designed to evaluate the completeness of information retrieved by AI agents. In addition, SeekerGym also measures how well agents quantify their uncertainty in the completeness of their information; if an agent fails to retrieve all relevant information, it is useful for it to at least quantify how much might be missing. At a high level, each task in SeekerGym is a document (e.g., a Wikipedia article), and the AI agent must issue queries to retrieve passages from that document. Intuitively, the document comprehensively covers a topic, so the ability to retrieve its sections directly measures completeness of information retrieval. In addition to Wikipedia, we also consider machine learning survey papers, where the goal is to retrieve relevant sections of a survey paper. We benchmark several models and algorithms; the best approaches retrieve 42.5% of passages on Wikipedia and 29.2% on ML Surveys, leaving substantial room for improvement.
[LG-83] BOIL: Learning Environment Personalized Information
链接: https://arxiv.org/abs/2604.17137
作者: Rohan Patil,Henrik I. Christensen
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Navigating complex environments poses challenges for multi-agent systems, requiring efficient extraction of insights from limited information. In this paper, we introduce the Blackbox Oracle Information Learning (BOIL) process, a scalable solution for extracting valuable insights from the environment structure. Leveraging the Pagerank algorithm and common information maximization, BOIL facilitates the extraction of information to guide long-term agent behavior applicable to problems such as coverage, patrolling, and stochastic reachability. Through experiments, we demonstrate the efficacy of BOIL in generating strategy distributions conducive to improved performance over extended time horizons, surpassing heuristic approaches in complex environments.
[LG-84] Live LTL Progress Tracking: Towards Task-Based Exploration
链接: https://arxiv.org/abs/2604.17106
作者: Noel Brindise,Cedric Langbort,Melkior Ornik
类目: Machine Learning (cs.LG)
*备注: 40 pages
Abstract:Motivated by the challenge presented by non-Markovian objectives in reinforcement learning (RL), we present a novel framework to track and represent the progress of autonomous agents through complex, multi-stage tasks. Given a specification in finite linear temporal logic (LTL), the framework establishes a ‘tracking vector’ which updates at each time step in a trajectory rollout. The values of the vector represent the status of the specification as the trajectory develops, assigning true, false, or ‘open’ labels (where ‘open’ is used for indeterminate cases). Applied to an LTL formula tree, the tracking vector can be used to encode detailed information about how a task is executed over a trajectory, providing a potential tool for new performance metrics, diverse exploration, and reward shaping. In this paper, we formally present the framework and algorithm, collectively named Live LTL Progress Tracking, give a simple working example, and demonstrate avenues for its integration into RL models. Future work will apply the framework to problems such as task-space exploration and diverse solution-finding in RL.
[LG-85] ree of Concepts: Interpretable Continual Learners in Non-Stationary Clinical Domains
链接: https://arxiv.org/abs/2604.17089
作者: Dongkyu Cho,Xiyue Li,Samrachana Adhikari,Rumi Chunara
类目: Machine Learning (cs.LG)
*备注: 17 pages, 2 figures
Abstract:Continual learning aims to update models under distribution shift without forgetting, yet many high-stakes deployments, such as healthcare, also require interpretability. In practice, models that adapt well (e.g., deep networks) are often opaque, while models that are interpretable (e.g., decision trees) are brittle under shift, making it difficult to achieve both properties simultaneously. In response, we propose Tree of Concepts, an interpretable continual learning framework that uses a shallow decision tree to define a fixed, rule-based concept interface and trains a concept bottleneck model to predict these concepts from raw features. Continual updates act on the concept extractor and label head while keeping concept semantics stable over time, yielding explanations that do not drift across sequential updates. On multiple tabular healthcare benchmarks under continual learning protocols, our method achieves a stronger stability-plasticity trade-off than existing baselines, including replay-enhanced variants. Our results suggest that structured concept interfaces can support continual adaptation while preserving a consistent audit interface in non-stationary, high-stakes domains.
[LG-86] Reference-state System Reliability method for scalable uncertainty quantification of coherent systems
链接: https://arxiv.org/abs/2604.17066
作者: Ji-Eun Byun,Hyeuk Ryu,Junho Song
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注: 36 pages, 13 figures, under review at a peer-reviewed journal
Abstract:Coherent systems are representative of many practical applications, ranging from infrastructure networks to supply chains. Probabilistic evaluation of such systems remains challenging, however, because existing decomposition-based methods scale poorly as the number of components grows. To address this limitation, this study proposes the Reference-state System Reliability (RSR) method. Like existing approaches, RSR characterises the boundary between different system states using reference states in the component-state space. Where it departs from these methods is in how the state space is explored: rather than using reference states to decompose the space into disjoint hypercubes, RSR uses them to classify Monte Carlo samples, making computational cost significantly less sensitive to the number of reference states. To make this classification efficient, samples and reference states are stored as matrices and compared using batched matrix operations, allowing RSR to exploit the advances in high-throughput matrix computing driven by modern machine learning. We demonstrate that RSR evaluates the system-state probability of a graph with 119 nodes and 295 edges within 10~seconds, highlighting its potential for real-time risk assessment of large-scale systems. We further show that RSR scales to problems involving hundreds of thousands of reference states – well beyond the reach of existing methods – and extends naturally to multi-state systems. Nevertheless, when the number of boundary reference states grows exceedingly large, RSR’s convergence slows down, a limitation shared with existing reference-state-based approaches that motivates future research into learning-based representations of system-state boundaries.
[LG-87] When Spike Sparsity Does Not Translate to Deployed Cost: VS-WNO on Jetson Orin Nano
链接: https://arxiv.org/abs/2604.17040
作者: Jason Yoo,Shailesh Garg,Souvik Chakraborty,Syed Bahauddin Alam
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Neural and Evolutionary Computing (cs.NE)
*备注: 4 pages, 2 figures. Submitted to ICONS 2026 (under review)
Abstract:Spiking neural operators are appealing for neuromorphic edge computing because event-driven substrates can, in principle, translate sparse activity into lower latency and energy. Whether that advantage survives deployment on commodity edge-GPU software stacks, however, remains unclear. We study this question on a Jetson Orin Nano 8 GB using five pretrained variable-spiking wavelet neural operator (VS-WNO) checkpoints and five matched dense wavelet neural operator (WNO) checkpoints on the Darcy rectangular benchmark. On a reference-aligned path, VS-WNO exhibits substantial algorithmic sparsity, with mean spike rates decreasing from 54.26% at the first spiking layer to 18.15% at the fourth. On a deployment-style request path, however, this sparsity does not reduce deployed cost: VS-WNO reaches 59.6 ms latency and 228.0 mJ dynamic energy per inference, whereas dense WNO reaches 53.2 ms and 180.7 mJ, while also achieving slightly lower reference-path error (1.77% versus 1.81%). Nsight Systems indicates that the request path remains launch-dominated and dense rather than sparsity-aware: for VS-WNO, cudaLaunchKernel accounts for 81.6% of CUDA API time within the latency window, and dense convolution kernels account for 53.8% of GPU kernel time; dense WNO shows the same pattern. On this Jetson-class GPU stack, spike sparsity is measurable but does not reduce deployed cost because the runtime does not suppress dense work as spike activity decreases.
[LG-88] Convergence theory for Hermite approximations under adaptive coordinate transformations
链接: https://arxiv.org/abs/2604.16975
作者: Yahya Saleh
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Recent work has shown that parameterizing and optimizing coordinate transformations using normalizing flows, i.e., invertible neural networks, can significantly accelerate the convergence of spectral approximations. We present the first error estimates for approximating functions using Hermite expansions composed with adaptive coordinate transformations. Our analysis establishes an equivalence principle: approximating a function f in the span of the transformed basis is equivalent to approximating the pullback of f in the span of Hermite functions. This allows us to leverage the classical approximation theory of Hermite expansions to derive error estimates in transformed coordinates in terms of the regularity of the pullback. We present an example demonstrating how a nonlinear coordinate transformation can enhance the convergence of Hermite expansions. Focusing on smooth functions decaying along the real axis, we construct a monotone transport map that aligns the decay of the target function with the Hermite basis. This guarantees spectral convergence rates for the corresponding Hermite expansion. Our analysis provides theoretical insight into the convergence behavior of adaptive Hermite approximations based on normalizing flows, as recently explored in the computational quantum physics literature.
[LG-89] Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon
链接: https://arxiv.org/abs/2604.16957
作者: Sai Vegasena
类目: Machine Learning (cs.LG)
*备注: 8 pages, 8 figures, 8 tables. Code: this https URL and this https URL
Abstract:We present Open-TQ-Metal, the first implementation of fused compressed-domain attention on Apple Silicon, enabling 128K-context inference for Llama 3.1 70B on a single 64GB consumer Mac – a configuration impossible with all existing inference frameworks. Open-TQ-Metal quantizes the KV cache to int4 on the fly and computes attention directly on the compressed representation via custom Metal compute shaders, eliminating all intermediate dequantization matrices. Across 330 experiments spanning two model families (Gemma 4 31B and Llama 3.1 70B), the fused sdpa_int4 kernel achieves 48x attention speedup at 128K context over the dequantize-then-attend baseline, reduces KV cache memory from 40 GB to 12.5 GB (3.2x compression), and maintains identical top-1 token predictions to FP16 inference. We further provide the first cross-architecture analysis of KV cache quantization methods, revealing that the attention scale factor – not model size – determines whether angular quantization schemes like PolarQuant succeed or fail, with Gemma 4’s attn_scale=1.0 amplifying directional error 25-100x more than Llama’s standard 1/sqrt(d) scaling.
[LG-90] L1 Regularization Paths in Linear Models by Parametric Gaussian Message Passing
链接: https://arxiv.org/abs/2604.16949
作者: Yun-Peng Li,Hans-Andrea Loeliger
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Methodology (stat.ME)
*备注:
Abstract:The paper considers the computation of L1 regularization paths in a state space setting, which includes L1 regularized Kalman smoothing, linear SVM, LASSO, and more. The paper proposes two new algorithms, which are duals of each other; the first algorithm applies to L1 regularization of independent variables while the second applies to L1 regularization of dependent variables. The heart of the proposed algorithms is parametric Gaussian message passing (i.e., Kalman-type forward-backward recursions) in the pertinent factor graphs. The proposed methods are broadly applicable, they (usually) require only matrix multiplications, and their complexity can be competitive with prior methods in some cases.
[LG-91] Covariance-Based Structural Equation Modeling in Small-Sample Settings with pn
链接: https://arxiv.org/abs/2604.16894
作者: Hiroki Hasegawa,Aoba Tamura,Yukihiko Okada
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 31 pages, 7 figures and 7 tables
Abstract:Factor-based Structural Equation Modeling (SEM) relies on likelihood-based estimation assuming a nonsingular sample covariance matrix, which breaks down in small-sample settings with pn . To address this, we propose a novel estimation principle that reformulates the covariance structure into self-covariance and cross-covariance components. The resulting framework defines a likelihood-based feasible set combined with a relative error constraint, enabling stable estimation in small-sample settings where pn for sign and direction. Experiments on synthetic and real-world data show improved stability, particularly in recovering the sign and direction of structural parameters. These results extend covariance-based SEM to small-sample settings and provide practically useful directional information for decision-making.
[LG-92] owards Fully Parameter-Free Stochastic Optimization: Grid Search with Self-Bounding Analysis
链接: https://arxiv.org/abs/2604.16888
作者: Yuheng Zhao,Yu-Hu Yan,Amit Attia,Tomer Koren,Lijun Zhang,Peng Zhao
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Parameter-free stochastic optimization aims to design algorithms that are agnostic to the underlying problem parameters while still achieving convergence rates competitive with optimally tuned methods. While some parameter-free methods do not require the specific values of the problem parameters, they still rely on prior knowledge, such as the lower or upper bounds of them. We refer to such methods as partially parameter-free''. In this work, we target achieving fully parameter-free’’ methods, i.e., the algorithmic inputs do not need to satisfy any unverifiable condition related to the true problem parameters. We propose a powerful and general grid search framework, named \textscGrasp, with a novel self-bounding analysis technique that effectively determines the search ranges of parameters, in contrast to previous work. Our method demonstrates generality in: (i) the non-convex case, where we propose a fully parameter-free method that achieves near-optimal convergence rate, up to logarithmic factors; (ii) the convex case, where our parameter-free methods are competitive with strong performance in terms of acceleration and universality. Finally, we contribute a sharper guarantee for the model ensemble, a final step of the grid search framework, under interpolated variance characterization.
[LG-93] OC-Distill: Ontology-aware Contrastive Learning with Cross-Modal Distillation for ICU Risk Prediction
链接: https://arxiv.org/abs/2604.16878
作者: Zhongyuan Liang,Junhyung Jo,Hyang-Jung Lee,Sang Kyu Kim,Irene Y. Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Early prediction of severe clinical deterioration and remaining length of stay can enable timely intervention and better resource allocation in high-acuity settings such as the ICU. This has driven the development of machine learning models that leverage continuous streams of vital signs and other physiological signals for real-time risk prediction. Despite their promise, existing methods have important limitations. Contrastive pretraining treats all patients as equally strong negatives, failing to capture clinically meaningful similarity between patients with related diagnoses. Meanwhile, downstream fine-tuning typically ignores complementary modalities such as clinical notes, which provide rich contextual information unavailable in physiological signals alone. To address these challenges, we propose OC-Distill, a two-stage framework that leverages multimodal supervision during training while requiring only vital signs at inference. In the first stage, we introduce an ontology-aware contrastive objective that exploits the ICD hierarchy to quantify patient similarity and learn clinically grounded representations. In the second stage, we fine-tune the pretrained encoder via cross-modal knowledge distillation, transferring complementary information from clinical notes into the model. Across multiple ICU prediction tasks on MIMIC, OC-Distill demonstrates improved label efficiency and achieves state-of-the-art performance among methods that use only vital signs at inference.
[LG-94] Untrained CNNs Match Backpropagation at V1: A Systematic RSA Comparison of Four Learning Rules Against Human fMRI
链接: https://arxiv.org/abs/2604.16875
作者: Nils Leutenegger
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 8 pages, 7 figures
Abstract:A central question in computational neuroscience is whether the learning rule used to train a neural network determines how well its internal representations align with those of the human visual cortex. We present a systematic comparison of four learning rules – backpropagation (BP), feedback alignment (FA), predictive coding (PC), and spike-timing-dependent plasticity (STDP) – applied to identical convolutional architectures and evaluated against human fMRI data from the THINGS-fMRI dataset (720 stimuli, 3 subjects) using Representational Similarity Analysis (RSA). Crucially, we include an untrained random-weights baseline that reveals the dominant role of architecture. We find that early visual alignment (V1/V2) is primarily architecture-driven: an untrained CNN achieves rho = 0.071, statistically indistinguishable from BP (rho = 0.072, p = 0.43). Learning rules only differentiate at higher visual areas: BP dominates at LOC/IT, and PC with local Hebbian updates achieves IT alignment statistically indistinguishable from BP (p = 0.18). FA consistently impairs representations below the random baseline at V1. Partial RSA confirms all effects survive pixel-similarity control. These results demonstrate that the relationship between learning rules and cortical alignment is region-specific: architecture determines early alignment, while supervised objectives drive late alignment.
[LG-95] Learning to Trade Like an Expert: Cognitive Fine-Tuning for Stable Financial Reasoning in Language Models
链接: https://arxiv.org/abs/2604.16862
作者: Yuchen Pan,Soung Chang Liew
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures
Abstract:Recent deployments of large language models (LLMs) as autonomous trading agents raise questions about whether financial decision-making competence generalizes beyond specific market patterns and how it should be trained and evaluated in noisy markets lacking ground truth. We propose a structured framework for training and evaluating such models. Central to our approach is a curated, multiple-choice question (MCQ) dataset derived from classic textbooks and historical markets, verified by an AI committee, enriched with structured reasoning traces, and augmented to reduce shortcut learning. To evaluate whether performance on isolated MCQs generalizes to real-world trading, we introduce a two-stage protocol combining test-set evaluation with an MCQ-based chronological trading simulation. Extensive evaluations across market regimes provide statistically robust evidence that open models trained with our framework exhibit competitive, risk-aware behavior over time, outperform open-source baselines, and approach frontier-model performance at smaller scale. We release the dataset and evaluation framework to support further research.
[LG-96] Singularity Formation: Synergy in Theoretical Numerical and Machine Learning Approaches
链接: https://arxiv.org/abs/2604.16842
作者: Yixuan Wang
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注:
Abstract:This thesis develops numerical and theoretical approaches for understanding and analyzing singularity formation in Partial Differential Equations (PDEs). The singularity formation in the Navier-Stokes Equation (NSE) is famously challenging as one of the seven Clay Prize problems. Unlike simpler equations such as the Nonlinear Heat (NLH) or Keller-Segel (KS) equations, where formal asymptotics near blowup are better understood, the intrinsic complexity of NSE makes quantitative analytical treatment difficult, if not impossible, without numerical guidance. Building on numerical insights, we introduce a robust analytical framework to simplify and systematize pen-and-paper proofs for simpler singular PDEs. We present a novel approach based on enforcing vanishing modulation conditions for perturbations around approximate blowup profiles, complemented by singularly weighted energy estimates. We demonstrate the efficacy of our method on PDEs with complicated asymptotics, such as NLH and the Complex Ginzburg-Landau (CGL) equation, and address the open problem of singularity formation in the 3D KS equation with logistic damping. We develop and refine numerical approaches that facilitate deeper insights into singularity formation. We demonstrate that machine learning methods significantly enhance our capability to identify and characterize potential blowup solutions with high precision. We improve on existing Physics-Informed Neural Network (PINN) and Neural Operator (NO) frameworks. Moreover, we present a novel machine learning paradigm, the Kolmogorov-Arnold Network (KAN) architecture, whose interpretability and excellent scaling properties are achieved through learnable nonlinearities. Subjects: Numerical Analysis (math.NA); Machine Learning (cs.LG); Analysis of PDEs (math.AP) Cite as: arXiv:2604.16842 [math.NA] (or arXiv:2604.16842v1 [math.NA] for this version) https://doi.org/10.48550/arXiv.2604.16842 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-97] owards Deep Encrypted Training: Low-Latency Memory-Efficient and High-Throughput Inference for Privacy-Preserving Neural Networks
链接: https://arxiv.org/abs/2604.16834
作者: Nges Brian Njungle,Eric Jahns,Michel A. Kinsy
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 14 Pages
Abstract:Privacy-preserving machine learning (PPML) has become increasingly important in applications where sensitive data must remain confidential. Homomorphic Encryption (HE) enables computation directly on encrypted data, allowing neural network inference without revealing raw inputs. While prior works have largely focused on inference over a single encrypted image, batch processing of encrypted inputs lags behind, despite being critical for high-throughput inference scenarios and training-oriented workloads. In this work, we address this gap by developing optimized algorithms for batched HE-friendly neural networks. We also introduced a pipeline architecture designed to maximize resource efficiency for different batch size execution. We implemented these algorithms and evaluated our work using HE-friendly ResNet-20 and ResNet-34 models on encrypted CIFAR-10 and CIFAR-100 datasets, respectively. For ResNet-20, our approach achieves an amortized inference time of 8.86 seconds per image when processing a batch of 512 encrypted images, with a peak memory usage of 98.96 GB. These results represent a 1.78x runtime improvement and a 3.74x reduction in memory usage compared to the state-of-the-art design. For the deeper ResNet-34 model, we achieve an amortized inference time of 28.14 on a batch of 256 encrypted images using 246.78GB of RAM Comments: 14 Pages Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) ACMclasses: I.2 Reportnumber: STAM-Center-REP-011 Cite as: arXiv:2604.16834 [cs.CR] (or arXiv:2604.16834v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.16834 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-98] RF-Inventory: A Large-Scale Dataset for Monotonic Inventory Estimation in Reach and Frequency Advertising SIGIR2026
链接: https://arxiv.org/abs/2604.16821
作者: Yunshan Peng,Ji Wu,Wentao Bai,Yunke Bai,Jinan Pang,Wenzheng Shu,Yanxiang Zeng,Xialong Liu,Peng Jiang
类目: Machine Learning (cs.LG)
*备注: Accepted by SIGIR 2026; 7 pages
Abstract:Reach and Frequency (RF) contract advertising is an important form of widely used brand advertising. Unlike performance advertising, RF contracts emphasize controllable delivery of UV and PV under given targeting, scheduling, and frequency control constraints. In practical systems, advertisers typically need to view the UV, PV change curves at different budget levels in real time when creating an RF contract. However, most existing publicly available advertising datasets are based on independent samples, lacking a characterization of the core structure of the “budget-performance curve” (including UV and PV) in RF this http URL paper proposes and releases a large-scale RF contract inventory estimation dataset. This dataset uses the RF contract context consisting of “targeting-scheduling-frequency control” as the basic context, providing observations of UV and PV corresponding to multiple budget points within the same context, thus forming a complete budget-performance curve. The dataset explicitly includes a time-window-based frequency control mechanism (e.g.,“no more than 3 times within 5 days”) and naturally satisfies the monotonicity and diminishing marginal returns characteristics in the budget and scheduling dimensions. We further derive the theoretical maximum exposure ceiling and use it as a consistency check to evaluate data quality and the feasibility of model predictions. Using this data set, this paper defines two standardized benchmark tasks: single-point performance prediction and reconstruction of budget-performance curves, and provides a set of reproducible baseline methods and evaluation protocols. This dataset can support systematic research on problems such as structural constraint learning, monotonic regression, curve consistency modeling, and RF contract this http URL code for our experiments can be found at this https URL.
[LG-99] Continuous Limits of Coupled Flows in Representation Learning
链接: https://arxiv.org/abs/2604.16801
作者: Zilin Li,Weiwei Xu,Xuchun Tong,Xuanbo Lu,Xuanqi Zhao
类目: Machine Learning (cs.LG)
*备注: Preprints
Abstract:While modern representation learning relies heavily on global error signals, decentralized algorithms driven by local interactions offer a fundamental distributed alternative. However, the macroscopic convergence properties of these discrete dynamics on continuous data manifolds remain theoretically unresolved, notoriously suffering from parameter explosion. We bridge this gap by formalizing decentralized learning as a coupled slow-fast dynamical system on Riemannian manifolds. First, using measure-theoretic limits, we prove that the discrete spatial transitions converge uniformly to an overdamped Langevin stochastic differential equation. Second, via the Itô-Poisson resolvent and a stochastic extension of LaSalle’s Invariance Principle, we establish that the representation weights unconditionally avoid divergence and align strictly with the principal eigenspace of the spatial measure. Finally, we construct a joint Lyapunov functional for the fully coupled spatial-parametric flow. This proves global dissipativity and demonstrates that orthogonally disentangled, linearly separable features emerge spontaneously at the stationary limit. Our framework bridges discrete algorithms with continuous stochastic analysis, providing a formal theoretical baseline for decentralized representation learning.
[LG-100] LLM -Extracted Covariates for Clinical Causal Inference: Rethinking Integration Strategies
链接: https://arxiv.org/abs/2604.16763
作者: Lei Liu,Jialin Chen,Kathy Macropol
类目: Machine Learning (cs.LG)
*备注:
Abstract:Causal inference from electronic health records (EHR) is fundamentally limited by unmeasured confounding: critical clinical states such as frailty, goals of care, and mental status are documented in free-text notes but absent from structured data. Large language models can extract these latent confounders as interpretable, structured covariates, yet how to effectively integrate them into causal estimation pipelines has not been systematically studied. Using the MIMIC-IV database with 21,859 sepsis patients, we compare seven covariate-integration strategies for estimating the effect of early vasopressor initiation on 28-day mortality, spanning tabular-only baselines, traditional NLP representations, and three LLM-augmented approaches. A central finding is that not all integration strategies are equally effective: directly augmenting the propensity score model with LLM covariates achieves the best performance, while dual-caliper matching on text-derived categorical distances restricts the donor pool and degrades estimation. In semi-synthetic experiments with known ground-truth effects, LLM-augmented propensity scores reduce estimation bias from 0.0143 to 0.0003 relative to tabular-only methods, and this advantage persists under substantial simulated extraction error. On real data, incorporating LLM-extracted covariates reduces the estimated treatment effect from 0.055 to 0.027, directionally consistent with the CLOVERS randomized trial, and a doubly robust estimator yielding 0.019 confirms the robustness of this finding. Our results offer practical guidance on when and how text-derived covariates improve causal estimation in critical care.
[LG-101] Neuroscience Inspired Graph Operators Towards Edge-Deployable Virtual Sensing for Irregular Geometries
链接: https://arxiv.org/abs/2604.16722
作者: William Howes,Farid Ahmed,Kazuma Kobayashi,Souvik Chakraborty,Syed Bahauddin Alam
类目: Machine Learning (cs.LG)
*备注: 6 pages, 1 figure, 2 tables
Abstract:Predicting full-field physics through the real-time virtual sensing of engineering systems can enhance limited physical sensors but often requires sparse-to-dense reconstruction, complex multiphysics, and highly irregular geometries as well as strict latency and energy constraints for edge-deployability. Neural operators have been presented as a potential candidate for such applications but few architectures exist that explicitly address power consumption. Spiking neuron integration can provide a potential solution when integrated on neuromorphic hardware but the current existing neuron models result in severe performance degradation towards regression-based virtual sensing. To address the performance concerns and edge-constraints, we present the Variable Spiking Graph Neural Operator (VS-GNO) which integrates a sophisticated spectral-spatial convolutional analysis and a previously developed Variable Spiking Neuron (VSN) and energy-error balance loss function. With a non-spiking L_2 error baseline of 0.4% , VS-GNO can provide a reconstruction error of 0.71% with 15% average spiking in its spectral-only form and 1.04% with 24.5% spiking in its entire form. These results position VS-GNO as a promising step towards energy-efficient, edge-deployable neural operators for real-time sparse-to-dense virtual sensing in complex, highly irregular engineering environments.
[LG-102] Chronax: A Jax Library for Univariate Statistical Forecasting and Conformal Inference
链接: https://arxiv.org/abs/2604.16719
作者: Xan Carey,Yash Deshmukh,Aileen Huang,Sunit Jadhav,Omkar Tekawade,Lorraine Yang,Anvesha Tiwary,Gerardo Riano,Amy Greenwald,Denizalp Goktas
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time-series forecasting is central to many scientific and industrial domains, such as energy systems, climate modeling, finance, and retail. While forecasting methods have evolved from classical statistical models to automated, and neural approaches, the surrounding software ecosystem remains anchored to the traditional Python numerical stack. Existing libraries rely on interpreter-driven execution and object-oriented abstractions, limiting composability, large-scale parallelism, and integration with modern differentiable and accelerator-oriented workflows. Meanwhile, today’s forecasting increasingly involves large collections of heterogeneous time series data, irregular covariates, and frequent retraining, placing new demands on scalability and execution efficiency. JAX offers an alternative paradigm to traditional stateful numerical computation frameworks based on pure functions and program transformations such as just-in-time compilation and automatic vectorization, enabling end-to-end optimization across CPUs, GPUs, and TPUs. However, this modern paradigm has not yet been fully incorporated into the design of forecasting systems. We introduce Chronax, a JAX-native time-series forecasting library that rethinks forecasting abstractions around functional purity, composable transformations, and accelerator-ready execution. By representing preprocessing, modeling, and multi-horizon prediction as pure JAX functions, Chronax enables scalable multi-series forecasting, model-agnostic conformal uncertainty quantification, and seamless integration with modern machine learning and scientific computing pipelines.
[LG-103] How to Approximate Inference with Subtractive Mixture Models AISTATS2026
链接: https://arxiv.org/abs/2604.16714
作者: Lena Zellinger,Nicola Branchini,Lennert De Smet,Víctor Elvira,Nikolay Malkin,Antonio Vergari
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: Accepted version at AISTATS 2026
Abstract:Classical mixture models (MMs) are widely used tractable proposals for approximate inference settings such as variational inference (VI) and importance sampling (IS). Recently, mixture models with negative coefficients, called subtractive mixture models (SMMs), have been proposed as a potentially more expressive alternative. However, how to effectively use SMMs for VI and IS is still an open question as they do not provide latent variable semantics and therefore cannot use sampling schemes for classical MMs. In this work, we study how to circumvent this issue by designing several expectation estimators for IS and learning schemes for VI with SMMs, and we empirically evaluate them for distribution approximation. Finally, we discuss the additional challenges in estimation stability and learning efficiency that they carry and propose ways to overcome them. Code is available at: this https URL.
[LG-104] Surgical Repair of Insecure Code Generation in LLM s
链接: https://arxiv.org/abs/2604.16697
作者: Gustavo Sandoval,Brendan Dolan-Gavitt,Siddharth Garg
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Large language models write production code, and yet they routinely introduce well-known vulnerabilities. We show that this is not a knowledge deficit: the same models that generate insecure code, correctly identify and explain the vulnerability when asked directly, this is a gap we call the Format-Reliability Gap. Mechanistic analysis reveals the cause: security representations are encoded from the earliest layers but remain computationally inert until the final layer, where format-compliance demands compete with them. Because the failure is localized to a single layer, per-vulnerability steering vectors reduce insecure generation by up to 74% with negligible overhead. The mechanism and the fix generalize across five models, three architecture families, and six vulnerability types, suggesting insecure code generation is an interpretability problem, not a training artifact.
[LG-105] DARLING: Detection Augmented Reinforcement Learning with Non-Stationary Guarantees
链接: https://arxiv.org/abs/2604.16684
作者: Argyrios Gerogiannis,Yu-Han Huang,Venugopal V. Veeravalli
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 31 pages, 5 figures
Abstract:We study model-free reinforcement learning (RL) in non-stationary finite-horizon episodic Markov decision processes (MDPs) without prior knowledge of the non-stationarity. We focus on the piecewise-stationary (PS) setting, where both the reward and transition dynamics can change an arbitrary number of times. We propose Detection Augmented Reinforcement Learning (DARLING), a modular wrapper for PS-RL that applies to both tabular and linear MDPs, without knowledge of the changes. Under certain change-point separation and reachability conditions, DARLING improves the best available dynamic regret bounds in both settings and yields strong empirical performance. We further establish the first minimax lower bounds for PS-RL in tabular and linear MDPs, showing that DARLING is the first nearly optimal algorithm. Experiments on standard benchmarks demonstrate that DARLING consistently surpasses the state-of-the-art methods across diverse non-stationary scenarios.
[LG-106] UniCon: Unified Framework for Efficient Contrastive Alignment via Kernels ICLR
链接: https://arxiv.org/abs/2604.16678
作者: Hangke Sui,Yuqing Wang,Minh N Do
类目: Machine Learning (cs.LG)
*备注: 33 pages, 8 figures, 8 tables. Accepted by The Fourteenth International Conference on Learning Representations (ICLR) 2026
Abstract:Contrastive objectives power state-of-the-art multimodal models, but their training remains slow, relying on long stochastic optimization. We propose a Unified Framework for Efficient Contrastive Alignment via Kernels (UniCon), which spans linear and nonlinear encoders as well as one-to-one and many-to-many alignments. At its core, UniCon introduces the contrastive similarity weight matrix S(\gamma) , which enables closed-form global solutions that provably replace minibatch back-propagation with exact updates. Through the lens of reproducing kernel Hilbert spaces (RKHS), UniCon provides a kernelized perspective that unifies contrastive alignment and reveals its connection to spectral methods. To validate the theory, we conduct experiments on synthetic, unimodal, multimodal, and zero-shot tasks, demonstrating that UniCon achieves substantial efficiency gains while preserving generality and strong empirical performance.
[LG-107] FLARE: A Data-Efficient Surrogate for Predicting Displacement Fields in Directed Energy Deposition
链接: https://arxiv.org/abs/2604.16649
作者: Kittipong Thiamchaiboonthawee,Ghadi Nehme,Ram Mohan Telikicherla,Jiawei Tian,Balaji Jayaraman,Vikas Chandan,Dhanushkodi Mariappan,Faez Ahmed
类目: Machine Learning (cs.LG)
*备注: 14 pages, 7 figures
Abstract:Directed energy deposition (DED) produces complex thermo-mechanical responses that can lead to distortion and reduced dimensional accuracy of a manufactured part. Thermo-mechanical finite element simulations are widely used to estimate these effects, but their computational cost and the complexity of accurately capturing DED physics limit their use in design iteration and process optimization. This paper introduces FLARE (Field Prediction via Linear Affine Reconstruction in wEight-space), a data-efficient surrogate modeling framework for predicting post-cooling displacement fields in DED from geometric and process parameters. We develop a predefined-geometry DED simulation workflow using an open-source finite element framework and generate a dataset of simulations with varying geometry, laser power, and deposition velocity. Each simulation provides full-field displacement, stress, strain, and temperature data throughout the manufacturing process. FLARE encodes each simulation as an implicit neural field and regularizes the corresponding neural-network weights so that they follow the affine structure of the input parameter space. This enables prediction of unseen parameter combinations by reconstructing network weights through affine mixing of training examples. On this DED benchmark, the method shows improved accuracy compared to baseline methods in both in-distribution and extrapolation settings. Although the present study focuses on DED displacement prediction, the proposed affine weight-space reconstruction framework offers a promising approach for data-efficient surrogate modeling of physical fields.
[LG-108] FRIGID: Scaling Diffusion-Based Molecular Generation from Mass Spectra at Training and Inference Time
链接: https://arxiv.org/abs/2604.16648
作者: Montgomery Bohde,Hongxuan Liu,Mrunali Manjrekar,Magdalena Lederbauer,Shuiwang Ji,Runzhong Wang,Connor W. Coley
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:In this work, we present FRIGID, a framework with a novel diffusion language model that generates molecular structures conditioned on mass spectra via intermediate fingerprint representations and determined chemical formulae, training at the scale of hundreds of millions of unlabeled structures. We then demonstrate how forward fragmentation models enable inference-time scaling by identifying spectrum-inconsistent fragments and refining them through targeted remasking and denoising. While FRIGID already achieves strong performance with its diffusion base, inference-time scaling significantly improves its accuracy, surpassing 18% Top-1 accuracy on the challenging MassSpecGym benchmark and tripling the Top-1 accuracy of the leading methods on NPLIB1. Further empirical analyses show that FRIGID exhibits log-linear performance scaling with increasing inference-time compute, opening a promising new direction for continued improvements in de novo structural elucidation. FRIGID code is publicly available at this https URL
[LG-109] Lower Bounds and Proximally Anchored SGD for Non-Convex Minimization Under Unbounded Variance
链接: https://arxiv.org/abs/2604.16620
作者: Arda Fazla,Ege C. Kaya,Antesh Upadhyay,Abolfazl Hashemi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Analysis of Stochastic Gradient Descent (SGD) and its variants typically relies on the assumption of uniformly bounded variance, a condition that frequently fails in practical non-convex settings, such as neural network training, as well as in several elementary optimization settings. While several relaxations are explored in the literature, the Blum-Gladyshev (BG-0) condition, which permits the variance to grow quadratically with distance has recently been shown to be the weakest condition. However, the study of the oracle complexity of stochastic first-order non-convex optimization under BG-0 has remained underexplored. In this paper, we address this gap and establish information-theoretic lower bounds, proving that finding an \epsilon -stationary point requires \Omega(\epsilon^-6) stochastic BG-0 oracle queries for smooth functions and \Omega(\epsilon^-4) queries under mean-square smoothness. These limits demonstrate an unavoidable degradation from classical bounded-variance complexities, i.e., \Omega(\epsilon^-4) and \Omega(\epsilon^-3) for smooth and mean-square smooth cases, respectively. To match these lower bounds, we consider Proximally Anchored STochastic Approximation (PASTA), a unified algorithmic framework that couples Halpern anchoring with Tikhonov regularization to dynamically mitigate the extra variance explosion term permitted by the BG-0 oracle. We prove that PASTA achieves minimax optimal complexities across numerous non-convex regimes, including standard smooth, mean-square smooth, weakly convex, star-convex, and Polyak-Lojasiewicz functions, entirely under an unbounded domain and unbounded stochastic gradients.
[LG-110] FedLLM : A Privacy-Preserving Federated Large Language Model for Explainable Traffic Flow Prediction
链接: https://arxiv.org/abs/2604.16612
作者: Seerat Kaur,Sukhjit Singh Sehra,Dariush Ebrahimi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Traffic prediction plays a central role in intelligent transportation systems (ITS) by supporting real-time decision-making, congestion management, and long-term planning. However, many existing approaches face practical limitations. Most spatio-temporal models are trained on centralized data, rely on numerical representations, and offer limited explainability. Recent Large Language Model (LLM) methods improve reasoning capabilities but typically assume centralized data availability and do not fully capture the distributed and heterogeneous nature of real-world traffic systems. To address these challenges, this study proposes FedLLM (Federated LLM), a privacy-preserving and distributed framework for explainable multi-horizon short-term traffic flow prediction (15-60 minutes). The framework introduces four key contributions: 1) a Composite Selection Score (CSS) for data-driven freeway selection that captures structural diversity across traffic regions 2) a domain-adapted LLM fine-tuned on structured traffic prompts encoding spatial, temporal, and statistical context 3) FedLLM framework, that enables collaborative training across heterogeneous clients while exchanging only lightweight LoRA adapter parameters, 4) a structured prompt representation that supports contextual reasoning and cross-region generalization. The FedLLM design allows each client to learn from local traffic patterns while contributing to a shared global model through efficient parameter exchange, reducing communication overhead and keeping data private. This setup supports learning under non-IID traffic distributions. Experimental results show that FedLLM achieves improved predictive performance over centralized baselines, while producing structured and explainable outputs. These findings highlight the potential of combining FL with domain-adapted LLMs for scalable, privacy-aware, and explainable traffic prediction.
[LG-111] SafeLM: Unified Privacy-Aware Optimization for Trustworthy Federated Large Language Models
链接: https://arxiv.org/abs/2604.16606
作者: Noor Islam S. Mohammad,Uluğ Bayazıt
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) are increasingly deployed in high-stakes domains, yet a unified treatment of their overlapping safety challenges remains lacking. We present SafeLM, a framework that jointly addresses four pillars of LLM safety: privacy, security, misinformation, and adversarial robustness. SafeLM combines federated training with gradient smartification and Paillier encryption for privacy, integrates defenses against training and inference-time attacks, employs contrastive grounding with calibrated decoding to reduce hallucinations, and introduces alignment-aware binarized aggregation to enhance robustness while maintaining bounded reconstruction quality. Across benchmarks on factuality, toxicity, and membership inference, SafeLM achieves 98.0% harmful content detection accuracy, reduces communication by 96.9%, and lowers gradient inversion PSNR from 31.7 dB to 15.1 dB. Ablations show that each component contributes independently, whereas their integration yields a strong privacy utility efficiency trade-off for deploying trustworthy LLMs.
[LG-112] From User Recognition to Activity Counting: An Identity-Agnostic Approach to Multi-User WiFi Sensing
链接: https://arxiv.org/abs/2604.16572
作者: Kemal Bayik,Olayinka Ajayi,Daniel Roggen,Philip Birch
类目: Machine Learning (cs.LG)
*备注: 9 pages, 5 figures
Abstract:Wi-Fi Channel State Information (CSI) enables device-free human activity recognition, but existing multi-user approaches assume a fixed set of known users during both training and inference. This closed-set assumption limits deployment, as models trained on a specific user set degrade when applied to new individuals or environments. We reformulate multi-user activity recognition as activity counting, estimating how many users perform each activity type at a given time, without associating actions with specific individuals. We propose a pipeline that converts CSI measurements into spatial projections and extracts features using a pretrained convolutional backbone. Two formulations are evaluated on the WiMANS dataset: a conventional identity-dependent model that assigns activities to fixed user slots, and an identity-agnostic model that estimates scene-level activity composition through regression. Under standard evaluation, the identity-agnostic model achieves a mean absolute error of 0.1081 on a 0-5 count scale. Under unseen-user evaluation, the identity-dependent model’s macro-F1 drops from 80.38 to 32.61, while the identity-agnostic model’s counting error remains stable. Feature space analysis confirms that identity-agnostic representations are more user-invariant, which explains their stronger generalization. These results suggest that activity counting provides a more practical and generalizable alternative to identity-dependent formulations for multi-user WiFi sensing.
[LG-113] Cross-Modal Generation: From Commodity WiFi to High-Fidelity mmWave and RFID Sensing
链接: https://arxiv.org/abs/2604.16558
作者: Zhixiong Yang,Long Jing,Yao Li,Shuli Cheng,Guoxuan Chi,Chenyu Wen
类目: Machine Learning (cs.LG)
*备注:
Abstract:AIGC has shown remarkable success in CV and NLP, and has recently demonstrated promising potential in the wireless domain. However, significant data imbalance exists across RF modalities, with abundant WiFi data but scarce mmWave and RFID data due to high acquisition cost. This makes it difficult to train high-quality generative models for these data-scarce modalities. In this work, we propose RF-CMG, a diffusion-based cross-modal generative method that leverages data-rich WiFi signals to synthesize high-fidelity RF data for scarce modalities including mmWave and RFID. The key insight of RF-CMG is to decouple cross-modal generation into high-frequency guidance and low-frequency constraint, which respectively learn high-frequency distribution from limited target modality data and preserve the underlying physical structure via low-frequency constraints during generation. On this basis, we introduce a Modality-Guided Embedding (MGE) module to steer the reverse diffusion trajectory toward the target high-frequency distribution, and a Low-Frequency Modality Consistency (LFMC) module to progressively enforce low-frequency constraints to suppress the accumulation of source-modality structural biases during inference, enabling high-quality target-modality generation. Performance comparison with several prevalent generative models demonstrates that RF-CMG achieves superior performance in synthesizing RFID and mmWave signals. We further showcase the effectiveness of the data generated by RF-CMG in gesture recognition tasks, and analyze the impact of the proportion of synthetic data on downstream performance.
[LG-114] Positive-Only Drifting Policy Optimization
链接: https://arxiv.org/abs/2604.16519
作者: Qi Zhang
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 12 pages, 6 figures
Abstract:In the field of online reinforcement learning (RL), traditional Gaussian policies and flow-based methods are often constrained by their unimodal expressiveness, complex gradient clipping, or stringent trust-region requirements. Moreover, they all rely on post-hoc penalization of negative samples to correct erroneous actions. This paper introduces Positive-Only Drifting Policy Optimization (PODPO), a likelihood-free and gradient-clipping-free generative approach for online RL. By leveraging the drifting model, PODPO performs policy updates via advantage-weighted local contrastive drifting. Relying solely on positive-advantage samples, it elegantly steers actions toward high-return regions while exploiting the inherent local smoothness of the generative model to enable proactive error prevention. In doing so, PODPO opens a promising new pathway for generative policy learning in online settings.
[LG-115] Learning-Based Sparsification of Dynamic Graphs in Robotic Exploration Algorithms
链接: https://arxiv.org/abs/2604.16509
作者: Adithya V. Sastry,Bibek Poudel,Weizi Li
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Many robotic exploration algorithms rely on graph structures for frontier-based exploration and dynamic path planning. However, these graphs grow rapidly, accumulating redundant information and impacting performance. We present a transformer-based framework trained with Proximal Policy Optimization (PPO) to prune these graphs during exploration, limiting their growth and reducing the accumulation of excess information. The framework was evaluated on simulations of a robotic agent using Rapidly Exploring Random Trees (RRT) to carry out frontier-based exploration, where the learned policy reduces graph size by up to 96%. We find preliminary evidence that our framework learns to associate pruning decisions with exploration outcomes despite sparse, delayed reward signals. We also observe that while intelligent pruning achieves a lower rate of exploration compared to baselines, it yields the lowest standard deviation, producing the most consistent exploration across varied environments. To the best of our knowledge, these results are the first suggesting the viability of RL in sparsification of dynamic graphs used in robotic exploration algorithms.
[LG-116] Multi-Label Phase Diagram Prediction in Complex Alloys via Physics-Informed Graph Attention Networks
链接: https://arxiv.org/abs/2604.16468
作者: Eunjeong Park,Amrita Basak
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:
Abstract:Accurate phase equilibria are foundational to alloy design because they encode the underlying thermodynamics governing stability, transformations, and processing windows. However, while the CALculation of Phase Diagrams (CALPHAD) provides a rigorous thermodynamic framework, exploring multicomponent composition-temperature space remains computationally expensive and is typically limited to sparse section. To enable rapid phase mapping and alloy screening, we propose a physics-informed graph attention network (GAT) that learns element-aware representations and couples them with thermodynamic constraints for multi-label phase-set prediction in the Ag-Bi-Cu-Sn alloy system. Using about 25,000 equilibrium states generated with pycalphad, each composition-temperature point is represented as a four-node element graph with atomic fractions and elemental descriptors as node features. The model combines graph attention, global pooling, and a multilayer perceptron to predict nine relevant phases. To improve physical consistency, we incorporate thermodynamic constraints, applied as training penalties or as an inference-time projection. Across six binary and three ternary subsystems, the baseline model achieves a macro-F1 score of 0.951 and 93.98% exact-set match, while physics-informed decoding improves robustness and raises exact-set accuracy to about 96% on dense in-domain grids. The surrogate also generalizes to an unseen ternary section with 99.32% exact-set accuracy and to a quaternary section at 700 °C with 91.78% accuracy. These results demonstrate that attention-based graph learning coupled with thermodynamic constraint enforcement provides an effective and physically consistent surrogate for high-resolution phase mapping and extrapolative alloy screening.
[LG-117] FairLogue: Evaluating Intersectional Fairness across Clinical Machine Learning Use Cases using the All of Us Research Program
链接: https://arxiv.org/abs/2604.16450
作者: Nick Souligne,Vignesh Subbian
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Intersectional biases in healthcare data can produce compound disparities in clinical machine learning models, yet most fairness evaluations assess demographic attributes independently. FairLogue, a toolkit for intersectional fairness auditing, was applied across multiple clinical prediction tasks to evaluate disparities across combined demographic groups. Using the All of Us dataset, two published models were selected for replication and evaluation: (A) prediction of selective serotonin reuptake inhibitor associated bleeding events and (B) two-year stroke risk in patients with atrial fibrillation. Observational fairness metrics were computed across race, gender, and intersectional subgroups, followed by counterfactual analysis to evaluate whether disparities were attributable to group membership. Intersectional evaluation revealed larger disparities than single-axis analyses; however, counterfactual diagnostics indicated that most observed disparities were comparable to those expected under randomized group membership. These results highlight the importance of intersectional fairness auditing and demonstrate how FairLogue provides deeper insight into bias in clinical machine learning systems.
[LG-118] FM-CAC: Carbon-Aware Control for Battery-Buffered Edge AI via Time-Series Foundation Models
链接: https://arxiv.org/abs/2604.16448
作者: Kang Yang,Walid A. Hanafy,Prashant Shenoy,Mani Srivastava
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:As edge AI deployments scale to billions of devices running always-on, real-time compound AI pipelines, they represent a massive and largely unmanaged source of energy consumption and carbon emissions. To reduce carbon emissions while maximizing Quality-of-Service (QoS), this paper proposes FM-CAC, a proactive carbon-aware control framework that leverages a battery as an active temporal buffer. By decoupling energy acquisition from energy consumption, FM-CAC can maximize the use of low-carbon energy, substantially reducing carbon emissions. At each control step, FM-CAC jointly optimizes the software pipeline variant, the hardware operating point, and the battery charging and discharging actions. To support this decision process, FM-CAC leverages edge-friendly Time-Series Foundation Models (TSFMs) for zero-shot carbon forecasting and integrates these forecasts into a dynamic programming solver with deferred cost attribution to prevent myopic battery depletion. Results show that FM-CAC reduces carbon emissions by up to 65.6% while maintaining near-maximum inference accuracy.
[LG-119] hermal-GEMs: Generalized Models for Building Thermal Dynamics
链接: https://arxiv.org/abs/2604.16443
作者: Felix Koch,Fabian Raisch,Benjamin Tischler
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: The 13th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation 2026
Abstract:Data-driven models for building thermal dynamics are a scalable approach for enabling energy-efficient operation through fault detection diagnosis or advanced control. To obtain accurate models, measurement data from a target building spanning months to years are required. Transfer Learning (TL) mitigates this challenge by employing pretrained models based on single or multiple source buildings. General multi-source TL models promise to outperform single-source TL, but alternative multi-source modeling architectures remain to be explored, and evaluation on real-world data is missing. Moreover, time series foundation models (TSFM) have emerged as candidates for the best-performing general models. Hence, we conduct a first, comprehensive assessment of general modeling approaches for building thermal dynamics, including multi-source TL and TSFMs. Our assessment includes ablations using four state-of-the-art multi-source TL architectures and evaluations on synthetic as well as real-world data. We demonstrate that multi-source TL models are highly effective in accurately modeling buildings in real-world applications, yielding up to 63% lower forecasting errors compared to single-source TL. Moreover, our results suggest a trade-off between multi-source TL models exclusively pretrained with building data and TSFMs pretrained with a multitude of different time series, revealing that data from 16-32 source buildings must be available over 1 year for pretraining multi-source TL models to consistently outperform TSFMs as evaluated using the mean absolute error. These findings provide practical guidance for selecting modeling strategies based on the number of source buildings available for pretraining multi-source TL models.
[LG-120] Fuzzy Encoding-Decoding to Improve Spiking Q-Learning Performance in Autonomous Driving
链接: https://arxiv.org/abs/2604.16436
作者: Aref Ghoreishee,Abhishek Mishra,Lifeng Zhou,John Walsh,Anup Das,Nagarajan Kandasamy
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:This paper develops an end-to-end fuzzy encoder-decoder architecture for enhancing vision-based multi-modal deep spiking Q-networks in autonomous driving. The method addresses two core limitations of spiking reinforcement learning: information loss stemming from the conversion of dense visual inputs into sparse spike trains, and the limited representational capacity of spike-based value functions, which often yields weakly discriminative Q-value estimates. The encoder introduces trainable fuzzy membership functions to generate expressive, population-based spike representations, and the decoder uses a lightweight neural decoder to reconstruct continuous Q-values from spiking outputs. Experiments on the HighwayEnv benchmark show that the proposed architecture substantially improves decision-making accuracy and closes the performance gap between spiking and non-spiking multi-modal Q-networks. The results highlight the potential of this framework for efficient and real-time autonomous driving with spiking neural networks.
[LG-121] Functional Similarity Metric for Neural Networks: Overcoming Parametric Ambiguity via Activation Region Analysis
链接: https://arxiv.org/abs/2604.16426
作者: Kutomanov Hennadii
类目: Machine Learning (cs.LG)
*备注: 90 pages, 3 figures, 3 tables
Abstract:As modern deep learning architectures grow in complexity, representational ambiguity emerges as a critical barrier to their interpretability and reliable merging. For ReLU networks, identical functional mappings can be achieved through entirely different weight configurations due to algebraic symmetries: neuron permutation and positive diagonal scaling. Consequently, traditional parameter-based comparison methods exhibit extreme instability to slight weight perturbations during training. This paper proposes a mathematically grounded approach to constructing a stable canonical representation of neural networks and a robust functional similarity metric. We shift focus from comparing raw weights to analyzing the topology of neuron activation regions. The algorithm first eliminates scaling ambiguity via L2-normalization of weight vectors with subsequent layer compensation. Next, discrete approximations of activation regions are generated as binary functional signatures evaluated over a data sample. To overcome the computational bottleneck of comparing large binary vectors, we adapt Locality-Sensitive Hashing, specifically MinHash, providing a fast and statistically precise approximation of the Jaccard index. The final cross-network neuron matching is formulated as a linear sum assignment problem solved via the Hungarian algorithm. We demonstrate theoretically and experimentally that our metric mitigates the neuron “flickering” effect and exhibits exceptional robustness to minor weight perturbations. This framework provides a solid foundation for model merging, transfer learning, objective assessment during pruning, and Explainable AI paradigms.
[LG-122] Method for Aggregating Unstructured Data Using Large Language Models ICML
链接: https://arxiv.org/abs/2604.16425
作者: Vsevolod Lazebnyi,Natalia Tereshkina,Maria Shabarina,Dmitriy Fedorov
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures. Preprint. Accepted for ICMLC 2026
Abstract:This paper presents a method for the automated collection and aggregation of unstructured data from diverse web sources, utilizing Large Language Models (LLMs). The primary challenge with existing techniques is their instability when the structure of webpages changes, their limited support for dynamically loaded content during information collection, and the requirement for labor-intensive manual design of data pre-processing processes. The proposed algorithm integrates hybrid web scraping (Goose3 for static pages and Selenium+WebDriver for dynamic ones), data storage in a non-relational MongoDB database management system (DBMS), and intelligent extraction and normalization of information using LLMs into a predetermined JSON schema. A key scientific contribution of this study is a two-stage verification process for the generated data, designed to eliminate potential hallucinations byy comparing the embeddings of multiple LLM outputs obtained with different temperature parameter values, combined with formalized rules for monitoring data consistency and integrity. The experimental findings indicate a high level of accuracy in the completion of key fields, as well as the robustness of the proposed methodology to changes in web page structures. This makes it suitable for use in tasks such as news content aggregation, monitoring, and log analysis in near real-time mode, with the capacity to scale rapidly in terms of the number of sources.
[LG-123] Cooperative Coevolution versus Monolithic Evolutionary Search for Semi-Supervised Tabular Classification
链接: https://arxiv.org/abs/2604.16412
作者: Jamal Toutouh
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: Accepted to be presented during the Genetic and Evolutionary Computation Conference 2026. July 13–17, 2026. San José, Costa Rica
Abstract:This paper studies semi-supervised tabular classification in the extreme low-label regime using lightweight base learners. The paper proposes a cooperative coevolutionary method (CC-SSL) that evolves (i) two feature-subset views and (ii) a pseudo-labeling policy, and compares it to a matched monolithic evolutionary baseline (EA-SSL) and three lightweight SSL baselines. Experiments on 25 OpenML datasets with labeled fractions 1%,5%,10% evaluate test MacroF1 and accuracy, together with evolutionary and pseudo-label diagnostics. CC-SSL and EA-SSL achieve higher median test MacroF1 than the lightweight baselines, with the largest separations at 1% labeled data. Most CC-SSL vs. EA-SSL comparisons are statistical draws on final test performance. EA-SSL shows higher best-so-far fitness and higher diversity during search, while time-to-target is comparable and generations-to-target favors EA-SSL in several multiclass settings. Pseudo-label volume, ProbeDrop, and validation optimism show no significant differences between CC-SSL and EA-SSL under the shared protocol.
[LG-124] CGCMA: Conditionally-Gated Cross-Modal Attention for Event-Conditioned Asynchronous Fusion
链接: https://arxiv.org/abs/2604.16411
作者: Yunxiang Guo
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study asynchronous alignment, a first-class multimodal learning setting in which a dense primary stream must be fused with sporadic external context whose value depends on when it arrives. Unlike standard multimodal benchmarks that assume structural synchrony, this setting requires models to reason explicitly about freshness and trust. We focus on the event-conditioned case in which continuous market states are paired with delayed web intelligence, and we use high-frequency cryptocurrency markets only as a timestamped, high-noise stress test for this broader problem. We propose CGCMA (Conditionally-Gated Cross-Modal Attention), whose central design principle is to separate text-conditioned grounding from lag-aware trust control. Text first attends over price sequences to identify event-relevant market states, after which a conditional gate uses modality agreement, web features, and lag \tau_\mathrmlag to regulate residual injection and fall back toward unimodal prediction when external context is stale or contradictory. We introduce CMI (Crypto Market Intelligence), an asynchronous evaluation corpus with 27,914 real-news samples pairing high-frequency price sequences with lagged web intelligence. On the current short real-news corpus, CGCMA attains the highest mean downstream Sharpe ratio ( +0.449 \pm 0.257 ) among the evaluated baselines under a shared zero-cost threshold-trading evaluation on news-available bars. Additional controls show that the gain is not explained by web scalars alone and is not recovered by simple freshness heuristics. The resulting evidence supports problem validity and a promising asynchronous multimodal gain on this stress-test setting.
[LG-125] Matched-Learning-Rate Analysis of Attention Drift and Transfer Retention in Fine-Tuned CLIP
链接: https://arxiv.org/abs/2604.16410
作者: Ruize Xia
类目: Machine Learning (cs.LG)
*备注:
Abstract:CLIP adaptation can improve in-domain accuracy while degrading out-of-domain transfer, but comparisons between Full Fine-Tuning (Full FT) and LoRA are often confounded by different learning-rate conventions. We study how adaptation method and optimization scale jointly shape attention drift and transfer retention in CLIP using a controlled matched-learning-rate comparison of Full FT and LoRA. The completed matrix contains 80 runs on CLIP ViT-B/32 across EuroSAT and Oxford-IIIT Pets, spanning four shared learning rates ( 10^-6 , 5\times10^-6 , 10^-5 , 5\times10^-5 ) and five seeds, and evaluates attention-drift metrics, best validation accuracy, and adapter-aware CIFAR-100 zero-shot accuracy. Learning rate strongly modulates structural change: on EuroSAT, Full FT moves from mild entropy broadening at 10^-6 to marked contraction at 5\times10^-5 , whereas LoRA remains entropy-positive across the full matched grid. At matched learning rates, LoRA preserves substantially more zero-shot transfer than Full FT, averaging 45.13% versus 11.28% CIFAR-100 accuracy on EuroSAT and 58.01% versus 8.54% on Pets. Oxford-IIIT Pets also reveals a regime effect: low-learning-rate LoRA underfits in-domain, so method-only averages can obscure when LoRA becomes competitive. Supporting rollout, patch-to-patch, and CKA analyses are directionally consistent with the controlled matrix. Overall, matched-learning-rate evaluation materially changes the interpretation of Full FT versus LoRA, and attention drift is most useful as a descriptive diagnostic of representation preservation rather than a causal explanation of transfer behavior.
[LG-126] Decoding AI Tutor Effects for Educational Measurement: Temporal Multi-Outcome and Behavior-Cognitive Analysis
链接: https://arxiv.org/abs/2604.16366
作者: Yiyao Yang,Yasemin Gulbahar
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 25 Pages, 9 Figures
Abstract:Artificial intelligence (AI) tutors have become increasingly popular in learning environments. In this study, we propose an AI agent prototype framework for exploring AI-assisted learning with temporal interaction patterns, multiple outcomes analysis, and behavioral-cognitive learner profiling. Based on three research questions, this study aims to investigate whether early interaction patterns can predict later performance and trust, how multiple outcomes can be traded off with different AI tutor feedback conditions, and if learner profiles can be identified with behavioral and cognitive indicators. An AI tutor agent has been developed to provide various feedback forms to learners, including hints, explanations, examples, and code. A neural policy model and a stochastic simulation framework are used to produce artificial student-AI tutor interaction records, which include response time, attempts, hint requests, correctness, quiz results, improvement, satisfaction, and trust. Temporal features are used to predict later correctness and trust with early interaction patterns, and clustering methods are used to find learner profiles. The results showed that early interaction patterns were predictive of later performance and trust, that student behavior changed over time with AI-based tutoring, and that latent student profiles could be identified based on their behavioral and cognitive differences.
[LG-127] BASIS: Balanced Activation Sketching with Invariant Scalars for “Ghost Backpropagation”
链接: https://arxiv.org/abs/2604.16324
作者: Vladimer Khasia
类目: Machine Learning (cs.LG)
*备注:
Abstract:The activation memory required for exact backpropagation scales linearly with network depth, context length, and feature dimensionality, forming an O(L * BN ) spatial bottleneck (where B is the sequence-batch cardinality and N is the feature dimension). This constraint historically throttles the scaling of deep neural networks. While randomized automatic differentiation attempts to mitigate this, it historically suffers from catastrophic variance. In this paper, we introduce BASIS (Balanced Activation Sketching with Invariant Scalars), an efficient backpropagation algorithm that fully decouples activation memory from the batch and sequence dimensions. BASIS propagates the exact error signal (dX) to preserve flawless gradient flow, but computes the weight updates (dW) using massively compressed rank-R tensors. To solve the foundational instability of sketched gradients, we propose two novel mechanisms: Balanced Hashing, which strictly eliminates off-diagonal collision variance, and Invariant Scalars, a principled bias-variance tradeoff that deterministically preserves the exact continuous energy norm of the spatial geometry. Theoretically, BASIS reduces activation memory to O(L * RN ) and heavily decreases the backward pass matrix-multiplication footprint. Empirically, training a GPT architecture for 50,000 steps validates our theoretical guarantees: at R = 32, BASIS achieves parity with (and marginally outperforms) exact backpropagation validation loss (6.575 vs. 6.616), acting as an implicit regularizer. Remarkably, the stabilized magnitude trajectory allows the model to converge smoothly even under extreme spatial compression (R = 1), proving the extreme robustness of the estimator. The code is available at this https URL
[LG-128] Revisiting Active Sequential Prediction-Powered Mean Estimation ICLR2026
链接: https://arxiv.org/abs/2604.18569
作者: Maria-Eleni Sfyraki,Jun-Kun Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Published as a conference paper at ICLR 2026
Abstract:In this work, we revisit the problem of active sequential prediction-powered mean estimation, where at each round one must decide the query probability of the ground-truth label upon observing the covariates of a sample. Furthermore, if the label is not queried, the prediction from a machine learning model is used instead. Prior work proposed an elegant scheme that determines the query probability by combining an uncertainty-based suggestion with a constant probability that encodes a soft constraint on the query probability. We explored different values of the mixing parameter and observed an intriguing empirical pattern: the smallest confidence width tends to occur when the weight on the constant probability is close to one, thereby reducing the influence of the uncertainty-based component. Motivated by this observation, we develop a non-asymptotic analysis of the estimator and establish a data-dependent bound on its confidence interval. Our analysis further suggests that when a no-regret learning approach is used to determine the query probability and control this bound, the query probability converges to the constraint of the max value of the query probability when it is chosen obliviously to the current covariates. We also conduct simulations that corroborate these theoretical findings.
[LG-129] ConforNets: Latents-Based Conformational Control in OpenFold3
链接: https://arxiv.org/abs/2604.18559
作者: Minji Lee,Colin Kalicki,Minkyu Jeon,Aymen Qabel,Alisia Fadini,Mohammed AlQuraishi
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:
Abstract:Models from the AlphaFold (AF) family reliably predict one dominant conformation for most well-ordered proteins but struggle to capture biologically relevant alternate states. Several efforts have focused on eliciting greater conformational variability through ad hoc inference-time perturbations of AF models or their inputs. Despite their progress, these approaches remain inefficient and fail to consistently recover major conformational modes. Here, we investigate both the optimal location and manner-of-operation for perturbing latent representations in the AF3 architecture. We distill our findings in ConforNets: channel-wise affine transforms of the pre-Pairformer pair latents. Unlike previous methods, ConforNets globally modulate AF3 representations, making them reusable across proteins. On unsupervised generation of alternate states, ConforNets achieve state-of-the-art success rates on all existing multi-state benchmarks. On the novel supervised task of conformational transfer, ConforNets trained on one source protein can induce a conserved conformational change across a protein family. Collectively, these results introduce a mechanism for conformational control in AF3-based models.
[LG-130] Duality for the Adversarial Total Variation
链接: https://arxiv.org/abs/2604.18540
作者: Leon Bungert,Lucas Schmitt
类目: Analysis of PDEs (math.AP); Machine Learning (cs.LG); Functional Analysis (math.FA); Optimization and Control (math.OC)
*备注: 39 pages
Abstract:Adversarial training of binary classifiers can be reformulated as regularized risk minimization involving a nonlocal total variation. Building on this perspective, we establish a characterization of the subdifferential of this total variation using duality techniques. To achieve this, we derive a dual representation of the nonlocal total variation and a related integration of parts formula, involving a nonlocal gradient and divergence. We provide such duality statements both in the space of continuous functions vanishing at infinity on proper metric spaces and for the space of essentially bounded functions on Euclidean domains. Furthermore, under some additional conditions we provide characterizations of the subdifferential in these settings.
[LG-131] Random Matrix Theory of Early-Stopped Gradient Flow: A Transient BBP Scenario
链接: https://arxiv.org/abs/2604.18450
作者: Florentin Coeurdoux,Grégoire Ferré,Jean-Philippe Bouchaud
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Empirical studies of trained models often report a transient regime in which signal is detectable in a finite gradient descent time window before overfitting dominates. We provide an analytically tractable random-matrix model that reproduces this phenomenon for gradient flow in a linear teacher–student setting. In this framework, learning occurs when an isolated eigenvalue separates from a noisy bulk, before eventually disappearing in the overfitting regime. The key ingredient is anisotropy in the input covariance, which induces fast and slow directions in the learning dynamics. In a two-block covariance model, we derive the full time-dependent bulk spectrum of the symmetrized weight matrix through a 2\times 2 Dyson equation, and we obtain an explicit outlier condition for a rank-one teacher via a rank-two determinant formula. This yields a transient Baik-Ben Arous-Péché (BBP) transition: depending on signal strength and covariance anisotropy, the teacher spike may never emerge, emerge and persist, or emerge only during an intermediate time interval before being reabsorbed into the bulk. We map the corresponding phase diagrams and validate the theory against finite-size simulations. Our results provide a minimal solvable mechanism for early stopping as a transient spectral effect driven by anisotropy and noise.
[LG-132] Spectral bandits for smooth graph functions ICML2014
链接: https://arxiv.org/abs/2604.18420
作者: Michal Valko,Rémi Munos,Branislav Kveton,Tomáš Kocák
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Published in International Conference on Machine Learning (ICML 2014)
Abstract:Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as content-based recommendation. In this problem, each item we can recommend is a node and its expected rating is similar to its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret with respect to the optimal policy would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in real-world graphs, and propose two algorithms for solving our problem that scale linearly and sublinearly in this dimension. Our experiments on real-world content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens of nodes evaluations.
[LG-133] Overcoming Selection Bias in Statistical Studies With Amortized Bayesian Inference
链接: https://arxiv.org/abs/2604.18319
作者: Jonas Arruda,Sophie Chervet,Paula Staudt,Andreas Wieser,Michael Hoelscher,Isabelle Sermet-Gaudelus,Nadine Binder,Lulla Opatowski,Jan Hasenauer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Selection bias arises when the probability that an observation enters a dataset depends on variables related to the quantities of interest, leading to systematic distortions in estimation and uncertainty quantification. For example, in epidemiological or survey settings, individuals with certain outcomes may be more likely to be included, resulting in biased prevalence estimates with potentially substantial downstream impact. Classical corrections, such as inverse-probability weighting or explicit likelihood-based models of the selection process, rely on tractable likelihoods, which limits their applicability in complex stochastic models with latent dynamics or high-dimensional structure. Simulation-based inference enables Bayesian analysis without tractable likelihoods but typically assumes missingness at random and thus fails when selection depends on unobserved outcomes or covariates. Here, we develop a bias-aware simulation-based inference framework that explicitly incorporates selection into neural posterior estimation. By embedding the selection mechanism directly into the generative simulator, the approach enables amortized Bayesian inference without requiring tractable likelihoods. This recasting of selection bias as part of the simulation process allows us to both obtain debiased estimates and explicitly test for the presence of bias. The framework integrates diagnostics to detect discrepancies between simulated and observed data and to assess posterior calibration. The method recovers well-calibrated posterior distributions across three statistical applications with diverse selection mechanisms, including settings in which likelihood-based approaches yield biased estimates. These results recast the correction of selection bias as a simulation problem and establish simulation-based inference as a practical and testable strategy for parameter estimation under selection bias.
[LG-134] Predictive Modeling of Natural Medicinal Compounds for Alzheimer Disease Using Cheminformatics
链接: https://arxiv.org/abs/2604.18316
作者: Hafiza Syeda Yusra Tirmizi,Syed Ibad Hasnain,Muhammad Faris,Rabail Khowaja,Saad Abdullah
类目: Other Quantitative Biology (q-bio.OT); Machine Learning (cs.LG)
*备注: Medicinteknikdagarna 2025
Abstract:The most common cause of dementia is Alzheimer disease, a progressive neurodegenerative disorder affecting older adults that gradually impairs memory, cognition, and behavior. It is characterized by the accumulation of abnormal proteins in the brain, including amyloid-beta plaques and neurofibrillary tangles of tau protein, which disrupt neuronal communication and lead to neuronal death. Early manifestations typically include mild memory impairment and reduced ability to acquire new information. As the disease progresses, patients experience severe cognitive decline, loss of independence, and significant personality and behavioral changes. Although the exact etiology of Alzheimer disease remains unclear, factors such as age, genetic predisposition, lifestyle, and cardiovascular health contribute to its development. While no definitive cure exists, early diagnosis, pharmacological interventions, and supportive care can slow progression and improve quality of life. This study presents a predictive cheminformatics-based model for identifying natural medicinal compounds with potential therapeutic efficacy against Alzheimer disease. The model functions as a drug screening system utilizing molecular descriptors and machine learning to detect anti-Alzheimer activity. More than 7,000 compounds from ChEBI, SynSysNet, and INDOFINE were preprocessed using Open Babel and analyzed with Dragon descriptors. A Random Forest classifier trained on approved treatments achieved moderate performance, with precision of 0.5970 and recall of 0.6590, identifying 73 candidate compounds. Key descriptors included atomic polarizability, bond multiplicity, and non-hydrogen bond this http URL findings demonstrate the value of cheminformatics in early-stage drug discovery for Alzheimer disease.
[LG-135] Symmetry Guarantees Statistic Recovery in Variational Inference
链接: https://arxiv.org/abs/2604.18310
作者: Daniel Marks,Dario Paccagnan,Mark van der Wilk
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 19 pages, 2 figures
Abstract:Variational inference (VI) is a central tool in modern machine learning, used to approximate an intractable target density by optimising over a tractable family of distributions. As the variational family cannot typically represent the target exactly, guarantees on the quality of the resulting approximation are crucial for understanding which of its properties VI can faithfully capture. Recent work has identified instances in which symmetries of the target and the variational family enable the recovery of certain statistics, even under model misspecification. However, these guarantees are inherently problem-specific and offer little insight into the fundamental mechanism by which symmetry forces statistic recovery. In this paper, we overcome this limitation by developing a general theory of symmetry-induced statistic recovery in variational inference. First, we characterise when variational minimisers inherit the symmetries of the target and establish conditions under which these pin down identifiable statistics. Second, we unify existing results by showing that previously known statistic recovery guarantees in location-scale families arise as special cases of our theory. Third, we apply our framework to distributions on the sphere to obtain novel guarantees for directional statistics in von Mises-Fisher families. Together, these results provide a modular blueprint for deriving new recovery guarantees for VI in a broad range of symmetry settings.
[LG-136] Block-encodings as programming abstractions: The Eclipse Qrisp BlockEncoding Interface
链接: https://arxiv.org/abs/2604.18276
作者: Matic Petrič,René Zander
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Mathematical Software (cs.MS); Programming Languages (cs.PL)
*备注: 11 pages
Abstract:Block-encoding is a foundational technique in modern quantum algorithms, enabling the implementation of non-unitary operations by embedding them into larger unitary matrices. While theoretically powerful and essential for advanced protocols like Quantum Singular Value Transformation (QSVT) and Quantum Signal Processing (QSP), the generation of compilable implementations of block-encodings poses a formidable challenge. This work presents the BlockEncoding interface within the Eclipse Qrisp framework, establishing block-encodings as a high-level programming abstraction accessible to a broad scientific audience. Serving as both a technical framework introduction and a hands-on tutorial, this paper explicitly details key underlying concepts abstracted away by the interface, such as block-encoding construction and qubitization, and their practical integration into methods like the Childs-Kothari-Somma (CKS) algorithm. We outline the interface’s software architecture, encompassing constructors, core utilities, arithmetic composition, and algorithmic applications such as matrix inversion, polynomial filtering, and Hamiltonian simulation. Through code examples, we demonstrate how this interface simplifies both the practical realization of advanced quantum algorithms and their associated resource estimation.
[LG-137] Incremental learning for audio classification with Hebbian Deep Neural Networks ICASSP2026
链接: https://arxiv.org/abs/2604.18270
作者: Riccardo Casciotti,Francesco De Santis,Alberto Antonietti,Annamaria Mesaros
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: ICASSP 2026
Abstract:The ability of humans for lifelong learning is an inspiration for deep learning methods and in particular for continual learning. In this work, we apply Hebbian learning, a biologically inspired learning process, to sound classification. We propose a kernel plasticity approach that selectively modulates network kernels during incremental learning, acting on selected kernels to learn new information and on others to retain previous knowledge. Using the ESC-50 dataset, the proposed method achieves 76.3% overall accuracy over five incremental steps, outperforming a baseline without kernel plasticity (68.7%) and demonstrating significantly greater stability across tasks.
[LG-138] DeepRitzSplit Neural Operator for Phase-Field Models via Energy Splitting
链接: https://arxiv.org/abs/2604.18261
作者: Chih-Kang Huang,Ludovick Gagnon,Miha Založnik,Benoît Appolaire
类目: Analysis of PDEs (math.AP); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:The multi-scale and non-linear nature of phase-field models of solidification requires fine spatial and temporal discretization, leading to long computation times. This could be overcome with artificial-intelligence approaches. Surrogate models based on neural operators could have a lower computational cost than conventional numerical discretization methods. We propose a new neural operator approach that bridges classical convex-concave splitting schemes with physics-informed learning to accelerate the simulation of phase-field models. It consists of a Deep Ritz method, where a neural operator is trained to approximate a variational formulation of the phase-field model. By training the neural operator with an energy-splitting variational formulation, we enforce the energy dissipation property of the underlying models. We further introduce a custom Reaction-Diffusion Neural Operator (RDNO) architecture, adapted to the operators of the model equations. We successfully apply the deep learning approach to the isotropic Allen-Cahn equation and to anisotropic dendritic growth simulation. We demonstrate that our physically-informed training provides better generalization in out-of-distribution evaluations than data-driven training, while achieving faster inference than traditional Fourier spectral methods. Subjects: Analysis of PDEs (math.AP); Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2604.18261 [math.AP] (or arXiv:2604.18261v1 [math.AP] for this version) https://doi.org/10.48550/arXiv.2604.18261 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-139] Horospherical Depth and Busemann Median on Hadamard Manifolds
链接: https://arxiv.org/abs/2604.18242
作者: Yangdi Jiang,Xiaotian Chang,Cyrus Mostajeran
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 52 pages, 10 figures
Abstract:\We introduce the horospherical depth, an intrinsic notion of statistical depth on Hadamard manifolds, and define the Busemann median as the set of its maximizers. The construction exploits the fact that the linear functionals appearing in Tukey’s half-space depth are themselves limits of renormalized distance functions; on a Hadamard manifold the same limiting procedure produces Busemann functions, whose sublevel sets are horoballs, the intrinsic replacements for halfspaces. The resulting depth is parametrized by the visual boundary, is isometry-equivariant, and requires neither tangent-space linearization nor a chosen base this http URL arbitrary Hadamard manifolds, we prove that the depth regions are nested and geodesically convex, that a centerpoint of depth at least 1/(d+1) exists, and hence that the Busemann median exists for every Borel probability measure. Under strictly negative sectional curvature and mild regularity assumptions, the depth is strictly quasi-concave and the median is unique. We also establish robustness: the depth is stable under total-variation perturbations, and under contamination escaping to infinity the limiting median depends on the escape direction but not on how far the contaminating mass has moved along the geodesic ray, in contrast with the Fréchet mean. Finally, we establish uniform consistency of the sample depth and convergence of sample depth regions and sample Busemann medians; on symmetric spaces of noncompact type, the argument proceeds through a VC analysis of upper horospherical halfspaces, while on general Hadamard manifolds it follows from a compactness argument under a mild non-atomicity assumption.
[LG-140] Centre manifold theorem for maps along manifolds of fixed points
链接: https://arxiv.org/abs/2604.18202
作者: Lachlan Ewen MacDonald
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG)
*备注: 28 pages, comments welcome
Abstract:We prove a centre manifold theorem for a map along a manifold-with-boundary of fixed points, and provide an application to the study of gradient descent with large step size on two-layer matrix factorisation problems.
[LG-141] mlr3torch: A Deep Learning Framework in R based on mlr3 and torch
链接: https://arxiv.org/abs/2604.18152
作者: Sebastian Fischer,Lukas Burk,Carson Zhang,Bernd Bischl,Martin Binder
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Deep learning (DL) has become a cornerstone of modern machine learning (ML) praxis. We introduce the R package mlr3torch, which is an extensible DL framework for the mlr3 ecosystem. It is built upon the torch package, and simplifies the definition, training, and evaluation of neural networks for both tabular data and generic tensors (e.g., images) for classification and regression. The package implements predefined architectures, and torch models can easily be converted to mlr3 learners. It also allows users to define neural networks as graphs. This representation is based on the graph language defined in mlr3pipelines and allows users to define the entire modeling workflow, including preprocessing, data augmentation, and network architecture, in a single graph. Through its integration into the mlr3 ecosystem, the package allows for convenient resampling, benchmarking, preprocessing, and more. We explain the package’s design and features and show how to customize and extend it to new problems. Furthermore, we demonstrate the package’s capabilities using three use cases, namely hyperparameter tuning, fine-tuning, and defining architectures for multimodal data. Finally, we present some runtime benchmarks.
[LG-142] Distributional Off-Policy Evaluation with Deep Quantile Process Regression
链接: https://arxiv.org/abs/2604.18143
作者: Qi Kuang,Chao Wang,Yuling Jiao,Fan Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:This paper investigates the off-policy evaluation (OPE) problem from a distributional perspective. Rather than focusing solely on the expectation of the total return, as in most existing OPE methods, we aim to estimate the entire return distribution. To this end, we introduce a quantile-based approach for OPE using deep quantile process regression, presenting a novel algorithm called Deep Quantile Process regression-based Off-Policy Evaluation (DQPOPE). We provide new theoretical insights into the deep quantile process regression technique, extending existing approaches that estimate discrete quantiles to estimate a continuous quantile function. A key contribution of our work is the rigorous sample complexity analysis for distributional OPE with deep neural networks, bridging theoretical analysis with practical algorithmic implementations. We show that DQPOPE achieves statistical advantages by estimating the full return distribution using the same sample size required to estimate a single policy value using conventional methods. Empirical studies further show that DQPOPE provides significantly more precise and robust policy value estimates than standard methods, thereby enhancing the practical applicability and effectiveness of distributional reinforcement learning approaches.
[LG-143] Boltzmann Machine Learning with a Parallel Persistent Markov chain Monte Carlo method for Estimating Evolutionary Fields and Couplings from a Protein Multiple Sequence Alignment
链接: https://arxiv.org/abs/2604.18022
作者: Sanzo Miyazawa
类目: Biomolecules (q-bio.BM); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: A manuscript of 11 pages including 3 figures and 3 tables, and a supplementary material of 9 pages including 8 figures. The program and multiple sequence alignments employed here are available from this https URL and this https URL
Abstract:The inverse Potts problem for estimating evolutionary single-site fields and pairwise couplings in homologous protein sequences from their single-site and pairwise amino acid frequencies observed in their multiple sequence alignment would be still one of useful methods in the studies of protein structure and evolution. Since the reproducibility of fields and couplings are the most important, the Boltzmann machine method is employed here, although it is computationally intensive. In order to reduce computational time required for the Boltzmann machine, parallel, persistent Markov chain Monte Carlo method is employed to estimate the single-site and pairwise marginal distributions in each learning step. Also, stochastic gradient descent methods are used to reduce computational time for each learning. Another problem is how to adjust the values of hyperparameters; there are two regularization parameters for evolutionary fields and couplings. The precision of contact residue pair prediction is often used to adjust the hyperparameters. However, it is not sensitive to these regularization parameters. Here, they are adjusted for the fields and couplings to satisfy a specific condition that is appropriate for protein conformations. This method has been applied to eight protein families.
[LG-144] he Umwelt Representation Hypothesis: Rethinking Universality
链接: https://arxiv.org/abs/2604.17960
作者: Victoria Bosch,Rowan Sommers,Adrien Doerig,Tim C Kietzmann
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: preprint v1
Abstract:Recent studies reveal striking representational alignment between artificial neural networks (ANNs) and biological brains, leading to proposals that all sufficiently capable systems converge on universal representations of reality. Here, we argue that this claim of Universality is premature. We introduce the Umwelt Representation Hypothesis (URH), proposing that alignment arises not from convergence toward a single global optimum, but from overlap in ecological constraints under which systems develop. We review empirical evidence showing that representational differences between species, individuals, and ANNs are systematic and adaptive, which is difficult to reconcile with Universality. Finally, we reframe ANN model comparison as a method for mapping clusters of alignment in ecological constraint space rather than searching for a single optimal world model.
[LG-145] Complex normalizing flows can be information Kähler-Ricci flows
链接: https://arxiv.org/abs/2604.17954
作者: Andrew Gracyk
类目: Differential Geometry (math.DG); Machine Learning (cs.LG)
*备注: First version
Abstract:We develop interconnections between the complex normalizing flow for data drawn from Borel probability measures on the twofold realification of the complex manifold and the Kähler-Ricci flow. The complex normalizing flow relates the initial and target realified densities under the complex change of variables, necessitating the log determinant of the Wirtinger Jacobian. The Ricci curvature of a Kähler manifold is the second order mixed Wirtinger partial derivative of the log of the local density of the volume form. Therefore, we reconcile these two facts by drawing forth the connection that the log determinant used in the complex normalizing flow matches the Ricci curvature term under differentiation and conditions. The log density under the normalizing flow is kindred to a spatial Fisher information metric under a holomorphic pullback and a Bayesian perspective to the parameter, thus under the continuum limit the log likelihood matches a Fisher metric, recovering the Kähler-Ricci flow up to expectation. Using this framework, we establish other relevant results, attempting to bridge the statistical and ordinary behaviors of the complex normalizing flow to the geometric features of the Kähler-Ricci flow.
[LG-146] Improving reproducibility by controlling random seed stability in machine learning based estimation via bagging
链接: https://arxiv.org/abs/2604.17694
作者: Nicholas Williams,Alejandro Schuler
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Predictions from machine learning algorithms can vary across random seeds, inducing instability in downstream debiased machine learning estimators. We formalize random seed stability via a concentration condition and prove that subbagging guarantees stability for any bounded-outcome regression algorithm. We introduce a new cross-fitting procedure, adaptive cross-bagging, which simultaneously eliminates seed dependence from both nuisance estimation and sample splitting in debiased machine learning. Numerical experiments confirm that the method achieves the targeted level of stability whereas alternatives do not. Our method incurs a small computational penalty relative to standard practice whereas alternative methods incur large penalties.
[LG-147] StrEBM: A Structured Latent Energy-Based Model for Blind Source Separation
链接: https://arxiv.org/abs/2604.17381
作者: Yuan-Hao Wei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:This paper proposes StrEBM, a structured latent energy-based model for source-wise structured representation learning. The framework is motivated by a broader goal of promoting identifiable and decoupled latent organization by assigning different latent dimensions their own learnable structural biases, rather than constraining the entire latent representation with a single shared energy. In this sense, blind source separation is adopted here as a concrete and verifiable testbed, through which the evolution of latent dimensions toward distinct underlying components can be directly examined. In the proposed framework, latent trajectories are optimized directly together with an observation-generation map and source-wise structural parameters. Each latent dimension is associated with its own energy-based formulation, allowing different latent components to gradually evolve toward distinct source-like roles during training. In the present study, this source-wise energy design is instantiated using Gaussian-process-inspired energies with learnable length-scales, but the framework itself is not restricted to Gaussian processes and is intended as a more general structured latent EBM formulation. Experiments on synthetic multichannel signals under linear and nonlinear mixing settings show that the proposed model can recover source components effectively, providing an initial empirical validation of the framework. At the same time, the study reveals important optimization characteristics, including slow late-stage convergence and reduced stability under nonlinear observation mappings. These findings not only clarify the practical behavior of the current GP-based instantiation, but also establish a basis for future investigation of richer source-wise energy families and more robust nonlinear optimization strategies.
[LG-148] Leverag ing Kernel Symmetry for Joint Compression and Error Mitigation in Edge Model Transfer
链接: https://arxiv.org/abs/2604.17371
作者: Anis Hamadouche,Mathini Sellathurai
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:This paper investigates communication-efficient neural network transmission by exploiting structured symmetry constraints in convolutional kernels. Instead of transmitting all model parameters, we propose a degrees-of-freedom (DoF) based codec that sends only the unique coefficients implied by a chosen symmetry group, enabling deterministic reconstruction of the full weight tensor at the receiver. The proposed framework is evaluated under quantization and noisy channel conditions across multiple symmetry patterns, signal-to-noise ratios, and bit-widths. To improve robustness against transmission impairments, a projection step is further applied at the receiver to enforce consistency with the symmetry-invariant subspace, effectively denoising corrupted parameters. Experimental results on MNIST and CIFAR-10 using a DeepCNN architecture demonstrate that DoF-based transmission achieves substantial bandwidth reduction while preserving significantly higher accuracy than pruning-based baselines, which often suffer catastrophic degradation. Among the tested symmetries, \textitcentral-skew symmetry consistently provides the best accuracy-compression tradeoff, confirming that structured redundancy can be leveraged for reliable and efficient neural model delivery over constrained links.
[LG-149] SPaRSe-TIME: Saliency-Projected Low-Rank Temporal Modeling for Efficient and Interpretable Time Series Prediction
链接: https://arxiv.org/abs/2604.17350
作者: K. A. Shahriar
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: N.A
Abstract:Time series forecasting is traditionally dominated by sequence-based architectures such as recurrent neural networks and attention mechanisms, which process all time steps uniformly and often incur substantial computational cost. However, real-world temporal signals typically exhibit heterogeneous structure, where informative patterns are sparsely distributed and interspersed with redundant observations. This work introduces \textbfSPaRSe-TIME, a structured and computationally efficient framework that models time series through a decomposition into three complementary components: saliency, memory, and trend. The proposed approach reformulates temporal modeling as a projection onto informative subspaces, where saliency acts as a data-dependent sparsification operator, memory captures dominant low-rank temporal patterns, and trend encodes low-frequency dynamics. These components are integrated through a lightweight, adaptive mapping that enables simplified, selective, and interpretable temporal reasoning. Extensive experiments on diverse real-world datasets demonstrate that SPaRSe-TIME achieves competitive predictive performance compared to recurrent and attention-based architectures, while significantly reducing computational complexity. The model is particularly effective in structured time series with clear temporal components and provides explicit interpretability through component-wise contributions. Furthermore, analysis reveals both the strengths and limitations of decomposition-based modeling, highlighting challenges in highly stochastic and complex multivariate settings. Overall, SPaRSe-TIME offers a principled alternative to monolithic sequence models, bridging efficiency, interpretability, and performance, and providing a scalable framework for time series learning.
[LG-150] PAC-Bayes Bounds for Gibbs Posteriors via Singular Learning Theory
链接: https://arxiv.org/abs/2604.17219
作者: Chenyang Wang,Yun Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We derive explicit non-asymptotic PAC-Bayes generalization bounds for Gibbs posteriors, that is, data-dependent distributions over model parameters obtained by exponentially tilting a prior with the empirical risk. Unlike classical worst-case complexity bounds based on uniform laws of large numbers, which require explicit control of the model space in terms of metric entropy (integrals), our analysis yields posterior-averaged risk bounds that can be applied to overparameterized models and adapt to the data structure and the intrinsic model complexity. The bound involves a marginal-type integral over the parameter space, which we analyze using tools from singular learning theory to obtain explicit and practically meaningful characterizations of the posterior risk. Applications to low-rank matrix completion and ReLU neural network regression and classification show that the resulting bounds are analytically tractable and substantially tighter than classical complexity-based bounds. Our results highlight the potential of PAC-Bayes analysis for precise finite-sample generalization guarantees in modern overparameterized and singular models.
[LG-151] Forecast Sports Outcomes under Efficient Market Hypothesis: Theoretical and Experimental Analysis of Odds-Only and Generalised Linear Models
链接: https://arxiv.org/abs/2604.17194
作者: Kaito Goto,Naoya Takeishi,Takehisa Yairi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Converting betting odds into accurate outcome probabilities is a fundamental challenge in order to use betting odds as a benchmark for sports forecasting and market efficiency analysis. In this study, we propose two methods to overcome the limitations of existing conversion methods. Firstly, we propose an odds-only method to convert betting odds to probabilities without using historical data for model fitting. While existing odds-only methods, such as Multiplicative, Shin, and Power exist, they do not adjust for biases or relationships we found in our betting odds dataset, which consists of 90014 football matches across five different bookmakers. To overcome these limitations, our proposed Odds-Only-Equal-Profitability-Confidence (OO-EPC) method aligns with the bookmakers’ pricing objectives of having equal confidence in profitability for each outcome. We provide empirical evidence from our betting odds dataset that, for the majority of bookmakers, our proposed OO-EPC method outperforms the existing odds-only methods. Beyond controlled experiments, we applied the OO-EPC method under real-world uncertainty by using it for six iterations of an annual basketball outcome forecasting competition. Secondly, we propose a generalised linear model that utilises historical data for model fitting and then converts betting odds to probabilities. Existing generalised linear models attempt to capture relationships that the Efficient Market Hypothesis already captures. To overcome this shortcoming, our proposed Favourite-Longshot-Bias-Adjusted Generalised Linear Model (FL-GLM) fits just one parameter to capture the favourite-longshot bias, providing a more interpretable alternative. We provide empirical evidence from historical football matches where, for all bookmakers, our proposed FL-GLM outperforms the existing multinomial and logistic generalised linear models.
[LG-152] he Virtue of Sparsity in Complexity
链接: https://arxiv.org/abs/2604.17166
作者: Nima Afsharhajari,Jonathan Yu-Meng Li
类目: General Finance (q-fin.GN); Machine Learning (cs.LG); Econometrics (econ.EM); Portfolio Management (q-fin.PM); Pricing of Securities (q-fin.PR)
*备注:
Abstract:Sparsity or complexity? In modern high-dimensional asset pricing, these are often viewed as competing principles: richer feature spaces appear to favor complexity, while economic intuition has long favored parsimony. We show that this tension is misplaced. We distinguish capacity sparsity-the dimensionality of the candidate feature space-from factor sparsity-the parsimonious structure of priced risks-and argue that the two are complements: expanding capacity enables the discovery of factor sparsity. Revisiting the benchmark empirical design of Didisheim et al. (2025) and pushing it to higher complexity regimes, we show that nonlinear feature expansions combined with basis pursuit yield portfolios whose out-of-sample performance dominates ridgeless benchmarks beyond a critical complexity threshold. The evidence shows that the gains from complexity arise not from retaining more factors, but from enlarging the space from which a sparse structure of priced risks can be identified. The virtue of complexity in asset pricing operates through factor sparsity.
[LG-153] FlowRefiner: Flow Matching-Based Iterative Refinement for 3D Turbulent Flow Simulation
链接: https://arxiv.org/abs/2604.17149
作者: Yilong Dai,Yiming Sun,Yiheng Chen,Shengyu Chen,Xiaowei Jia,Runlong Yu
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
Abstract:Accurate autoregressive prediction of 3D turbulent flows remains challenging for neural PDE solvers, as small errors in fine-scale structures can accumulate rapidly over rollout. In this paper, we propose FlowRefiner, a flow matching-based iterative refinement framework for 3D turbulent flow simulation. The method replaces stochastic denoising refinement with deterministic ODE-based correction, uses a unified velocity-field regression objective across all refinement stages, and introduces a decoupled sigma schedule that fixes the noise range independently of refinement depth. These design choices yield stable and effective refinement in the small-noise regime. Experiments on large-scale 3D turbulence with rich multi-scale structures show that FlowRefiner achieves state-of-the-art autoregressive prediction accuracy and strong physical consistency. Although developed for turbulent flow simulation, the proposed framework is broadly applicable to iterative refinement problems in scientific modeling.
[LG-154] Negative Momentum for Convex-Concave Optimization
链接: https://arxiv.org/abs/2604.17145
作者: Henry Shugart,Shuyi Wang,Jason M. Altschuler
类目: Optimization and Control (math.OC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:This paper revisits momentum in the context of min-max optimization. Momentum is a celebrated mechanism for accelerating gradient dynamics in settings like convex minimization, but its direct use in min-max optimization makes gradient dynamics diverge. Surprisingly, Gidel et al. 2019 showed that negative momentum can help fix convergence. However, despite these promising initial results and progress since, the power of momentum remains unclear for min-max optimization in two key ways. (1) Generality: is global convergence possible for the foundational setting of convex-concave optimization? This is the direct analog of convex minimization and is a standard testing ground for min-max algorithms. (2) Fast convergence: is accelerated convergence possible for strongly-convex-strong-concave optimization (the only non-linear setting where global convergence is known)? Recent work has even argued that this is impossible. We answer both these questions in the affirmative. Together, these results put negative momentum on more equal footing with competitor algorithms, and show that negative momentum enables convergence significantly faster and more generally than was known possible.
[LG-155] Automated Classification of Plasma Regions at Mars Using Machine Learning
链接: https://arxiv.org/abs/2604.17131
作者: Yilan Qin,Chuanfei Dong,Hongyang Zhou,Chi Zhang,Kaichun Xu,Jiawei Gao,Simin Shekarpaz,Xinmin Li,Liang Wang
类目: pace Physics (physics.space-ph); Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG); Plasma Physics (physics.plasm-ph)
*备注: 14 pages, 4 figures
Abstract:The plasma environment around Mars is highly variable because it is strongly influenced by the solar wind. Accurate identification of plasma regions around Mars is important for the community studying solar wind-Mars interactions, region-specific plasma processes, and atmospheric escape. In this study, we develop a machine-learning-based classifier to automatically identify three key plasma regions–solar wind, magnetosheath, and induced magnetosphere–using only ion omnidirectional energy spectra measured by the MAVEN Solar Wind Ion Analyzer (SWIA). Two neural network architectures are evaluated: a multilayer perceptron (MLP) and a convolutional neural network (CNN) that incorporates short temporal sequences. Our results show that the CNN can reliably distinguish the three plasma regions, whereas the MLP struggles to separate the solar wind and magnetosheath. Therefore, the CNN-based approach provides an efficient and accurate framework for large-scale plasma region identification at Mars and can be readily applied to future planetary missions.
[LG-156] A proposal for PU classification under Non-SCAR using clustering and logistic model
链接: https://arxiv.org/abs/2604.17130
作者: Konrad Furmanczyk,Kacper Paczutkowski
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12 pages, 2 figures, MDAI 25
Abstract:The present study aims to investigate a cluster cleaning algorithm that is both computationally simple and capable of solving the PU classification when the SCAR condition is unsatisfied. A secondary objective of this study is to determine the robustness of the LassoJoint method to perturbations of the SCAR condition. In the first step of our algorithm, we obtain cleaning labels from 2-means clustering. Subsequently, we perform logistic regression on the cleaned data, assigning positive labels from the cleaning algorithm with additional true positive observations. The remaining observations are assigned the negative label. The proposed algorithm is evaluated by comparing 11 real data sets from machine learning repositories and a synthetic set. The findings obtained from this study demonstrate the efficacy of the clustering algorithm in scenarios where the SCAR condition is violated and further underscore the moderate robustness of the LassoJoint algorithm in this context.
[LG-157] rajectory-Restricted Optimization Conditions and Geometry-Aware Linear Convergence
链接: https://arxiv.org/abs/2604.17067
作者: Faris Chaudhry,Anthea Monod,Keisuke Yano
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 37 pages, 2 figures
Abstract:Linear convergence of first-order methods is typically characterized by global optimization conditions whose constants reflect worst-case geometry of the ambient space. In high-dimensional or structured problems, these global constants can be arbitrarily conservative and fail to capture the geometry actually encountered by optimization trajectories. In this paper, we develop a trajectory-restricted framework for linear convergence based on localized geometric regularity. We introduce restricted variants of the Polyak–Łojasiewicz inequality, error bound, and quadratic growth conditions that are required to hold only on subsets of the domain. We show that classical convergence guarantees extend under these localized conditions, and in key cases, we develop new arguments that yield explicit relationships between the corresponding constants. The resulting rates are governed by geometric quantities associated with the regions traversed by the algorithm. For polyhedral composite problems, we prove that convergence is controlled by restricted Hoffman constants corresponding to the active polyhedral faces visited along the trajectory. Once the iterates enter a well-conditioned face, the effective condition number improves accordingly. Our work provides a geometric quantification for fast local convergence after active-set or manifold identification and more broadly suggests that linear convergence is fundamentally governed by the geometry of the subsets explored by the algorithm, rather than by worst-case global conditioning.
[LG-158] E2E-WAVE: End-to-End Learned Waveform Generation for Underwater Video Multicasting
链接: https://arxiv.org/abs/2604.17047
作者: Khizar Anjum,Tingcong Jiang,Dario Pompili
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Accepted to the 22nd Annual IEEE International Conference on Sensing, Communication, and Networking (SECON 2026)
Abstract:We present E2E-WAVE, the first end-to-end learned waveform generation system for underwater video multicasting. Acoustic channels exhibit 20–46% bit error rates where forward error correction becomes counterproductive – LDPC increases rather than decreases errors beyond its decoding threshold. E2E-WAVE addresses this by embedding semantic similarity directly into physical layer waveforms: when decoding errors are unavoidable, the system preferentially selects semantically similar tokens rather than arbitrary corruption. Combining VideoGPT tokenization (1024x compression) with a trainable waveform bank and fully differentiable OFDM transmission, E2E-WAVE achieves +5 dB (19.26%) PSNR and +0.10 (14.28%) SSIM over the strongest FEC-protected baseline in less challenging underwater channel (NOF1) while delivering real-time 16 FPS video at 128x128 resolution over 2.3 kbps channels – impossible for conventional digital modulation. The performance gap only increases in harsher channels (BCH1, NCS1). Trained on a single channel, E2E-WAVE generalizes to unseen underwater environments without retraining, while HEVC fails at sub-5 kbps rates and SoftCast’s AWGN assumptions collapse on frequency-selective channels.
[LG-159] Neighbor Embedding for High-Dimensional Sparse Poisson Data
链接: https://arxiv.org/abs/2604.16932
作者: Noga Mudrik,Adam S. Charles
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Across many scientific fields, measurements often represent the number of times an event occurs. For example, a document can be represented by word occurrence counts, neural activity by spike counts per time window, or online communication by daily email counts. These measurements yield high-dimensional count data that often approximate a Poisson distribution, frequently with low rates that produce substantial sparsity and complicate downstream analysis. A useful approach is to embed the data into a low-dimensional space that preserves meaningful structure, commonly termed dimensionality reduction. Yet existing dimensionality reduction methods, including both linear (e.g., PCA) and nonlinear approaches (e.g., t-SNE), often assume continuous Euclidean geometry, thereby misaligning with the discrete, sparse nature of low-rate count data. Here, we propose p-SNE (Poisson Stochastic Neighbor Embedding), a nonlinear neighbor embedding method designed around the Poisson structure of count data, using KL divergence between Poisson distributions to measure pairwise dissimilarity and Hellinger distance to optimize the embedding. We test p-SNE on synthetic Poisson data and demonstrate its ability to recover meaningful structure in real-world count datasets, including weekday patterns in email communication, research area clusters in OpenReview papers, and temporal drift and stimulus gradients in neural spike recordings.
[LG-160] Extraction of informative statistical features in the problem of forecasting time series generated by Itô-type processes
链接: https://arxiv.org/abs/2604.16865
作者: Victor Korolev,Mikhail Ivanov,Tatiana Kukanova,Artyom Rukavitsa,Alexander Vakshin,Peter Solomonov,Alexander Zeifman
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:In this paper, we consider the problem of extraction of most informative features from time series that are regarded as observed values of stochastic processes satisfying the Itô stochastic differential equations with unknown random drift and diffusion coefficients. We do not attract any additional information and use only the information contained in the time series as it is. Therefore, as additional features, we use the parameters of statistically adjusted mixture-type models of the observed regularities of the behavior of the time series. Several algorithms of construction of these parameters are discussed. These algorithms are based on statistical reconstruction of the coefficients which, in turn, is based on statistical separation of normal mixtures. We obtain two types of parameters by the techniques of the uniform and non-uniform statistical reconstruction of the coefficients of the underlying Itô process. The reconstructed coefficients obtained by uniform techniques do not depend on the current value of the process, while the non-uniform techniques reconstruct the coefficients with the account of their dependence on the value of the process. Actually, the non-uniform techniques used in this paper represent a stochastic analog of the Taylor expansion for the time series. The efficiency of the obtained additional features is compared by using them in the autoregressive algorithms of prediction of time series. In order to obtain pure conclusion that is not affected by unwanted factors, say, related to a special choice of the architecture of the neural network prediction methods, we used only simple autoregressive algorithms. We show that the use of additional statistical features improves the prediction.
[LG-161] Scalable Quantum Error Mitigation with Physically Informed Graph Neural Networks
链接: https://arxiv.org/abs/2604.16815
作者: Huaxin Wang,Xinge Wu,Jiajun Liu,Ruiqing He,Jiandong Shang,Hengliang Guo,Qiang Chen
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Quantum error mitigation (QEM) provides a practical route for estimating reliable observables on noisy intermediate-scale quantum (NISQ) devices. Traditional QEM strategies, including zero-noise extrapolation (ZNE) and Clifford data regression (CDR), rely on noise scaling or global regression, and their performance is constrained by the exponential growth of the system degrees of freedom. We construct a graph-enhanced mitigation (GEM) framework, which incorporates physical information into the model representation. In this work, quantum circuits are encoded as attributed graphs. Hardware-level physical information is mapped to node and edge features: local noise parameters such as calibration parameters T_1 , T_2 , and readout errors are encoded at nodes, while coupling-related information such as two-qubit gate errors is encoded as edge features. Graph neural networks are used to model how errors propagate along the physical coupling structure and build up into non-local correlations. This allows the model to capture local interactions and part of the resulting non-local correlations across qubits. A dual-branch affine correction is applied to maintain consistency with physical constraints. Experiments on 10-qubit and 16-qubit random circuits executed on superconducting quantum processors show that GEM provides a level of accuracy comparable to CDR at small scales, while yielding lower mean absolute error and improved stability in zero-shot transfer to larger systems. Results of the traditional QEM strategy indicate that global regression methods remain effective in low-dimensional settings but become less reliable as system degrees of freedom grow. In contrast, GEM makes use of local physical structures to show better scalability and generalization, while preserving the overall error propagation patterns. This work provides a practical scalable approach to QEM for NISQ devices.
[LG-162] A Mechanism Study of Delayed Loss Spikes in Batch-Normalized Linear Models
链接: https://arxiv.org/abs/2604.16809
作者: Peifeng Gao,Wenyi Fang,Yang Zheng,Difan Zou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Delayed loss spikes have been reported in neural-network training, but existing theory mainly explains earlier non-monotone behavior caused by overly large fixed learning rates. We study one stylized hypothesis: normalization can postpone instability by gradually increasing the effective learning rate during otherwise stable descent. To test this hypothesis at theorem level, we analyze batch-normalized linear models. Our flagship result concerns whitened square-loss linear regression, where we derive explicit no-rising-edge and delayed-onset conditions, bound the waiting time to directional onset, and show that the rising edge self-stabilizes within finitely many iterations. Combined with a square-loss decomposition, this yields a concrete delayed-spike mechanism in the whitened regime. For logistic regression, under highly restrictive active-margin assumptions, we prove only a supporting finite-horizon directional precursor in a knife-edge regime, with an optional appendix-only loss lower bound under an extra non-degeneracy condition. The paper should therefore be read as a stylized mechanism study rather than a general explanation of neural-network loss spikes. Within that scope, the results isolate one concrete delayed-instability pathway induced by batch normalization.
[LG-163] Q-SINDy: Quantum-Kernel Sparse Identification of Nonlinear Dynamics with Provable Coefficient Debiasing
链接: https://arxiv.org/abs/2604.16779
作者: Samrendra Roy,Syed Bahauddin Alam
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Quantum feature maps offer expressive embeddings for classical learning tasks, and augmenting sparse identification of nonlinear dynamics (SINDy) with such features is a natural but unexplored direction. We introduce Q-SINDy, a quantum-kernel-augmented SINDy framework, and identify a specific failure mode that arises: coefficient cannibalization, in which quantum features absorb coefficient mass that rightfully belongs to the polynomial basis, corrupting equation recovery. We derive the exact cannibalization-bias formula Delta xi_P = (P^T P)^-1 P^T Q xi_Q and prove that orthogonalizing quantum features against the polynomial column space at fit time eliminates this bias exactly. The claim is verified numerically to machine precision (10^-12) on multiple systems. Empirically, across six canonical dynamical systems (Duffing, Van der Pol, Lorenz, Lotka-Volterra, cubic oscillator, Rossler) and three quantum feature map architectures (ZZ-angle encoding, IQP, data re-uploading), orthogonalized Q-SINDy consistently matches vanilla SINDy’s structural recovery while uncorrected augmentation degrades true-positive rates by up to 100%. A refined dynamics-aware diagnostic, R^2_Q for X-dot, predicts cannibalization severity with statistical significance (Pearson r=0.70, p=0.023). An RBF classical-kernel control across 20 hyperparameter configurations fails more severely than any quantum variant, ruling out feature count as the cause. Orthogonalization remains robust under depolarizing hardware noise up to 2% per gate, and the framework extends without modification to Burgers’ equation.
[LG-164] Fairness Constraints in High-Dimensional Generalized Linear Models
链接: https://arxiv.org/abs/2604.16610
作者: Yixiao Lin,James Booth
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Machine learning models often inherit biases from historical data, raising critical concerns about fairness and accountability. Conventional fairness interventions typically require access to sensitive attributes like gender or race, but privacy and legal restrictions frequently limit their use. To address this challenge, we propose a framework that infers sensitive attributes from auxiliary features and integrates fairness constraints into model training. Our approach mitigates bias while preserving predictive accuracy, offering a practical solution for fairness-aware learning. Empirical evaluations validate its effectiveness, contributing to the advancement of more equitable algorithmic decision-making.
[LG-165] Horizon-Aware Forecasting of Passenger Assistance Demand for Rail Station Workforce Planning
链接: https://arxiv.org/abs/2604.16464
作者: Michael Sheehan,Irina Timoshenko
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 26 pages, 6 figures, 3 tables
Abstract:Passenger assistance services are essential for accessible rail travel, yet demand varies substantially across stations and over time, creating challenges for workforce planning and staff rostering. This paper presents a data-driven decision support framework for forecasting station-level passenger assistance demand and translating forecasts into workforce plans. The forecasting component applies a horizon-aware Prophet modelling approach using multi-source operational data, while the planning component maps demand forecasts to staffing requirements under service and operational constraints through an interpretable red-amber-green risk framework. The approach has been implemented within a production-grade system to support routine planning and staffing decisions across LNER-managed stations. Results demonstrate improved forecast accuracy relative to year-on-year baseline methods, with absolute error reduced by up to 76.9%, and show that forecast-informed staffing is associated with an approximate 50% reduction in failed passenger assistance deliveries attributable to staff availability. These findings highlight the value of integrating interpretable forecasting with operational work.
[LG-166] Modelling Gas-Phase Reaction Kinetics with Guided Particle Diffusion Sampling
链接: https://arxiv.org/abs/2604.16461
作者: Andrew Millard,Zheng Zhao,Henrik Pedersen
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:
Abstract:Physics-guided sampling with diffusion priors has recently shown strong performance in solving complex systems of partial differential equations (PDEs) from sparse observations. However, these methods are typically evaluated on benchmark problems that do not fully demonstrate their ability to generate temporally consistent solutions of time-dependent PDEs, often focusing instead on reconstructing a single snapshot. In this work, we apply these methods to gas-phase reaction kinetics problems governed by the advection-reaction-diffusion (ARD) equation, providing a setting that more closely reflects realistic laboratory experiments. We demonstrate that guided sampling can be used to reconstruct full spatiotemporal trajectories, rather than isolated states. Furthermore, we show that these methods generalise to previously unseen parameter regimes, highlighting their potential for real-world applications.
附件下载


