本篇博文主要内容为 2026-06-29 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-06-29)
今日共更新445篇论文,其中:
- 自然语言处理共58篇(Computation and Language (cs.CL))
- 人工智能共131篇(Artificial Intelligence (cs.AI))
- 计算机视觉共106篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共134篇(Machine Learning (cs.LG))
- 多智能体系统共13篇(Multiagent Systems (cs.MA))
- 信息检索共13篇(Information Retrieval (cs.IR))
- 人机交互共8篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Which Nash Equilibrium? Solver-Dependent Selection on Zero-Sum Nash Polytopes
【速读】:该论文旨在解决多玩家零和博弈中存在多个纳什均衡(Nash equilibrium)时,不同算法选择的均衡点是否具有系统性差异的问题。在许多此类博弈中,纳什均衡构成一个凸集(即纳什多面体),各均衡点共享相同的最小最大值(minimax value V*),但策略表现各异。传统求解器通常收敛至某一个均衡点,并被视为可互换,但本文质疑这种假设,提出算法本身可能系统性地偏好特定成员,而非仅由随机种子决定。其核心解决方案在于揭示:在非对称纳什集上,算法类型决定了所选均衡的位置;具体而言,正则化最后迭代方法(如R-NaD、磁性镜像下降)倾向于选择最大熵成员(即信息投影,I-projection),该成员是均匀参考分布在纳什集上的最优信息投影,其熵值在二维多面体上精确达到最大,在库恩扑克(Kuhn poker)中也达到了99.7%的最大熵;而后悔平均方法(如CFR、CFR+、虚构博弈)则趋向于低熵边界面。这一发现通过180个随机生成的游戏集合得到验证,其中R-NaD在所有收敛案例中均达到最大熵成员,而CFR+在94%的案例中严格低于最大熵,且统计显著性极强(配对威尔科xon检验,p < 10⁻²⁷)。此外,所选均衡成员对次优对手的表现具有下游影响,其优势随序贯性和不完全信息结构增强但保持有界——在库恩扑克中,最大熵成员为更优的对冲策略,而在矩阵博弈中,各均衡成员间无绝对主导关系。研究还纠正了两个常见误解:移除CFR中的正象限投影(max(R,0))并不能消除边界漂移;且R-NaD的选择具有锚定依赖性(anchor-following),并非初始值无关。作者将最大熵/信息投影特性作为一项强数据支持的猜想提出,并在所有案例中与解析真值进行交叉验证。
链接: https://arxiv.org/abs/2606.28308
作者: Luis Leal
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 18 pages, 9 figures
Abstract:Many two-player zero-sum games admit not a unique Nash equilibrium but a convex set of them: a polytope of profiles that all share the minimax value V* yet prescribe different behaviour. Standard solvers each converge to some equilibrium and are treated as interchangeable. We ask whether they instead select different members of the Nash set, systematically as a function of the algorithm rather than the seed. Using a tabular, exactly solvable testbed of six games with analytically known Nash sets – including a two-dimensional Nash polytope and Kuhn poker – we find that (i) selection is determined by the algorithm, not the seed, but families differ only on asymmetric Nash sets; (ii) regularized last-iterate methods (R-NaD, magnetic mirror descent) select the maximum-entropy member, the information projection of their uniform reference onto the Nash set – exactly on the 2-D polytope and at 99.7% of maximum entropy in Kuhn – while regret-averaging methods (CFR, CFR+, fictitious play) drift to a lower-entropy face; we confirm this on a randomized 180-game ensemble, where R-NaD attains the maximum-entropy member in 100% of converged games while CFR+ sits strictly below it in 94% (paired Wilcoxon p 10^-27); (iii) the selected member has downstream consequences against sub-optimal opponents that scale with sequential/hidden-information structure but stay bounded – in Kuhn the max-entropy member is a strictly better hedge, whereas on the matrix games the members differ without either dominating. We also report two negative results correcting common intuitions: removing CFR’s positive-orthant (max(R,0)) projection does not eliminate boundary drift; and R-NaD’s selection is anchor-following, not initialization-independent. We state the maximum-entropy / I-projection characterization as a strongly data-supported conjecture, checked throughout against analytic ground truth.
[MA-1] Democratic ICAI: Debating Our Way to Steering Principles from Preferences ICLR2026
【速读】:该论文旨在解决生成式 AI(Generative AI)在偏好对齐过程中难以捕捉人类判断背后推理逻辑的问题。传统基于偏好的对齐方法依赖成对标签,仅反映最终选择结果,而忽略了影响判断的多重交互性评估标准。现有方法如逆宪法AI(Inverse Constitutional AI, ICAI)虽能将偏好归纳为自然语言原则以提升可解释性,但其单次生成的解释难以涵盖复杂决策中的细微差异。为此,本文提出民主式逆宪法AI(Democratic ICAI),通过结构化角色辩论机制聚合多种竞争性论证,从而获得更全面、更具表现力的决策影响因素描述。基于这些丰富信号,模型提炼出更清晰且完整的引导原则,并结合大语言模型(LLM)与决策树判别器共同指导决策建模。在创意偏好基准数据集MuCE-Pref和LiTBench上的实验表明,该方法在多类创意任务中显著提升了偏好结构的忠实度,相较于渐进式提示和基于原则的基线方法,在偏好预测准确率上实现整体提升,同时生成的“宪法”文本也更受LLM标注者青睐。
链接: https://arxiv.org/abs/2606.28294
作者: Kevin Kingslin,Anish Natekar,Ashutosh Ranjan,Vivek Srivastava,Savita Bhat,Shirish Karande
机构: TCS Research
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Accepeted to the ICLR 2026 HCAIR Workshop, 40 pages
Abstract:Preference-based alignment often struggles to capture the reasoning that underlies human judgments. Many evaluations rely on multiple interacting criteria, yet pairwise labels reveal only the final choice rather than the considerations that shape preferences. Inverse Constitutional AI (ICAI) improves interpretability in decision making by summarizing preferences into natural-language principles, but its single-pass explanations miss much of the nuance involved in complex decisions. We introduce Democratic ICAI, a novel approach that gathers multiple competing rationales through structured persona debate, offering a broader and more expressive account of the factors influencing each comparison. From these richer signals, we derive clearer and more comprehensive steering principles and use them to guide decision modeling through both LLM-based and decision-tree judges. Experiments on creative preference benchmarks, MuCE-Pref and LiTBench, across multiple creative task categories show that Democratic ICAI yields a more faithful preference structure. It improves average preference prediction across tasks relative to deliberative prompting and principle-based baselines, while producing constitutions that LLM annotators prefer.
[MA-2] Agent -Native Immune System: Architecture Taxonomy and Engineering
【速读】:该论文旨在解决当前自主智能体(autonomous agents)在运行时面临的安全威胁问题,特别是针对其内部认知循环中因记忆污染、工具链操纵及多智能体协议攻击等导致的“运行时劫持”风险。现有防御机制如边界防护与训练阶段对齐(training-time alignment)均位于智能体推理流程之外,无法有效抵御此类动态攻击。为此,论文提出首个内生于智能体认知回路的生物启发式防御架构——智能体原生免疫系统(Agent-Native Immune System, ANIS),其核心创新在于将防御机制深度嵌入智能体的运行过程,实现动态、自适应的安全防护。关键解决方案包括:构建六层免疫塔(Immune Tower, L0-L5),其中L1层引入非认知的屏障免疫(Barrier Immunity),提供物理与逻辑层面的隔离;建立智能体病毒与疫苗的统一分类体系,明确区分表层非参数化防御与深层参数化疫苗的本质差异;提出“驾驭三元组”(Harness Triad:Meta, Self, Auto),作为支持持续免疫学习(Continual Immune Learning, CIL)的元认知自动化骨干,使疫苗可动态演化以应对新型威胁;并首次在理论上厘清模型对齐(alignment)与智能体免疫之间的根本区别——前者为训练期静态的价值基底,后者则为运行期动态的“执法”机制。这一框架为构建具备自我保护能力的下一代智能体提供了理论与技术基础。
链接: https://arxiv.org/abs/2606.28270
作者: Bo Shen,Lifeng Chang,Tianyuan Wei,Yunpeng Li,Feng Shi,Yichen Han,Peijie Gao,Shiyi Kuang,Xin Chang,Dehui Li
机构: Novo Ordo for AI
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:The transition from static chat bots to autonomous agents–equipped with persistent memory, tool-use protocols, and multi-agent collaboration–has fundamentally expanded the AI threat landscape. Current defense mechanisms, such as perimeter security and training-time alignment, remain external to the agent’s active reasoning loop. Consequently, they fall short: a fully aligned agent remains highly vulnerable to runtime hijacking via memory poisoning, tool-chain manipulation, or multi-agent protocol attacks. To address this critical gap, we introduce the Agent-Native Immune System (ANIS), the first biologically inspired, endogenous defense architecture embedded directly within the agent’s cognitive loop. Our framework presents four primary contributions. First, we design a six-layer Immune Tower (L0-L5), distinctly incorporating Barrier Immunity (L1) as a non-cognitive, physical-and-logical isolation layer. Second, we establish a unified taxonomy of Agent Viruses and Agent Vaccines, formalizing the critical distinction between superficial non-parametric defenses and robust parametric vaccines. Third, we conceptualize the Harness Triad–Meta, Self, and Auto–a self-monitoring, meta-cognitive automation backbone that drives Continual Immune Learning (CIL), enabling vaccines to dynamically adapt to novel threats. Finally, we establish a rigorous theoretical demarcation between model alignment and agent immunity: while alignment provides a static “constitutional” value foundation during training, ANIS serves as the dynamic “law enforcement” mechanism during runtime. We conclude by framing open challenges for the field, including immune protocol standardization, novel evaluation metrics such as the Autoimmunity Rate (false-positive intervention rate), and the co-evolutionary dynamics between pathogens and vaccines within collective intelligence ecosystems.
[MA-3] Estimation–Prediction Tradeoff in Causal Probabilistic Temporal Graphs
【速读】:该论文旨在解决传统时间链接预测评估中因将模型误差与不可约不确定性(irreducible uncertainty)混淆而导致的评价偏差问题。在概率性时间图中,仅依赖未见边的预测性能作为评估标准,可能无法区分模型是否真正学习了底层因果机制。其解决方案的关键在于提出一种基于概率因果框架的时间图生成方法,能够生成具有瞬时边和已知真实因果结构的合成数据,从而实现对时间链接预测与因果参数恢复的联合评估。针对所提出的二元逻辑回归参数化形式,论文推导了Cramér–Rao界,并验证了参数估计误差与不可约预测损失之间的内在权衡关系。研究结果表明,单纯追求预测精度可能无法反映模型对因果机制的学习程度,因此亟需建立能够区分可约模型误差与内在过程不确定性的基准测试体系。
链接: https://arxiv.org/abs/2606.28225
作者: Aniq Ur Rahman
机构: 未知
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI); Systems and Control (eess.SY)
备注: 8 pages, 4 figures (preliminary work)
Abstract:Temporal link prediction is usually evaluated by predictive performance on unseen edges, but in probabilistic temporal graphs this criterion can conflate model error with irreducible uncertainty. We study this issue by characterising an inherent estimation–prediction tradeoff in binary logistic models where regimes that maximise Fisher information and improve parameter recoverability are also those with the highest entropy, making individual predictions intrinsically harder even under perfect parameter recovery. We propose a probabilistic causal framework for generating temporal graphs with transient edges and known ground-truth causal structure, allowing temporal link prediction to be evaluated jointly with causal parameter recovery. For the proposed binary logistic parametrisation, we derive the Cramér–Rao bound and validate the tradeoff between parameter estimation error and irreducible predictive loss. Our results show that predictive accuracy alone may not reflect whether a model has learned the underlying causal mechanism, motivating benchmarks that distinguish reducible model error from intrinsic process uncertainty.
[MA-4] owards Value-Constrained Credit Assignment in Fully Delegated AI Cooperatives
【速读】:该论文旨在解决在完全委托式人工智能协作(fully delegated AI cooperatives)中,如何公平分配奖励的问题,尤其是在人类参与者作为代理贡献数据并参与模型更新,且各主体存在异质性价值约束(heterogeneous value constraints)的背景下。核心挑战在于如何准确识别并量化每个参与者对最终模型性能的实际贡献,同时确保其更新行为符合各委托方的价值偏好。解决方案的关键在于引入一种基于价值条件的梯度过滤机制(value-conditioned gradient filtering),通过将模型更新与各参与方的价值配置文件(value profile)进行比对,仅对符合价值约束的更新进行信用归属;在此基础上,结合在线边际贡献信号(online marginal contribution signals)与累积收益结算(cumulative revenue settlement),构建于遍历学习(traversal learning, TL)框架之上。该框架的优势在于实现了去中心化的反向传播,避免了以聚合为中心的分布式学习所导致的性能损失,并通过显式保留遍历路径和梯度路径,提供了比传统FedAvg风格联邦学习更精细的贡献归因基础,从而在数据估值、联邦贡献估计、个性化联邦学习及多元对齐等多个方向上展现出更强的理论与实践适配性。
链接: https://arxiv.org/abs/2606.28217
作者: Young Yoon,Jimin Kim,Soyeon Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注:
Abstract:We propose a framework for reward allocation in fully delegated AI cooperatives where humans are represented by agents that contribute data and participate in model updates under heterogeneous value constraints. The key idea is to credit only those updates that remain admissible after screening them against each principal’s value profile. We formulate value-conditioned gradient filtering, online marginal contribution signals, and cumulative revenue settlement within a traversal learning (TL) substrate. TL is especially attractive here because it performs decentralized backpropagation without the quality loss associated with aggregation-centric distributed learning and, we argue, offers a finer attribution substrate than FedAvg-style federated learning by preserving explicit traversal and gradient paths. The framework is positioned against data valuation, federated contribution estimation, personalized federated learning, and pluralistic alignment.
[MA-5] GBC: Gradient-Based Connections for Optimizing Multi-Agent Systems SIGDIAL2026
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的多智能体系统(Multi-Agent Systems, MAS)在执行复杂任务时普遍存在的协调失调问题,其根本挑战在于缺乏细粒度的信用分配(credit assignment)机制。现有方法通常依赖粗粒度反馈信号,难以精确定位导致错误的具体智能体或交互步骤。为此,本文提出梯度驱动连接(Gradient-Based Connections, GBC)方法,其核心创新在于将多智能体系统建模为计算图,并引入基于梯度的连接权重,在令牌(token)层面量化每个智能体输出对下游智能体的影响。通过构建归因图并反向传播任务特定损失信号,GBC实现了对错误来源的精准定位,并支持针对性的提示优化。进一步地,作者开发了AgentChord这一高效实现方案,利用前缀(prefix)基梯度计算提升效率。实验结果表明,GBC在MultiWOZ和τ-bench基准上显著提升了多智能体系统的性能,优于多个强基准模型,且归因质量越高,优化效果越显著。
链接: https://arxiv.org/abs/2606.28187
作者: Xiaocheng Yang,Abdulrahman Alrabah,Dilek Hakkani-Tür,Gokhan Tur
机构: University of Illinois Urbana-Champaign(伊利诺伊大学厄本那-香槟分校)
类目: Multiagent Systems (cs.MA)
备注: 15 pages, 8 figures, accepted by SIGDIAL 2026 Long Papers
Abstract:Multi-agent systems (MAS) built on large language models (LLMs) provide a promising framework for solving complex tasks through role specialization and structured interaction. However, their performance is often limited by miscoordination and, more fundamentally, the lack of fine-grained credit assignment across agents. Existing approaches typically rely on coarse-grained feedback, making it difficult to identify which agents or interaction steps are responsible for errors. We propose Gradient-Based Connections (GBC), an approach for fine-grained attribution and optimization of multi-agent systems. GBC models a MAS as a computational graph and introduces gradient-based connection weights to quantify the influence of each agent’s output on downstream agents at the token level. By constructing an attribution graph and propagating task-specific loss signals backward, our method enables precise identification of error sources and targeted prompt optimization. We further develop AgentChord, an efficient implementation that leverages prefix-based gradient computation. Experiments on MultiWOZ and \tau-bench show that GBC improves multi-agent performance and outperforms strong single-agent and multi-agent baselines, and higher attribution quality is associated with greater optimization effectiveness. Code is available at: this https URL.
[MA-6] MMAO: A Metabolic Multi-Agent Optimizer with Endogenous Resource Allocation for Continuous and Discrete Optimization
【速读】:该论文旨在解决传统元启发式算法中普遍存在的参数依赖问题,包括固定种群规模、人工设定搜索尺度以及依赖外部参数调节模块等局限性。其核心解决方案是提出一种名为代谢多智能体优化器(Metabolic Multi-Agent Optimizer, MMAO)的跨领域优化框架,其关键在于通过内在的“私有-公共代谢资源循环”实现自适应机制的内生化。每个智能体具备内部能量、连续的角色状态、运动或结构记忆及局部搜索历史,而种群共享一个公共资源池;通过鲁棒的进展度量与近期成功统计,将适应度提升转化为归一化的代谢收益,并以此闭环调控感知强度、搜索幅度、角色漂移、分支、剪枝、重生及精英再投资等行为。在连续空间中,MMAO采用能量调控的对称零阶探测与角色插值运动;在离散空间中,则通过结构感知、局部路径优化、引导扰动与能量加权边复用实现相同控制律。实验基于CEC2017子集(10D/30D,20次种子)和五个TSPLIB实例(共100次离散运行)进行可复现评估,结果表明MMAO主要作为一种低参数、自校准的优化框架,其核心原创性体现在异质搜索行为间的代谢内生资源分配机制,而非普遍意义上的最优求解器。
链接: https://arxiv.org/abs/2606.28109
作者: Jinliang Xu,Liping Ma
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Multiagent Systems (cs.MA)
备注: 10
Abstract:Traditional meta-heuristics often rely on fixed population sizes, manually chosen search scales, and externally attached parameter-control modules. This paper presents the \textitMetabolic Multi-Agent Optimizer (MMAO), a cross-domain optimization framework in which adaptation is derived endogenously from a private-public metabolic resource loop. Each agent carries internal energy, a continuous role state, motion or structural memory, and local search history, while the population shares a communal resource pool. Fitness improvements are converted into normalized metabolic gains through a robust progress scale and a recent success statistic; the same closed loop then regulates sensing intensity, search amplitude, role drift, branching, pruning, respawning, and elite reinvestment. In the continuous setting, MMAO uses energy-regulated symmetric zero-order probing and role-interpolated motion. In the discrete setting, the same control law is instantiated through structural sensing, local route improvement, guided perturbation, and energy-weighted edge reuse. The paper combines an implementation-faithful formulation with a reproducible experimental study on a CEC2017 subset (10D/30D, 20 seeds) and five TSPLIB instances (100 discrete runs in total). The current evidence supports MMAO primarily as a parameter-light, self-calibrating optimization framework whose main validated originality lies in metabolically endogenous resource allocation across heterogeneous search behaviors, rather than as a universally superior optimizer.
[MA-7] riadic Werewolf: A Jester Role for Multi-Hop Theory of Mind in LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在理论心理(Theory-of-Mind, ToM)评估中因依赖二元社会推理游戏而产生的偏差问题。传统评估范式通常采用双人社交推断任务,其中每个可观察线索仅指向单一隐藏立场,导致模型可通过语言先验知识获得高分,而无需真正模拟对手的动机与策略。为突破这一局限,本文将狼人杀(Werewolf)游戏扩展为包含第三方角色“小丑”(Jester)的三方博弈结构:小丑的胜利条件与其被怀疑程度正相关,即其目标是被投票出局,从而形成与狼人(Werewolf)和村民(Villager)完全对立的效用函数。这种三元激励结构要求模型必须进行跨主体的多智能体推理,以识别并应对多重异质性目标。实验结果显示,在GPT-4.1、DeepSeek-V3.1和Llama-3.3-70B上进行60场对局时,小丑在无自学习条件下仍能赢得60%-70%的游戏,而狼人胜率始终低于20%;尤为关键的是,GPT-4.1中的狼人竟在首日投票淘汰小丑的比例高达60%-70%,此行为明显违背自身利益,暴露其缺乏对复杂激励结构的建模能力。自学习机制对DeepSeek和Llama有显著提升作用,但反而削弱了GPT-4.1的表现,且代价主要由村民承担而非狼人。唯有DeepSeek成功习得“看似可疑但不刻意可疑”的隐蔽策略,并从反馈循环中获益最大。因此,该研究的关键在于引入具有反向激励的第三方角色,构建三元效用结构,从而揭示出传统二元推理任务所掩盖的深层多智能体推理能力缺失,为评估大模型的真实心智理论能力提供了更严谨、更具挑战性的基准。
链接: https://arxiv.org/abs/2606.27909
作者: Avni Mittal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:
Abstract:Theory-of-mind evaluations of large language models typically use dyadic social-deduction games, where every observable cue points to a single hidden side, so a model with strong language priors can score well without ever simulating opponents’ incentives. We extend the Werewolf game with a Jester, a third faction whose utility on peer suspicion is inverted because it wins by being voted out, so optimal play requires reasoning across three opposing utility functions. Across 60 games on GPT-4.1, DeepSeek-V3.1, and Llama-3.3-70B with Jester self-learning on and off, the Jester wins 60-70% of games while Werewolves never exceed 20%, and GPT-4.1 wolves vote the Jester out on day 1 in 60-70% of games, a strictly self-defeating action. Self-learning helps DeepSeek and Llama but hurts GPT-4.1, with the cost landing on Villagers rather than Werewolves. Only DeepSeek learns the subtle strategy of looking suspicious without looking intentionally suspicious, and it gains the most from the loop. Triadic incentive structure exposes a layer of multi-agent reasoning that dyadic deduction games leave invisible.
[MA-8] GenWorld: Empirically Grounded Urban Simulation Infrastructure for Scalable LLM -Agent Studies
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在城市尺度模拟中面临的联合接地(joint grounding)与可扩展性(scaling)难题:一方面,代理行为需基于真实城市约束以确保模拟的现实性,另一方面,直接对大规模城市人口进行在线调用LLM在计算上不可行。其解决方案的关键在于提出GenWorld——一个基于实证数据的城市仿真基础设施,通过构建建筑级合成城市、设计结构化的代理-环境交互接口,并将离线生成的LLM决策信号编译为可高效部署的查表式策略(lookup policies),从而实现规模化部署。在对日本广岛市东部(Higashihiroshima)的实例化应用中,GenWorld以普查和地理空间数据为依据,实现了196,608名合成居民的精准建模,验证了人口统计学一致性,并利用YJMob100K手机定位数据作为通勤距离的诊断基准。通过三个可复现的案例(全城工作日运行、工作日与周末行为对比、预警响应扰动及可审计的重规划追踪),证明了GenWorld作为可复现、具现实基础且可扩展的LLM代理研究平台的可行性,而交通、疏散或政策效果的精确预测仍属未来工作。
链接: https://arxiv.org/abs/2606.27650
作者: Gen Li,Jieyuan Lan,Pengcheng Xu,Zongyuan Wu,Masaki Ogura,Tao Feng
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: 27 pages, 24 figures. Code: this https URL . Project page: this https URL
Abstract:LLM-agent simulation faces a joint grounding and scaling problem: agents should act in environments that reflect real urban constraints, yet direct online LLM calls for city-scale populations are computationally prohibitive. We present GenWorld, an empirically grounded urban simulation infrastructure that combines a building-level synthetic city, a structured agent-environment interface, and offline compilation of LLM-derived decision signals into lookup policies for scalable rollout. In a reference instantiation for Higashihiroshima, Japan, GenWorld grounds 196,608 synthetic residents in census and geospatial data, validates demographic consistency against census tabulations, and uses YJMob100K mobile-phone data as a commuting-distance diagnostic. We demonstrate the infrastructure through three reproducible cases: a full-city weekday rollout, a weekday-weekend behavioral contrast, and a warning-response perturbation with auditable replanning traces. These cases support GenWorld as a reproducible platform for grounded and scalable LLM-agent studies, while calibrated forecasting for traffic, evacuation, or policy outcomes remains future work.
[MA-9] QueenBee Planner: Skill-Evolving Communication Topologies for Token-Efficient LLM Multi-Agent Systems
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)多智能体系统中,智能体间通信拓扑结构对系统性能的关键影响问题。传统方法通常依赖预设的固定通信模式或随机生成的拓扑,缺乏对通信结构本身的优化与可复用设计知识的积累。其解决方案的核心在于提出QueenBee Planner框架,将智能体间的通信拓扑视为可检索、可自我改进的设计技能(design skill)。该框架通过一个外部的LLM规划器动态生成时序通信有向无环图(temporal communication DAG),明确指定信息发送者、接收者、通信轮次、消息聚合方式及最终输出节点。执行轨迹被提炼为基于证据的设计规则,包含“保留”(Preserve)、“修改”(Modify)和“避免”(Avoid)三类动作。为防止自演化过程受偶然性或错误解释误导,系统引入了保留接受门(held-out acceptance gates)、方差感知信用分配(variance-aware credit)、模式级归因(motif-level attribution)、迁移信任机制(transfer trust)、洞察证伪(insight falsification)以及结构去重(structural deduplication)等策略。在Count-Frequency聚合与Silo-Bench风格分布式协调任务上的实验表明,使用固定工作智能体的前提下,自演化生成的通信结构显著优于固定拓扑和冷启动生成的拓扑:在全测试集下,平均绝对误差(RMSE)从最优固定拓扑的12.53降至7.87,同时减少消息数量、模型调用次数与令牌消耗。结果表明,多智能体系统能够学习到可复用的架构设计知识,而不仅限于记忆特定任务的答案。
链接: https://arxiv.org/abs/2606.27492
作者: Congjia Tian,Yuhang Yao,Jiaming Cui
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Large language model (LLM) multi-agent systems increasingly depend not only on how individual agents reason, but also on how agents are connected. This paper introduces QueenBee Planner, a framework that treats inter-agent communication topology as a retrievable and self-improving design skill. A pool of worker agents, the task adapter, and the scoring function are frozen; only an outer LLM planner learns to generate temporal communication DAGs specifying who sends information to whom, in which round, who merges messages, and who emits the final answer. Execution traces are distilled into evidence-backed design rules with three actions: \emphPreserve, \emphModify, and \emphAvoid. To prevent self-evolution from turning lucky runs or plausible but false explanations into policy, QueenBee uses held-out acceptance gates, variance-aware credit, motif-level attribution, transfer trust, insight falsification, and structural deduplication. We evaluate the method on Count-Frequency aggregation and Silo-Bench-style distributed coordination tasks. With fixed workers, self-evolved graph generation produces communication structures that improve over fixed topologies and cold generation. In the CF fulltest setting, the best generated graph reduces RMSE from 12.53 for the strongest fixed topology to 7.87 while also reducing messages, model calls, and token cost; Silo-style results show the same direction of improvement over cold and fixed-topology baselines. These results suggest that multi-agent systems can learn reusable architectural design knowledge rather than merely memorizing task answers.
[MA-10] Glite ARF: Verifier-Driven Research with Parallel LLM Coding Agents
【速读】:该论文旨在解决大规模实证研究中直接将实验任务委托给大语言模型(LLM)编码代理时出现的可扩展性问题:由于指令传递存在低频遗漏,导致实验结果断裂且不可复现。其核心解决方案是提出 Glite ARF——一个开源的 Python 框架,支持在研究代码库中并行运行多个 LLM 编码代理,同时保障研究过程的可复现性与可审计性。该框架采用三角色架构:人类研究人员定义待验证假设,编码代理(如 Claude Code、Codex CLI)在固定结构下执行具体任务,而确定性 Python 验证脚本则强制实施任务隔离、已完成工作的不可变性、纠错叠加机制以及项目全局视图的显式化。这一“验证器驱动的研究”范式将研究规则编码为可执行、失败时会明确报错的程序逻辑,而非依赖模型被动遵循的文本说明。通过该框架,作者在 BEA 2026 词汇难度共享任务中取得优异成绩,显著优于基线,并实现了对 273 项追踪任务、146 次实验运行的高效管理,仅消耗约 450 美元的 LLM API 费用,且框架带来的额外开销不足总耗时的 1%。
链接: https://arxiv.org/abs/2606.27416
作者: Vassili Philippov,Pavel Katunin,Dmitry Andreev,Igor Ostanin,Anton Nikolaev
机构: Glite(格莱特); The University of Sheffield(谢菲尔德大学)
类目: Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: 13 pages, 6 figures, 7 tables. Open-source framework (Apache-2.0) and a public demo project at this https URL and this https URL
Abstract:LLM coding agents make it tempting to automate empirical research by delegating experiments to them directly, but naive delegation does not scale to large projects: low-rate instruction lapses compound into broken, irreproducible artefacts. To address this problem, we present Glite ARF, an open-source Python framework for running many LLM coding agents in parallel on a research repository without sacrificing reproducibility or auditability. The framework defines a three-role stack: a human researcher chooses which hypotheses to test, coding agents (Claude Code, Codex CLI) implement individual tasks under a fixed structure, and deterministic Python verifier scripts enforce task isolation, immutability of completed work, a corrections overlay, and a materialised project overview. We call this verifier-driven research: the rules of the research process live in code that fails loudly when violated, not in prose that agents are merely asked to follow. Using Glite ARF, we developed our submission to the BEA 2026 vocabulary-difficulty shared task, placing first in the closed track and second in the open track on all three target languages (Spanish, German, Mandarin) and reducing the official baseline RMSE by 29.9% (closed) and 35.9% (open). The campaign comprised 273 tracked tasks (146 experiment runs) across 129 feature sets, run by up to twelve parallel agents orchestrated from a single laptop - with some model training on rented A100s - at approximately \ 450 in LLM API spend (\ 498 total third-party cost), and structured per-fold provenance let us catch and strip four target-leaking feature sets, correcting an implausible 0.609 RMSE to 0.802. Across three campaigns in three domains, the framework’s structural machinery adds only about 1% of wall-clock time. Framework and a public demo project accompany this paper.
[MA-11] Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement
【速读】:该论文旨在解决多智能体大语言模型(Multi-agent LLM)系统中因验证延迟导致的幻觉信息传播问题。在验证机制尚未完成时,错误信念可能已在智能体网络中扩散,从而影响整体系统的可靠性。其核心解决方案是将这一过程建模为带有接地修正节点(grounded corrector nodes)的图上延迟共识问题,并通过接地拉普拉斯矩阵的谱分解,推导出验证强度与延迟之间的稳定阈值——过强或过迟的修正会导致共识状态从收敛转为振荡。研究发现,当通信延迟与验证延迟相等时系统最不稳定,且在延迟为2的情况下,稳定阈值恰好为黄金比例的倒数。此外,该框架进一步导出了一个超模性(supermodular)的修正节点部署目标函数,并提出一种贪心算法,可实现(1-1/e)近似最优的有限修正预算分配策略。五种开源模型的实验验证了预测的剂量-延迟振荡现象;而采用基于事实的直接回答机制则能将“真实”设为吸收态边界,彻底消除此类不稳定性,表明该振荡现象仅存在于符号信念类任务中,而接地验证本身仍具有稳定作用。
链接: https://arxiv.org/abs/2606.27409
作者: Igor Itkin
机构: 未知
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 20 pages, 5 figures, 1 table. Code and data: this https URL
Abstract:Multi-agent large language model (LLM) systems often rely on verifier and critic agents to suppress hallucinations, but verification is delayed. During this delay, false claims can propagate through the agent network. We model this process as delayed consensus on a graph with grounded corrector nodes. Spectral decomposition by the grounded Laplacian yields a closed-form stability threshold for the verification dose: correction that is too strong or too delayed can turn consensus into oscillation. The most unstable regime occurs when the communication and verification delays coincide; for delay two, the threshold is the inverse golden ratio. The same framework gives a supermodular placement objective and a greedy (1-1/e)-approximation rule for assigning a limited corrector budget to influential nodes. Experiments across five open models confirm the predicted dose-delay oscillations. By contrast, grounded factual answering makes truth an absorbing boundary and eliminates the effect, suggesting that the instability is specific to signed-belief tasks while grounded verification remains stabilizing
[MA-12] SidConArena: An Environment Evaluating Agents in Open-EndedPositive-Sum Bargaining Game
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在开放性、正和博弈环境中的评估难题,传统静态推理与零和博弈框架无法充分反映真实世界经济互动中复杂的混合动机特征。其核心问题在于如何构建一个能够动态模拟多智能体间协商、资源创造、稀缺资产竞争及长期投资规划的评测环境。解决方案的关键在于提出SidConArena这一新型基准框架,该框架将多智能体经济建模为有限时域、部分可观测的随机博弈,包含三个耦合阶段:基于自然语言且具有约束力的交易谈判、基于确定性转换器的生产机制,以及针对长期资产的密封投标拍卖。通过引入结构化观测、阶段感知的代理调度机制、神经符号动作接口及异步执行机制,该框架在保障规则可验证评估的前提下,实现了自由形式交互的灵活性。实验表明,前沿模型在经济产出上表现更优,但仍存在资源估值偏差、被动谈判及长周期投资规划能力不足等局限。
链接: https://arxiv.org/abs/2606.27397
作者: Yeqi Feng,Yuxin Chen,Tianxing He
机构: Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息研究院)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: 15 pages
Abstract:Evaluating LLM agents requires dynamic environments that go beyond static reasoning and zero-sum games. Real-world economic interaction is often open-ended and mixed-motive: agents must negotiate, create positive-sum surplus, compete for scarce assets, and plan under delayed returns. We introduce SidConArena, a new benchmark framework for evaluating LLM agents in open-ended, positive-sum bargaining. SidConArena formalizes a multi-player economy as a finite-horizon partially observable stochastic game with three coupled phases: natural-language negotiation with binding trades, deterministic converter-based production, and sealed-bid auctions for long-term assets. The framework combines structured observations, phase-aware agent dispatching, a neural-symbolic action interface, and asynchronous execution, enabling free-form interaction while preserving rule-grounded evaluation. Across homogeneous and heterogeneous tournaments, stronger frontier models achieve higher economic outcomes, yet agents still misvalue resources, bargain passively, and remain limited in long-horizon investment planning.
自然语言处理
[NLP-0] owards Automating Scientific Review with Googles Paper Assistant Tool
【速读】: 该论文旨在解决生成式 AI(Generative AI)在科学发现中快速应用所引发的系统性挑战:传统的人工同行评审机制无法跟上由人工智能辅助科研带来的成果爆炸式增长,导致审稿效率严重滞后。为应对这一矛盾,论文提出将AI本身应用于科学验证与评审过程以实现加速。其解决方案的关键在于构建一个四层级渐进式人机协作科学评估框架,并开发名为“论文助手工具”(Paper Assistant Tool, PAT)的智能体式AI系统,该系统能够深度解析完整科学论文,对理论结果进行检验、实验进行验证、提出改进建议并识别潜在缺陷。PAT通过推理扩展(inference scaling)技术显著提升错误检测能力,在SPOT基准测试中数学错误的零样本召回率提升了34%。在STOC和ICML两大顶级计算机科学会议的预提交试点中,PAT成功识别出关键错误并提供实质性改进建议,有效减轻审稿人的认知负担,同时确保人类在最终决策中保持主导权。
链接: https://arxiv.org/abs/2606.28277
作者: Rajesh Jayaram,Drew Tyler,David Woodruff,Corinna Cortes,Yossi Matias,Vahab Mirrokni,Vincent Cohen-Addad
机构: Google Research(谷歌研究院); Carnegie Mellon University(卡内基梅隆大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Artificial intelligence is driving a revolution in scientific discovery, accelerating everything from hypothesis generation to mathematical theorem proving. However, this rapid acceleration is creating a systemic challenge: traditional human peer review cannot scale to match the influx of AI-assisted science. Ultimately, to resolve this tension, we must also deploy AI to accelerate the verification and review process itself. To frame the discussion around this transition, we propose a taxonomy consisting of four progressive levels of AI-human collaboration in scientific evaluation, and discuss various trade-offs involved with each. As a step toward this future, we introduce the Paper Assistant Tool (PAT), an agentic AI framework built for deep scientific review and verification. PAT ingests full scientific manuscripts and produces a comprehensive evaluation, checking theoretical results, validating experiments, suggesting improvements, and identifying potential flaws. By utilizing inference scaling techniques, PAT is able to identify deeper issues than a single model call alone, achieving a 34% improvement over zero-shot recall on mathematical errors in the SPOT benchmark. Pilot deployments of PAT as a pre-submission tool for authors at two major Computer Science conferences – STOC and ICML – demonstrate its ability to identify critical errors and suggest substantive improvements to research papers. By catching errors early, PAT eases the cognitive burden placed on referees, while preserving their control over the outcomes of the review process. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY) Cite as: arXiv:2606.28277 [cs.LG] (or arXiv:2606.28277v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.28277 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-1] Vision-Default Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models
【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在视觉证据与记忆中的世界知识发生冲突时,如何进行决策的问题。此类冲突的处理机制直接影响多模态系统的可靠性,但以往研究仅从行为层面描述该过程,缺乏对模型内部组件因果作用的解析。本文通过跨三个粒度层级(残差流、注意力头、MLP子层)的激活补丁(activation patching)技术,结合模型组件消融实验与机械分析,揭示了关键机制:视觉感知默认占据主导地位,而基于先验知识的推理依赖于网络后半部分中一小部分因果必要注意力头(占比2.5%-4.8%)。这些特定注意力头使模型能够在视觉输入矛盾的情况下仍输出基于记忆的知识性回答(如“草莓是红色”)。消融这些头部后,在先验知识提示下,68%-96%的预测由知识驱动转为视觉驱动,而仅0.8%-7.5%的视觉驱动预测发生变化,表明其存在显著的不对称因果结构。进一步分析发现,这些关键头部可分解为“路由头”(调节信息流动)和“写入头”(直接将答案令牌投影至残差流),该结构在不同模型家族与规模下保持一致,揭示了视觉-知识冲突中一个稀疏且可解释的因果回路。
链接: https://arxiv.org/abs/2606.28273
作者: Niclas Lietzow,Danielle Bitterman,Carsten Eickhoff,William Rudman,Michal Golovanevsky
机构: University of Tübingen(图宾根大学); Harvard University(哈佛大学); The University of Texas at Austin(德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL)
备注: 14 pages, 11 figures, 8 tables
Abstract:Vision-language models must reconcile visual evidence with memorized world knowledge when the two conflict. How they resolve this conflict shapes the reliability of multimodal systems, yet prior work characterizes it behaviorally without a component-level causal account. We combine activation patching across three granularities (residual stream, attention heads, and MLP sublayers) with model-component ablation studies and mechanistic analysis. Across three VLM families, we find that visual grounding emerges by default, whereas prior grounding depends on a small set of causally necessary attention heads (2.5-4.8%) concentrated in the second half of the network. These heads enable answers from stored world knowledge (e.g., “red” for a strawberry) despite conflicting visual input. Ablating them flips predictions from knowledge-grounded to visually grounded answers in 68-96% of cases under prior-knowledge prompts, but changes only 0.8-7.5% of visually grounded predictions, establishing an asymmetric causal structure. The identified heads decompose into routing heads, which modulate information flow, and writing heads, which directly project answer tokens into the residual stream. This structure is consistent across model families and scales, revealing a sparse causal circuit underlying perception-knowledge conflict in VLMs.
[NLP-2] Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction
【速读】: 该论文旨在解决教育评估中人类作答难度预测的难题,现有方法多依赖成本高昂的人工标定或仅基于题目文本的表征,难以揭示导致题目难度的认知机制。其核心问题在于:如何有效建模题目所引发的认知负荷过程,以提升难度预测的准确性与可解释性。解决方案的关键在于提出Epi2Diff(Episode to Difficulty)框架,该框架将大型推理模型(Large Reasoning Models, LRM)生成的推理轨迹转化为具有认知基础的“认知事件”(episode)序列,通过识别推理过程中功能性的问题求解状态,捕捉推理规模、努力分配及状态转移等动态特征。这些结构化的“事件-动态”特征与题目的语义表示相结合,实现了对人类作答难度的精准预测。实验表明,Epi2Diff在四个真实世界难度数据集上显著优于多种基线方法,包括微调的小型语言模型、LLM上下文学习及监督式适配,尤其在SAT衍生分类任务中相较监督微调基线平均提升8.1%相对性能。进一步分析显示,较难题目诱发更费力、迭代性更强且以实现为中心的事件动态,而非简单响应长度增加。这证明了LRM推理轨迹中的认知事件序列能够作为人类题目难度的可预测且可解释的过程表征,为教育测量提供了基于推理模型的新视角。
链接: https://arxiv.org/abs/2606.28186
作者: Chenguang Wang,Ming Li,Xinyue Zeng,Zhuochun Li,Hong Jiao,Tianyi Zhou,Dawei Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 32 pages, 8 figures, 10 tables
Abstract:Predicting human item difficulty is central to educational assessment, where reliable estimates support fairness and effective test construction. Existing methods often depend on costly human calibration or item-level textual representations, providing limited evidence about the cognitive processes that make items difficult. We argue that difficulty should be viewed not only as a property of item text, but also as an observable consequence of the problem-solving burden an item induces. Large Reasoning Models (LRMs) offer scalable process evidence through reasoning traces, but such evidence must be structured to support interpretable modeling. To this end, we introduce Epi2Diff (Episode to Difficulty), a framework that maps LRM reasoning traces into cognitively grounded episode sequences. These episodes group trace segments into functional problem-solving states, enabling difficulty to be modeled through reasoning scale, effort allocation, and state transitions. Epi2Diff extracts compact episode-dynamic features and combines them with semantic item representations for human difficulty prediction. Experiments on four real-world human difficulty datasets show that Epi2Diff consistently outperforms strong baselines, including fine-tuned small language models, LLM in-context learning, and supervised LLM adaptation. On SAT-derived classification benchmarks, Epi2Diff achieves an 8.1% average relative gain over supervised LLM fine-tuning baselines. Further analyses show that harder items induce more effortful, iterative, and implementation-centered episode dynamics, rather than merely longer responses. These results demonstrate that cognitive episodes in LRM reasoning traces provide a predictive and interpretable process representation for human item difficulty, offering a new lens for educational measurement with reasoning models.
[NLP-3] From Tokens to States: LLM s as a Special Case of World Models and the Continuous Path Beyond
【速读】: 该论文旨在解决当前人工智能领域对大语言模型(LLM)与世界模型(World Model)关系的二元对立认知问题,即认为前者仅负责逐标记预测,后者则模拟现实。其核心观点是:大语言模型本质上是世界模型的一种退化特例——其状态空间为所有可能的标记序列,唯一动作仅为追加一个标记,因此世界模型是对大语言模型的严格泛化,而非替代。论文的关键解决方案在于提出一种从自回归标记预测(NTP)到联合嵌入预测架构(JEPA)的连续谱系,其中包含多标记预测、未来摘要预测和下一潜在状态预测等中间阶段,这些阶段已在现有研究中得到体现。沿着此谱系演进可逐步放宽大语言模型的约束,同时逐渐放弃其两大可扩展训练优势:互联网规模的自监督文本数据与专为离散标记预测设计的Transformer架构。论文将数据获取(从自监督文本到带标注行为的受控环境)与模型架构(Transformer是否能泛化至连续状态预测,或需引入新范式)作为开放性研究问题进行探讨。
链接: https://arxiv.org/abs/2606.28127
作者: Paul Dubois
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 6 figures, 1 table
Abstract:The AI community has framed the relationship between large language models (LLMs) and world models as a dichotomy: LLMs predict tokens; world models simulate reality. Yann LeCun argues in 2022 that reaching general intelligence requires abandoning autoregressive token prediction in favour of latent-space architectures. This framing is unnecessarily binary. Two claims will be defended. First, LLMs are a degenerate special case of world models: the state space is the set of all token sequences, the only action is appending one token, and world models are therefore a strict generalisation of LLMs, not a replacement. Second, there is a natural continuous spectrum from NTP to JEPA, with multi-token prediction, future-summary prediction, and next-latent prediction as intermediate stations already populated by current research. Moving along this spectrum relaxes the LLM constraints one by one. It also progressively surrenders the two practical advantages that make LLMs trainable at scale: internet-scale self-supervised data, and a transformer architecture co-designed for discrete token prediction. Both are examined as open research questions: the data question (the cliff from self-supervised text to instrumented action-labelled environments) and the architecture question (whether the transformer generalises to continuous-state prediction, or whether a new primitive is needed).
[NLP-4] Mechanism-Driven Monitors for Preemptive Detection of LLM Training Instability
【速读】: 该论文旨在解决大规模语言模型(Large Language Model, LLM)训练过程中因数值错误或超参数异常引发的训练不稳定性问题,此类故障在发生后可能持续数千步,且损失与梯度范数仍呈现正常表象,导致难以及时察觉。其核心解决方案是基于各关键模块的功能角色和最早可能出现可测量异常信号的计算位置,构建机制驱动的内部监控指标。针对低精度闪存注意力(Flash Attention)模块,通过监测QK双线性分解的谱熵(spectral entropy),可在损失完全崩溃前捕捉到一阶项异常;对于混合专家(MoE)路由模块,则依据其在专家选择中的功能特性设计相应的诊断指标。故障注入实验表明,这些监控信号对不同类型的故障(如低精度注意力、大学习率、复合故障)具有显著区分性,能够在损失发散前提前数千步触发预警,从而显著提升训练过程的稳定性与容错能力。
链接: https://arxiv.org/abs/2606.28116
作者: Ruixuan Huang,Yipei Wang,Wenyi Fang,Hantao Huang,Yifan Huang,Ansheng You,Zhenxing Zhang,Shuai Wang,Fan Wu,Yang Zheng
机构: Huawei(华为)
类目: Computation and Language (cs.CL)
备注:
Abstract:Frontier large language model training consumes massive accelerator fleets and long wall-clock computation, making stability failures costly when they occur. After a numerical or a hyperparameter fault has already destabilized the training dynamics, it may continue for thousands of steps while loss and gradient norms still appear normal. We study mechanism-driven detection of training instability by deriving internal monitors from the functional role of each critical module and from the earliest computational sites where failures are expected to produce measurable signatures. For low-precision flash attention, we monitor the spectral entropy of a QK bilinear decomposition, whose first-order term becomes abnormal before the loss fully collapses. For MoE routers, we derive indicators from their role in expert selection. Our fault-injection experiments on low-precision attention, large learning-rate, and combined faults show that these signals provide distinct signatures for different failures, triggering thousands of steps before loss divergence.
[NLP-5] MultiHashFormer: Hash-based Generative Language Models
【速读】: 该论文旨在解决语言模型(Language Model, LM)中嵌入矩阵(embedding matrix)参数规模随词汇量线性增长导致的参数膨胀问题。传统方法通过哈希(hashing)将多个词元映射到同一向量以实现参数压缩,但在因果语言模型(causal LM)中因多对一冲突(many-to-one collisions)无法支持自回归生成。为此,本文提出MultiHashFormer框架,其核心创新在于使用多个独立的哈希函数为每个词元生成唯一的哈希签名(hash signature),即由离散哈希ID组成的短序列,从而避免哈希碰撞。该哈希签名经哈希编码器(Hash Encoder)压缩为单一潜在向量,输入Transformer解码器进行处理;随后,哈希解码器(Hash Decoder)生成下一个词元的哈希签名,并映射回文本。该方案实现了基于哈希的自回归生成,同时在1亿、10亿和30亿参数规模下均表现出优于标准Transformer模型的性能。此外,该方法可在不增加参数量的前提下实现多语言词汇扩展,具备良好的可扩展性与通用性。
链接: https://arxiv.org/abs/2606.28057
作者: Huiyin Xue,Atsuki Yamaguchi,Nikolaos Aletras
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review
Abstract:Language models (LMs) represent tokens using embedding matrices that scale linearly with the vocabulary size. To constrain the parameter footprint, prior work proposes hashing many tokens into a single vector within encoder-only models. While this offers parameter efficiency, many-to-one collisions prevent its use in causal LMs. In this paper, we propose MultiHashFormer, a new framework that allows hash-based autoregression. Each token is represented as a unique hash signature, a short sequence of discrete hash IDs, generated by multiple independent hash functions. A Hash Encoder compresses this signature into a single latent vector for processing by a Transformer decoder. Then, a Hash Decoder generates the hash signature of the next token, which is then mapped back to text. We evaluate our approach at the 100M, 1B and 3B parameter scales, demonstrating that MultiHashFormer consistently outperforms standard Transformer LMs across multiple benchmarks. Furthermore, we show that our model handles multilingual vocabulary expansion with a constant parameter footprint without any modifications.
[NLP-6] Can LLM s Judge Better Than They Generate? Evaluating Task Asymmetry Mechanistic Interpretability and Transferability for In-Context QA
【速读】: 该论文旨在解决生成式模型在自评估(self-evaluation)任务中普遍假设“评估比生成更简单”这一核心前提的有效性问题。研究通过控制变量的上下文问答(in-context QA)设置,在仅依赖上下文片段作为信息源且由同一模型对自身生成的答案进行评判的情境下,排除了开放域中参数化知识带来的混淆因素。实验结果表明,评估并非始终比生成更简单:在四个基准数据集(SQuAD 2.0、DROP、HotpotQA、MuSiQue)和两种模型上,生成准确率在其中三个数据集上均高于自评估表现,仅有多跳推理的MuSiQue例外。注意力分析揭示其根本原因在于:评估过程对上下文的关注度仅为生成过程的3–5倍,且几乎不读取候选答案本身。进一步通过LoRA微调验证,发现生成端微调导致过度接受错误答案,而评估端微调反而损害生成质量,说明评估与生成之间的不对称性并非训练偏差所致。这些发现挑战了当前自评估流水线所依赖的核心假设,提示需重新审视生成式AI中评估机制的设计基础。
链接: https://arxiv.org/abs/2606.28050
作者: Sambaran Bandyopadhyay
机构: Adobe Research(Adobe研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages
Abstract:LLM-as-a-Judge and self-evaluation pipelines implicitly assume that evaluation is easier than generation. We test this in a controlled in-context QA setting where a context passage is the sole information source and each model judges the answer it generated, removing the parametric-knowledge confound of open-domain comparisons. Across four benchmarks (SQuAD 2.0, DROP, HotpotQA, MuSiQue) and two models, evaluation is not uniformly easier: generation accuracy exceeds self-evaluation on three of four, with multi-hop MuSiQue the exception. Attention analysis reveals why: evaluation attends to context 3–5x less than generation does and barely reads the candidate answer. LoRA fine-tuning confirms the asymmetry is not a training artifact: generation fine-tuning induces over-acceptance and evaluation fine-tuning degrades generation. These findings challenge core assumptions in self-evaluation pipelines.
[NLP-7] DGVoiC: Speaker Clustering for Fraud Investigation under Real Call-Centre Conditions
【速读】: 该论文旨在解决保险欺诈检测中因缺乏有效利用跨通话语音特征而导致的识别盲区问题,尤其是在理赔报案(FNOL)流程的呼叫中心场景下,重复声纹身份作为潜在欺诈信号未被充分挖掘。其核心解决方案是提出一种名为DG^VoiC的语音聚类框架,通过敏感信息对齐的匿名化处理、面向语音的预处理、滑动窗口式声纹嵌入提取以及基于余弦相似度的聚类算法,在真实电话通信条件下实现客户身份验证与跨用户声纹关联。该方法在121段录音数据上进行评估,基于人工标注的56个样本构成的22个声纹簇参考集,最佳配置下达到96%的调整互信息(AMI)、95%的调整兰德指数(ARI)、98%的完整性、100%的同质性及99%的V-measure,证明了声纹聚类可为反欺诈分析提供强有力补充,显著提升分析师对声纹一致性判断与跨账户重复发声的识别能力。
链接: https://arxiv.org/abs/2606.28048
作者: Muhammad Shakeel Akram,Amal Htait,Abdul Hamid Sadka,Emma Meisingseth,Karishma Jaitly
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 5 pages, 4 figures, 1 table
Abstract:Insurance fraud remains costly and operationally difficult, particularly in call-centre workflows where many customer interactions begin at FNOL. While recent fraud detection methods mainly rely on structured data, text, or images, repeated speaker identity across calls remains underused as an investigative signal. This paper presents DG^VoiC, a voice clustering framework for customer verification and cross-profile speaker linking on anonymised real call-centre audio. The approach combines sensitive information-aligned anonymisation, speech-focused preprocessing, sliding-window speaker embedding extraction, and cosine similarity based clustering to identify repeated speakers under real telephony conditions. The method was evaluated on 121 recordings, with a curated reference subset of 56 samples in 22 human-agreed speaker clusters. used for validation. The best configuration achieved 96% AMI, 95% ARI, 98% completeness, 100% homogeneity, and 99% V-measure. These results show that speaker clustering can provide a strong additional signal for fraud investigation by helping analysts verify speaker consistency and surface repeated voices across customers.
[NLP-8] A Tree-of-Thoughts Inspired Hybrid Approach for Legal Case Judgement Summarization using LLM s
【速读】: 该论文旨在解决法律判决书摘要生成中传统提取式与抽象式方法各自存在的局限性问题,尤其是二者在信息完整性与语义连贯性之间的权衡不足。现有研究多集中于单一的提取式或抽象式摘要生成,缺乏对混合型(hybrid)摘要方法的深入探索。为此,本文提出一种受思维树(Tree-of-Thoughts)启发的提取-抽象混合摘要方法,通过分阶段整合关键信息提取与语义重构过程,实现更全面、准确且逻辑连贯的法律判决摘要生成。其解决方案的关键在于设计一种新型的提示工程(prompt engineering)框架,引导大语言模型(LLM)在生成过程中先进行结构化信息提取,再基于提取内容进行语义层面的抽象概括,从而有效融合两种方法的优势。实验采用DeepSeek和LLama两款主流大模型,在多个评估指标上验证了该方法相较于纯提取式或纯抽象式提示的优越性,表明该混合策略在法律文本摘要任务中具有显著潜力。
链接: https://arxiv.org/abs/2606.28044
作者: Aniket Deroy,Kripabandhu Ghosh,Saptarshi Ghosh
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at ICAIL 2026
Abstract:In recent times, Large Language Models (LLMs) are increasingly being used for legal case judgement summarization. Most prior works have tried traditional extractive and abstractive summarization of case judgements. However, hybrid or extractive-abstractive techniques have not been explored much. In this work, we propose a novel tree-of-thoughts inspired extractive-abstractive summarization approach for legal judgement summarization. We conduct experiments using two popular LLMs, DeepSeek and LLama, and compare among extractive, abstractive and extractive-abstractive summarization. Our experiments show that the proposed extractive-abstractive prompt provides better summaries compared to other types of LLM prompts.
[NLP-9] he Signal-Coverag e Matrix: Stratifying Type and Semantic Errors in Statement Autoformalization
【速读】: 该论文旨在解决大语言模型(LLM)在自动形式化(autoformalization)任务中,类型正确率(TC%)提升背后隐藏的错误类型分布不均与评估模糊性问题。现有方法虽使类型正确率从约53%提升至约76%,但这一单一指标无法揭示具体是哪类错误被修复或引入。为此,作者提出一种信号覆盖矩阵(signal-coverage matrix),将Lean求解器的通过/失败结果与语义等价性判断(等价/不等价)交叉分类,将所有输出划分为四类:真正成功(TS)、仅类型正确(TO)、仅语义正确(SO)和两者均失败(BF)。研究发现:(1)三种基于求解反馈的方法(Vanilla、Lean-Retry、Sample-Filter、Stratified Autoformalization, SAF)带来的+34至+36的TS提升中,约64%源于对类型缺陷的恢复,而语义错误净修复率为87.5%,同时新增8个错误;(2)各方法的TO→TS转化率为23/61(95%置信区间[26.6%, 50.3%]),该层级恢复率可高度预测未见方法的ΔTS(误差≤2/186),且ΔTC与原始Vanilla求解失败率呈强线性关系(R²=0.96);(3)两名评判者对求解反馈输出的分歧达26–37个百分点(远高于Vanilla的7个百分点),其中30–56%的符号判断假阴性可归因于求解器强制重写导致的语义偏差。最终,残余错误减少至两个金标准形式化错误。因此,论文强调:类型正确率的提升应依据具体类别转移情况来归因,而非依赖单一标量指标。
链接: https://arxiv.org/abs/2606.28013
作者: Chengxiao Dai,Zhaokun Yan,Zhanhui Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Headline type-correctness (TC%) of LLM autoformalization has climbed from \sim 53% to \sim 76% in two years, yet this scalar conceals which errors each method resolves. We propose a signal-coverage matrix that crosses the Lean elaborator (pass/fail) with a semantic-equivalence judgment (equivalent/not), sorting every output into one of four cells: true success (TS), type-only (TO), semantic-only (SO), or both fail (BF). On ProofNet# and MiniF2F-test with DeepSeek V4-Pro across Vanilla, Lean-Retry, Sample-Filter, and Stratified Autoformalization (SAF): (1) the +34 to +36 TS gain across the three elab-feedback methods is \sim 64% type-stratum recovery, with SO flat on net (87.5% of original semantic errors rescued, 8 newly created). (2) The TO-to-TS rate is 23/61 for each method (Wilson 95% CI [26.6%, 50.3%]), and this stratum-level recovery rate predicts \Delta TS on held-out methods to within 2/186 and renders \Delta TC linear in the Vanilla elab-fail rate across six (model, dataset) cells ( R^2=0.96 ). (3) The two judges disagree by 26 to 37 pp on elab-feedback outputs (vs. 7 pp on Vanilla), with 30 to 56% of symbolic-judge false negatives traceable to elaborator-forced rewrites. The persistent residual reduces to two gold-formalization errors. TC% gains should be credited by which cell moved, not by the scalar alone.
[NLP-10] Dialogue to Detection: A Multimodal Hybrid NLP Pipeline for Insurance Fraud Detection
【速读】: 该论文旨在解决保险理赔欺诈(Insurance Fraud)在理赔初期(FNOL, First Notice of Loss)阶段难以实现早期识别的问题。现有方法多依赖于私有的、仅包含文本数据的语料库,限制了融合语言、行为及说话人特征等多模态信息的检测技术发展。其解决方案的关键在于构建一个合成的多模态框架,该框架能够模拟真实的理赔场景,生成包含对话文本与双人语音的多模态数据,并通过自动语音识别(ASR)与说话人分离(Diarisation)技术完成信号处理。在此基础上,下游模块整合命名实体识别(NER)、基于正则表达式的特征提取、大语言模型-检索增强生成(LLM-RAG)的语义检索以及说话人嵌入(Speaker Embeddings),采用规则驱动的风险评分机制,用于识别叙事重复、结构不一致以及跨案件的语音重复等欺诈线索,在保证高敏感度的同时有效控制误报率。实验验证表明,该框架在数据集层面和组件层面均表现出良好的稳定性与可迁移性,为超越传统纯文本检测提供了可复现的基准。
链接: https://arxiv.org/abs/2606.28002
作者: Muhammad Shakeel Akram,Amal Htait,Abdul Hamid Sadka,Emma Meisingseth,Karishma Jaitly
机构: Aston University (阿斯顿大学); Domestic General (国内通用)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 10 pages, 8 figures, 2 tables
Abstract:Insurance fraud imposes substantial financial losses and operational inefficiencies, raising premiums and impacting trust among legitimate policyholders. Early detection at FNOL remains a persistent challenge. Existing approaches rely largely on private, text-only datasets, limiting progress on multimodal methods that integrate linguistic, behavioural, and speaker-based indicators. We introduce a synthetic multimodal framework that replicates FNOL conditions. It generates agent-customer dialogue transcripts and two-speaker audios, performs ASR and diarisation. Downstream modules combine NER, regex-based feature extraction, LLM-RAG retrieval, and speaker embeddings in a rule-based risk score to flag narrative reuse, structural inconsistencies, and cross-case voice repetition while balancing sensitivity and false positives. Dataset validation and component-level evaluations show stability and transfer potential, offering a reproducible baseline beyond text-only fraud detection.
[NLP-11] oxiREX: A Dataset on Toxic REasoning in ConteXt
【速读】: 该论文旨在解决多语言对话场景中隐性、语境依赖型毒性(implicit and context-dependent toxicity)的识别与解释难题,现有方法往往忽视上下文或仅针对显性攻击性内容进行标注,导致对复杂网络言论中深层偏见与隐喻性歧视的捕捉能力不足。其解决方案的关键在于提出并应用一种系统化的“毒性推理框架”(toxic reasoning schema),通过结构化标注揭示评论背后的潜在意图与社会文化语境,实现对隐性毒性的精准刻画;同时,构建了首个融合多语言(英语、阿拉伯语、土耳其语、西班牙语、德语、荷兰语)、对话上下文与隐性毒性标注的大型数据集ToxiREX,涵盖重大事件相关的Reddit评论线程,并采用上下文保留的预处理策略以维持原始语义完整性。该研究进一步通过商业大模型生成训练集与母语者标注的测试集,验证了标注分歧多为合理解释差异而非噪声,凸显任务复杂性;并通过设计针对层级化、基于框架的预测结果的评估策略,为后续模型优化提供基准。实验表明,尽管当前模型性能优于随机基线,但仍有显著提升空间,证实该任务具有高度挑战性。
链接: https://arxiv.org/abs/2606.27981
作者: Stefan F. Schouten,Ilia Markov,Piek Vossen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce a new, contextual, multilingual dataset called ToxiREX: Toxic REasoning in ConteXt. The dataset consists of threads of Reddit comments and structured characterizations of what the comments imply, following a systematic toxic reasoning schema developed in a previous paper. Using the schema allows us to capture and explain implicit and context-dependent toxicity, while supporting mappings to existing toxicity taxonomies. The dataset includes comments in six languages (English, Arabic, Turkish, Spanish, German, and Dutch), collected from posts connected to specific major events (e.g. the 2023 Turkey earthquakes; the Russian invasion of Ukraine). We describe the context-preserving preprocessing of the threads. We create a training set of 125 thousand comments which is annotated by a commercially available LLM, and a test set of just under three thousand comments that is annotated by native speakers. We show that apparent disagreements in the test set annotations often reflect defensible alternative interpretations rather than noise. Finally, we provide baseline results by prompting and fine-tuning language models. To produce these results, we develop evaluation strategies for our hierarchical, schema-based predictions. While models perform better than random, there remains a lot of room for improvement, showing the task to be challenging. ToxiREX is the first dataset to simultaneously incorporate multiple languages, conversational context, and implicit toxicity, while using the toxic reasoning schema for rich, structured annotations. Dataset available at: this https URL
[NLP-12] From Black-Box to Clinical Insight: A Multi-Stage Explainable Framework for Speech-Based Cognitive Impairment Detection INTERSPEECH2026
【速读】: 该论文旨在解决基于变压器(Transformer)的语音认知障碍检测模型在临床应用中缺乏可解释性的问题,即其“黑箱”特性限制了医生对模型决策的信任与采纳。其解决方案的关键在于构建一个多层次可解释性框架:通过结合基于SHapley Additive exPlanations(SHAP)的词元归因分析、基于理论的语言学特征提取,以及采用LLaMA-3.1-70B-Instruct模型的四阶段大语言模型(Large Language Model, LLM)推理流程,将复杂的模型输出转化为具有临床意义的自然语言叙述。该框架以SpeechCARE-Adaptive Gating Network多模态筛查模型为基础,在NIA PREPARE基准上实现F1值72.11%,并映射出词汇丰富度、句法复杂性及语义连贯性等四个认知-语言维度,经70例分层英语样本的医师评估验证,与患者认知状态高度一致,且系统可用性量表得分为82/100,表明其具备良好的临床工作流集成潜力。
链接: https://arxiv.org/abs/2606.27973
作者: Yasaman Haghbin,Sina Rashidi,Ali Zolnour,Fatemeh Taherinezhad,Ali Fartoot,Hossein Azadmaleki,James M Noble,Maryam Dadkhah,Maryam Zolnoori
机构: Columbia University (哥伦比亚大学); Chalmers University of Technology (查尔姆斯理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Interspeech 2026
Abstract:Speech-based cognitive impairment detection offers a noninvasive, accessible alternative to costly biomarker assays, yet transformer-based models remain clinically uninterpretable. We propose a multi-stage explainability framework that translates black-box transformer predictions into clinically grounded narratives by integrating SHapley Additive exPlanations (SHAP)-based token attribution, theory-informed linguistic features, and a four-stage LLM reasoning pipeline using LLaMA-3.1-70B-Instruct. Built on the SpeechCARE-Adaptive Gating Network multimodal screening model (F1 = 72.11% on the NIA PREPARE benchmark), the framework maps model outputs to four cognitive-linguistic dimensions, including lexical richness, syntactic complexity, and semantic coherence. Physician evaluation on 70 stratified English samples demonstrated strong alignment with patient-level cognitive profiles, and a System Usability Scale score of 82/100 indicated high potential for clinical workflow integration.
[NLP-13] An Empirical Analysis of Factual Errors in Human-Written Text and its Application
【速读】: 该论文旨在解决人类撰写文本中事实性错误检测(Factual Error Detection, FED)被忽视的问题,尤其是在大型语言模型(LLM)兴起后,研究重心过度集中于模型生成文本中的幻觉(hallucination)检测,导致对真实人类写作中事实错误的系统性识别能力评估不足。其解决方案的关键在于:通过分析报纸文章的修订记录,构建了一个针对人类诱发事实错误的分类体系(taxonomy),识别出如汉字误用(kanji misconversions)和数词量词混淆(numeral classifier errors)等具有特征性的错误类型,这些类型在现有幻觉基准中未被充分覆盖。基于该分类体系,研究进一步设计了合成的真实测试案例与真实修正数据,评估通用大模型在该任务上的表现,结果表明即使是高性能模型如GPT-5.4,在合成数据上的词级F1分数也仅达52%,揭示了当前FED任务的显著挑战性,并为未来改进模型的事实核查能力提供了关键基准与方向。
链接: https://arxiv.org/abs/2606.27959
作者: Kazuma Iwamoto,Kazumasa Omura,Shotaro Ishihara
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Factual Error Detection (FED), which is the task of identifying factually incorrect spans in a given text, has long been recognized as an important research problem. However, with the rapid rise of large language models (LLMs), research attention has shifted toward factual errors specific to LLM-generated text (hallucinations) and their detection. As a result, the detection of factual errors in human-written text has been relatively neglected. To address this gap, we first distill a taxonomy of human-induced factual errors by analyzing corrections of newspaper articles, a representative source of text that is guaranteed to be human-written and contains few grammatical errors. Our analysis revealed that there are characteristic categories such as kanji misconversions and numeral classifier errors, which are not focused in existing hallucination benchmarks. Based on the taxonomy, we then evaluate the FED capability of vanilla LLMs on synthesized realistic test cases and real corrections. Experimental results demonstrated that even high-performance LLMs such as GPT-5.4 achieved only word-level F1 score of 52% on the synthetic evaluation data, highlighting the task difficulty. Furthermore, a detailed analysis by detection difficulty revealed the current state of FED.
[NLP-14] VASAE: Naming SAE Dictionary Directions with Vocabulary-Aligned Anchoring ICML2026
【速读】: 该论文旨在解决稀疏自编码器(Sparse Autoencoder, SAE)在解析Transformer残差流时,其学习到的特征缺乏与模型词元词汇表(token vocabulary)直接语义关联的问题。传统SAE的特征通常依赖事后命名(post hoc naming),难以明确其语义含义。为此,本文提出词汇对齐稀疏自编码器(Vocabulary-Aligned Sparse Autoencoder, VASAE),其核心解决方案是引入词汇对齐锚定(vocabulary-aligned anchoring)机制,在训练过程中将每个特征与词元词汇表中嵌入向量最近的词元字符串进行绑定,从而为每个特征赋予一个内在的、可解释的词元名称。该方法在不降低重构质量的前提下,显著提升了特征与词元词汇的对齐程度:在GPT-2-small的前10层中,约90%的特征在0.8对齐得分阈值下实现了有效对齐;在Llama-3.1-8B模型中,浅层和中层字典的对齐率分别达到92.8%和较高水平,而深层字典对齐度较低。进一步分析表明,扣除句子级均值稀疏码后,剩余特征仍保留与邻近输入词元相关的内在词元名称,证明了词汇对齐锚定能够实现训练阶段的语义可解释性增强,有效补充了事后解释的局限性。
链接: https://arxiv.org/abs/2606.27941
作者: Kairui Zhang,Ziwen Yu,Zahraa S. Abdallah,Martha Lewis
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 7 figures. Accepted to the 2nd Workshop on Compositional Learning at ICML 2026
Abstract:Sparse autoencoders (SAEs) provide useful decompositions of Transformer residual streams, but their learned features are usually named post hoc rather than directly connected to the Transformer’s token vocabulary. We introduce Vocabulary-Aligned Sparse Autoencoder (VASAE), a method that trains SAE features under vocabulary-aligned anchoring and assigns each feature an intrinsic token name: the token string whose embedding is nearest to that feature. Without reducing reconstruction quality compared with a standard SAE, VASAE produces dictionaries with vocabulary-aligned features. Using a 0.8 cutoff on the nearest-token alignment score, dictionaries trained on GPT-2-small post-residual streams align about 90% of features in layers 0–10. In Llama-3.1-8B, representative shallow and middle-layer dictionaries contain strongly aligned features, including 92.8% in the shallow layer, while the representative final-layer dictionary shows limited alignment. After subtracting the sentence-level mean sparse code, case studies show that many remaining intrinsic token names are relevant to nearby input tokens. These results suggest that vocabulary-aligned anchoring can connect learned features to intrinsic token names during training, complementing post hoc interpretation of learned dictionaries.
[NLP-15] Verifiable Geometry Problem Solving: Solver-Driven Autoformalization and Theorem Proposing
【速读】: 该论文旨在解决几何问题求解中神经符号框架面临的两大核心瓶颈:一是自动形式化(autoformalization)阶段将多模态翻译视为与下游求解器兼容性无关的静态任务,导致形式化结果难以有效支持后续推理;二是定理预测(theorem prediction)阶段因规则库固定,求解器常陷入演绎僵局。其解决方案的关键在于提出一种以求解器驱动的框架SD-GPS,将符号求解器作为贯穿形式化与推理全过程的执行预言机(execution oracle)。具体而言,首先通过“求解器驱动的自动形式化”(Solver-Driven Autoformalization),融合监督式语言适配与可求解性引导的强化学习,以可执行性为核心训练信号,基于QwenVL3-2B模型实现统一建模;其次提出“验证式定理提议”(Verified Theorem Proposing),引入具备僵局感知能力的代理,从当前证明状态中生成局部辅助引理,并通过符号验证严格筛选所有提议,确保逻辑一致性。在Geometry3K与PGPS9K上的实证评估表明,SD-GPS在标准补全、多选题及跨模态参照等多种范式下均显著优于现有多模态大模型(MLLM)、纯神经模型及神经符号方法,验证了将多模态感知与符号执行闭环整合对几何推理性能的提升作用,为神经代理如何通过形式系统实现可验证的问题求解提供了深刻洞见。
链接: https://arxiv.org/abs/2606.27926
作者: Can Li,Ting Zhang,Junbo Zhao,Hua Huang
机构: Beijing Normal University (北京师范大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Geometry Problem Solving have increasingly adopt the neuro-symbolic paradigm, combining neural intuition with symbolic rigor. However, current frameworks suffer from severe bottlenecks in two core stages: autoformalization, which treats multimodal translation as a static task decoupled from downstream solver compatibility, and theorem prediction, where solvers frequently hit a deductive impasse due to fixed rule libraries. To address these, we propose SD-GPS, a solver-driven framework that treats the symbolic solver as an execution oracle throughout both formalization and deduction. First, Solver-Driven Autoformalization unifies supervised formal-language adaptation and solvability-guided reinforcement learning into a single module built on QwenVL3-2B, making executability the central training signal. Second, Verified Theorem Proposing introduces an impasse-aware agent that proposes local auxiliary lemmas from current proof states, ensuring soundness by filtering all proposals through symbolic verification. Empirical evaluations on Geometry3K and PGPS9K demonstrate that SD-GPS consistently outperforms existing MLLM, neural, and neuro-symbolic methods across standard completion, multiple-choice, and cross-modal reference regimes, proving that closing the loop between multimodal perception and symbolic execution significantly improves geometric reasoning, offering profound insights into how neural agents can be grounded by formal systems to achieve verifiable problem-solving capabilities.
[NLP-16] A Study of Temporal Fusion Strategies for Named Entity Recognition in Historical Texts
【速读】: 该论文旨在解决历史文本中命名实体识别(NER)面临的时序变化挑战,即实体的表面形式与显著性随时间发生漂移,导致传统模型难以准确识别。其核心问题在于语言模型(LMs)在历时性(diachronic)语境下对时间性的推理能力有限。为此,论文提出通过轻量级融合策略将时间元数据结构化地嵌入基于Transformer的NER模型中,探索绝对与相对时间表示在早期融合与晚期融合机制(如交叉注意力、适配器、拼接等)中的应用效果。实验结果表明,晚期融合策略在法语和德语历史语料上表现出更稳健且具有更强时序泛化能力的性能,尤其在早期及噪声较大的时期表现更优,揭示了时间信息以晚期融合方式注入模型的关键优势。
链接: https://arxiv.org/abs/2606.27881
作者: Emanuela Boros
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Temporal variation poses a unique challenge for named entity recognition (NER) in historical texts, where entities drift in surface form and salience across time. While language models (LMs) have made progress in various NLP tasks, their ability to reason about temporality, especially in diachronic contexts, remains limited or at least, questionable. In this paper, we systematically study how temporal metadata can be structurally embedded into NER models using a range of lightweight fusion strategies. We experiment with both absolute and relative temporal representations, injected into Transformer-based architectures via early or late fusion mechanisms such as cross-attention, adapters, and concatenation. Our evaluations on French and German historical datasets reveal that late fusion strategies yield more robust and temporally generalisable performance, particularly in early and noisy periods.
[NLP-17] Learning Complementary Action Modeling from Automotive Maintenance Instructions
【速读】: 该论文旨在解决汽车维修指令中因微小词汇变化导致程序意义反转的问题,即在保持实体、修饰成分及上下文基本不变的情况下,仅通过动作短语的调整使指令从非过程性转变为过程性(或反之)。其核心挑战在于如何准确识别或生成具有互补性的程序化指令对,关键在于区分“语义互补性”与表面相似性,实现对动作短语层面的精准控制,并通过检索、重叠度指标与人工评估相结合的方式验证关系正确性。研究提出“互补动作建模”(Complementary Action Modeling, CAM)任务,基于德语汽车维修数据集,采用候选匹配与受控序列到序列(Seq2Seq)生成方法进行实验,结果表明:互补性维修指令应被建模为基于细微词汇线索的过程性关联,而非普通的句子相似性或基于同义词的改写任务。
链接: https://arxiv.org/abs/2606.27808
作者: Jiaqi Wu,Bai Li,Jochen Hartmann,Martin Gaedke,Sander Stuijk
机构: Eindhoven University of Technology (埃因霍温理工大学); BMW Group (宝马集团); Chemnitz University of Technology (开姆尼茨工业大学)
类目: Computation and Language (cs.CL)
备注: Preprint. 11 pages, 4 figures
Abstract:A minute lexical variation can reverse the procedural meaning of an instruction even when the rest of the sentence remains unchanged. In automotive maintenance instructions, this pattern often appears when an action phrase turns an instruction into its procedural counterpart. The entities, modifiers, and surrounding context remain largely invariant, while the action phrase determines the procedural relation. We define this task as Complementary Action Modeling (CAM). Given a maintenance instruction, the goal is to identify or generate its procedural counterpart by modifying the action phrase while preserving the remaining sentence context. This task focuses on three aspects: distinguishing complementarity from surface similarity, controlling generation at the action-phrase level, and evaluating relational correctness using retrieval, overlap-based, and human evaluation. Using a German automotive maintenance dataset, we examine these questions through candidate matching and controlled Seq2Seq generation. The results show that complementary maintenance instructions are best modeled as procedural associations grounded in subtle lexical cues. They should therefore not be treated as ordinary cases of sentence similarity or synonym-based paraphrasing.
[NLP-18] Position Bias Correction is Insufficient for One-Pass Attention Sorting
【速读】: 该论文旨在解决长上下文语言模型中存在的位置偏差(position bias)问题,即模型对序列中间位置的信息利用不足。现有方法如Attention Sorting通过多次迭代重排序文档以缓解此问题,但其多轮排序-生成循环显著增加了部署成本。本文提出一种去偏单次遍历注意力重排序(Debiased One-Pass Attention Sorting)方案,其核心在于从低注意力占比的多数文档中估计每提示的位置偏差曲线,并通过减法或除法对原始注意力得分进行校正,从而实现仅一次遍历即可完成重排序。然而实验结果表明,在所测试场景下该假设不成立:在LLaMA-2-7B-32K-Instruct上,去偏处理与未校准的单次遍历排序效果相当(包含准确率94.83%);而在YaRN-Llama-2-7b-64k上,虽去偏使准确率提升8.67个百分点,但仍落后于迭代式排序14.84个百分点,仅弥补了差距的37%。这说明单纯的位置偏差校正不足以达到迭代重排序的效果,表明重复重排序过程本身可能带来超越偏差修正的额外优势。
链接: https://arxiv.org/abs/2606.27793
作者: Qiong Tang,Xiangkun Hu,Xiangyang Liu,Yiran Chen,Yunfan Shao
机构: Analemma(Analemma); FARS(全自动化研究系统)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-context language models suffer from position bias, where information in middle positions is underutilized. Attention Sorting addresses this by iteratively reordering documents based on attention patterns, but its multiple sort-and-generate cycles increase deployment cost. We hypothesize that position bias is the primary bottleneck and propose Debiased One-Pass Attention Sorting, which estimates a per-prompt position-bias curve from the low-attention majority of documents and uses it to correct raw attention scores (via subtraction or division) to enable single-pass sorting. Our experiments on two models refute this hypothesis in the tested setting: on LLaMA-2-7B-32K-Instruct, debiasing produces identical results to uncalibrated single-pass sorting (94.83% containment accuracy), while on YaRN-Llama-2-7b-64k, debiasing improves accuracy by 8.67 percentage points but remains 14.84pp behind iterative sorting, closing only 37% of the gap. These results suggest that position-bias correction is insufficient to match iterative sorting, and that repeated reordering provides additional benefits beyond bias correction.
[NLP-19] NLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation
【速读】: 该论文旨在解决混合注意力模型(hybrid attention models)在长上下文推理中一个关键问题:如何确定哪些网络层应保留全注意力(full attention, FA),以在保证下游任务准确率的同时实现计算效率最大化。现有方法依赖固定的周期性模式或基于注意力的启发式策略,但这些方法可能无法有效捕捉对任务性能真正重要的层。本文提出一种无需训练的负对数似然引导的层选择(NLL-guided layer selection) 方法,其核心在于通过计算当某一层由全注意力切换为滑动窗口注意力(sliding-window attention, SWA)时,对答案词元的负对数似然(negative log-likelihood, NLL)下降程度,来量化该层的重要性。实验表明,在LongMemEval基准上,使用仅1/4的全注意力层,该方法在Qwen3-4B模型上实现了64.6%的准确率,与1/2-FA周期基线(65.0%)相当,但计算开销减半;相比SWAA报告的1/4-FA周期基线和同类轻量级迁移(LightTransfer-style)基线,分别提升10.4和26.4个百分点。去混淆分析进一步验证了该信号与长程依赖需求相关,而非通用层敏感性。该方法仅需约15分钟的一次性校准,显著推进了长上下文大语言模型部署中的效率-精度权衡前沿。
链接: https://arxiv.org/abs/2606.27791
作者: Qiong Tang,Xiangkun Hu,Xiangyang Liu,Yiran Chen,Yunfan Shao
机构: Analemma(Analemma); FARS(全自动化研究系统)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Hybrid attention models that mix full and sliding-window attention across layers offer a promising approach to efficient long-context inference, but the critical question of \emphwhich layers should retain full attention remains unsolved. Existing methods use either fixed periodic patterns or attention-based heuristics that may not capture what matters for downstream accuracy. We propose NLL-guided layer selection, a training-free method that directly measures each layer’s importance by computing the negative log-likelihood degradation on answer tokens when that layer uses sliding-window instead of full attention. On LongMemEval with Qwen3-4B, our method achieves 64.6% accuracy using only 1/4 full-attention layers, matching the 1/2-FA periodic baseline (65.0%) while halving the computational budget. NLL-guided selection outperforms the SWAA-reported periodic 1/4-FA baseline by 10.4 percentage points and a matched LightTransfer-style baseline by 26.4 percentage points. De-confounding analysis shows the signal is consistent with long-range attention needs rather than generic layer sensitivity. The method requires only \sim 15 minutes of one-time calibration, advancing the efficiency-accuracy Pareto frontier for long-context LLM deployment.
[NLP-20] SHIFT: Gate-Modulated Activation Steering for Knowledge Conflict Mitigation in Retrieval-Augmented Generation
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中检索到的外部上下文与模型参数化知识之间存在的冲突问题。此类冲突可能导致生成结果偏离准确信息,影响模型性能。现有方法多通过识别并编辑与知识相关的内部神经元来缓解此问题,但这类神经元级修改可能引发不可预见的级联效应,因神经元通常与模型的整体行为高度耦合,从而损害模型的通用能力。本文提出SHIFT框架,将神经元级修改重构为可学习的门控调制机制,使大语言模型(LLM)能够自适应地调节内部激活状态,以动态平衡上下文知识与参数化知识的使用。其关键技术在于引入一个轻量级门控模块,在仅优化少于0.01%可训练参数的前提下,保持主干模型冻结,实现对知识冲突的有效缓解。实验在六个数据集上的结果表明,SHIFT显著优于多种基线方法,验证了其有效性与泛化能力。
链接: https://arxiv.org/abs/2606.27786
作者: Ruochang Li,Pengcheng Huang,Zhenghao Liu,Yukun Yan,Huiyuan Xie,Yu Gu,Ge Yu,Maosong Sun
机构: Northeastern University (东北大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 13 Figures
Abstract:Retrieval-augmented generation (RAG) enhances LLMs by incorporating external knowledge to support response generation. However, conflicts between retrieved context and parametric knowledge have emerged as a critical challenge in RAG systems. To mitigate such conflicts, numerous studies have attempted to identify and edit knowledge-related internal neurons, aiming to improve the ability of LLMs to rely on contextual evidence during generation. However, these neuron-level approaches may introduce unintended cascading effects that compromise the general capabilities of LLMs, as the modified neurons are often entangled with broader model behaviors and functionalities. In this paper, we introduce SHIFT, a novel framework that reformulates neuron-level modification as learnable gate modulation, allowing LLMs to adaptively regulate internal activations for knowledge conflict resolution. Technically, our SHIFT equips LLMs with a lightweight gate module and optimizes fewer than 0.01% trainable parameters while keeping the backbone model frozen. During generation, the gate module adjusts the model’s internal representations to adaptively leverage contextual and parametric knowledge. Extensive experiments on six datasets validate the effectiveness of our SHIFT in comparison with various competing baselines. All datasets and code are available at this https URL.
[NLP-21] Output-Space Allocation Costs for Calibration-Guided LLM Compression: An Empirical Study
【速读】: 该论文旨在解决训练-free大语言模型(LLM)压缩方法中,压缩决策成本函数与模型输出性能目标不一致所导致的压缩效率瓶颈问题。现有方法如ROCKET虽采用基于输出重建的目标进行每层因子分解,但其多选择背包问题(MCKP)中的分配成本仍使用权重空间的Frobenius范数误差,与实际下游任务表现存在偏差。本文提出的关键解决方案是将分配成本从权重空间误差改为输出空间误差(即“ROCKET-ActCost”),以实现压缩策略与输出性能目标的对齐。实验结果表明,在Qwen3-8B模型50%压缩率下,该方法在8个零样本基准测试上平均准确率提升0.8个百分点(53.1% vs 52.3%),验证了输出空间目标导向的压缩策略在提升模型精度方面的有效性;然而,其代价是维基文本(WikiText)困惑度上升16%,反映出精度与生成质量之间的权衡。进一步分析发现,权重空间与输出空间误差间高达0.99的相关性限制了分配策略的分化程度,解释了改进效果有限的原因。而在更低压缩比(20%)的Llama-3.2-1B上,两种成本函数表现趋同,说明成本函数的影响随压缩率降低而减弱。因此,该研究的核心贡献在于揭示了压缩成本函数与下游评估指标间的匹配性对模型压缩性能的关键影响,并强调了在高压缩场景下需优先考虑输出空间目标对齐的重要性。
链接: https://arxiv.org/abs/2606.27785
作者: Qiong Tang,Xiangkun Hu,Xiangyang Liu,Yiran Chen,Yunfan Shao
机构: Analemma; FARS(全自动化研究系统)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Training-free compression methods for large language models (LLMs) often use calibration data to guide compression decisions. ROCKET, a recent method combining sparse-dictionary factorization with multi-choice knapsack problem (MCKP) allocation, derives its per-layer factorization from an output reconstruction objective but uses weight-space Frobenius error as the MCKP allocation cost. We investigate whether aligning the allocation cost with the output-space objective improves compressed model fidelity. On Qwen3-8B at 50% compression, our ROCKET-ActCost achieves +0.8 percentage points higher average accuracy across 8 zero-shot benchmarks (53.1% vs 52.3%), but increases WikiText perplexity by 16% (61.46 vs 52.98). This accuracy-perplexity tradeoff reveals that different allocation objectives favor different downstream metrics. The high correlation ( 0.99) between weight-space and output-space errors limits allocation divergence, explaining the modest effect size. On Llama-3.2-1B at 20% compression, the two methods produce near-identical results (53.3% vs 53.5% accuracy, 14.45 vs 14.66 PPL), suggesting that the effect of the cost function is minor at lower compression ratios.
[NLP-22] KG2Cypher: Data-Centric Pipeline for Building Enterprise Text-to-Cypher Systems
【速读】: 该论文旨在解决企业在构建私有知识图谱(Knowledge Graph, KG)的自然语言接口时面临的高成本问题,尤其针对企业内部搜索、分析与问答场景中,如何高效生成准确的文本到Cypher查询映射。其核心解决方案是提出一种数据驱动的KG2Cypher框架,通过从已有图谱事实中自动构造可执行的Cypher查询,并利用大语言模型(LLM)生成对应的自然语言问题,从而构建高质量的文本-Cypher配对数据。这些数据经过LLM判别与人工验证后,转化为具备候选信息感知能力的监督微调(SFT)数据集。在模型部署阶段,采用类别条件化的模式提示(schema prompting)、实体检索与基于LoRA的参数高效微调推理策略,显著提升了查询生成的准确性与鲁棒性。实验表明,在韩语企业环境中,面对短查询和模式同义表达带来的语言指代难题,该方法在广播节目查询上将执行结果F1从0.806提升至0.950,在公司查询上从0.70提升至0.92;在11类任务中实现95.2%的精确匹配率、99.9%的执行率以及0.964的执行结果F1,验证了其在复杂企业场景下的有效性与实用性。
链接: https://arxiv.org/abs/2606.27742
作者: Minjun Choi,Yerin Kim,Junghyuk Seo,Sujin Mo,Hyemin Lee,Youngjoong Ko
机构: Sungkyunkwan University (成均馆大学); NAVER (NAVER)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures, 10 tables
Abstract:Enterprise Knowledge Graphs (KGs) are increasingly used for internal search, analytics, and question answering, but building natural-language interfaces for private enterprise graphs remains costly. We present KG2Cypher, a data-centric pipeline for building enterprise text-to-Cypher systems from existing KGs. KG2Cypher first constructs an executable Cypher query from observed graph facts and then uses LLMs to generate its associated natural-language question. The resulting Text-Cypher pairs are validated with an LLM judge and human validation, and are converted into candidate-aware SFT data. The trained generator is served with class-conditioned schema prompting, entity retrieval, and LoRA-based inference. We evaluate KG2Cypher in Korean enterprise settings, where short search-style queries and schema paraphrases make language grounding difficult. LoRA SFT improves execution-result F1 from 0.806 to 0.950 on broadcast-program queries and from 0.70 to 0.92 on company queries. In an 11-class setting, KG2Cypher achieves 95.2% exact match, 99.9% execution rate, and 0.964 execution-result F1.
[NLP-23] Enhancing Numerical Prediction in LLM s via Smooth MMD Alignment
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在需要数值精确输出的任务中表现不可靠的问题。其核心挑战在于标准交叉熵损失函数将数值标记视为无结构的类别,忽略了数值之间的度量结构(metric structure),导致模型难以捕捉数值间的连续性和相对关系。为此,论文提出了一种基于平滑最大均值差异(Smooth Maximum Mean Discrepancy, SMMD)的新损失函数,其关键创新在于构建了一个基于数值子词汇表的值距核(value-distance kernel)并引入图结构平滑性(graph-based smoothness),通过核匹配对齐预测数值分布与目标分布,并在诱导的核图上平滑预测-目标残差,以增强局部一致性。实验在数学推理、算术计算、时钟时间识别和图表问答四类数值目标任务上验证了SMMD的有效性,结果表明其在多个开源权重的LLM和视觉语言模型(VLM)骨干网络上均显著优于交叉熵及现有数值目标损失函数。分析进一步揭示了核匹配与平滑项之间的互补作用,并强调了基于距离的核设计对性能提升的关键意义。
链接: https://arxiv.org/abs/2606.27731
作者: Zhuo Zuo,Li Yue,Wenhao Zheng,Chenpeng Wang,Xianggen Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:
Abstract:Despite their strong general capabilities, large language models (LLMs) often remain unreliable when outputs must be numerically precise. A key reason is the training objective: standard cross-entropy treats numeric tokens as unstructured categories and ignores the metric structure of their values. We address this mismatch with Smooth Maximum Mean Discrepancy (SMMD), which builds on the classic MMD by incorporating value-distance kernels over numeric tokens and graph-based smoothness. With this kernel defined over a numeric sub-vocabulary, SMMD aligns the predicted numeric distribution to the target via kernel matching and smooths the prediction-target residual over the induced kernel graph to encourage local consistency. We evaluate SMMD on four numeric-target tasks: mathematical reasoning, arithmetic calculation, clock-time recognition, and chart question answering, across multiple open-weight LLM and VLM backbones. SMMD consistently improves accuracy over both cross-entropy and recent numeric-target losses; analyses show complementary effects between MMD and smoothness and underscore the importance of distance-based kernel design. Code is available at this https URL.
[NLP-24] Do Speech Emphasis Models Generalize across Languages and Emotions? INTERSPEECH2026
【速读】: 该论文旨在解决现有生成式强调检测模型在多语言、多情感及多样化语用风格下泛化能力不足的问题,尤其针对当前模型主要基于单一语言、中性朗读语音进行训练与评估所导致的适用性局限。其核心解决方案在于构建并公开了首个大规模多语言多情感强调数据集——MMEE(Multilingual Multi-Emotion Emphasis),包含7种语言、34种情感/语体类别,共10,000条专业录制的表达性语句(总计14.13小时),每条样本配备三级感知标注(每样本10次人工标注)。通过在多种设置(单语言、跨语言、多语言、跨情感、跨数据集、不同数据规模)下对两种前沿架构进行基准测试,研究发现:单语言模型零样本迁移能力有限,且在类型学差异较大的语言间性能显著下降;而多语言训练可显著提升模型鲁棒性;模型在高唤醒与低唤醒情感间具备稳健迁移能力,且合成数据与感知基准间的双向迁移表明存在共享的韵律结构;此外,模型在小规模训练数据下仍保持稳定性能,验证了数据效率与泛化潜力。
链接: https://arxiv.org/abs/2606.27717
作者: Megan Wei,Deepali Aneja,Jiaqi Su,Yunyun Wang,Haonan Chen,Zeyu Jin
机构: Adobe Research(Adobe 研究院); Brown University(布朗大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Interspeech 2026
Abstract:Prosodic emphasis varies across languages, emotions, and speaking styles, yet existing emphasis detection models are largely trained and evaluated on monolingual neutral read speech. We introduce MMEE (Multilingual Multi-Emotion Emphasis), a corpus of 10,000 professionally recorded expressive utterances (14.13 hours) across 7 languages and 34 emotion/style categories, with three-level perceptual labels (10 annotations per sample). We benchmark two state-of-the-art architectures under monolingual, cross-lingual, multilingual, cross-emotion, cross-dataset, and data-scale settings. Monolingual models show limited zero-shot transfer, degrading across typologically distant languages, while multilingual training substantially improves robustness. Models transfer robustly between high- and low-arousal emotions; bidirectional transfer between synthetic and perceptual benchmarks suggests shared prosodic structure; and performance stays robust even at smaller training scales.
[NLP-25] Low-Agreeableness Persona Conditioning for Safe LLM Fine-Tuning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在进行社会温暖性(social warmth)微调过程中出现的对抗安全性下降问题,即模型更容易受到提示注入攻击(jailbreaks)和生成有害内容。其核心问题是:现有的温暖性微调方法虽提升了对话的亲和力,却可能无意中增强了模型对恶意指令的顺从性,从而削弱了安全防护能力。解决方案的关键在于提出一种基于角色驱动(persona-driven)的文本重写流程,通过将用户输入设定为低宜人性(low agreeableness)的语境,并配以温和且缓和的助手回应,实现对模型行为的引导。该方法仅通过数据设计即可有效降低模型对越狱攻击的敏感性与有害输出率,同时保持对话的温暖性,而无需引入安全标签、危害检测器或修改训练目标。代表性探针分析进一步表明,该策略可减少潜在空间中温暖性与顺从性方向之间的几何对齐程度,揭示了其在表征层面的安全机制。
链接: https://arxiv.org/abs/2606.27709
作者: Austin MY Cheung,Yi Yang
机构: Hong Kong University of Science and Technology (香港科技大學); OpenAI (OpenAI); Meta (Meta); Stability.AI (Stability.AI); Anthropic (Anthropic); Character.ai (Character.ai); Claude (Claude)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 tables, 5 figures
Abstract:Recent work has shown that fine-tuning large language models (LLMs) for social warmth degrades factual reliability and increases sycophancy. We investigate a related but distinct failure mode: warmth fine-tuning also weakens adversarial safety, making models more susceptible to jailbreaks and harmful output generation. We examine whether this reflects an inherent consequence of empathetic adaptation or an artifact of data construction. To address this, we introduce a persona-driven rewriting pipeline that conditions user turns on low agreeableness and pairs this with warm, de-escalating assistant responses. Across three experiments on four models, our approach reduces jailbreak susceptibility and harmful output rates relative to generic warmth fine-tuning baselines, while preserving conversational warmth. Representational probing provides suggestive evidence that this conditioning reduces the geometric alignment between warmth and compliance directions in latent space. These results show that safer empathetic fine-tuning is achievable through data design alone, without safety labels, harm detectors, or changes to the training objective.
[NLP-26] Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长上下文输入时存在的“中间信息丢失”(lost-in-the-middle)问题,即位于输入序列中段的关键信息常因注意力分布不均而被弱化或忽略。现有方法虽尝试通过多尺度旋转位置编码(multi-scale rotary position embeddings, RoPE)缓解该问题,但普遍存在推理延迟高或依赖次优人工设计缩放策略的缺陷。本文提出一种分层位置编码缩放(layer-specific positional embedding scaling, LPES)方法,其核心创新在于为每一网络层分配独立的缩放因子,从而在不进行参数微调且不增加推理延迟的前提下,实现更均衡的注意力分布。为高效搜索各层最优缩放因子,研究设计了一种基于贝塞尔曲线(Bézier curves)的遗传算法,显著压缩了搜索空间。大量实验表明,LPES能有效缓解位置注意力偏差,在多个长上下文基准测试中均取得稳定性能提升,尤其在关键值检索数据集上最高实现11.2%的准确率增益。
链接: https://arxiv.org/abs/2606.27705
作者: Changze Lv,Zhenghua Wang,Yiran Ding,Yixin Wu,Tianlong Li,Zhibo Xu,Muling Wu,Tianyuan Shi,Shizheng Li,Qi Qian,Xuanjing Huang,Xiaoqing Zheng
机构: Fudan University (复旦大学); Westlake University (西湖大学); Shanghai Key Laboratory of Intelligent Information Processing (上海市智能信息处理重点实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) still struggle with the ``lost-in-the-middle’’ problem, where critical information located in the middle of long-context inputs is often underrepresented or lost. While existing methods attempt to address this by combining multi-scale rotary position embeddings (RoPE), they typically suffer from high latency or rely on suboptimal hand-crafted scaling strategies. To overcome these limitations, we introduce a layer-specific positional embedding scaling~(LPES) method that assigns distinct scaling factors to each layer. LPES achieves a more balanced attention distribution without fine-tuning model parameters or increasing inference delay. A specially designed genetic algorithm is employed to efficiently select the optimal scaling factors for each layer by incorporating Bézier curves to significantly reduce the search space. Extensive experiments demonstrate that LPES effectively mitigates positional attention bias and delivers consistent improvements across multiple long-context benchmarks, yielding up to an 11.2 % accuracy gain on the key-value retrieval dataset.
[NLP-27] Joint Transcription and Decryption of Images of Encrypted Handwritten Documents: A Comparison with the Traditional Pipeline STOC ALT
【速读】: 该论文旨在解决历史加密手稿自动解密中存在的关键问题,即传统两阶段解密流程(先对密文符号进行图像转录,再进行解密)对转录错误高度敏感,且错误会逐级传播至最终结果,严重影响解密准确性。其解决方案的关键在于提出一种端到端的直接图像解密方法(Direct Image Decryption),通过联合建模从加密手稿图像直接映射至明文的全过程,跳过中间符号转录环节,从而避免转录误差的引入与累积。研究以Copiale密文为案例,构建了大规模合成数据生成管道以生成类密文训练数据,并对比验证了联合图像到明文建模的有效性,结果表明该方法在解密性能上优于传统的分步式处理流程,为历史密码学文献的自动化分析提供了新的可行路径。
链接: https://arxiv.org/abs/2606.27700
作者: Marino Oliveros-Blanco,Lei Kang,Alicia Fornés,Beáta Megyesi
机构: Computer Vision Center; Department of Computer Science, Universitat Autònoma de Barcelona(西班牙巴塞罗那自治大学计算机科学系); Stockholm University(瑞典斯德哥尔摩大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published at HistoCrypt 2026 (9th International Conference on Historical Cryptology). NEALT Proceedings Series Number 61. Tartu University Library. 10 pages
Abstract:Historical encrypted manuscripts present a challenging problem at the intersection of cryptology, linguistics, paleography, and computer vision. Current automatic decipherment approaches usually rely on a two-stage pipeline: transcription of cipher symbols from manuscript images, followed by decryption into plaintext. However, this design is sensitive to transcription errors, which propagate to the final output. We present Direct Image Decryption, an end-to-end approach that directly maps encrypted manuscript images to plaintext, bypassing the intermediate transcription stage. Using the Copiale cipher as a case study, we build a synthetic data generation pipeline to create large-scale cipher-like training data and compare the traditional pipeline with the proposed joint architecture. Results show that joint image-to-plaintext modeling is a promising alternative to traditional transcription-based pipelines.
[NLP-28] Mitigating LLM -based p-Hacking by Preregistering for the Next LLM
【速读】: 该论文旨在解决生成式 AI(Generative AI)在科研应用中普遍存在的“p-hacking”问题,即研究人员通过反复调整提示词(prompt)、解码参数或输出格式等手段,直至获得期望结果,从而导致研究结论不可靠。其核心解决方案是提出一种基于预注册(preregistration)的实验协议:研究者在当前模型上确定分析流程后,预先注册实验设计及一组未来可接受的模型列表,并在预注册完成后,使用首个符合条件的后续发布模型进行确认性分析。由于该模型在预注册时尚未存在,无法针对其进行针对性优化;同时,针对某一模型有效的配置通常不具备跨模型迁移性。在两个已知真实值的任务上评估表明,该协议在20个来自四个供应商的模型和11种LLM分析配置下,分别在73.9%和72.7%的情况下成功阻止了p-hack的转移。进一步的压力测试显示,该方法在多种挑战场景下仍具有显著缓解效果。研究团队还践行自身协议并进行了预注册实验,结果显示,在先前成功“黑客攻击”前序模型的7种配置中,有6种未能在首个符合条件的新模型上复现成功,验证了该协议的有效性。
链接: https://arxiv.org/abs/2606.27687
作者: Maria Thomas,Kristina Gligoric,Nihar B. Shah
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注:
Abstract:Large language models (LLMs) are increasingly used to generate, classify, and annotate data whose outputs feed downstream hypothesis tests. However, LLM-based research is easy to p-hack: a researcher can tune the prompts, decoding parameters, or output format until a desired result is reached. We propose a protocol to mitigate p-hacking in LLM-based research: preregistering the experiment and eligible models, and then running it on the first eligible LLM that is released after the preregistration. The researcher finalizes the procedure on current models, preregisters the analysis plan together with a set of eligible future models, and runs the confirmatory analysis on the first eligible model released afterward. Because this model does not exist at commitment time, it cannot be hacked against; furthermore, configurations that hack one model frequently do not transfer to the next. We evaluate the protocol on two tasks whose true values are known. Across 20 models from four providers and 11 LLM-analysis configurations, the protocol would have blocked successful transfer of the p-hack in 73.9% and 72.7% of cases in the two tasks. Additional analyses reveal that mitigation remains substantial under several stress tests. Finally, putting money where our mouth is, we followed our own protocol and preregistered our experiment. The preregistered experiment confirmed the protocol’s effectiveness: out of the 7 configurations that hacked the prior model, the hacking failed to carry over in 6 configurations on the first eligible model released afterward.
[NLP-29] xtual Belief States for World Models: Identifiable Representation Learning Under Strict Mediation
【速读】: 该论文旨在解决生成式语言模型(Generative AI)在部分可观测环境中的世界模型(World Model)所面临的潜在表示(Latent Representation)质量不可靠问题。现有基于大语言模型(LLM)的架构普遍存在“历史绕过”(history bypass)现象,即模型预测结果不依赖于隐状态,而是直接利用完整历史信息,导致潜在状态无法被有效识别与评估。为解决此问题,论文提出严格潜在状态中介(Strict Latent State Mediation)这一关键原则:要求所有预测必须仅依赖于当前隐状态和动作,从而确保隐状态具备可解释性和可测试性。其解决方案的核心在于设计一种新型离散、可解释且变长的文本型潜在状态(Textual Latent States),并引入因子化广义策略优化(factorized GRPO, fGRPO),一种树结构强化学习方法,通过显式约束训练过程以强制实现严格中介。实验表明,在TextWorld和ScienceWorld任务中,该方法在保持一步预测精度的同时,显著提升了表示质量(最高达57%)和轨迹生成性能(提升98%),且优势随任务复杂度和规划时域增加而增强。
链接: https://arxiv.org/abs/2606.27681
作者: Xiang Gao,Kaiwen Dong,Yuguang Yao,Padmaja Jonnalagedda,Kamalika Das
机构: Intuit AI Research
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:World models in partially observed environments rely on latent representations that summarize interaction history, but in many modern LLM-based architectures predictive performance fails to reflect representation quality due to history bypass, rendering the latent state unidentifiable. Strict latent state mediation, requiring predictions to depend only on the latent state and action, is a classical principle that resolves this, but enforcing it in text-based settings is an open challenge: textual latent states are discrete and non-differentiable, precluding variational training, and expressive LLM decoders readily ignore the bottleneck. We show how to make strict mediation work in the text domain. We formalize why it is necessary, showing that strict mediation makes representation quality empirically testable while history-leaky architectures break this connection. We then introduce textual latent states, which are discrete, interpretable, and variable-length, and factorized GRPO (fGRPO), a tree-structured reinforcement learning method that enforces strict mediation during training. Experiments on TextWorld and ScienceWorld show preserved one-step prediction accuracy alongside up to 57% gains in representation quality and 98% improvements in rollout performance, increasing with task complexity and horizon.
[NLP-30] From Signals to Transfer: A Factorised Study of Probe-Based Uncertainty Estimation in Large Language Models
【速读】: 该论文旨在解决当前基于探针(probe-based)的不确定性估计(Uncertainty Estimation, UE)方法在检测大语言模型(Large Language Models, LLMs)幻觉时存在的可比性差与性能评估不透明问题。现有方法在特征设计、训练数据构建及评估设置上存在多重差异,导致难以明确区分哪些因素真正驱动了性能提升。为此,本文提出在统一条件下对探针式UE进行解耦分析,其关键在于系统性地分离并验证不同变量(如隐藏状态、注意力特征、提示工程与标签构造)的影响。研究发现,原始隐藏状态与注意力特征在域内(in-domain)表现优异,但在分布外(distribution shift)场景下鲁棒性较差;而结构化且压缩后的特征在分布外更具优势,表明仅依赖域内性能不足以衡量进展。此外,提示设计与标签构建显著影响探针行为。基于上述最佳实践,论文训练了基于基准的预训练探针,可在开放式事实生成任务中实现良好迁移,提供了一个稳定可靠的即用型基线。该工作推动了探针式不确定性估计器向更注重实际部署评估的方向发展。
链接: https://arxiv.org/abs/2606.27679
作者: Ponhvoan Srey,Xiaobao Wu,Cong-Duy Nguyen,Quang Minh Nguyen,Duc Anh Vu,Anh Tuan Luu
机构: Nanyang Technological University (南洋理工大学); Shanghai Jiao Tong University (上海交通大学); VinUniversity (越南大学); KAIST (韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Probe-based uncertainty estimation (UE) has emerged as a prominent approach to detect hallucinations in Large Language Models (LLMs) by learning uncertainty from internal model signals. Yet, recent methods vary simultaneously across feature design, training data construction, and evaluation setting, obscuring what actually drives performance. To address this issue, we propose a factorised study of probe-based UE under matched conditions. Our results show that raw hidden states and attention features are difficult to outperform in-domain. However, under distribution shift, structured and compressed features are more robust, suggesting that in-domain performance alone is insufficient to measure progress. Furthermore, prompting and label construction significantly affect probe behaviour. Building on these best-practice findings, we train benchmark-based pretrained probes that transfer reasonably well to open-ended factual generation, providing a stable off-the-shelf baseline. Our work encourages more deployment-oriented evaluation of probe-based uncertainty estimators. The code repository is available at this https URL.
[NLP-31] When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search
【速读】: 该论文旨在解决当前基于大语言模型(LLM)的搜索代理在深度搜索场景中面对模糊、不完整或错误用户查询时,缺乏主动识别歧义并进行有效澄清的能力这一关键问题。现有基准普遍假设用户查询是完整且明确的,忽略了真实世界搜索请求中普遍存在的语义模糊性,而这种模糊性在多步推理链中会不断传播,导致搜索路径偏离正确方向。为此,论文提出DiscoBench基准,专门用于评估搜索代理在面对歧义时是否具备主动识别、生成有效澄清问题,并通过与用户多轮交互恢复正确推理路径的能力。其核心解决方案在于构建一个涵盖11个真实领域、包含211个样本和463个歧义实例的多类型歧义数据集,并设计了一个支持多轮交互的用户模拟器,从任务效用、歧义检测能力、交互策略合理性及成本效率四个维度综合评估模型表现。实验结果表明,歧义识别与有效澄清是两个独立且关键的能力,而盲目重复检索往往比直接猜测更差,揭示了当前搜索代理在检索能力与交互式问题求解之间存在显著差距。
链接: https://arxiv.org/abs/2606.27669
作者: Yiling Tao,Shihan Deng,Meiling Tao,Pengzhi Wei,Zhichao Hu,Zhihao Zhu
机构: Hunyuan, Tencent; Shenzhen International Graduate School, Tsinghua University
类目: Computation and Language (cs.CL)
备注: 26 pages, 7 figures, 12 tables
Abstract:Search agents powered by large language models (LLMs) are increasingly used to solve complex information-seeking tasks, requiring multi-step retrieval and reasoning to fulfill user goals. However, existing benchmarks often assume that user queries are complete and explicit, overlooking the fact that real-world search requests are frequently vague, underspecified, or even factually incorrect. In deep search scenarios, such ambiguity can propagate along multi-step reasoning chains and lead agents toward incorrect search trajectories. To address this gap, we introduce DiscoBench, a benchmark for clarification-aware deep search, designed to evaluate whether search agents can proactively identify ambiguity, ask effective clarification questions, and recover correct reasoning paths through user interaction. DiscoBench contains 211 samples and 463 ambiguity instances across 11 real-world domains, covering four ambiguity types. We further design a user simulator for multi-turn interaction and evaluate model performance from four perspectives: task utility, ambiguity detection, interaction strategy, and cost efficiency. Experiments on representative LLMs show that ambiguity detection and effective clarification are distinct capabilities, and that repeatedly searching instead of asking for clarification often performs worse than direct guessing, highlighting a critical gap between retrieval ability and interactive problem-solving in current search agents.
[NLP-32] Yuvion LLM : An Adversarially-Aware Large Language Model for Content And AI Safety
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中因安全漏洞导致有害输出与滥用的问题,核心挑战在于现有模型开发范式忽视了安全问题的对抗性本质——即许多安全失效并非源于自然输入,而是来自攻击者精心设计的策略以规避模型政策与防护机制。传统方法在面对需要规划、工具使用及多步推理的复杂真实场景时,其安全性能评估往往严重高估了模型的实际鲁棒性。为应对这一问题,论文提出Yuvion LLM,一种将对抗鲁棒性(adversarial robustness)与智能体能力(agentic capability)作为首要目标的大语言模型。其解决方案的关键在于构建一个端到端的综合训练框架:包含对抗感知的数据构造、知识增强的持续预训练,以及基于策略的多任务安全后训练,涵盖风险感知的监督微调、基于强化学习的策略优化,以及面向复杂安全场景下工具使用与多步推理的安全感知智能体强化学习。此外,研究还引入了Yuvion LLM RiskEval(YLRE),一套包含93个基准测试的评估体系,覆盖开放与内部评估,聚焦于安全性、对抗鲁棒性及现实应用能力。实验结果表明,Yuvion LLM在安全相关基准上表现显著优于现有模型,尤其在对抗性条件下展现出更强的鲁棒性,且其80亿参数版本(Yuvion-8B)在多项安全任务中超越了包括GPT-5.4和Qwen3-MAX等更大规模的先进基线模型。
链接: https://arxiv.org/abs/2606.27632
作者: Ting Ma,Xiufeng Huang,Benlei Cui,Xiaowen Xu,Shikai Qiu,Ruijie Jian,Hongxing Li,Guanghui Wang,Longtao Huang,Haiwen Hong,Haolei Xu,Wenjing Jiang,Ziwen Xu,Zhaoyu Fan,Shaoxuan He,Chuxi Xiao,Yujian Li,Xinyue Chen,Chunyang Chai,Wenxuan Liu,Ziheng Wang,Dongjie Zhang,Yangfan Zhou,Libin Dong,Yupeng Cao,Xiaoqian Xia,Jing Wang,Zhe Jiang,Zhenan Ye,Guang Yang,Bin Liu,Wei Peng,Ziqiang Zhu,Meihui Lian,Kaiwen Lv Kacuila,Haidong Ding,Bingyu Zhu,Yan Wang,Hai Zhao,Xuan Jin,Wei Zhao,Pengfei Sun,Wei Wang,Huiming Zhang,Bin Li,Hui Xue
机构: Alibaba Security AGI Lab(阿里巴巴安全AGI实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:As large language models are increasingly deployed in real-world systems, safety failures can still lead to harmful outputs and dangerous misuse. We argue that the essence of safety is adversarial: many failures arise not from natural inputs alone, but from strategic attempts to evade model policies and safeguards. However, existing general-purpose model development largely overlook this adversarial nature, and often remain insufficient for realistic safety scenarios involving planning, tool use, and multi-step reasoning, causing measured safety performance to overestimate real deployment robustness. To address this gap, we present Yuvion LLM, a large language model built for adversarially robust content safety and broader AI safety. Yuvion LLM treats adversarial robustness and agentic capability as first-class objectives. Its pipeline combines adversarially aware data construction, knowledge-enhanced continued pretraining, and policy-grounded multi-task safety post-training, including risk-aware supervised fine-tuning and reinforcement learning-based policy optimization, together with safety-aware agentic reinforcement learning for tool use and multi-step reasoning in complex safety scenarios. We further introduce the Yuvion LLM RiskEval (YLRE), a collection of 93 benchmarks across four evaluation categories, covering diverse open and internal evaluations with a focus on safety, adversarial robustness, and real-world capability requirements. Across these evaluations, Yuvion LLM demonstrates clear advantages on safety-focused benchmarks and particularly strong robustness under adversarial conditions, while maintaining solid overall capability. Notably, Yuvion-8B outperforms most state-of-the-art baselines, including substantially larger models such as GPT-5.4 and Qwen3-MAX, on several safety tasks.
[NLP-33] Cross-Platform Chinese Offensive Comment Detection via Dual-Threshold Hard Example Mining
【速读】: 该论文旨在解决中文社交媒体中负面评论检测模型在跨平台部署时性能下降的问题。其核心解决方案是提出一种双阈值硬样本挖掘(dual-threshold hard example mining)方法:通过预测置信度从无标签语料中筛选出高置信度与低置信度的易错样本,利用少量人工标注的困难样本在隐式上下文中对模型进行二次微调,从而实现低成本的跨平台领域自适应。该方法有效缓解了源域与目标域之间因领域差异导致的性能退化瓶颈,实验表明优化后的模型在微博、小红书、贴吧和知乎四个平台均取得显著性能提升。
链接: https://arxiv.org/abs/2606.27629
作者: Ruixing Ren,Junhui Zhao,Fangfang Wang
机构: Beijing Jiaotong University (北京交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 10 pages, 7 figures
Abstract:Cross-platform deployment of offensive comment detection for Chinese social media suffers performance degradation. The paper proposes a dual-threshold hard mining method to address this. First, the clean-Chinese-base RoBERTa is finetuned on COLD to establish a binary baseline for fair comparison. Second, a three-class fine-labeled test set covering Weibo, Xiaohongshu, Tieba, and Zhihu is constructed, domain distances from the source are quantified using Jaccard and Proxy-A Distance, as well as the degradation bottleneck of the baseline under domain shift is systematically revealed. Herein, a dual threshold hard example mining strategy is proposed. High- and low-confidence error-prone samples are filtered from unlabeled corpora by prediction confidence. The model is secondarily finetuned under implicit contexts with merely a small set of manually labeled hard examples, realizing low-cost cross-platform domain adaptation. Experiments reveal significant performance gains of the optimized model across four platforms.
[NLP-34] Masked Language Flow Models
【速读】: 该论文旨在解决生成式语言模型在少步采样(few-step sampling)场景下效率与质量之间的权衡问题,尤其针对基于掩码的扩散模型(Masked Diffusion Models, MDMs)因反向过程在词元位置上独立分解(factorisation across token positions)而带来的近似误差,以及流语言模型(Flow Language Models, FLMs)在多步推理任务中因必须逐个解码所有词元而导致的灵活性不足。其解决方案的关键在于提出一种新型的掩码语言流模型(Masked Language Flow Models, MLFMs),通过引入连续随机插值器(continuous stochastic interpolant)将部分掩码序列与完整序列之间的映射关系建模为连续流,从而在保持流模型连续性优势的同时支持条件生成和可选的离散去掩码操作。这一设计使得预训练的MDMs可通过轻量级适配转化为MLFMs,进而实现从连续去噪到高置信度词元离散解码的交替采样策略,显著增强了对复杂多步推理任务的支持能力。实验表明,首次实现了基于流的语言模型在GSM8K和MT-Bench等基准上的有效扩展,验证了其在指令遵循与推理任务中的可行性与优越性。
链接: https://arxiv.org/abs/2606.27617
作者: Iskander Azangulov,Kianoosh Ashouritaklimi,Leo Zhang,Simon Vary,Patrick Rebeschini
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint
Abstract:Masked Diffusion Models (MDMs) promise fast, parallel language generation, but their reverse transition factorises across token positions – an approximation that breaks down in the few-step sampling regime where parallel generation ought to provide the greatest efficiency gains. Flow Language Models (FLMs) sidestep this limitation by learning a continuous flow that transports noise toward clean sequences represented in Euclidean space, inducing a flow map that can be distilled for single-step generation. However, this makes complex tasks requiring multi-step reasoning problematic for FLMs, as FLMs are forced to decode every token during generation. To address this, we introduce Masked Language Flow Models (MLFMs), which incorporate masking into FLMs using a continuous stochastic interpolant to bridge partially masked and clean sequences. This design enables conditional generation via continuous flows and allows pretrained MDMs to be converted into MLFMs through a simple, lightweight adaptation. Leveraging this flexibility, we propose a novel sampler that alternates continuous denoising with the discrete unmasking of confident tokens to better support multi-step reasoning. We evaluate our approach on GSM8K and MT-Bench and find, for the first time, that flow-based language models can be scaled to solve downstream reasoning and instruction-following tasks.
[NLP-35] Narrative-UFET: Narrative Generation for Ultra-Fine Entity Typing
【速读】: 该论文旨在解决超细粒度实体类型识别(UFET)在长尾类型上的性能瓶颈问题,其核心挑战在于现有方法过度依赖句内上下文,而实际消歧所需的证据往往跨越多个句子。为验证这一假设,作者构建了首个基于叙事的UFET数据集Narrative-UFET,将每个实体提及与自动生成的短篇连贯叙事关联,从而可控地分离话语结构的影响。解决方案的关键在于通过合成叙事来系统性地考察不同话语属性的作用:设计了“类型保持不变”(Maintain)与“类型随叙事变化”(Change)两种变体。实验表明,叙事上下文显著提升了长尾类型的识别效果,其中“改变”型叙事提供了更强的语义信号。相较于自然文本中的真实上下文,合成叙事带来的性能提升更明显,说明受控的话语构造能够揭示自然文本中隐含的语义线索。研究结果表明,当前模型在话语建模与叙事生成方面仍有巨大改进空间。
链接: https://arxiv.org/abs/2606.27598
作者: Mreedul Gupta,Advait Deshmukh,Ashwin Umadi,Matt Pauk,Maria Leonor Pacheco
机构: University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Ultra-fine entity typing (UFET) assigns highly specific types to entity mentions, but current approaches struggle with types in the long tail. We hypothesize that a key limitation is the reliance on sentence-level context, since disambiguating evidence is often spread across multiple sentences. Testing this has been difficult because all existing UFET resources are sentence-level. We present Narrative-UFET, a controlled extension of UFET in which each entity mention is paired with an automatically generated short, coherent narrative. Synthesizing narratives lets us isolate the effect of specific discourse properties. We experiment with two paired variants: one in which the entity’s type is held constant across the narrative (Maintain) and one in which it shifts (Change). We show that narrative context yields consistent improvements on long-tail types over sentence-level baselines, with the Change variant providing the stronger signal. A comparison against naturally occurring contexts shows that synthetic narratives yield stronger gains, indicating that controlled discourse construction can surface signals that real text leaves implicit. Substantial room for improvement remains, suggesting open directions in both discourse modeling and narrative construction.
[NLP-36] Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents
【速读】: 该论文旨在解决当前Web-agent基准测试中对“广度”(breadth)评估严重不足的问题,即多数评测聚焦于深度推理(depth),仅关注单一答案是否正确,而忽视了对封闭集合内所有成员及其属性的完整、准确填充能力的评估,尤其在非英语语境下更为薄弱。其核心挑战在于:构建一个全面且正确的黄金标准数据集成本极高,远超验证单一答案。为此,作者提出 \textsc{Ko-WideSearch}——一个基于自动化合成与验证流水线构建的韩语广度搜索基准。该基准通过独立调节两个结构参数(表宽与二维复合键)实现跨层级难度递增,使成员关系覆盖率从0%提升至100%,涵盖190个实体、228张表格及16个类别,任务要求完整列出父级实体(如电视剧季、王朝、联赛等)的全部成员及其属性,并采用项-列-行三重F1评分体系进行评估。为确保评估一致性,金标构建与评分共享同一感知归一化的比较器,避免因格式化导致稳定日期或数量列被误删。实验表明,现有二十个网页代理在处理该基准时普遍存在“找回集合但无法还原行”的缺陷(项-F1 92.8 vs 行-F1 53.7),随着结构复杂度增加,准确率持续下降,且增加搜索次数或资源投入均无法弥合差距。进一步分析显示,核心瓶颈在于识别正确值本身,而非格式化,其中开放文本型字段失败率最高,而具有标准答案(如日期、名称)的字段则表现良好。因此,解决方案的关键在于建立结构化、可扩展的广度评估框架,并揭示当前AI系统在面对大规模、高维度结构化信息检索时的核心局限。
链接: https://arxiv.org/abs/2606.27595
作者: Minbyul Jeong
机构: Upstage AI (Upstage AI)
类目: Computation and Language (cs.CL)
备注:
Abstract:Web-agent benchmarks overwhelmingly measure depth – pinning one obscure answer behind a chain of constraints – while breadth, exhaustively enumerating a closed set and filling each item’s attributes, is barely evaluated, especially outside English. Breadth is also hard to build: certifying that a gold set is complete and every cell correct is far costlier than checking a single answer. I introduce \textscKo-WideSearch, a Korean breadth-search benchmark built by an automated synthesize-and-verify pipeline. Each task names a set-parent entity – a TV season, a dynasty, a league, an administrative region, an election – and asks for its full membership plus a per-item attribute table, graded by Item-, Column-, and Row-F1. It spans 228 tables over 190 entities and sixteen categories across three difficulty tiers, set by two structural knobs I dial independently – table width and a 2-D composite key – so cross-product membership climbs from 0% to 100% across the tiers. A single normalization-aware comparator is shared between gold construction and grading, so stable date and count columns are not over-dropped on formatting alone. Across twenty web agents, the failure is consistent: agents recover the set but not the rows (e.g.\ Item-F1 92.8 against Row-F1 53.7), accuracy falls steadily as the knobs harden, and neither more search nor more spend closes the gap. Broken down by cell, the hard part is finding the right value, not formatting it: open-ended free-text cells fail most, while cells with a standard answer such as a date or a name usually come out right.
[NLP-37] EntMTP: Accelerating LLM Inference with Entropy Guided Multi Token Prediction
【速读】: 该论文旨在解决现有多标记预测(Multi-Token Prediction, MTP)方法在文本生成过程中存在的计算资源浪费问题。当前主流模型采用静态树状注意力拓扑结构,导致推测深度在整个生成序列中保持不变,无法根据上下文的可预测性动态调整,从而在高熵区域产生不必要的冗余计算,在低熵区域则未能充分挖掘并行推理潜力。其解决方案的关键在于提出一种无需训练的调度机制——熵引导多标记预测(Entropy-guided Multi-Token Prediction, EntMTP),该机制基于局部生成熵的实时估计,动态切换一组任务特定的帕累托最优树状注意力拓扑结构,使推测深度与上下文可预测性相匹配。通过将推测深度自适应地调整至与语义熵一致的水平,EntMTP在不牺牲生成质量的前提下,显著提升了全分布生成文本的预期采纳标记吞吐量。在Humaneval、ShareGPT、GSM8k和Litbench等多个基准测试中,EntMTP相较Hydra基线平均提速1.15倍,相较于Medusa基线最高提速达1.36倍。
链接: https://arxiv.org/abs/2606.27550
作者: Carrie Chen
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages, 5 figures
Abstract:Multi-token prediction has been shown to increase data density during training, improve downstream text-generation quality, and serves as the defacto approach for self-speculative decoding. Existing foundation and open source models that use MTP heads commit to a static tree-based attention topology throughout the entire generation sequence, meaning the speculation depth, and thus the compute required during verification, stays constant regardless of the context. This is fundamentally misaligned with the entropy patterns of natural language where low-entropy regions often support reliable multi-step drafting, while high-entropy regions require more conservative speculation. To address this, we propose Entropy-guided Multi-Token Prediction (EntMTP), a training-free scheduler that toggles between tree-based attention topologies from a set of task-specific pareto-optimal trees conditioned on a running estimate of local generation entropy. By matching speculation depth to context predictability, EntMTP maximizes expected accepted-token throughput across the full distribution of generated text without sacrificing generation quality. When evaluated across Humaneval, ShareGPT, GSM8k, and Litbench benchmarks, EntMTP consistently achieves a 1.15x speedup against Hydra and peak speedup of 1.36x against Medusa baselines respectively.
[NLP-38] he Context-Ready Transformer NEURIPS
【速读】: 该论文旨在解决传统Transformer在序列生成任务中缺乏高效上下文记忆机制的问题,特别是在长序列建模和自回归生成过程中难以有效利用历史上下文信息的瓶颈。其核心挑战在于:标准Transformer在每个位置处理输入时仅依赖原始嵌入,无法动态地将先前生成内容的累积上下文融入当前输入,导致推理效率受限且对长距离依赖建模能力不足。解决方案的关键是提出一种“上下文就绪Transformer”(context-ready transformer)架构,通过引入一个校正网络(correction network),在每个Transformer块前对当前词元进行上下文增强——该网络将前一位置的块输出(即过去上下文的缓存摘要)与当前词元嵌入结合,使输入进入块时已具备上下文感知能力。这一设计使得模型在左到右生成过程中形成递归结构,实现高效的顺序推理;同时,在训练阶段通过展开校正链(unrolling the correction process K次)实现并行化处理,兼顾训练效率与推理性能。实验表明,该架构在保持或超越标准Transformer性能的同时,显著提升推理速度(最高达2.6倍加速),尤其在宽表示和长上下文场景下优势明显,并在指针追踪任务中展现出线性深度依赖而非阶梯式性能下降,验证了其对复杂层级结构的有效建模能力。
链接: https://arxiv.org/abs/2606.27538
作者: Mahesh Godavarti
机构: A Carrot, Inc (A Carrot, Inc)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NeurIPS, 22 pages
Abstract:We introduce the context-ready transformer, a new recurrent neural network architecture built from a D-layer transformer block that pre-contextualizes each token before it enters the block. During left-to-right generation, a correction network combines the previous position’s block output – a cached summary of past context – with the current token embedding, so the tokenenters the block already contextualized rather than as a raw embedding. At sequential inference, the correction chain makes the architecture a recurrent neural network. For training, we unroll the correction process K times over the full sequence, processing all positions in parallel at each step. A pretrained transformer can also be converted to a context-ready model by adding a zero-initialized correction FFN and fine-tuning. We evaluate across widths, depths, block sizes, and two datasets, with all comparisons against standard transformers, variants, and ablations. A D=5 model beats a 12-layer transformer while generating 1.7x faster on an A100. With K=10, a single-layermodel (D=1) beats a 6-layer transformer with a 2.6x inference speedup, and sequential inference matches parallel K=10 to within 0.01 PPL. The architecture benefits most from wide representations and long contexts. On a pointer-chasing task, D=1 trained with BPTT solves all 10 composition levels, while standard transformers exhibit staircase-like depth dependence.
[NLP-39] he Curse of Multiple Mediators: Hidden Interaction Effects in Activation Patching
【速读】: 该论文旨在解决生成式人工智能(Generative AI)模型中机制可解释性(mechanistic interpretability)的核心挑战:如何准确量化模型内部组件对特定行为的因果责任。当前主流方法——激活补丁(activation patching)通过估计自然间接效应(NIE)来实现这一目标,但本文重新从因果中介分析(causal mediation analysis)角度推导该估计算子,发现NIE不仅包含目标组件的直接因果效应,还混杂了交互效应(INT),即该组件的因果作用依赖于其他组件状态的程度。尽管尝试通过调整估计器或分析单元以消除INT的方案看似合理,但均存在可预测的失效模式。在GPT-2 IOI电路的实证中,研究揭示:那些因果重要性依赖于其他组件状态的组件要么完全被忽略,要么被人为夸大;而此前观测到的忠实度(faithfulness)评分不稳定性,正源于INT的方差。论文进一步证明,INT与干净激活和补丁激活之间的距离呈正相关,在局部仿射(locally affine)模型中可忽略,并可组合分解为成对及高阶分组交互项。尽管INT不可避免,但它并非需要消除的噪声,而是可解释性研究的重要诊断工具:其个体与群体层面的大小与符号可揭示因果结论是否依赖于提示(prompt-dependent),并指示基于贪婪NIE排序的方法可能遗漏仅可通过组合搜索发现的复杂机制。
链接: https://arxiv.org/abs/2606.27510
作者: Sankaran Vaidyanathan,David Arbour,Aaron Mueller,Scott Niekum,David Jensen
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Adobe Research (Adobe 研究院); Boston University (波士顿大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Activation patching is the primary tool in mechanistic interpretability. It attributes causal responsibility for a model behavior to each of its individual components by estimating its natural indirect effect (NIE). Re-deriving the activation patching estimand from causal mediation analysis, we find that the NIE does not solely capture the causal effect through the specific component. It also contains interaction effects (INT) that measure how much the component’s causal effect itself depends on the state of other components in the model. A natural response may be to try to eliminate INT by adjusting the estimator or unit of analysis, but each of these potential remedies has predictable failure modes. We demonstrate these failure modes in the GPT-2 IOI circuit; components whose causal importance is conditional on the state of other components are either invisible or artificially inflated, and INT variance explains the previously documented instability of faithfulness scores. We prove that INT scales with the distance between clean and patched component activations, is negligible when the model is locally affine, and decomposes combinatorially into pairwise and higher-order group interactions. Despite its inevitability, INT is not a nuisance to be eliminated, but rather a diagnostic for interpretability studies. Its individual and group-level magnitude and sign signal when causal conclusions are prompt-dependent, and when greedy NIE-based component ranking will miss mechanisms only discoverable through combinatorial search.
[NLP-40] Aloe-Vision: Robust Vision-Language Models for Healthcare
【速读】: 该论文旨在解决当前医疗领域生成式视觉-语言模型(LVLMs)发展中面临的三大核心问题:高质量医学多模态数据稀缺、模型在安全关键场景下的鲁棒性不足,以及评估基准狭窄且易受污染导致结果不可靠。其解决方案的关键在于构建一个大规模、经过严格质量过滤的多源异构数据混合体——Aloe-Vision-Data,该数据集融合了医学与通用领域的多模态及纯文本数据,可直接用于模型微调。基于此数据集,研究团队训练并开源了Aloe-Vision系列医疗专用LVLMs(7B和72B两种规模),提供完整的权重、训练方案与数据来源,确保系统可复现、可验证与可改进。实验表明,高质量的训练混合数据能有效提升模型性能,在保持通用能力的前提下显著优于基线模型,并达到与先进水平相当的表现。为进一步保障评估可靠性,研究提出了CareQA-Vision,一个源自西班牙医学与护理专科住院医师考试(MIR和EIR)的精心设计的视觉问答基准,包含低污染风险的新颖视觉题目。此外,研究揭示了现有LVLMs在临床环境中仍易受对抗性与误导性输入影响,凸显了其在实际应用中的可靠性挑战。
链接: https://arxiv.org/abs/2606.27500
作者: Jaume Guasch-Martí,Enrique Lopez-Cuena,Martín Suárez-Fernández,Jordi Bayarri-Planas,Anna Arias-Duart,Dario Garcia-Gasulla
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: MIDL 2026
Abstract:Large Vision-Language Models (LVLMs) specialized in healthcare are emerging as a promising research direction due to their potential impact in clinical and biomedical applications. However, progress is constrained by the scarcity of high-quality medical multimodal data, concerns about robustness in safety-critical settings, and the narrow and potentially contaminated evaluation benchmarks that limit reliable assessment. To address these issues, the field requires state-of-the-art solutions to be fully open and reproducible systems in which all components can be inspected, evaluated, and improved. This work introduces Aloe-Vision-Data, a large-scale, quality-filtered mixture which integrates both medical and general domains across multimodal and text-only sources, designed for direct use in model fine-tuning. Building on this dataset, we train the Aloe-Vision family of medical LVLMs, openly released with full weights, training recipes and data, in two scales (7B and 72B). Through comprehensive benchmarking, we demonstrate that high quality training mixtures produce balanced LVLMs which yield significant gains over the baseline models without compromising general capabilities, achieving competitive performance with respect to state-of-the-art alternatives. To support reliable evaluation, we introduce CareQA-Vision, a carefully curated vision benchmark derived from MIR and EIR exams, the residency entrance exams for medical and nursing specialists in Spain, offering novel vision questions with low likelihood of contamination. Finally, we show that current LVLMs remain vulnerable to adversarial and misleading inputs, underscoring reliability challenges in clinical contexts.
[NLP-41] DMV-Bench: Diagnosing Long-Horizon Multimodal Agents Visual Memory with Incidental Cue Injection
【速读】: 该论文旨在解决现有智能体(agent)记忆研究过度聚焦文本模态、缺乏在交互式环境中评估视觉记忆真实需求的问题。现有基准大多仅考察智能体能否生成描述性文本,而非其是否真正依赖对视觉信息的长期记忆进行决策。为填补这一空白,作者提出了首个面向多模态智能体视觉记忆的交互式基准测试DMV-Bench(Code: this https URL),基于一个包含1,000种商品变体的受控家居电商目录,通过“文本泄露协议”确保任务的判别性信号仅存在于图像像素中,从而杜绝文本线索干扰。在一系列自主购物会话中,每个访问过的商品图像均嵌入唯一预渲染的偶然提示(incidental cue),后续要求智能体根据该提示回忆并导航至对应商品链接。受双编码理论(dual-coding theory)启发,论文提出DualMem记忆架构,采用视觉与语言双通道并行存储机制,其中视觉通道负责端到端传递提示信息,而语言通道则以辅助查询锚定角色参与记忆检索。实验结果表明,在Gemini 2.5 Flash与Qwen2.5-VL-7B两种模型上,DualMem在链长分别为5、10、15、50的全部设置下均显著优于基线方法及三种近期多模态智能体记忆系统,且优势在控制记忆库大小与编码位置偏差后依然成立,验证了其在不对称双编码机制下的有效性。
链接: https://arxiv.org/abs/2606.27499
作者: Yujin Tang,Chenming Shang,Ruize Xu,Nikhil Singh
机构: Dartmouth College (达特茅斯学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages
Abstract:Research on agent memory has matured rapidly, but almost entirely on the text side: few existing benchmarks ask, in an interactive environment, when an agent genuinely needs to remember what it saw rather than what it could write down. We introduce DMV-Bench (Code: this https URL), the first interactive benchmark for multimodal-agent visual memory. DMV-Bench is built on a controlled home-furnishing e-commerce catalogue of 1,000 product variants in which a text-leakage contract keeps the discriminative signal of each task in the pixels alone. Across a chain of autonomous shopping sessions, every visited product image carries a unique, pre-rendered incidental cue, and the agent is later asked to recall a particular cued product and navigate to its URL. Inspired by dual-coding theory, we propose DualMem, a memory architecture that maintains a visual and a verbal code in parallel. On DMV-Bench, DualMem outperforms a caption baseline and three recent multimodal agent-memory systems at every chain length J in 5, 10, 15, 50 on both Gemini 2.5 Flash and Qwen2.5-VL-7B, with the lead surviving controls for memory-bank size and encoding-position bias, and an asymmetric dual-coding regime in which vision carries the cue end-to-end while the verbal channel plays a smaller query-grounding role.
[NLP-42] Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在长期多轮对话中对动态变化事实的更新与维护能力不足的问题。具体而言,当用户信息(如位置、价格、计划等)随时间更新时,模型需准确使用最新事实并主动丢弃已被覆盖的过时信息,但现有模型在此方面表现严重滞后。研究通过真实对话数据集LongMemEval的知识更新子集发现,即使使用前沿模型(如gpt-5.4),将全上下文替换为有限且自我维护的记忆后,准确率从92%骤降至77%,且该差距具有统计显著性(配对McNemar检验,p<0.005),并随模型规模增大而持续存在,表明瓶颈在于记忆维护而非理解能力。进一步实验表明,该失败并非因记忆容量不足:随着对话长度增加24倍,准确率从68%下降至28%,即便按比例增加可用记忆空间也未能带来可检测的性能恢复,说明问题根源在于对事实时效性的建模能力,而非压缩比或存储量。为此,作者提出并发布了Supersede——一个基于强化学习框架(verifiers / prime-rl栈)的开放环境,将事实更新行为转化为可训练的奖励信号:模型被奖励使用当前有效事实作答,被惩罚依赖过时信息。最终实验表明,通过对小型开源模型(Qwen2.5-3B)进行GRPO微调,其在未见真实对话中的事实时效性准确率从9.0%提升至16.7%,实现近两倍增长,且性能随训练过程单调提升,验证了该能力可通过训练获得,而非仅限于测量。这是首个以时间性事实货币化为目标的可训练环境,也是首次证明“事实超期”差距可被有效缩小的实证研究。
链接: https://arxiv.org/abs/2606.27472
作者: Vedant Patel
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 4 figures, 3 tables. Code, environment, model, and dataset: this https URL
Abstract:Large language model (LLM) agents operate over long, multi-session interactions in which facts change: a user moves, a price updates, a plan is revised. Acting correctly requires using the current value of a fact and discarding values that have been superseded. We isolate this ability on real conversational data and show that it is a distinct, unsolved failure. On the knowledge-update subset of LongMemEval, replacing an agent’s full context with a bounded, self-maintained memory drops accuracy from 92% to 77% even on a frontier model (gpt-5.4), a gap that is statistically significant (paired McNemar p0.005) and persists across model scale while full-context accuracy saturates near 92%. The bottleneck is therefore memory maintenance, not comprehension, and is not closed by a stronger model. We then ask whether this is merely an undersized memory, and find it is not: as the conversation grows 24x, accuracy falls further (from 68% to 28%), and granting the agent proportionally more memory yields no detectable recovery (28% to 28%, n=25). The failure scales with the length of the conversation, not the compression ratio. We release Supersede, an open reinforcement-learning environment (on the verifiers / prime-rl stack) that turns this measurement into a training signal: agents are rewarded for answering from the current value and penalized for stale ones. Finally, we close the loop and show the gap is trainable: GRPO fine-tuning a small open model (Qwen2.5-3B) on this environment nearly doubles its held-out supersession accuracy on real, unseen conversations (9.0% to 16.7%, a single run), along a monotonic checkpoint curve indicating the learned policy, not the harness, carries the gain. To our knowledge this is the first trainable environment whose reward targets temporal fact-currency, and the first evidence the supersession gap can be trained down, not only measured.
[NLP-43] Developmental approach reveals the statistical learning of Neural Language Models: Transformers generalize from the most abstract statistical patterns
【速读】: 该论文旨在解决生成式语言模型(Generative Language Models, GLMs)在学习过程中如何逐步形成对语言统计规律的表征,以及其认知机制是否与人类语言习得存在相似性的问题。研究采用发展路径(developmental approach)的方法,通过在合成语法数据上训练一系列生成式Transformer模型,并在训练过程中的多个阶段保存模型状态,系统分析其内部表征随时间演化的动态特征。研究发现,神经语言模型(Neural Language Models, NLMs)在学习初期即已获得高度抽象的全局统计知识,随后逐步习得相对局部的统计依赖关系;整个学习过程伴随着从初始阶段大量过度泛化现象,到后期逐渐被约束和修正的过程。这一发现揭示了模型学习并非从局部到全局的线性积累,而是呈现出“先抽象后具体”的非线性发展轨迹。基于此,论文提出了一种新的理论框架,用以解释NLMs的统计学习机制与语言认知的形成路径,强调早期过度泛化作为学习起点的重要性,并指出后期的约束机制在实现精确语言表征中的关键作用。
链接: https://arxiv.org/abs/2606.27460
作者: Wang Bojun,Holly Jenkins,Elizabeth Wonnacott
机构: University of Oxford (牛津大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 7 figures, oral presentation at Interdisciplinary Advances in Statistical Learning
Abstract:In this study, we use a developmental approach to investigate the statistical learning and mental representation of neural language models (NLM). A series of Generative Transformer models are trained on a synthetic grammar. The model states are saved at multiple stages in the course of training. Through analyzing how the internal representations of these models change in the developmental path, we found that NLMs acquire the most abstract global statistical knowledge at the beginning of learning and later acquire the relatively local statistical dependencies. This learning path contains many over-generalizations from the very beginning and these over-generalizations are gradually constrained in the later stage of learning. Based on this observation, we propose a new framework to explain the statistical learning and language cognition of NLMs.
[NLP-44] Cluster Route Escalate: Cascaded Framework for Cost-Aware LLM Serving
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生产环境中高效部署时面临的准确率与成本之间的权衡问题。现有方案通常采用单一模型,导致简单查询时成本过高,而复杂查询时性能不足。为此,论文提出一种两阶段级联式解决方案:第一阶段通过聚类对输入查询进行分组,并将每类分配至最具成本效益的模型,路由过程的成本预算由一个可解释的超参数控制,该参数可在离线阶段调优;第二阶段引入质量评估(Quality Estimation, QE)级联机制,当第一阶段输出被判定为低质量时,查询将被升级至更强的模型,从而确保仅高难度或低置信度的请求进入昂贵模型。该方案在测试数据集上实现了与最强模型相当的97%-99%准确率,同时显著降低每输出词元的时间(Time Per Output Token, TPOT),且仅需任务正确性标签,无需人工重新配置即可适应模型池的变化。其核心创新在于通过可解释的路由策略与动态质量反馈机制,在保障性能的同时实现成本优化。
链接: https://arxiv.org/abs/2606.27457
作者: Yasmin Moslem,Magdalena Kacmajor,Vasudevan Nedumpozhimana,Ammar Abbas,Solmaz Panahi,David Lynch,Zhuangzhuang Nie,Alexandros Agapitos,Aleksandar Milenovic,Hongmeng Song,Yucheng Shi,Yue Pan,Patricia Buffini,John D. Kelleher
机构: ADAPT Centre, Trinity College Dublin (ADAPT中心,都柏林三一学院); Huawei Research (华为研究)
类目: Performance (cs.PF); Computation and Language (cs.CL)
备注:
Abstract:Efficient deployment of large language models (LLMs) in production forces a trade-off between accuracy and cost. Operators often default to a single model that is either expensive for easy queries or insufficient for hard ones. To address this challenge, we propose a two-stage cascaded solution. Stage 1 clusters incoming queries and assigns each cluster to its most cost-effective model. The cost budget for this routing process is set by an interpretable hyperparameter, tuned offline. Stage 2 adds a quality estimation (QE) cascade; when an output from Stage 1 is judged low-quality, the query is escalated to a stronger model. This ensures only hard or low-confidence cases reach the expensive models. On the test datasets, the cascaded system retains 97-99% of the strongest model’s accuracy while reducing Time Per Output Token (TPOT). It requires only task-correctness labels and adapts to changes in the model pool without manual reconfiguration.
[NLP-45] Causal Connections: Leverag ing Multilingual Fine-Tuning for Financial QA@FinCausal 2026
【速读】: 该论文旨在解决从金融文本中提取因果关系这一自然语言处理挑战,特别是在多语言环境下通过抽取式问答(extractive question answering)实现跨语言的因果关系识别。其核心问题在于如何在英语和西班牙语两种语言中有效建模金融叙事中的因果结构,并提升模型在低资源或跨语言场景下的泛化能力。解决方案的关键在于系统性地比较三类模型架构:基于多语言BERT的编码器-仅用标记分类、基于多语言BART的编码器-解码器生成以及基于解码器的大型语言模型(LLM,如Llama 3.1与GPT系列)结合提示优化(prompt refinement)、少样本示例(few-shot demonstrations)与监督微调(supervised fine-tuning)。实验表明,尽管提示工程与少样本学习已能取得良好效果,但任务特定的监督微调带来了显著性能提升;其中,使用英语与西班牙语混合训练数据对GPT-4.1 Mini进行微调的模型,在英文子任务上达到并列最高分(4.8140),在西班牙语子任务上位列第三(4.7753),验证了任务定制化适配与多语言微调在金融因果问答中的关键作用。
链接: https://arxiv.org/abs/2606.27446
作者: Akash Kumar Gautam,Serhii Hamotskyi,Christian Hänig
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper describes team HSA_CORAL’s submission to the FinCausal 2026 shared task on extracting cause-effect relations from financial narratives via extractive question answering in English and Spanish. We compare three modeling families: (i) encoder-only token tagging with multilingual BERT, (ii) encoder-decoder generation with multilingual BART, and (iii) decoder-only LLMs (Llama 3.1 and GPT variants) using prompt refinement, few-shot demonstrations, and supervised fine-tuning. Across settings, prompting and few-shot examples yield competitive performance, while supervised fine-tuning provides the largest gains. Our best system, GPT-4.1 Mini fine-tuned on combined English and Spanish training data, achieves a tied highest score on the English subtask (score 4.8140) and ranks third on Spanish (score 4.7753) under the shared task’s LLM-as-a-judge metric. Overall, the results highlight the value of task-specific adaptation and multilingual fine-tuning for cross-lingual transfer in financial causality QA.
[NLP-46] CalBrief: A Pilot Diagnostic Benchmark for Evidence-Calibrated Scientific Briefing with Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在充当科研助手时,难以根据支持性证据的强度与范围准确校准研究结论的问题。具体而言,研究关注的是“证据校准型科学简报”(evidence-calibrated scientific briefing)任务:在给定一组相关文献的有限证据包(evidence package)时,系统应生成具有证据强度、作用范围边界及缺失证据警示的高层次综合结论。其解决方案的关键在于构建了一个经过验证的试点基准(verified pilot benchmark),包含16个异质性的科学证据包和96条人工验证的结论,并引入CalBrief框架——一种可审计的角色/缺口/强度分析框架,作为诊断工具以定位简报生成过程中的失效环节。研究发现,在公平评估设置下,结构化组织有助于提升角色与缺口推理能力,但显式的强度校准策略表现出系统性保守倾向,低于多数人判断和直接调用大模型的基线表现。通过在三个闭源模型(GPT-4o、Claude Sonnet、Gemini Flash)上进行受控诊断,研究分离出保守性的三大潜在成因:约63%的保守性差异源于将标签空间从二元(中等、弱)扩展至四元(中等、弱、不确定、证据不足),且该效应在所有模型中均具有统计显著性;仅1%归因于缺口/范围信号注入(不显著);其余36%则源自校准策略本身的流程设计。此外,研究发现四分类预测可通过后处理合并为二分类,其性能可达到甚至超越直接二分类提示,表明额外标签蕴含了严格匹配机制所掩盖的信息价值。因此,论文指出,标签级别的强度判断与可审计的证据组织能力是当前大语言模型在科研辅助场景中相互冲突的两种独立能力,应在评估中予以区分。
链接: https://arxiv.org/abs/2606.27383
作者: Yu Fu,Yongqi Kang,Yong Zhao
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly used as research assistants, yet it remains unclear whether they can calibrate research takeaways to the strength and scope of the supporting evidence. We study evidence-calibrated scientific briefing: given a bounded package of related papers, a system should generate package-level takeaways with evidence strength, scope boundaries, and missing-evidence caveats. We contribute a verified pilot benchmark of 16 heterogeneous scientific evidence packages and 96 human-verified takeaways, and we use CalBrief, an auditable role/gap/strength framework, as a diagnostic probe to locate where briefing breaks down. Under a fair-schema evaluation, structured organization improves role and gap reasoning, but an explicit strength-calibration policy is systematically over-conservative and falls below majority and direct-LLM baselines. To explain why, we run a controlled diagnostic across three closed-model backbones (GPT-4o, Claude Sonnet, Gemini Flash) that separates three potential causes of conservatism. Approximately 63% of the conservatism gap is attributable to expanding the label space from binary moderate, weak to four-way moderate, weak, uncertain, insufficient_evidence (p 0.001 across all backbones); only 1% is attributable to gap/scope signal injection (not significant); the remaining 36% arises from the pipeline policy itself. We also find that 4-way predictions can be post-hoc collapsed back to binary and then match or exceed direct binary prompting, so the extra labels carry information that strict matching hides. Label-level strength judgment and auditable evidence organization are distinct abilities currently in tension, and should be evaluated separately for LLM research assistants.
[NLP-47] A Survey of Automated Presentation Coaching: Systems Methods and Open Challenges ACL
【速读】: 该论文旨在解决自动化演讲辅导系统在语音发音、语调、流畅性及内容忠实度等关键维度上缺乏系统性评估与比较的问题。其核心挑战在于现有研究分散且未形成统一的评价框架,导致技术进展难以横向对比。解决方案的关键在于提出一个五维任务分类体系,涵盖音段发音(segmental pronunciation)、词汇重音(lexical stress)、超音段语调(suprasegmental prosody)、语速控制(pacing)和内容忠实度(content faithfulness),并据此对现有系统进行结构化归类与映射,从而揭示当前技术在多维度覆盖上的空白。此外,论文系统梳理了基于语音合成(TTS)的范例生成与诊断方法在发音、语调与流畅性评估中的应用,明确了未来亟需突破的方向:包括构建大规模标注的演讲语料库、实现跨母语背景(L1)的公平反馈机制,以及支持实时演练所需的低延迟诊断能力。
链接: https://arxiv.org/abs/2606.27380
作者: Wen Liang,Li Siyan,Zackary Rackauckas,Julia Hirschberg
机构: Columbia University (哥伦比亚大学); Red Hat (红帽); RoleGaku (角色学馆)
类目: Computation and Language (cs.CL)
备注: accepted into the BEA 2026 workshop at ACL
Abstract:Automated coaching for oral presentations sits at the intersection of computer-assisted pronunciation training (CAPT), prosody modeling, and speech synthesis, yet no prior work has systematically surveyed and compared existing systems along these dimensions. This survey reviews and categorizes automated presentation coaching systems, spanning pronunciation tutors, fluency and prosody coaches, multimodal trainers, and conference QA practice tools. We introduce a five-dimensional task taxonomy - covering segmental pronunciation, lexical stress, suprasegmental prosody, pacing, and content faithfulness - and explicitly map surveyed systems onto it to reveal coverage gaps. We further review the core technical methods these systems employ: TTS-based exemplar generation and diagnostic methods for pronunciation, prosody, and fluency assessment. Key open challenges include the scarcity of annotated presentation corpora, achieving accent-fair feedback across diverse L1 backgrounds, and delivering low-latency diagnostics for real-time rehearsal.
[NLP-48] Position: The Term “Machine Unlearning” Is Overused in LLM s ICML2026
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)研究中“机器遗忘”(machine unlearning)概念被滥用的问题。其核心关切在于,许多本应属于不同目标的任务(如对有害请求的拒绝、特定实体或知识的移除、针对性抑制等)均被笼统归类为“遗忘”,导致术语混淆,进而引发评估标准与基准测试的误用。论文指出,“机器遗忘”应严格限定于基于数据集定义的删除任务:即精确指定需“遗忘”的训练数据集,并确保模型在移除该数据影响后,其表现近似于从头重新训练时未包含这些数据的情形。而其他任务则涉及不同的技术目标,如对齐(alignment)、抑制(suppression)、编辑(editing)或混淆(obfuscation),需采用相应术语与可比参照模型进行评估。解决方案的关键在于建立更严格的术语规范,明确区分各类任务所隐含的技术承诺(implicit guarantees),并设计与声称目标相匹配的评估方法,以避免仅依赖表面指标(如低ROUGE得分或遗忘准确率)却忽视重训练等价性验证的问题。
链接: https://arxiv.org/abs/2606.27379
作者: Sangyeon Yoon,Yeachan Jun,Albert No
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages; ICML 2026 Position Paper Track. Sangyeon Yoon and Yeachan Jun contributed equally
Abstract:Large language models increasingly face demands to “forget” training data, knowledge, or behaviors due to regulatory deletion obligations, copyright/licensing disputes, and safety or product-policy requirements. This position paper argues that machine unlearning is overused as a term in LLM research and should be reserved for dataset-defined deletion: removing the training influence of a precisely specified forget set such that the resulting model is approximately indistinguishable from retraining without that data. We contend that many tasks currently labeled “unlearning” (e.g., refusal for harmful requests, entity/knowledge removal, or targeted suppression) pursue different, often policy-dependent objectives and therefore require different terminology and baselines (e.g., alignment, suppression, editing, obfuscation). We further argue that this confusion is not cosmetic: because papers make different implicit guarantees under the same label, metrics and benchmarks are frequently reused outside their intended scope, rewarding surface-level non-disclosure (e.g., low ROUGE/forget accuracy) even when retraining-equivalence is not tested and derived capabilities remain. We conclude by calling for stricter terminology tied to explicit guarantees and reference models, and for evaluations that match the claimed objective.
[NLP-49] Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中潜在思维表征(latent thought representations)评估中存在的根本性问题:现有评估方法将表征质量与模型容量混为一谈,导致无法准确识别表征本身的缺陷。其核心挑战在于,下游任务的准确率无法区分表征失败是源于表征本身的设计不足,还是模型处理能力的局限。为此,论文提出一个公理化评估框架,定义了四个独立于下游任务表现的功能性公理——因果性(Causality)、最小性(Minimality)、可分性(Separability)和稳定性(Stability),并针对每个公理构建了可量化的度量指标,直接作用于表征空间本身,不依赖下游任务性能。通过在23个推理任务(如空间推理、事实问答)上对开源权重的LLMs进行审计,研究发现:没有任何候选模型同时满足全部四条公理;表征能够可靠区分不同任务类型,但无法有效区分同一任务内的不同问题;且表征所包含的信息量远低于输入嵌入中的已有信息。这一失败模式在密集型、推理蒸馏型及强化学习训练的模型家族中均一致存在,表明该缺陷具有结构性特征,而非由模型规模或训练方式所致。
链接: https://arxiv.org/abs/2606.27378
作者: Fahd Seddik,Fatemeh Fard
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 44 pages, 27 tables, 14 figures
Abstract:We introduce an axiomatic evaluation framework for latent thought representations in LLMs, comprising metrics that are independent of downstream benchmark scores and reveal representational failures that benchmark accuracy masks. Existing evaluations conflate representation quality with model capacity. Therefore, failures cannot be attributed to the representation rather than to the model that processes it. We formalize four functional axioms (Causality, Minimality, Separability, and Stability) and define a quantitative measure for each, computed directly on the representation independently of downstream accuracy. We audit open-weight LLMs across 23 reasoning tasks (e.g., Spatial Reasoning, Factual QA). We find that no candidate satisfies all four axioms simultaneously, that the representations distinguish task type reliably but cannot distinguish between two questions within the same task, and that the representations encode little information beyond what is already present in the input embedding. The failure is consistent across dense, reasoning-distilled, and RL-trained model families, indicating that the gap is structural rather than a property of model size or training procedure.
[NLP-50] HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech
【速读】: 该论文旨在解决基于大语言模型(Large Language Model, LLM)的文生语音(Text-to-Speech, TTS)系统在情感表达上的局限性问题。现有监督微调范式易收敛至统计平均化的韵律模式,导致情感表现力不足;尽管偏好驱动优化提供了一种有前景的替代方案,但现有方法存在两大结构性缺陷:一是信息冲突,即内容与情感在共享潜在空间中引发梯度冲突,导致奖励劫持(reward hacking)和语义退化;二是尺度鸿沟,即稀疏的句级奖励难以有效指导密集的帧级语音生成。为克服上述挑战,论文提出一种分层渐进式奖励优化框架(HPRO),其核心在于引入一种新型可微分奖励模型——HD-Emo编解码器(HD-Emo codec),通过将语音特征解耦为独立的内容与风格偏好令牌,从结构上隔离情感优化与语义内容,从而解决信息冲突问题。在此构建的结构化偏好空间基础上,HPRO通过渐进式对齐帧级、词级与句级目标,有效弥合尺度鸿沟。实验表明,HPRO显著提升了语音的情感表现力,同时保持了良好的语言可理解性。
链接: https://arxiv.org/abs/2606.28249
作者: Sihang Nie,Xiaofen Xing,Rui Xing,Haoming Li,Ruitong Xiao,Jingyuan Xing,Baiji Liu,Xiangmin Xu
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 7 pages, 3 figures, 3 tables; Preprint
Abstract:Recently, Large Language Model (LLM)-based Text-to-Speech (TTS) models have achieved remarkable naturalness. However, the standard Supervised Fine-Tuning paradigm often converges to statistically averaged prosody, limiting emotional expressiveness. While preference-driven optimization offers a promising alternative, existing approaches suffer from two structural mismatches: information conflict, where content and emotion in a shared latent space produce conflicting gradients, leading to reward hacking and semantic degradation; and scale gap, where sparse sentence-level rewards struggle to guide dense frame-level generation. To overcome these challenges, we propose HPRO, a hierarchical progressive reward optimization framework. Within HPRO, we introduce the HD-Emo codec as a novel differentiable reward model to resolve the information conflict. It extracts speech into distinct content and style preference tokens, structurally isolating emotional optimization from semantic content. Building upon this structured preference space, HPRO bridges the scale gap by progressively aligning frame-, word- and sentence-level objectives. Experiments demonstrate that HPRO significantly enhances emotional expressiveness, while effectively preserving linguistic intelligibility. The code and audio samples are publicly available at this https URL.
[NLP-51] Scaling limit of the Random Language Model
【速读】: 该论文旨在解决生成式语言模型中随机语法模型(Random Language Model, RLM)在大尺度极限下的统计行为与相变特性问题,特别是针对其在隐藏符号数 N→∞ 且语法规则温度 ϵ~d→0 时,以固定 x=ϵ~dlogN 的标度极限下,语言统计特性如何随系统规模和温度演化的问题。其核心挑战在于理解规则使用分布的非平凡依赖关系、熵的压缩机制以及是否存在热力学相变。解决方案的关键在于引入一个基于大偏差原理(large-deviation principle)的可控描述框架,将问题映射到一类具有复杂组合结构的随机能量模型(Random Energy Model, REM),并通过半退火近似实现解析求解。研究揭示了在临界点 xc=1/8 处存在凝聚相变(condensation transition),导致规则使用集中;而在 x=1/2 处出现熵减至最大值以下的第二特征尺度。通过推导不同区域下不同规则数量、熵及相关可观测量的显式标度律,明确了由语法规模、语料长度与温度共同调控的标度、饱和与临界三类行为。该理论统一解释了自然语言统计的普适性起源,并为大型语言模型的行为提供了深刻的统计物理视角,同时澄清了以往关于热力学相变存在的争议及 N→∞ 极限收敛缓慢的本质原因——源于对 logN 的依赖。
链接: https://arxiv.org/abs/2606.28105
作者: Eric De Giuli
机构: Toronto Metropolitan University (多伦多都会大学)
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Computation and Language (cs.CL)
备注: 17 pages + 14 pages SI
Abstract:We develop a quantitative theory of the Random Language Model (RLM), an ensemble of stochastic context-free grammars, in a scaling limit where the number of hidden symbols N \to \infty while the grammar temperature \tilde\epsilon_d \to 0 at fixed x = \tilde\epsilon_d \log N . In this limit, the model admits a controlled description based on a large-deviation principle over rule-usage patterns. A semi-annealed approximation maps the problem to a class of Random Energy Models with nontrivial combinatorics. We show that the RLM exhibits a condensation transition at a critical value x_c=1/8 , below which rule usage concentrates and language statistics acquire a nontrivial dependence on corpus length. A second characteristic scale at x=1/2 marks the onset of entropy reduction from its maximal value. Across these regimes, we derive explicit scaling laws for the number of distinct rules, entropy, and related observables, identifying distinct scaling, saturation, and critical regimes controlled by the interplay of grammar size, corpus length, and temperature. The theory resolves previous ambiguities regarding the existence of a thermodynamic transition and explains the slow approach to the large- N limit as a consequence of the dependence on \log N . It further provides a unified framework in which universal statistical properties of language emerge from typical realizations of generative grammars, with implications for both natural language statistics and the behavior of large language models. Comments: 17 pages + 14 pages SI Subjects: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Computation and Language (cs.CL) Cite as: arXiv:2606.28105 [cond-mat.dis-nn] (or arXiv:2606.28105v1 [cond-mat.dis-nn] for this version) https://doi.org/10.48550/arXiv.2606.28105 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
信息检索
[IR-0] Context-Aware Explanations for Spatialized Document Layouts
链接: https://arxiv.org/abs/2606.28081
作者: Wei Liu,John Wenskovitch,Chris North,Rebecca Faust
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 10 pages, 4 figures, accepted to Graphics Interface 2026 (GI 2026)
Abstract:Spatialized document layouts are widely used for exploratory analysis of text corpora, but interpreting the spatial organization of documents and the relationships between regions remains challenging. Existing approaches primarily summarize document content or explain how layouts are generated, providing limited support for understanding spatial relationships within the layout itself. We present CAPE, a context-aware explanation framework that generates natural-language explanations grounded in both document semantics and layout-derived spatial context. CAPE identifies salient spatial patterns (e.g., clusters, subgroups, outliers, and bridging documents) and constructs multi-level contextual representations to guide LLM-based explanation generation. It supports both AI-guided overview and user-driven exploration, with explanations available at multiple levels of detail. We demonstrate CAPE on news and scholarly document layouts and evaluate it in a controlled user study against keyword-based and content-only LLM baselines. Our results suggest that spatially grounded explanations are perceived as more helpful than content-only baselines for interpreting the spatial organization of document layouts.
[IR-1] Single and Multi Truth Data Fusion using Large Language Models
链接: https://arxiv.org/abs/2606.28062
作者: Hira Beril Kucuk,Norman W Paton,Jiaoyan Chen,Zhenyu Wu
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Data fusion, also known as truth discovery, is a data integration problem that aims to determine the correct value or set of values for each attribute of an object when presented with potentially conflicting values from multiple sources. Data fusion tasks belong to two main categories: single-truth scenarios, where each attribute has only one correct value, and multi-truth scenarios, where multiple values can be valid simultaneously. This paper investigates the use of Large Language Models (LLMs) in data fusion tasks for tabular data. Various prompting strategies, encompassing both single-truth and multi-truth scenarios, are investigated empirically. Domain-dependent, domain-independent, zero-shot and one-shot prompts are evaluated on three different benchmark datasets. Experimental results demonstrate that LLM-based approaches outperform traditional unsupervised truth discovery methods, such as DART and LTM, across all datasets. The codebase of this study has been made publicly available on GitHub.
[IR-2] Fast and Feasible: Permutation-based Constrained Reranking for Revenue Maximization
链接: https://arxiv.org/abs/2606.28059
作者: Svetlana Shirokovskikh,Anastasiia Soboleva,Ekaterina Solodneva,Aleksandr Katrutsa,Roman Loginov,Egor Samosvat
类目: Information Retrieval (cs.IR); Optimization and Control (math.OC)
备注:
Abstract:Search and recommender systems have produced highly relevant search results. A natural next step in the development of such systems in e-commerce is to rerank these results to increase the platform’s revenue from paid promotion products. However, maximizing revenue alone may degrade the user experience by reducing relevance or increasing fraud risk. To avoid this, we state the reranking problem as an integer linear program ( ILP ) that maximizes revenue subject to per-query constraints on other metrics, e.g., relevance. Since solving ILP exactly for every query is slow for deployment to the online service, we propose a lightweight permutation-based reranking approximation algorithm PermR. At each step, the algorithm selects a pair of neighboring items and swaps them to either improve the objective or repair a violated constraint. We evaluate PermR across multiple categories of a large classified platform in offline and online settings. PermR achieves about 63% of the ILP revenue improvement, within production latency limits, preserving all constraints. In a 14-day online A/B test over 56 million search queries, PermR increased revenue by 2 %.
[IR-3] Listwise Explanation of Embedding-Based Rankings via Semantic Chunk Grouping
链接: https://arxiv.org/abs/2606.27980
作者: Hyunkyu Kim,Yeeun Yoo,Youngjun Kwak
类目: Information Retrieval (cs.IR)
备注: 17 pages, 5 figures, 4 tables
Abstract:Dense embedding rankers score documents through contextual sentence- and passage-level representations. Yet many listwise explanation methods still attribute rankings to isolated words. This feature-unit mismatch leaves word-level features too fragmented for dense semantic ranking. We introduce ChunkGroupSHAP, a listwise Shapley method that clusters semantically related chunks into shared cross-document features. Masking a group perturbs all documents with related evidence, attributing rankings at a granularity closer to dense representations while preserving the listwise setup. Our findings across MS MARCO, FinanceBench, AILACaseDocs, and FinQA with E5 rankers and BM25 show that the best explanation unit is setting-dependent: word features for lexical BM25, corpus-level groups for dense rankers, and query-local grouping for heterogeneous web retrieval. Feature units should thus follow both the ranker’s representational granularity and the structure of the retrieved corpus.
[IR-4] SHARD: cell-keyed residual splitting for alignment-resistant private dense retrieval
链接: https://arxiv.org/abs/2606.27976
作者: Sergey Kurilenko
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: arXiv admin note: text overlap with arXiv:2606.26373
Abstract:Dense embeddings underpin semantic search and RAG, yet a leaked vector store hands much of the underlying text back to whoever holds it. The attacks that make this possible (few-shot alignment, zero-shot inversion, unsupervised cross-space translation) share one weakness: the protected store is a single global geometry that can be aligned to a known one. A secret global rotation, the usual lightweight defence, is no exception: orthogonal Procrustes recovers it once the attacker has about the subspace dimension in known pairs. We introduce Shard, a retrieval-preserving embedding transform that removes this weak axis. The centred embedding is split into a short public prefix (for stage-1 retrieval) and a private residual sharded into C cells under separate secret keys; the residual is reranked under CKKS, where the keys cancel and leave the inner product exact. A single parameter C runs the design from the global-linear baseline it replaces (C=1) to per-document micro-keys (C=N). Because the rerank is full-dimensional, Shard returns the raw-space nDCG@10 that half-SVD truncation gives up; and because the residual is keyed cell-locally, mapping it back to a common frame under a diffuse known-plaintext leak costs roughly C times more anchors (median 200 to 102,400 at C=256), for a few encrypted queries. The short public prefix leaks far less neighbour structure, and a micro-key limit drives the residual graph to zero with an unlinkable, renewable template. The barrier holds against learned, non-linear and unsupervised aligners, and where a matched-utility noise defence de-anonymises almost every probe, Shard de-anonymises none. We are plain about the limits: within a cell the keys cancel, a targeted attacker needs only about d_priv anchors, and an overlapping reference corpus still leaks through the prefix. Shard is an attack-aware geometric defence, not a cryptographic guarantee. Comments: arXiv admin note: text overlap with arXiv:2606.26373 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2606.27976 [cs.CR] (or arXiv:2606.27976v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.27976 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-5] An LLM -Powered Semantic Alignment Framework for Journal Recommendation
链接: https://arxiv.org/abs/2606.27930
作者: Yanglin Yan,Zicheng Xie,Tianchen Gao,Rui Pan,Hansheng Wang
类目: Information Retrieval (cs.IR); Applications (stat.AP)
备注:
Abstract:Journal recommendation is an important task in scholarly information systems. Existing approaches typically rely on supervised learning models, manually engineered features, or historical interaction data, which may limit their generalizability and interpretability. We propose an LLM-powered semantic alignment framework that formulates journal recommendation as a semantic matching problem between manuscript content and journal scope descriptions. The framework enables large language models (LLMs) to infer journal suitability directly from article titles, abstracts, keywords, and candidate journal information without task-specific training. Experiments are conducted using DeepSeek-V3 on a dataset of 23,609 articles from 49 journals in statistics and related fields. The proposed framework achieves Top-3, Top-5, and Top-10 accuracies of 40.23%, 53.67%, and 70.05%, respectively. Additional analyses show that incorporating reference information generally improves recommendation performance and that recommendations remain highly stable across repeated runs, with an average Top-5 Jaccard similarity of 84%. The framework also generates interpretable reasoning outputs that provide insights into the recommendation process. These findings demonstrate the potential of LLMs as a training-free and scalable paradigm for journal recommendation and scholarly decision support.
[IR-6] From Bootstrapping to Sequence Modeling: A Unified Generative Framework for Personalized Landing-Page Modeling
链接: https://arxiv.org/abs/2606.27865
作者: Fan Li,Chang Meng,Jiaqi Fu,Shuchang Liu,Tianke Zhang,Xueliang Wang,Xiaoqiang Feng,Yongqi Liu,Kaiqiao Zhan
类目: Information Retrieval (cs.IR)
备注: arXiv admin note: text overlap with arXiv:2507.23459
Abstract:Modern online platforms increasingly adopt multi-page architectures to accommodate diverse user needs. On these platforms, page navigation (the process of directing users to specific functional pages upon app entry) serves as a critical gateway that shapes user’s first impression and significantly influences subsequent engagement. To optimize this process, Kuaishou formulated the task of Personalized Landing Page Modeling (PLPM) and proposed KLAN, a reinforcement learning framework built upon Conservative Q-Learning (CQL). However, CQL-based approaches suffer from two fundamental limitations: (1) the Markov assumption fails to capture the strong non-Markovian temporal dependencies inherent in real-world user behaviors, and (2) TD learning with bootstrapping incurs severe cumulative errors and credit assignment difficulties under delayed rewards, particularly in long-horizon settings where users enter the app multiple times daily. To address these limitations, we propose GLAN (Generative Landing-page Adaptive Navigator), a sequence modeling framework built on Decision Transformer to tackle PLPM from a unified global-local perspective. Specifically, GLAN incorporates two key modules. First, we design the L-RTG module that captures users’ inter-day consumption dynamics to provide accurate global guidance for all page assignments within a day. Furthermore, we propose the HRM module that decomposes session-level feedback into fine-grained signals, enabling precise local supervision for each page assignment. Extensive online experiments conducted on the Kuaishou platform demonstrate the effectiveness of GLAN, achieving +0.158% and +0.108% improvements on Daily Active Users (DAU) and user Lifetime (LT) respectively.
[IR-7] End-to-End Dynamic Sparsity for Resource-Adaptive LLM Inference
链接: https://arxiv.org/abs/2606.27743
作者: Yuhang Chen,Jinhao Duan,Ruichen Zhang,Mingfu Liang,Xiaohan Wei,Yunchen Pu,Fei Tian,Chonglin Sun,Parish Aggarwal,Frank Shyu,Luke Simon,Sandeep Pandey,Tianlong Chen,Xi Liu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) inference is typically deployed under a static resource assumption, where models execute a fixed computational graph regardless of the runtime environment. However, real-world cloud infrastructure is inherently dynamic, characterized by fluctuating availability (e.g., spot instance preemption) and tiered Quality-of-Service requirements. In such volatile settings, static models are inflexible: they either crash under resource constraints or waste compute on redundant operations. To bridge this gap, we propose Learning to Allocate (L2A), an end-to-end framework for resource-adaptive inference. Unlike prior methods that condition only on input difficulty, we formulate inference as a constrained allocation problem conditioned on both the input and the runtime resource budget itself. We introduce lightweight, budget-conditioned and input-aware gating networks integrated into the LLM. These gates are trained via a unified objective that jointly optimizes task performance, logical consistency, and resource costs along three axes matching how real-world dynamics manifest: layer skipping for memory and depth pressure, head pruning for throughput contention, and reasoning-token reduction for latency tightening. This lets the model learn a budget-aware policy beyond input difficulty alone: it adaptively configures its computational footprint with respect to real-time resource dynamics, maximizing reasoning depth when resources permit while enforcing strict frugality when budgets tighten. A single L2A model traces the entire compute-accuracy Pareto frontier on Llama-3-8B and Qwen-3-4B: at up to 34% realized layer sparsity, it stays within 0.6% of the dense baseline on GSM8K, with the same gap holding zero-shot on out-of-distribution tasks, while every static or heuristic baseline requires a separately tuned model and still drops by 5-10% at comparable inference time.
[IR-8] Bifocal Diffusion Language Models: Asymmetric Bidirectional Context for Parallel Generation
链接: https://arxiv.org/abs/2606.27732
作者: Yuhang Chen,Xianfeng Wu,Jinhao Duan,Mingfu Liang,Xiaohan Wei,Yunchen Pu,Fei Tian,Chonglin Sun,Parish Aggarwal,Frank Shyu,Luke Simon,Sandeep Pandey,Xi Liu,Tianlong Chen
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Discrete diffusion language models (dLLMs) recover masked tokens in parallel, offering significant speedups over autoregressive (AR) generation. However, such promising frameworks face a fundamental architectural design dilemma: \ding182 Adopting bidirectional attention achieves strong generation quality by allowing each position to access the full context, but is inherently incompatible with KV caching, limiting inference throughput in batch-serving scenarios; \ding183 Conversely, causal attention enables efficient cached inference but loses all right-side context, substantially degrading generation quality. This paper introduces Bifocal dLLMs, a new paradigm that resolves this dilemma through \emphasymmetric bidirectional context. Analogous to bifocal lenses, we instantiate the paradigm as \textbfR2LM (Right-to-Left Mamba), which combines two complementary mechanisms: a ) standard causal attention providing precise left-context with full KV cache compatibility, while b ) a lightweight reverse Mamba SSM sidecar supplying compressed right-side context without breaking cacheability. Comprehensive experiments on continued pretraining of Qwen3-1.7B with 60B tokens demonstrate that R2LM achieves 2.4\times to 12.9\times higher throughput than bidirectional dLLMs and 1.9\times to 2.9\times speedup over AR baselines in batch serving through parallel decoding with KV caching, while exceeding the causal baseline on most benchmarks and surpassing the bidirectional dLLM on average.
[IR-9] Intuition-Guided Latent Reasoning for LLM -Based Recommendation
链接: https://arxiv.org/abs/2606.27684
作者: Chang Liu,Yimeng Bai,Xiaoyan Zhao,Yang Zhang,Qifan Wang,Fuli Feng,Wenge Rong
类目: Information Retrieval (cs.IR)
备注:
Abstract:Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks, motivating their use for preference reasoning in recommender systems. Latent reasoning, which operates in continuous hidden spaces rather than discrete tokens, has recently emerged as a promising paradigm for LLM-based recommendation. However, existing methods often start from unconstrained reasoning points, where hidden representations are misaligned with target item embeddings, leading to suboptimal reasoning trajectories. Inspired by cognitive neuroscience, which suggests that human multi-step reasoning is guided by intuition as a latent prior, we propose \emphIntuRec, a two-stage framework that anchors latent reasoning with \emphrecommendation intuition. In the extraction stage, the LLM-based recommender generates a top- K candidate set based on users’ histories as the source of intuition. In the injection stage, the candidate set is transformed into a preference-aligned intuition embedding using self- and cross-attention mechanisms, which initializes the reasoning start point and guides subsequent latent reasoning. By providing a semantically grounded starting point, IntuRec efficiently explores the preference space along more accurate reasoning trajectories. Extensive experiments on multiple real-world datasets demonstrate that IntuRec consistently outperforms state-of-the-art baselines. We release our code at this https URL. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2606.27684 [cs.IR] (or arXiv:2606.27684v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.27684 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-10] DysLexLens: A Low-Resource LLM Framework for Analysing Dyslexic Learners Insights from Online Forums
链接: https://arxiv.org/abs/2606.27619
作者: Dana Rezazadegan,Atie Kia,Phongpadid Nandavong,Dominique Carlon,Jeremy Nguyen,Abhik Banerjee,James Marshall,Anthony McCosker,Yong-Bin Kang
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Dyslexic learners increasingly use artificial intelligence (AI) tools to support reading, writing, organisation, and study-related tasks. However, their lived experiences with these tools remain largely underexamined. This paper proposes DysLexLens, a low-resource LLM framework, designed to analyse dyslexic learners experience with AI through online forum discussions. DysLexLens is designed as an end-to-end, evidence-traceable architecture which transforms noisy social media posts into a dictionary-driven corpora, provides knowledge-graph (KG)-based question reasoning, generates verifiable query responses, and enables response evaluation through quantitative and human-grounded assessment. DysLexLens has four key features. First, it employs a dictionary-driven filtering method to construct a more focused Reddit corpus on dyslexia and AI, filtering out noisy and weakly related posts to improve the relevance of data collected from low-resource forum contexts. Second, it integrates LLM-assisted semantic analysis with KG-based query reasoning to uncover meaningful patterns. Third, it has quantitative evaluation metrics (RAGAS and Query Robustness) to measure LLM-generated response performance. Fourth, it provides structured qualitative validation guidelines for assessing response quality, with a specific focus on hallucination and evidence alignment. We demonstrate the effectiveness of DysLexLens using dyslexia-related Reddit forum data and 30 questions. The results show its potential generalisability to other low-resource forum data contexts. DysLexLens, sample data, questions and evaluation results are available at Github to support reproducibility.
[IR-11] A Sensitivity-Aware Test Collection for Search Among Personal Information SIGIR2026
链接: https://arxiv.org/abs/2606.27559
作者: Jack McKechnie,Graham McDonald,Craig Macdonald
类目: Information Retrieval (cs.IR)
备注: SIGIR 2026 Resource Paper
Abstract:Traditional search tasks aim to satisfy user information needs by returning a subset of a collection of documents, ranked by the documents’ relevance to a user query. However, some collections that contain useful information also contain sensitive personal information. Recently, there has been increasing interest in the development of Sensitivity-Aware Search (SAS) retrieval models to provide users with effective retrieval results without revealing such sensitive information. To develop such systems, test collections containing both sensitive and non-sensitive information, a set of queries, and query-document relevance assessments are required. The Enron email corpus contains real business-related emails, where some emails also contain sensitive personal information. However, the original Enron collection does not contain queries or query-relevance assessments. To this end, we crowdsource 150 query formulations for 50 different topics and 11,471 query-relevance assessments for a subset of the Enron documents that have been manually labelled for sensitivity. We follow best practices for using large language models (LLMs) in Information Retrieval evaluation to extend the collection further with additional LLM judged query-relevance assessments and sensitivity labels. We present baseline performances for relevance, sensitivity classification, and sensitivity-aware search on the collection. We make the collection available, including through the popular ir_datasets package, and provide pre-built sparse and dense indices on Huggingface to facilitate easy experimentation.
[IR-12] Recall Before Rerank: Benchmarking Deep Learning Models for Large-Scale Code-to-Code Retrieval
链接: https://arxiv.org/abs/2606.27401
作者: Leonardo Venuta,Francesco Tosoni,Paolo Ferragina
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 15 pages, 4 figures
Abstract:Semantic code search and clone detection are essential for software development, maintenance, and reuse. This paper evaluates the effectiveness, efficiency, and scalability of contemporary deep learning models for first-stage recall in large-scale code-to-code search engines. Benchmarking across multiple programming languages and datasets reveals critical limits in the precision and scalability of these models on Terabyte-scale source-code collections. We present LLM-based code normalisation and query-rewriting schemes that yield significant gains in precision for lower-performing models. Our results question the sustainability of resource-constrained deployment and the assumed robustness of current code-specialised LLMs across datasets. We conclude with actionable insights for building scalable, efficient code-retrieval systems.
人机交互
[HC-0] Functional outcomes and naturalistic engagement with a purpose-built conversational AI for mental health (Ash)
链接: https://arxiv.org/abs/2606.28241
作者: Kristen M. Van Swearingen,Thomas D. Hull,Karthik V. Sarma,Caitlin A. Stamatis
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Background: Conversational AI chatbots designed for mental health may offer an accessible, scalable avenue for supporting psychological well-being, yet prior evaluations have largely focused on clinical symptom reduction rather than broader indicators of day-to-day functioning, and have rarely monitored for potential harms such as inflated self-perception. Objective: We examined within-person change in psychological functioning indicators among real-world users of Ash, a purpose-built conversational AI for mental health support, over the first four weeks of use, and whether these changes were associated with engagement metrics. Methods: In this single-arm observational cohort study, new users (n = 1,284) completed in-app single-item measures of psychological functioning (life satisfaction, relationship satisfaction, sleep quality, behavioral activation), working alliance, and grandiosity (inflated self-perception), at baseline and Week 4. Paired-sample t-tests examined within-person change; ANCOVAs tested engagement-outcome associations at Week 4, controlling for baseline. Results: At baseline, participants reported below-average life satisfaction and fair sleep quality. Significant within-person improvements emerged across all functioning indicators and working alliance (ps .001; d = 0.14-0.26), with no change in grandiosity. Active days, total sessions, and total minutes consistently predicted Week 4 psychological functioning and working alliance (ps = .006; partial R^2 range: 0.58-2.15%; controlling for baseline), whereas user message volume did not. Conclusion: Findings provide preliminary data for the potential of evidence-based conversational AI to extend mental health support for broad psychological functioning, extending the existing literature beyond symptom-based outcomes. Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2606.28241 [cs.HC] (or arXiv:2606.28241v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2606.28241 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Caitlin Stamatis [view email] [v1] Fri, 26 Jun 2026 16:27:39 UTC (1,275 KB) Full-text links: Access Paper: View a PDF of the paper titled Functional outcomes and naturalistic engagement with a purpose-built conversational AI for mental health (Ash), by Kristen M. Van Swearingen and 3 other authorsView PDF view license Current browse context: cs.HC prev | next new | recent | 2026-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[HC-1] yping Behavior in Human-LLM Interaction: Keystroke Dynamics Reveal Cognitive Effort During Prompting
链接: https://arxiv.org/abs/2606.28090
作者: Laura Schütz,Yousri Cherif,Clara Sayffaerth,Thomas Weber,Francesco Chiossi
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:As Large Language Models (LLMs) become increasingly integrated into daily routines, understanding how users interact with these systems is crucial for effective human-AI collaboration. This work investigates keystroke dynamics as a behavioral measure of user mental effort and perceived output usefulness in human-LLM interaction. We conducted a user study (N = 36) to examine how task difficulty (easy vs. hard) and device type (desktop vs. mobile) influence typing behavior and workload (NASA-TLX) during interactions. Our results indicate that hard tasks led to significantly more keystrokes, slower typing, increased pauses, and higher self-reported workload. Device type had weaker effects, with mobile use slightly reducing input length and typing speed. While keystrokes captured differences in cognitive effort, they did not predict perceived LLM output usefulness. These findings highlight the potential of keystroke dynamics as real-time indicators of cognitive effort during LLM prompting, while also showing their limitations in capturing perceived collaboration success.
[HC-2] STAG: Spatio-temporal Evolving Structural Representation of Action Units for Micro-expression Recognition
链接: https://arxiv.org/abs/2606.28083
作者: Nandani Sharma,Varun Sharma,Dinesh Singh
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注:
Abstract:Micro-expression recognition is challenging due to subtle and short-lived facial muscle movements. Existing methods rely heavily on apex-onset frames, overlook fine-grained inter-frame dynamics, and separately model spatial and temporal information, limiting generalization across datasets. To address these challenges, we propose STAG, a dynamic ROI-AU-coupled spatial-temporal network that jointly models motion flow and adaptive facial connectivity. The framework extracts optical flow from discriminative frames using magnitude-based selection and temporal attention. A dual-branch architecture combines an enhanced graph attention network for structured spatial reasoning with a transformer encoder for temporal modeling. A bidirectional cross-attention module enables mutual refinement of spatial and temporal features, while AU-guided dynamic connectivity adapts facial region interactions according to muscle activation patterns. The transformer captures subtle temporal dynamics beyond apex-based approaches, improving semantic consistency and interpretability for explainable micro-expression recognition. The fused representation is optimized using focal loss and evaluated on CASME II, 4DME, DFME, NaME, SAMM, and SMIC-HS. Extensive experiments demonstrate improved robustness, generalization, interpretability, and computational efficiency, confirming the effectiveness of adaptive relational reasoning, AU-guided dynamic connectivity, and deep spatial-temporal feature fusion for accurate cross-dataset micro-expression recognition.
[HC-3] AI Persuasive Framing in Collective Dilemmas
链接: https://arxiv.org/abs/2606.27951
作者: Anders Giovanni Møller,Alessia Galdeman,Arianna Pera,Luca Maria Aiello
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Physics and Society (physics.soc-ph)
备注: The first two authors contributed equally to this research. The article contains 20 pages, 10 figures, and 2 tables
Abstract:AI agents are promising tools that can act as flexible behavioral nudges to enhance human cooperation in addressing large-scale societal problems. However, evidence on whether AI agents can effectively boost cooperation remains mixed. We recruited 1,283 participants to play iterated Collective Risk Games in small groups, testing whether AI assistants could nudge participants toward cooperation. By using persuasive framing personalized to each player’s Social Value Orientation profile, the AI interventions significantly increased contributions and group success rates. These cooperative effects were short-lived, however, fading after the first few rounds. Strikingly, when the AI treatments were reconfigured to promote selfish behavior through exculpatory framing, the negative effects on contributions and group success were larger and substantially more persistent, particularly for personalized interventions. This asymmetry between prosocial and antisocial persuasion highlights the dual-use risks of AI systems designed to influence group behavior in collective action settings.
[HC-4] A Multi-Attribute Latent Space for Visual Analysis of Watches
链接: https://arxiv.org/abs/2606.27897
作者: Kai Lawonn,Tobias Günther,Monique Meuschke
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:We present a design rationale, embedding model, and interactive visual-analysis system for exploring large wristwatch collections through heterogeneous visual and semantic attributes. The system addresses a common limitation of catalog and e-commerce interfaces: users can filter by metadata, but they receive little support for open-ended exploration of visual similarity, stylistic alternatives, and mixed aesthetic-functional criteria. We therefore represent watches with separate attribute graphs for dial color and dial design, while using watch type as an explicit semantic organizer. Dials are segmented with a U-Net, watch types are predicted with a Vision Transformer, colors are represented through a shared CIELAB reference palette, and dial structure is described with a gradient-based image descriptor. We extend UMAP by combining attribute-specific neighborhood graphs in a unified probabilistic objective and by adding a class-aware layout term that separates global type structure from local visual neighborhoods. The resulting map is exposed in an interactive interface with spatial navigation, metadata filtering, detail inspection, and search-by-example insertion. We evaluate the approach through parameter analysis, runtime measurements, and a qualitative pilot study with watch experts and novices. The results suggest that the system supports discovery and comparison, while also revealing limitations in scalability assessment, search-by-example validation, and the need for broader domain studies. We explicitly discuss these limitations and derive design implications for multi-attribute latent-space visualization across heterogeneous visual collections.
[HC-5] HandMade: Spatial Prompting for Generative 3D Creation with Part-Labeled VR Sketches
链接: https://arxiv.org/abs/2606.27738
作者: Jialin Huang,Rana Hanocka,Ariel Shamir,Yotam Gingold
类目: Human-Computer Interaction (cs.HC)
备注: 15 pages, 5 figures, 1 table
Abstract:Text-to-3D generation lowers the barrier to 3D content creation, but text alone is a weak interface for specifying spatial intent: where parts should be placed, how they relate, and how an object should be organized in 3D. We present HandMade, a workflow that combines VR 3D sketching and language for open-domain 3D asset generation. HandMade treats coarse, part-labeled 3D sketches not as incomplete geometry to reconstruct directly, but as spatial prompts for existing generative models. It converts segmented VR strokes into multi-view part guidance and structured prompts, allowing users to specify object layout and part relationships through 3D sketching while using language for identity, material, style, and local details. A technical evaluation shows that HandMade better preserves user-authored spatial scaffolds than text-only and sketch-based baselines on 20 varied examples. A user study with eight participants characterizes how users make use of 3D sketching for spatial layout and language for identity, materials, and details across initial authoring and subsequent revision. HandMade contributes an interaction paradigm and interface-to-generation pipeline for spatially guided 3D creation.
计算机视觉
[CV-0] DexCompose: Reusing Dexterous Policies for Multi-Task Manipulation with a Single Hand
链接: https://arxiv.org/abs/2606.28323
作者: Dihong Huang,Zhenyu Wei,Zhuxiu Xu,Yunchao Yao,Sikai Li,Mingyu Ding
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Dexterous manipulation policies can solve individual skills, but composing them to perform multiple tasks with a single hand remains challenging. Adding a new task on top of an existing manipulation skill often imposes conflicting demands on overlapping fingers and contact modes, causing destructive interference between preserving an existing manipulation outcome and executing a new one. We propose DexCompose, a role-aware residual composition framework that reuses pretrained dexterous policies for multi-task manipulation through explicit finger-level action ownership. Given two pretrained full-hand policies, DexCompose first collects successful post-task states from the first skill and performs release tests over candidate finger masks to identify which fingers are necessary for maintaining the established skill state. It then trains two asymmetric residual modules: a bounded residual stabilizer for task preservation, and a context-aware residual that adapts the frozen downstream policy only within the action subspace assigned to the new task. We evaluate the framework on 16 composite dexterous manipulation tasks spanning four object-retention skills and four downstream interactions. DexCompose achieves a 77.4% average composite success rate, demonstrating that structural action ownership with dual residuals offers a promising direction for composing dexterous skills beyond conventional policy chaining.
[CV-1] PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception ICML2026
链接: https://arxiv.org/abs/2606.28322
作者: Yana Wei,Hongbo Peng,Yanlin Lai,Liang Zhao,Kangheng Lin,En Yu,Keyu Lv,Han Zhou,Yin Tang,Haodong Li,Mitt Huang,Hangyu Guo,Jianjian Sun,Zheng Ge,Xiangyu Zhang,Daxin Jiang,Vishal M. Patel
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026. Project page: this https URL
Abstract:We introduce PerceptionRubrics, a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness. Shifting evaluation from holistic semantic matching to rigorous atomic auditing, PerceptionRubrics pairs 1,038 information-dense images with over 12,000 instance-specific rubrics. These criteria are derived from golden captions constructed via a novel Circular Peer-Review consensus pipeline and then distilled into a dual-stream system of Must-Right (essential facts) and Easy-Wrong (fine-grained details) rubrics. Crucially, PerceptionRubrics implements a Gated Scoring mechanism: unlike linear averages, failure on mandatory visual facts triggers sharp binary penalties. Extensive evaluation yields critical insights: (1) The Reliability Gap: models often verify fragmented elements correctly yet fail strict conjunctive constraints, exposing brittleness in dense domains; (2) Open-Closed Stratification: contrary to reasoning trends, we reveal a persistent 8% perception deficit between open-source and proprietary frontiers; and (3) Human-Aligned Rigor: our gated metrics substantially out-align conventional benchmarks, validating that strict perceptual fidelity is the prerequisite for reliable generation.
[CV-2] StructSplat: Generalizable 3D Gaussian Splatting from Uncalibrated Sparse Views
链接: https://arxiv.org/abs/2606.28321
作者: Jia-Chen Zhao,Beiqi Chen,Xinyang Chen,Guangcong Wang,Liqiang Nie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Code: this https URL
Abstract:We present StructSplat, a feed-forward and generalizable 3D Gaussian reconstruction framework that operates directly on uncalibrated images without requiring camera parameters. Existing methods either rely on per-scene optimization or assume known camera poses, and often entangle geometry and appearance within a unified backbone, limiting reconstruction fidelity and generalization. Our key idea is to adopt a structured representation that organizes geometry, semantic, and texture cues with explicit roles in the reconstruction process. Specifically, we introduce a pixel-aligned feature injection mechanism to enable accurate texture modeling from 2D observations, incorporate semantic-aware priors to improve global consistency, and design a camera alignment strategy to prevent information leakage and improve generalization. Experiments show that our method significantly outperforms prior approaches on challenging benchmarks. On DL3DV, our method achieves 28.045 PSNR, surpassing AnySplat (22.377) by +5.67 dB. In cross-dataset evaluation, our method achieves +1.94 dB over AnySplat on ACID and +1.72 dB on RealEstate10K. Project page: this https URL Code: this https URL
[CV-3] Learning Topology-Aware Representations via Test-Time Adaptation for Anomaly Segmentation
链接: https://arxiv.org/abs/2606.28268
作者: Ali Zia,Usman Ali,Abdul Rehman,Umer Ramzan,Kang Han,Muhammad Faheem,Shahnawaz Qureshi,Wei Xiang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Test-time adaptation (TTA) has emerged as a promising paradigm for mitigating distribution shifts in deep models. However, existing TTA approaches for anomaly segmentation remain limited by their reliance on pixel-level heuristics, such as confidence thresholding or entropy minimisation, which fail to preserve structural consistency under noise and texture variation. Moreover, they typically treat anomaly maps as flat intensity fields, ignoring the higher-order spatial relationships that characterise complex defect geometries. We introduce TopoTTA (Topological Test-Time Adaptation), a novel framework that integrates persistent homology, a tool from topological data analysis, into the TTA pipeline to enforce geometric and structural coherence during adaptation. By applying multi-level cubical complex filtration to anomaly score maps, TopoTTA derives robust topological pseudo-labels that guide a lightweight test-time classifier, enhancing segmentation quality without retraining the backbone model. The approach avoids reliance on method-specific raw-score thresholding for mask binarisation, preserves connectivity, and generalises across both 2D and 3D modalities. Extensive experiments across six standard benchmarks (MVTec AD, VisA, Real-IAD, MVTec 3D-AD, AnomalyShapeNet, and MVTec LOCO) demonstrate an average 15% F1 improvement over state-of-the-art unsupervised anomaly detection and segmentation methods, with the largest gains on anomalies exhibiting complex geometric or structural variations. These findings suggest that integrating topological reasoning into test-time adaptation provides a principled route to structure-aware generalisation, bridging the gap between geometric learning and robust adaptation.
[CV-4] RSICCLLM : A Multimodal Large Language Model for Remote Sensing Image Change Captioning ECCV2026
链接: https://arxiv.org/abs/2606.28266
作者: Yelin Wang,Zijia Song,Shuo Ye,Chuanguang Yang,Miaoyu Wang,Yong Xu,Zhulin An,Yongjun Xu,Zitong Yu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026
Abstract:Remote Sensing Image Change Captioning (RSICC) aims to describe changes between bi-temporal remote sensing images and holds significant research and application value. However, most existing methods rely on conventional deep learning architectures, and the limited model capacity constrains performance. Although large-model post-training techniques have achieved great success in general domains, their direct transfer to RSICC remains challenging due to data scarcity and the need for fine-grained change understanding. To address this, we propose RSICCLLM, the first post-training framework for large vision-language models in RSICC. Specifically, we design a data generation paradigm, release the instruction dataset RSICI, and establish a task-specific RSICC benchmark. We further introduce Difference-aware Supervised Fine-tuning to explicitly extract change representations and guide the model in perceiving and understanding temporal differences. In addition, we propose Dual-Negative Preference Optimization (DNPO), which employs two complementary negative-sample construction strategies to construct the preference dataset RSICP and further refine model performance. Extensive experiments validate the superior capability of RSICCLLM, which achieves outstanding results with only 7B parameters, surpassing models of substantially larger scales. The code and dataset will be made publicly available at this https URL.
[CV-5] Exposure Bias Can Alleviate Itself via Directional and Frequency Rectification in Flow Matching
链接: https://arxiv.org/abs/2606.28226
作者: Guanbo Huang,Jingjia Mao,Fanding Huang,Fengkai Liu,Xiangyang Luo,Yaoyuan Liang,Jiasheng Lu,Xiaoe Wang,Pei Liu,Ruiliu Fu,Ruqi Huang,Shao-Lun Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2512.04904
Abstract:Flow Matching (FM) has achieved remarkable generative performance, yet it suffers from exposure bias due to discrepancies between training and inference. Existing mitigation strategies typically rely on static constraints or external heuristics. In this work, we propose that exposure bias itself inherently contains dynamic signals that can guide its own rectification. To leverage this, we introduce DEFAR (DirEctional-Frequency Adaptive Rectification). This framework simulates the single-step inference process during training to identify exposure bias. It utilizes directional and frequency-adaptive feedback signals from the bias itself to enhance the model’s bias tolerance. It consists of two key components: (1) Anti-Drift Rectification (ADR). ADR treats inference-time drift as a signal to learn the direction to steer deviated states back toward the target. ADR endows the model with intrinsic active self-rectification capabilities; (2) Frequency Compensation (FC). Empirically, we observe that accumulated bias often stems from a lack of low-frequency components in high-noise stages, and exposure bias carries the missing frequency. FC leverages the bias itself as a self-feedback weighting factor to reinforce the missing frequency components. Experiments on CIFAR-10, CelebA-64, and ImageNet-256/512 show that DEFAR outperforms prior baselines and further demonstrates favorable scalability, compatibility, and inference robustness.
[CV-6] HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration ECCV2026
链接: https://arxiv.org/abs/2606.28215
作者: Jiaxin Li,Yuxiang Wu,Zhenkai Zhang,Xinrui Shi,Haoyuan Wang,Yichen Zhao,Su Linxiang,Chenyang Yu,Mingyu Zhang,Yifan Ding,Boran Wen,Li Zhang,Ruiyang Liu,Yong-Lu Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Accepted to ECCV 2026. 15 pages of main text and 39 pages of appendices. Project page: this https URL
Abstract:Extracting dynamic 4D object interactions from massive, in-the-wild monocular videos offers a highly efficient data collection pathway for scaling Embodied AI and training VLAs. However, existing monocular 4D reconstruction methods primarily focus on isolated objects, often failing under the severe occlusions and complex dynamics inherent in multi-object interactions. To bridge this gap, we propose HAT-4D, the first agentic framework designed to reconstruct the 3D geometry, temporal dynamics, and physical interactions of multiple objects from a single video. By integrating VLMs with a multi-level human-in-the-loop feedback mechanism, HAT-4D efficiently resolves depth ambiguities and interaction-induced occlusions during 3D generation and 4D propagation, yielding physically plausible assets without relying on expensive multicamera rigs. As a scalable data engine, HAT-4D facilitates the creation of MVOIK-4D, an open-world benchmark for monocular 4D interaction reconstruction, accompanied by a novel multi-dimensional evaluation protocol focused on physical plausibility and temporal consistency. Extensive experiments demonstrate that HAT-4D achieves SOTA performance on most evaluation metrics, while maintaining competitive semantic alignment. Ablation studies show that introducing a small amount of human feedback improves interaction reconstruction. Moreover, the data produced by HAT-4D effectively improves baseline performance when used for fine-tuning. Our data and code are available at this https URL
[CV-7] LLawCo: Learning Laws of Cooperation for Modeling Embodied Multi-Agent Behavior ICML2026
链接: https://arxiv.org/abs/2606.28182
作者: Qinhong Zhou,Chuang Gan,Anoop Cherian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to ICML 2026
Abstract:Embodied agents operating in decentralized and partially observable environments have attracted growing attention in recent years. However, existing large language model (LLM)-based agents often exhibit behaviors that are misaligned with their partners or inconsistent with the environment state, leading to inefficient cooperation and poor task success. To address this challenge, we propose a novel framework, Learning Laws of Cooperation (LLawCo), that enables embodied agents to autonomously align with both their partners and task objectives. Our framework allows agents to reflect on past failures to extract misaligned behavioral patterns, which are used to derive high-level behavioral laws, such as “Talk when necessary” and “Wait for partner.” These laws are explicitly incorporated into the agents’ chains of thought via supervised fine-tuning, aligning their reasoning with task requirements and the behavior of other agents. To evaluate our approach, we introduce PARTNR-Dialog, a large-scale multi-agent communicative and cooperative planning benchmark built on the PARTNR environment. Experiments on existing tasks and our new benchmark demonstrate significant improvements in cooperative efficiency and task success rates. Across four backbone LLMs, our method achieves average success rate improvements of 4.5% on the PARTNR-Dialog benchmark and 6.8% on the TDW-MAT benchmark over state-of-the-art open-source communicative agent frameworks. See the LLawCo project page for details: this https URL
[CV-8] EchoSonar-R: A Multi-View Reasoning -Enabled Model for Disease Classification and Report Generation in Echocardiography
链接: https://arxiv.org/abs/2606.28164
作者: Darya Taratynova,Ahmed Aly,Numan Saeed,Mohammad Yaqub
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Echocardiography is the most widely used non-invasive cardiac imaging modality, providing essential information for cardiovascular diagnosis. Interpreting an echocardiogram requires synthesizing complementary evidence across multiple heart views to identify abnormalities and produce structured clinical reports. While recent efforts focus on improving classification performance, most models lack explicit diagnostic reasoning and spatially grounded anatomical evidence, limiting clinician trust. We present EchoSonar-R, a multi-view reasoning-enabled vision-language model that jointly performs multi-label disease classification and report generation from echocardiography studies. EchoSonar-R combines a spatiotemporal video encoder with a structure-aware cardiac detector that provides spatially grounded anatomical cues to improve interpretability and clinician trust during cross-view reasoning. EchoSonar-R is trained in two stages: supervised fine-tuning (SFT) on reasoning-annotated targets, followed by Group Relative Policy Optimization (GRPO) with task-specific rewards that jointly align classification and report generation within a unified reinforcement-learning framework. Across a private multi-view dataset and two public benchmarks, EchoSonar-R improves macro balanced accuracy by 17.1% on the private set and 6.1% on MIMICEchoQA over the strongest baseline, achieves a GREEN clinical faithfulness score of 0.800, and produces interpretable reasoning traces grounded in multi-view visual evidence.
[CV-9] oward Robust In-Context Segmentation via Concept Guidance ECCV2026
链接: https://arxiv.org/abs/2606.28149
作者: Zhigang Chen,Xiawu Zheng,Rongrong Ji
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ECCV 2026
Abstract:In-context segmentation (ICS) requires a model to segment target regions in a query image using only a few reference images and their corresponding masks, without updating any parameters. Despite recent progress, prior ICS studies have largely overlooked a critical aspect: system robustness, ie, whether the model can produce stable segmentation results for the same query under different references. In this work, we revisit ICS from the robustness perspective and introduce a novel paradigm, Concept-Guided In-Context Segmentation (CG-ICS), which performs segmentation by extracting high-level semantic concepts from references rather than relying solely on low-level visual matching. Specifically, CG-ICS introduces a concept reasoning module that uses an MLLM to propose candidates and a SAM3-driven scoring function with tree-search refinement to select reliable textual concepts, together with a parallel visual exemplar route that provides query-side spatial grounding via a simple context construction. Both the textual concept and the visual exemplar are then used to activate the segmentation capability of a frozen SAM3 backbone. Extensive experiments on standard ICS benchmarks demonstrate that CG-ICS not only achieves state-of-the-art accuracy but also substantially improves robustness, yielding a more reliable ICS system with significantly reduced variance across diverse reference choices.
[CV-10] Monocular Avatar Reconstruction via Cascaded Diffusion Priors and UV-Space Differentiable Shading ECCV2026
链接: https://arxiv.org/abs/2606.28144
作者: Hong Li,Minqi Meng,Yanjun Liang,Chongjie Ye,Houyuan Chen,Weiqing Xiao,Xianda Guo,Guojun Lei,Xuhui Liu,Chaojie Yang,Yanlun Peng,Hao Zhao,Baochang Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026. Project page: this https URL
Abstract:Reconstructing high-fidelity, relightable 3D avatars from a single in-the-wild image is a challenging ill-posed problem, primarily hindered by the scarcity of high-quality PBR data and the complexity of disentangling illumination from intrinsic materials. In this paper, we present a data-efficient framework that leverages the robust priors of a unified pre-trained diffusion backbone to sequentially address texture completion, delighting, and material decomposition. Unlike existing methods that rely on fragmented pipelines or extensive proprietary datasets, we utilize cascaded Low-Rank Adaptations (LoRAs) to adapt the strong generative prior of the diffusion model for each sub-task in UV space. Specifically, we first employ an Inpainting LoRA to complete missing UV textures caused by occlusion, leveraging the model’s semantic understanding to generate semantically and photometrically coherent details. Subsequently, a Light-Homogenization LoRA and a novel Cross-Intrinsic Attention mechanism are introduced to remove baked-in lighting and collaboratively synthesize pixel-aligned PBR maps (Albedo, Normal, Roughness, Specular, and Displacement). To ensure physical plausibility, we impose a UV-space differentiable BRDF shading loss during the decomposition stage, forcing the generative process to adhere to the rendering equation without the artifacts typical of rasterization-based supervision. Extensive experiments demonstrate that our method, trained on fewer than 100 real 3D scans, generates comprehensive, 4K-resolution PBR assets with superior realism and generalization compared to state-of-the-art methods, and all training code and model weights will be released upon acceptance.
[CV-11] ranslation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots
链接: https://arxiv.org/abs/2606.28133
作者: Sijin Chen,Kaixuan Jiang,Haixin Shi,Yanhui Wang,Weiheng Zhong,Haosheng Li,Bo Jiang,Yuxiao Liu,Xihui Liu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:We study whether we can learn novel manipulation skills from human actions to a bi-manual robot with parallel grippers. Human action data is cheap, abundant, and diverse, making it one of the most promising resources for scaling up robot learning. Yet transferring skills from humans to robots remains hard: most prior work treats humans as just another bi-manual 6DoF embodiment, where hand-pose estimates are noisy and the contact patterns of human fingers differ fundamentally from those of a parallel gripper. We argue that learning rotation-inclusive action signals from human data is therefore sub-optimal, and instead propose a bridging action representation: the relative wrist translation within the initial head-camera frame, an action space shared by humans and robots. To handle the potential absence of certain action components in different embodiments, we build a \pi_0 -like vision-language-action model with interleaved action tokens and attention masking. On a suite of novel bi-manual manipulation tasks, our bridging action transfers human manipulation knowledge to robots far more effectively than noisy 6DoF human actions and scales with the amount of human data.
[CV-12] PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation
链接: https://arxiv.org/abs/2606.28128
作者: Peiwen Zhang,Yufan Deng,Shangkun Sun,Juncheng Ma,Duomin Wang,Jonas Du,Zilin Pan,Ye Huang,Hao Liang,Songyan Huang,Ruihua Zhang,Enze Xie,Ming-Yu Liu,Daquan Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Github: this https URL Project website: this https URL
Abstract:Video generation models have emerged as a promising paradigm for embodied world simulation. However, both general-domain video generators and robot-specific data fine-tuned models can still produce physically implausible manipulations, including discontinuous motion trajectories and inconsistent robot-object interactions, which limits their reliability as world simulators. Through extensive experiments, we find that such physical instability mainly arises from two factors: deformation of moving objects and implausible spatio-temporal correlations among interacting entities, particularly during contact. Building on this observation, we propose PhysisForcing, a scalable training framework that strengthens physical consistency by focusing supervision on physics-informative regions through joint optimization of pixel-level and semantic-level features. The framework consists of a pixel-level trajectory alignment loss, which supervises DiT features using reference point trajectories, and a semantic-level relational alignment loss, which aligns DiT features with inter-region relations extracted from a frozen video understanding encoder. Extensive experiments on R-Bench, PAI-Bench, and EZS-Bench show that PhysisForcing consistently improves embodied video generation over strong baselines, improving the Wan2.2-I2V-A14B and Cosmos3-Nano base models on R-Bench by 22.3% and 9.2% (7.1% and 3.7% over vanilla finetuning), with the Cosmos3-Nano variant attaining the best overall score. Beyond generation, as a world model under the WorldArena action-planner protocol it raises the closed-loop success rate from 16.0% to 24.0% and further improves downstream policy success, indicating that physically aligned video models yield stronger representations for robotic manipulation.
[CV-13] Higher-Order Fourier Neural Operator: Explicit Mode Mixer for Nonlinear PDEs
链接: https://arxiv.org/abs/2606.28122
作者: Alex Colagrande,Paul Caillon,Eva Feillet,Alexandre Allauzen
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 46 pages
Abstract:Neural operators provide deep neural networks for learning mappings between function spaces. Among them, the Fourier Neural Operator (FNO) is particularly effective: its spectral convolution relies on low-dimensional Fourier-domain representations and can handle inputs at different resolutions. This design aligns well with settings where the Fourier basis diagonalizes the underlying operator, such as linear, constant-coefficient PDEs on periodic domains, in which Fourier modes evolve independently. However, nonlinear PDEs may benefit from an additional inductive bias, as they exhibit structured interactions between modes, governed by polynomial nonlinearities. To capture this inductive bias, we introduce the Higher-Order Spectral Convolution, a spectral mixer that extends FNO from diagonal modulation to explicit n-linear mode mixing, aligned with the dynamics of nonlinear PDEs. Our experiments on standard benchmarks show that the proposed Higher-Order FNO (HO-FNO) retains the efficiency of FNO-based architectures and consistently improves over other spectral neural operators. HO-FNO also performs on par with or better than state-of-the-art transformers and state-space models on several datasets, with stronger gains in highly nonlinear regimes, such as the Poisson equation with polynomial forcing, where a single HO-FNO layer outperforms FNO models with up to 16 layers. We open-source our code for reproducibility at: this https URL.
[CV-14] BiDeMem: Bidirectional Degradation Memory for Explainable Image Restoration
链接: https://arxiv.org/abs/2606.28112
作者: Xinrui Wu,Lichen Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Degradation-aware prompts, conditions, and latent priors are increasingly used in image restoration, yet they are usually judged by a single endpoint: whether the restored image obtains higher PSNR. This is a weak test of semantics. A condition can help by adding capacity, acting as a global correction bias, or exploiting dataset shortcuts, without becoming an interpretable degradation prior. We propose BiDeMem, a bidirectional degradation memory for explainable image restoration. A query built from restoration features and input statistics retrieves a compact top-k subset of memory slots. The same selected slot identity supports the restoration path at inference time and a training-only forward-degradation explanation path. The study centers on verifiability in a controlled multi-degradation NAFNet setting. New controls separate the gain from a correction head alone, a dense query prior, and a static global prior: these variants are 0.2588, 0.2586, and 0.2839 dB below BiRank, respectively. Strong residual supervision and a wider degradation head also remain below the full bidirectional memory model. Intervention probes show that BiRank preserves restoration quality while increasing wrong-prior and native-prior sensitivity, framing degradation memory as both a restoration module and a falsifiable explanation mechanism.
[CV-15] Cross-view Multimodal Vision-Based Assessment Framework for Traditional Chinese Medicine Rehabilitation Training
链接: https://arxiv.org/abs/2606.28104
作者: Francis Xiatian Zhang,Hao Yao,Shengxuan Chen,Hong Zhu,Hongxiao Jia,Sisi Zheng,Hubert P. H. Shum
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published in IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2026
Abstract:Vision-based assessment can provide convenient and cost-effective evaluation in Traditional Chinese Medicine (TCM) rehabilitation training, where action quality assessment (AQA) from computer vision offers a promising solution. Existing automatic AQA frameworks for physical therapy typically rely on skeletal data captured from a single viewpoint, which is inefficient for TCM techniques such as acupuncture or Tuina that involve dense hand self-occlusion and complex hand-object interactions. To address these challenges, we propose CME-AQA, a cross-view, multimodal vision-based assessment framework that integrates visual-pose fusion to enhance understanding of environmental context and leverages both first-person and third-person videos during training to improve inference robustness. We collected two dual-view datasets, TCM-AQA61-A (Acupuncture) and TCM-AQA61-T (Tuina), each containing synchronized first-person and third-person recordings of 61 subjects with expert annotations. Experimental results show that our approach achieves superior or comparable mean performance against competitive baselines, achieving over 10% relative improvement in weighted F1 over the best competing method on key rating tasks such as Needle Depth and Quick Needle Insertion, while also reducing mean absolute error in quantitative measures such as insertion time and manipulation frequency. Testing on a CPR dataset further demonstrates comparable performance on several posture-based criteria, suggesting applicability to related structured simulated clinical skill assessments where participant motion is central to evaluation. Overall, CME-AQA enhances assessment accuracy for structured TCM rehabilitation training and facilitates more convenient and effective training-oriented skill evaluation.
[CV-16] OSOR: One-Step Diffusion Inpainting for Effect-Aware Object Removal
链接: https://arxiv.org/abs/2606.28094
作者: Qinming Zhou,Chenxi Sun,Deyang Kong,Junhao He,Xiangheng Tang,Peike Yu,Haotian Wu,Leilei Cao,Linfeng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code and resources are available at this https URL
Abstract:Real-world object removal is challenging due to two key difficulties: the target object’s non-local effects, such as shadows and reflections, which are difficult to model, and the fact that user-provided masks are often inaccurate or incomplete. With billions of parameters and tens of denoising steps, diffusion-based models achieve strong removal performance at the expense of substantial computational cost, limiting their use in interactive applications and on edge devices. To address these challenges, we present OSOR (One-Step Object Removal), which simultaneously achieves efficient, effect-aware, and mask-robust object removal. Concretely, OSOR introduces: (1) an occupancy-guided discriminator for precise boundary supervision, enabling stable single-step diffusion training; (2) an alpha head that leverages knowledge from pretrained diffusion models to predict appropriate removal regions with minimal overhead, thereby handling imperfect masks; and (3) a semantic-anchored verification pipeline (SAVP) that filters noisy instruction-based triplets to produce effect-aware supervision at scale. Using SAVP, we curate CORNE, which contains 280K verified removal pairs, and further annotate AnimeEraseBench and TextEraseBench to evaluate performance on more complex removal tasks. Experiments show that OSOR surpasses strong multi-step diffusion baselines in perceptual quality while achieving 4\times to 30\times faster inference.
[CV-17] Diffusion Model Attribution via Spectral Coupling of Denoiser Responses
链接: https://arxiv.org/abs/2606.28092
作者: Pragati Shuddhodhan Meshram,Varun Chandrasekaran
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Attributing a generated image to its source diffusion model is a fundamental challenge in provenance verification and intellectual property protection. This problem is particularly difficult because diffusion models trained on different datasets can converge to similar score functions and thus similar output distributions, making the generated images themselves unreliable as attribution evidence. Existing non-invasive methods either fail on architecturally similar variants or rely on signals that vanish when models share the same autoencoder. We propose Spectral Denoising Signatures (SDS), a non-invasive attribution method that identifies the source model by fingerprinting each candidate model’s denoising behavior. Our key insight is that a model’s denoising score function exhibits a distinctive spectral geometry, reflected in how it redistributes energy across spatial frequency bands during denoising. By probing this behavior with frequency-controlled perturbations, SDS extracts a stable signature that is intrinsic to the model, requiring only standard forward passes with no inversion, optimization, or generation-time enrollment. Our results demonstrate that SDS achieves approximately 99.9% accuracy across eight diverse diffusion models and 96.2% under cross-domain prompt shift, outperforming non-invasive baselines across variations in training data, architecture, and training procedure, establishing spectral geometry as a principled and practical basis for diffusion model attribution. Code is available at: this https URL
[CV-18] RPM-Distill: Physiology-guided Adaptive Cross-modal Distillation for Robust Remote Physiological Measurement ECCV2026
链接: https://arxiv.org/abs/2606.28089
作者: Jiyao Wang,Qingyong Hu,Duoxun Tang,Xiao Yang,Kaishun Wu,Jiangbo Yu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV2026
Abstract:Video-based remote physiological measurement (RPM) is highly accessible but remains fragile under varying illumination, skin tones, and motion. Radio frequency (RF) radar is largely invariant to illumination and appearance, providing complementary cardio-respiratory micro-motion cues; however, requiring radar at inference is often impractical due to its limited ubiquity and deployment overhead. We propose RPM-Distill, a physiology-guided cross-modal distillation framework that leverages synchronized radar only during training while retaining video-only inference. Our key observation is that although RGB and RF waveforms differ in sensing physics and time-domain morphology, they share similar latent periodic rhythm in the frequency domain. We thus distill physiology-structured spectral evidence to improve robustness, via losses that (i) anchor the fundamental peak, (ii) match the off-peak background distribution, and (iii) preserve spectral morphology and sharpness. To avoid negative transfer under sample-level teacher quality and alignment uncertainty, a spectral policy network predicts sample-level distillation gates and component weights from the student–teacher spectral relation map, learned with a meta bilevel objective on a small labeled validation split. Through extensive experiments in challenging conditions and cross-dataset settings, RPM-Distill brings 81% MAE and 21% correlation improvement over unimodal baselines. Code is at this https URL.
[CV-19] xtDS: Parameter-Efficient Representation Alignment for Scene Text Detection under Distribution Shifts ECCV2026
链接: https://arxiv.org/abs/2606.28077
作者: Boyuan Chen,Zichen Dang,Chuang Yang,Lap-pui Chau,Yi Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026. Project page: this https URL
Abstract:In real-world deployments, scene text detectors inevitably face distribution shifts beyond the training distribution. Prior work often depends on large-scale scene-text pretraining, yet evaluation under cross-domain changes and real-world imaging degradations remains limited. We propose TextDS, an efficient framework for scene text detection under distribution shifts. First, we propose a data-efficient dual-encoder design with visual foundation models, eliminating the reliance on large-scale scene-text pretraining. Second, we introduce Step-wise LoRA adaptation (SWLoRA), which performs progressive low-rank refinement with a dynamic early-exit mechanism for effective feature adaptation. Third, we propose Common Subspace Fusion (CSF) to align and fuse the two branches in a shared subspace while retaining complementary, shift-robust information. Finally, we construct adverse-condition scene text detection datasets to address the gap in evaluating under imaging degradation. Experiments show that TextDS achieves competitive performance in scene text detection, demonstrating robustness across domains and adverse imaging conditions with only 4.9M trainable parameters.
[CV-20] ReScene: Structured Indoor Scene Reconstruction from Multi-View Captures
链接: https://arxiv.org/abs/2606.28060
作者: Haoran Xu,Lechao Zhang,Daoguo Dong,Yan Gao,Xin Tan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Constructing simulation-ready 3D scenes from multi-view captures is a key bottleneck for Embodied Artificial Intelligence, as downstream tasks require object-level structure, explicit inter-object relations, and physical plausibility. Existing approaches either rely on specialized capture hardware, suffer from single-view bias in object reconstruction, or yield layouts that are geometrically reasonable but physically inconsistent. We identify that the problem is not single-object reconstruction but cross-view relation fusion and physically plausible scene assembly. To address this challenge, we present ReScene, a framework that threads multi-view geometry throughout the pipeline as a unifying prior. Our method consists of two main components: HierView prioritizes reconstruction views based on semantic consistency and 3D coverage completeness, replacing the largest-mask heuristic that conflates image occupancy with object coverage; and Relation-Aware Assembly fuses multi-frame relation predictions from a vision-language model with geometric and room-shell priors into a confidence-weighted scene graph, enabling physically consistent scene assembly. ReScene sets a new state of the art across geometry, rendering, and perceptual quality on a set of ScanNet scenes, achieving a 17% reduction in Chamfer Distance and 26% in LPIPS over the strongest prior baseline, while running up to 10x faster than prior multi-view methods. Based on the reconstructed scenes, we also generate an embodied visual question answering dataset, on which fine-tuned Qwen-VL approaches the performance of strong closed-source models on several spatial reasoning tasks.
[CV-21] AirGroundBench: Probing Spatial Intelligence in Multimodal Large Models under Heterogeneous Multi-View Embodied Collaboration
链接: https://arxiv.org/abs/2606.28049
作者: Haotian Li,Yida Wang,Leyuan Wang,Jinshan Lai,Keyang Wang,Zonghao Guo,Qiang Ma,Liuyu Xiang,Jianwei Hu,Zhaofeng He
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In recent years, multimodal large language models (MLLMs) have shown strong potential for embodied intelligence, yet their ability to maintain geometrically consistent spatial understanding across heterogeneous views remains under-evaluated. Existing benchmarks largely focus on single-agent, single-view perception, leaving a gap in the systematic assessment of collaborative air-ground settings, where multi-scale observations are complementary but introduce scale mismatch, asymmetric occlusion, and reference-frame inconsistencies. We present AirGroundBench, a diagnostic benchmark for evaluating multi-view spatial intelligence in heterogeneous UAV-UGV collaboration. AirGroundBench is built from 11 high-fidelity simulated environments with 1,021 synchronized air-ground observation pairs, yielding approximately 62,000 dual-view, four-option single-choice visual question answering instances and 115 closed-loop vision-language navigation episodes. It covers 10 task types organized into four progressively demanding capability dimensions: spatial perception, cross-view alignment, spatial transformation and reasoning, and embodied decision-making. To support geometry-grounded evaluation and analysis, we provide structured spatial annotations, including cross-view object identities and metric 2D and 3D bounding boxes. Evaluations of 13 representative MLLMs under UAV-only, UGV-only, and dual-view input settings reveal consistent bottlenecks: models perform relatively well on spatial perception but struggle with cross-view alignment and transformation-intensive reasoning, and these deficits propagate to sequential decision-making in vision-language navigation. Although dual-view inputs provide measurable gains over single-view variants, a persistent gap from human performance remains, highlighting geometric consistency as a key limitation of current embodied MLLMs.
[CV-22] Mind the Gap: Quantifying the Domain Gap in Cross-Sensor Diffusion Super-Resolution
链接: https://arxiv.org/abs/2606.28039
作者: Dawid Kopeć,Katarzyna Jabłońska,Wojciech Kozłowski,Maciej Zięba
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 26th International Conference on Computational Science
Abstract:Demand for high-resolution satellite imagery has increased interest in super-resolution (SR) to bridge the spatial resolution gap between freely available missions such as Sentinel-2 and commercial systems like PlanetScope. Because no sensor provides true paired low- and high-resolution observations, SR models are usually trained on synthetically degraded data, creating a domain gap on real cross-sensor imagery. In this work, we provide the first systematic study of how this synthetic-to-real mismatch affects the performance of modern diffusion-based SR models. Using a large, geometrically and temporally aligned dataset of Sentinel-2 and PlanetScope imagery, we evaluate five state-of-the-art diffusion architectures under controlled experimental settings. We also introduce LPIPS-Sat, a domain-adapted perceptual metric based on Sentinel-2 self-supervised features. Our results show two persistent challenges: synthetically trained models degrade sharply on real pairs, while models trained on real cross-sensor data exhibit optimisation difficulties and struggle to adapt to the physical and radiometric diversity. These findings highlight a key limitation of current SR and motivate methods that disentangle super-resolution from domain adaptation.
[CV-23] EMOSH: Expressive Motion and Shape Disentanglement for Human Animation ECCV2026
链接: https://arxiv.org/abs/2606.28026
作者: Dongbin Zhang,Hao Liu,Binquan Dai,Kangjie Chen,Chuming Wang,Chen Li,Jing Lyu,Haoqian Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026, Project Page: this https URL
Abstract:High-fidelity and expressive controllable human animation is essential for content creation and digital avatar applications. However, existing methods face a dilemma between expressiveness and disentanglement. Mainstream 2D pose-conditioned approaches suffer from “motion-shape entanglement”, leading to the leakage of the driving subject’s body shape. Conversely, methods relying on 3D priors (e.g., SMPL) achieve geometric disentanglement but struggle to capture facial expressions and complex gestures, resulting in rigid animations. To this end, we propose EMOSH, a novel framework for high-fidelity controllable human video generation. First, an Expressive Human Model (EHM) is introduced as the core control representation. By explicitly disentangling shape and pose parameters, we fundamentally resolve the body shape leakage issue. Alongside this, a robust motion tracker is designed to accurately estimate EHM parameters from video. Second, we propose a Coarse-to-Fine Hybrid Motion Injection strategy, enabling more fine-grained control over expressions and gestures. Furthermore, we introduce a Spatially-Aligned Conditioning mechanism to bridge the domain gap between training and inference, improving identity consistency. Extensive experiments demonstrate that EMOSH outperforms previous methods in both self-driven and cross-driven scenarios, producing high-fidelity videos with vivid expressions while maintaining shape disentanglement.
[CV-24] mpAct: Advancing Temporal Plausibility in Autoregressive Video Generation via Planner-Executor RL
链接: https://arxiv.org/abs/2606.28016
作者: Jing Wang,Xiangxin Zhou,Jiajun Liang,Kaiqi Liu,Wanyun Pang,Zhenyu Xie,Tianyu Pang,Xiaodan Liang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autoregressive (AR) video diffusion models enable low-latency streaming generation by synthesizing videos chunk by chunk with cached visual context, but this chunk-wise formulation makes temporal instruction following ambiguous. A single global prompt does not specify which sub-event should be realized in each chunk, while naively switching to step-wise prompts often leads to delayed reactions, blended step semantics, and error propagation across prompt transitions. These failures are difficult to address with supervised fine-tuning or distillation alone: SFT suffers from exposure bias, while rollout-based distillation still optimizes low-level denoising or teacher-distribution matching rather than directly enforcing action ordering and prompt-transition correctness. We address these challenges with TempAct, a planner–executor reinforcement learning framework that jointly optimizes temporal decomposition and step-conditioned execution for temporally plausible AR video generation. TempAct uses an LLM planner to explore span-aware step prompts that are executable by the video model, and trains an AR diffusion executor to follow these prompts under its own generated histories. Its key mechanism is hierarchical group exploration: candidate plans form planning groups, and each plan induces an execution group of multiple continuations from a shared visual context, enabling plan-level credit assignment for long-horizon temporal outcomes and executor-level credit assignment for prompt-switch behavior. We further design hierarchical rewards that combine plan-quality and full-video temporal feedback for the planner with local transition-level step-following rewards, aesthetic regularization, and KL constraints for the executor. Experiments on Self-Forcing and LongLive show that TempAct improves temporal consistency while preserving overall visual quality.
[CV-25] Curriculum-guided Change Detection Training: Toward Accurate Serac Fall Monitoring
链接: https://arxiv.org/abs/2606.28012
作者: Arthur Dérédel,Carlos Crispim-Junior,Pierre Lemaire,Johan Berthet,Laure Tougne Rodet
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, 11 pages, 5 figures
Abstract:Change Detection (CD) aims to identify semantic or structural changes from nearly registered multi-temporal images. While recent advances in training methodologies have largely focused on semi-supervised learning and consistency regularization, alternative training paradigms remain underexplored. In particular, most deep CD methods rely on uniform sampling during training, implicitly assuming that all training samples contribute equally to the optimization process. However, such naive sampling can introduce noisy gradients and hinder robust representation learning. To address this limitation, we propose a curriculum learning framework tailored for change detection. Our approach investigates two complementary difficulty measures: the Solar Angular Gap (SAG), a physically grounded proxy for acquisition-condition variability, and the Structural Similarity Index Measure (SSIM), which evaluates appearance similarity between image pairs. Based on these criteria, the framework progressively introduces challenging samples during training, enabling models to learn robust representations in a coarse-to-fine manner. We evaluate our method on the challenging SeracFallDet benchmark, where results demonstrate consistent improvements of the proposed approach over standard uniform-sampling strategies for both pixel-based and object-based approaches. These results highlight the potential of curriculum learning to improve robustness in deep change detection. Importantly, our training framework is orthogonal to existing CD architectures, making it readily applicable to a broad range of methods.
[CV-26] HumanMoveVQA: Can Video MLLM s reason about human movement in videos?
链接: https://arxiv.org/abs/2606.27999
作者: Pulkit Gera,Faegheh Sardari,Asmar Nadeem,Valentina Bono,Padraig Boulton,Adrian Hilton,Armin Mustafa
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite the rapid advance of Multimodal Large Language Models (MLLMs) in high-level video understanding, a fundamental bottleneck remains: these models collapse complex human motion into coarse semantic labels. Existing benchmarks mostly focus on scene-centric events or local joint articulations, failing to probe global human motion in space over time (trajectory and orientation changes). We introduce HumanMoveVQA, the first comprehensive benchmark designed to evaluate global trajectory and orientation reasoning from an exocentric perspective. Our benchmark utilizes a first-frame anchored world coordinate system, preserving translation and rotation relative to a fixed starting point. We propose a scalable, multi-stage pipeline that lifts 2D video observations into world-consistent 3D motion tracks to generate over 10K structured question-answer pairs across seven reasoning categories, including motion aggregation, sequential ordering, and trajectory-level inference. Our extensive evaluation reveals a critical capability gap in state-of-the-art proprietary models on deep human motion understanding. However, we demonstrate that this is a learnable problem; by fine-tuning an open-source baseline with our targeted, world-consistent supervision, we achieve a significant this http URL establishes a rigorous geometric foundation for developing next-generation, movement-aware video understanding models.
[CV-27] Latent Visual Diffusion Reasoning with Monte Carlo Tree Search ECCV2026
链接: https://arxiv.org/abs/2606.27988
作者: Xirui Teng,Nan Xi,Junsong Yuan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. Project page: this https URL
Abstract:Analyzing fine-grained skill activities (e.g., sports, surgery) requires not only recognizing visual patterns but also performing step-by-step visual reasoning that leads to the final judgment. While recent advances in action quality assessment have achieved remarkable progress in evaluating performance, existing models remain black boxes, where they lack the ability to explicitly reveal the reasoning processes underlying their judgments. To address this limitation, we propose Latent Visual Diffusion Reasoning (LVDR), a novel framework that integrates keypoint-guided Monte Carlo Tree Search (MCTS) to model and visualize the latent visual reasoning process. LVDR not only produces more accurate skill assessments but also uncovers the critical visual reasoning sequences that contribute to the final evaluation. Extensive experiments across four datasets spanning diverse sports and surgical domains demonstrate that LVDR achieves competitive quantitative performance while providing interpretable visual reasoning trajectories leading to the final predictions. Source codes and models can be found through the following link: this https URL.
[CV-28] Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation
链接: https://arxiv.org/abs/2606.27978
作者: Jiayi Xu,Di He,Guolin Ke
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Pixel-space continuous-token autoregressive (AR) generation directly models images as sequences of raw pixel patches, avoiding discrete tokenization or a separately pretrained tokenizer. However, it faces coupled challenges: high-dimensional patch generation causes large single-step errors, and teacher-forced training creates a train–inference gap that makes these errors accumulate across AR steps. Existing fixes such as x -prediction and input noise injection only partially mitigate these issues. Exact rollout training better matches inference-time conditions, but is impractical due to prohibitively slow sequential sampling. We propose \emphParallel Rollout Approximation (PRA), a scalable framework that addresses both challenges jointly. PRA generates low-dimensional intermediate states instead of high-dimensional pixel patches, then maps them back to pixel-space tokens with a pixel decoder, preserving a pixel-in, pixel-out AR interface. It also constructs inference-like pixel inputs through the same intermediate-state-to-pixel path used at inference, independently across positions, approximating the pixel-feedback interface encountered during inference-time rollout while retaining parallel teacher-forced training. On class-conditional ImageNet-1K generation at 256\times256 resolution, PRA-S with 135M parameters achieves an FID of 2.58, surpassing the previous billion-scale pixel-space AR result of 3.60. Scaling to PRA-L with 511M parameters further improves FID to 1.94, establishing a new state of the art among pixel-space AR models. Beyond generation, PRA achieves higher ImageNet classification probing accuracy than other AR and diffusion baselines, suggesting its potential for unified pixel-space image generation and understanding.
[CV-29] ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering
链接: https://arxiv.org/abs/2606.27974
作者: ZhengXian Wu,Hangrui Xu,Kai Shi,Zhuohong Chen,Yunyao Yu,Chuanrui Zhang,Zirui Liao,Jun Yang,Zhenyu Yang,Haonan Lu,Haoqian Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge-based Visual Question Answering (KB-VQA) requires models to combine image understanding with external knowledge. Most prior methods use a fixed retrieve-then-generate pipeline with a pre-selected retriever and a static top-k setting, which is not adaptive during reasoning. We propose ProMSA, a progressive multimodal search agent for KB-VQA. Given an image-question pair, the agent iteratively chooses image search, text search, or stop, under explicit tool-call budgets and with deduplication to avoid redundant retrieval. For training, we first use rejection-sampling SFT to learn valid tool-use formats, then optimize the agent with TN-GSPO, a sequence-level RL objective that normalizes updates by both generation length and tool-interaction depth. Experiments on E-VQA and InfoSeek show consistent gains over strong RAG and agent baselines, and improved retrieval and end-to-end accuracy. The code is available at this https URL.
[CV-30] Directing the World: Fast Autoregressive Video Generation with Compositional Human-Camera Control
链接: https://arxiv.org/abs/2606.27964
作者: Haoyuan Wang,Yabo Chen,Haibin Huang,Chi Zhang,Xuelong Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Building interactive world models requires generating realistic videos while maintaining controllable dynamics over long horizons. Autoregressive video generation offers a scalable foundation, but suffers from error accumulation and temporal degradation during extended rollouts. This issue is further amplified under heterogeneous controls such as human motion and camera trajectories, which may interfere and destabilize a pretrained video prior, while existing methods often trade off controllability and visual quality. We propose “Directing the World”, a fast autoregressive framework for controllable world-model video generation with compositional human-motion and camera-trajectory control. Our key idea is to decouple control learning while preserving a unified autoregressive video prior. We introduce a Fast-Slow Memory training strategy to stabilize long-horizon rollout learning and improve convergence. For human motion control, we design a t-guided Dynamic Projection mechanism and a refined Motion-CFG strategy, enabling temporally smooth and accurate motion alignment without degrading visual fidelity, and supporting multi-person this http URL learning a robust motion prior, we introduce a second-stage camera-trajectory control module to compose human dynamics with viewpoint changes for coherent world exploration. We further construct a large-scale dataset with synchronized video, text, human-motion, and camera-trajectory annotations, organized into motion-centric and camera-centric subsets for decoupled training. Extensive experiments show stable long-horizon generation with precise controllability and high visual quality. See more at this https URL.
[CV-31] Understanding How MLLM s Describe Artworks Using Token Activation Maps ICPR2026
链接: https://arxiv.org/abs/2606.27947
作者: Nicola Fanelli,Pasquale De Marinis,Raffaele Scaringi,Eva Cetinic,Gennaro Vessio,Giovanna Castellano
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at PRESTIGE workshop at ICPR 2026
Abstract:Multimodal Large Language Models (MLLMs) describe artworks with remarkable fluency, yet the visual reasoning behind their outputs remains opaque. When an MLLM names a style, identifies a subject, or recognizes an iconographic symbol, does it ground each claim in the relevant region of the canvas, draw on an undifferentiated visual signal, or rely primarily on textual priors? We study this using the Token Activation Map (TAM), which produces, for each generated token, a heatmap isolating the visual evidence specific to that token from prior-context interference. Applying TAM to a curated set of paintings spanning multiple periods and genres, we analyze grounding patterns across five semantically distinct token categories: common visual objects, style descriptors, metadata, iconographic tokens, and affective expressions. We find that visual grounding varies substantially with token semantics. We further show that MLLMs attempt to identify artworks and artists, achieving higher accuracy in artist attribution than in title prediction, where hallucinations are more frequent. Finally, we compare TAM with SAM~3 open-vocabulary segmentation. To ensure reproducibility, we release our code, experimental configurations, prompts, and qualitative results on the project page at this https URL.
[CV-32] Controllable Histopathology Image Synthesis with Training-free Structural Initialization and Textural Modulation
链接: https://arxiv.org/abs/2606.27935
作者: Yuheng Qiu,Jingyi Luo,Chenfei Ye,Ting Ma,Jianfeng Cao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning has demonstrated remarkable success in high-throughput histopathology image analysis. However, the performance of learning-based models critically depends on the quality and size of annotations by expert pathologists, which is a resource-intensive and time-consuming process. To address the limitations of data scarcity and annotation burden, several methods have been proposed to synthesize paired histopathology data. Nevertheless, these frameworks typically still require annotation data, albeit in reduced quantities, to impose structural constraints during training. In this work, we present CHIS, a plug-in framework that guides the sampling trajectory of a pretrained diffusion model through two key stages: structural initialization at the start and textural modulation during generation. The initial noise state is refined by fusing the phase information from a prior mask with the amplitude of Gaussian noise in the frequency domain, yielding a structurally informed starting point. During the reverse diffusion process, we adaptively modulate both coarse-grained and fine-grained textures at different wavelet decomposition levels. This enables a diffusion model pretrained solely on unlabeled images to generate outputs that align with prior structural masks while preserving the reference tissue style. We conducted extensive experiments demonstrating the superiority of CHIS in generation fidelity and its substantial benefits for downstream segmentation tasks. Code is available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.27935 [cs.CV] (or arXiv:2606.27935v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.27935 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-33] Home3D 1.0: A High-Fidelity Image-to-3D Asset Generation System for Interior Design
链接: https://arxiv.org/abs/2606.27923
作者: Yiyun Fei,Guoqiu Li,Jin Song,Chuqiao Wu,Delong Wu,Hong Wu,Ziru Zeng,Haohui Chen,YinDong Kong,Jing Li,Qi Wu,Feng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 10 figures, 2 tables; technical report
Abstract:We present Home3D 1.0, a modular image-to-3D generation system that produces high-quality 3D assets from a single reference image, targeting interior design and e-commerce applications. Given a photograph of a furniture or decor item, the system outputs a mesh with physically-based rendering (PBR) materials, and the mesh can be decomposed into material-specific components. The pipeline is organized into four tightly coupled modules: Geometry reconstructs a watertight mesh through latent SDF modelling with a geometry VAE and a coarse-to-fine flow-matching DiT; Texture predicts multiview albedo observations, reprojects them onto the mesh, and completes unseen surface regions with a 3D texture field; Material uses MatWeaver to obtain component masks through video-based segmentation and UV-space voting, then retrieves and bakes PBR maps from a curated material library through hierarchical multi-modal matching; and Parts generates material-editable semantic part meshes with a PartVAE and PartDiT, decoding multi-head part-specific SDF fields in one pass. Each module is evaluated independently with dedicated metrics, highlighting both the current system capability and the remaining gaps toward broader deployment.
[CV-34] Reflect-R1: Evidence-Driven Reflection for Self-Correction in Long Video Understanding ECCV
链接: https://arxiv.org/abs/2606.27922
作者: Shuimu Chen,Yuteng Chen,Yuanshen Guan,Zebang Cheng,Zeyu Zhang,Shengqian Qin,Bin Xia,Jiaran Li,Wenming Yang,Fei Ma
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures, ECCV
Abstract:Current multimodal reflection mechanisms for long video understanding predominantly rely on closed-loop self-reflection within internal parameters. Lacking objective external evidence, models are frequently trapped in blind confidence and often fail to correct errors. Furthermore, applying reinforcement learning to multi-stage reflection pipelines introduces severe policy coupling, which is exacerbated by a critical scarcity of dedicated training data. To address these limitations, this work proposes Reflect-R1, the first Evidence-Driven self-correction framework for long video understanding. The framework constructs a three-stage pipeline consisting of intuition, verification, and arbitration. By dynamically retrieving objective visual evidence to verify initial intuitions and autonomously executing multiple temporal searches to resolve conflicts, it completely breaks the hallucination loop. To overcome policy coupling, we design a stage-decoupled reinforcement learning algorithm named SD-GRPO that independently computes advantage functions across different reasoning stages. Concurrently, we construct a dataset of 120K samples to bridge the training data gap. Extensive experiments on benchmarks such as VideoMME and LongVideoBench demonstrate that Reflect-R1 achieves state-of-the-art performance. Our method significantly improves the genuine rectification rate and enables authentic self-correction strictly grounded in objective evidence.
[CV-35] Every Step of the Way: Video-based Parkinsonian Turning Step Counting
链接: https://arxiv.org/abs/2606.27918
作者: Qiushuo Cheng,Jingjing Liu,Catherine Morgan,Alan Whone,Majid Mirmehdi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:As a prominent symptom of Parkinson’s disease (PD), turning impairment is evaluated through parameters such as turning angle, duration, and particularly, the number of steps required to complete a turn, which directly reflects motor dysfunction. Accurate step counting is challenging due to variability in real-world turning movements and atypical shuffling patterns in parkinsonian gait. Existing methods are predominantly wearable-based, requiring users to wear and manage dedicated devices, which can be inconvenient for continuous daily use. To address this, we propose a passive, video-based framework that estimates step count in a coarse-to-fine manner using diverse motion representations. Specifically, an initial step count is estimated from foot movement signals derived from 3D human mesh recovery, providing high-level motion structures. To incorporate fine-grained motion details, a motion encoder learns complementary gait dynamics from mesh and optical flow to refine the initial estimate. In this process, coarse foot movement signals query the pixel-level motion cues via cross attention to capture subtle parkinsonian gait dynamics. To handle varying video lengths, we partition each video into clips and integrate clip-wise motion embeddings via multiple instance learning (MIL) for step count residual prediction. Extensive experiments show our method consistently outperforms existing step counting methods on real-world PD turning datasets.
[CV-36] here and Back Again: A Flexible-Frame Transformer for Multi-Exposure Fusion ECCV2026
链接: https://arxiv.org/abs/2606.27905
作者: Lishen Qu,Yao Liu,Shihao Zhou,Jie Liang,Hui Zeng,Lei Zhang,Jufeng Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026
Abstract:Multi-exposure fusion (MEF) brings the dynamic range of conventional cameras closer to that of human vision, producing images with rich scene content. Given the large variability in scene luminance, exposure strategies often require different numbers of frames to capture the full radiance range faithfully. However, conventional MEF techniques are typically designed for a fixed number of inputs, forcing deployment systems to maintain separate models for different frame-count requirements, which undermines deployment efficiency. To address this limitation, we propose FreeMEF, the first flexible-frame transformer for MEF that seamlessly accommodates varying numbers of input exposures without retraining or architectural changes. The proposed approach consists of two key modules. First, we introduce a recurrent state space module (RSSM) that sequentially fuses features from arbitrary sequences via adaptive alignment and state-space recurrent modeling, thereby providing global information guidance for the subsequent restoration. Second, we devise a global feature guided block (GFGB) incorporating an extremity-aware hybrid attention (EAHA) and an affine-injection feed-forward network (AFFN), which effectively resolves the similarity paradox while simultaneously optimizing contrast and brightness regulation. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our method, which performs favorably against state-of-the-art methods both quantitatively and qualitatively.
[CV-37] Long-Term Prediction of Local and Global Human Motion with Occlusion Recovery
链接: https://arxiv.org/abs/2606.27900
作者: Qiaoyue Yang,Sven Heutger,Christopher Niemann,Magnus Jung,Ayoub Al-Hamadi,Sven Wachsmuth
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Advances in Visual Computing (ISVC 2025)
Abstract:Human motion describes the three-dimensional full-body movement of a person. Anticipating such motion holds significant relevance across a wide range of application domains such as human-robot interaction, autonomous driving, animation, and healthcare. In recent research, spatial and temporal dependencies are modeled by bidirectional attention mechanisms. These typically anticipate human motion in an autoregressive manner which could cause an accumulation of errors over time. As a consequence, they solely focus on local pose forecasting. To address these limitations, we propose a non-autoregressive transformer based on spatio-temporal attention, and train it not only for local pose anticipation, but also for global motion prediction in space. Furthermore, to enhance its applicability in real-world scenarios, our model is also trained to recover missing joints due to occlusions, and is capable of processing varying lengths of history observations. Our code is publicly available at this https URL.
[CV-38] OrthoTryOn: Geometric Orthogonalization for Conflict-Free Unified Fashion Generation ECCV2026
链接: https://arxiv.org/abs/2606.27880
作者: Zhaotong Yang,Ying Tai,Jiahui Zhan,Yu Zheng,Jianjun Qian,Jian Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV2026
Abstract:Unified fashion generation integrates tasks like virtual try-on and garment reconstruction into a single model to reduce task-specific adaptation costs. However, naive parameter sharing across semantically distinct tasks induces negative transfer through severe inter-task gradient conflict. We propose OrthoTryOn, a unified framework mitigating this interference within a shared Low-Rank Adaptation (LoRA) module. Its Orthogonal Subspace Projection (OSP) applies task-specific orthogonal rotations to bottleneck features, mapping them into decorrelated coordinate frames. To address residual semantic coupling at inference time, we further propose Fisher-guided Negative Guidance (FNG), a parameter-free strategy that utilizes diagonal Fisher information to quantify inter-task sensitivity overlap and explicitly repels generation trajectories from the most confusable task via Classifier-Free Guidance. Extensive experiments demonstrate that OrthoTryOn avoids the severe performance degradation typical of naive unified training and even surpasses independently trained task-specific models, achieving state-of-the-art results across multiple benchmarks while generalizing robustly across diverse diffusion backbones. Code is available at this https URL.
[CV-39] SpatialUAV: Benchmarking Spatial Intelligence for Low-Altitude UAV Perception Collaboration and Motion
链接: https://arxiv.org/abs/2606.27876
作者: Haoyu Zhang,Meng Liu,Qianlong Xiang,Kun Wang,Yaowei Wang,Liqiang Nie
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures
Abstract:Spatial intelligence is essential for low-altitude unmanned aerial vehicle (UAV) perception, collaboration, and navigation. However, existing UAV benchmarks often emphasize image-level recognition, single-view understanding, or narrow answer formats, leaving 3D spatial inference, multi-view collaboration, scene dynamics, and diverse task formulations insufficiently evaluated. To address these gaps, we introduce SpatialUAV, a real low-altitude UAV benchmark comprising 4,331 curated instances across 14 fine-grained task types, covering semantic discrimination, spatial relation, aerial–aerial collaboration, aerial–ground collaboration, and motion understanding. SpatialUAV organizes all samples into a unified visual-input–question–answer schema, while supporting seven input configurations and nine answer formats, including option labels, region identifiers, geometric values, cross-view correspondences, and free-form motion descriptions. To ensure reliable and grounded evaluation, our data construction pipeline integrates detector-assisted regions, depth supervision, metadata-derived rules, extensive manual annotation, blind filtering, and multi-turn human validation, together with task-specific metrics for heterogeneous outputs. Evaluating representative vision-language models across three categories, we show that current models remain far from human-level performance, with pronounced bottlenecks in cross-view association, structured grounding, geometric reasoning, and temporal viewpoint understanding. These results offer empirical guidance for advancing low-altitude UAV spatial intelligence. Code and data are available at this https URL.
[CV-40] A Unified Framework for Vision Transformers Equivariant to Discrete Subgroups of mathrmO(2)
链接: https://arxiv.org/abs/2606.27864
作者: T=ıkun Ông,Georg Bökman
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Vision transformers have become a dominant architecture for visual recognition. However, standard models do not explicitly encode the planar symmetries that arise in many vision domains. We introduce a family of vision transformers equivariant to arbitrary discrete subgroups of \mathrmO(2) , providing a unified framework that generalizes prior flipping- and D_4 -equivariant transformer architectures. Our construction yields equivariant analogues of the core transformer components, together with expressivity guarantees for the resulting layers. In particular, we show that whenever H \le G , the class of G -equivariant ViTs embeds naturally into the class of H -equivariant ViTs. We also prove that, in the single-head setting, the corresponding equivariant self-attention layer realizes every G -equivariant self-attention map representable by ordinary self-attention. We further construct a D_6 -equivariant model based on hexagonal patches, making the architecture compatible with six-fold rotational symmetries. We evaluate the resulting models on the PatternNet aerial image dataset in artificially data-scarce regimes across subgroups of D_4 and D_6 . Our experiments compare two equivariant attention mechanisms and analyze how the choice of homogeneous-space configurations used in the nonlinearities affects performance. Preliminary results under matched parameter budgets indicate that equivariance can improve recognition accuracy, motivating further study of how discrete symmetry groups shape transformer-based visual recognition models.
[CV-41] ScaLe-INR: Scale and Learn Implicit Neural Representations NEURIPS2026
链接: https://arxiv.org/abs/2606.27862
作者: Buwaneka Epakanda,Athulya Ratnayake,Pandula Thennakoon,Mario De Silva,Avishka Ranasinghe,Roshan Godaliyadda,Parakrama Ekanayake
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted as a conference paper to NeurIPS 2026
Abstract:Implicit Neural Representations (INRs) parameterized by multilayer perceptrons excel at modeling continuous signals. However, a key challenge persists as INRs fundamentally suffer from spectral bias and information cross-talk. When a single network attempts to capture multi-scale phenomena, high-frequency weight updates destructively interfere with the underlying low-frequency structural approximation. We introduce Scale and Learn INR (ScaLe-INR), a novel multi-branch architecture that resolves these limitations by explicitly matching the signal’s frequency spectrum with the optimal operating region of the INR. Drawing upon the Fourier inverse scaling theorem we demonstrate that applying directional coordinate scaling expands a network’s representational bandwidth along specific spatial axes. To mathematically enforce functional disentanglement and minimize task-specific information leakage between branches, we propose a Directional Edge Guidance Loss, a spatially-conditioned sparsity prior derived from ground-truth gradients. By constraining the high-frequency branches to act as strict, localized edge-filters, ScaLe-INR eliminates spectral cross-talk, accelerates convergence, and achieves high-fidelity signal reconstruction on complex multi-scale topologies. We evaluate ScaLe-INR across diverse reconstruction and inverse tasks, demonstrating substantial performance gains over existing state-of-the-art (SOTA) methods. The proposed architecture improves upon the nearest baselines by +5.16 dB in image reconstruction and +0.65 dB in image denoising. Furthermore, it achieve an impressive figure of 50.02 dB on audio reconstruction and 0.999 IOU(Intersection Over Union) on 3D reconstruction which beats the all SOTA models.
[CV-42] Hippocampus-DETR: An Explicit Memory Object Detection Framework Based on Hippocampus Modeling
链接: https://arxiv.org/abs/2606.27831
作者: Zhaoning Shi,Bo Ma,Hao Xu,Zepeng Yang,Bo Liang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper addresses the lack of explicit memory mechanisms in current object detection models and proposes Hippocampus-DETR, a novel detection framework based on biological hippocampal memory modeling. This framework integrates a hippocampal memory network module, HipNet, into the DETR architecture and systematically simulates the anatomical structure and functional organization of hippocampal subregions, including the entorhinal cortex, dentate gyrus, CA3, CA1, and subiculum. Through this design, Hippocampus-DETR realizes pattern separation, pattern completion, importance filtering, and information integration of visual encoding features. During training, different memory submodules are optimized using a layer-wise training strategy, ultimately forming a memory system with memory retrieval and completion capabilities. Experimental results demonstrate that Hippocampus-DETR achieves higher detection accuracy than current mainstream models. More importantly, models equipped with this framework also exhibit excellent generalization ability and data efficiency in tasks such as few-shot image classification, multimodal feature construction, and image restoration. Subsequent experiments further validate the functional necessity and internal interpretability of each memory submodule. This study not only provides a novel object detection framework, but also offers a feasible technical pathway for integrating neurocognitive mechanisms with deep learning models, highlighting its significant value in improving model learning efficiency and task robustness. The project is available at this https URL.
[CV-43] CSD: Content-aware Speculative Decoding for Efficient Image Generation
链接: https://arxiv.org/abs/2606.27829
作者: Mingcheng Wang,Junbo Qiao,Yunchen Li,Lingfu Jiang,Wei Li,Jie Hu,Jiao Xie,Zhou Yu,Xinghao Chen,Guixu Zhang,Shaohui Lin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Speculative decoding (SD) has emerged as a key solution to accelerate the inference of autoregressive models. However, in the field of image generation, it faces the challenge of low acceptance rates, and directly relaxing its criteria leads to degradation in image quality. In this paper, we propose a novel content-aware speculative decoding algorithm, termed CSD, which integrates an entropy-based probability relaxation mechanism with an optimal resampling strategy to enhance the inference efficiency for autoregressive image generation. By leveraging the informational uncertainty inherent in different regions of an image, CSD dynamically adjusts the acceptance probability of candidate tokens, increasing the acceptance rate in low-detail areas to accelerate generation. Moreover, a distribution alignment filter is introduced to ensure the output distribution to be aligned with the target model, which significantly improves the generative quality. Experiments conducted on Lumina-mGPT and Janus-Pro demonstrate that the superiority of the proposed CSD. Our source code is available at this https URL.
[CV-44] Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning
链接: https://arxiv.org/abs/2606.27828
作者: Hohin Kwan,Hongyu Li,Ray Zhang,Manyuan Zhang,Xianghao Kong,Anyi Rao,Jiahao Xie,Si Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent interest in multimodal large language models (MLLMs) raises a central question: can they reason over dynamic visual evidence rather than merely recognize objects or events in individual frames? This ability, which we refer to as video temporal-logical reasoning, requires models to maintain, update, and compose evidence as visual states evolve across frames. Existing video benchmarks often conflate this capability with scene complexity, static recognition, or uncontrolled temporal variation. To isolate this capability, we introduce Video-MME-Logical, a controlled benchmark organized around five temporal-logical operations: state tracking, sequential counting, temporal ordering, dynamic spatiality, and structural composition. The benchmark contains 25 fine-grained task categories generated with controlled object states, transitions, temporal dependencies, and logical compositions. It enables difficulty-controlled final-answer evaluation by varying temporal horizon and reasoning complexity, and supports intermediate-state diagnostics by verifying whether models recover the required logical reasoning trace before producing the final answer. Experiments with state-of-the-art MLLMs reveal a substantial human-model gap, especially as temporal-logical complexity increases. Supervised fine-tuning on up to 500K generated samples improves performance but remains insufficient to close the reasoning gap, positioning Video-MME-Logical as a scalable testbed for analyzing and improving temporal-logical reasoning in MLLMs.
[CV-45] Scalable and Differentiable Point-Cloud Registration Using Maximum Mean Discrepancy ICML2026
链接: https://arxiv.org/abs/2606.27818
作者: Rixon Crane,Fahira Afzal Maken,Nicholas Lawrance,Stanislav Funiak,Kasra Khosoussi,Ming Xu,Russell Tsuchida
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICML 2026
Abstract:We present MMD-Reg, a novel correspondence-free approach to point-cloud registration that is differentiable and has linear computational complexity in the number of points. We model registration as a nonlinear least-squares problem based on the Maximum Mean Discrepancy, approximated using random Fourier features. The resulting objective can be solved efficiently with standard methods such as Levenberg-Marquardt, and the solution is differentiable via the implicit function theorem. This allows MMD-Reg to be used as a differentiable optimization layer within end-to-end trainable models, supporting registration under challenging conditions such as poor initial alignment and partial overlap. We demonstrate this Neural MMD-Reg formulation by integrating the layer with a set transformer, training the resulting model in supervised and unsupervised settings, and comparing its performance against recent learning-based methods. We also evaluate standalone MMD-Reg, comparing its accuracy and scalability against widely used non-learning-based registration methods.
[CV-46] xt as Illumination: Spatial Contrastive Retinex Learning for Language-guided Medical Image Segmentation MICCAI2026
链接: https://arxiv.org/abs/2606.27794
作者: Jian Shi,Cheng Zhen,Pingping Zhang,Rui Xu,Yanan Lv,Yili Ma,Huan Bi,Haojie Li,Huchuan Lu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Aceepted by MICCAI2026. More modifications may be performed
Abstract:Language-guided Medical Image Segmentation (LMIS) has shown great potential to improve the delineation of anatomical structures and lesions by integrating clinical textual information. Existing methods generally rely on either implicit interaction between textual and visual features or auxiliary coarse-grained supervision for cross-modal alignment. However, these methods lack explicit and fine-grained constraints to ensure semantic consistency, causing a mismatch between language and the segmentation outputs. To address this issue, we propose Text-as-Illumination Retinex Network (TIRNet), a novel Retinex-inspired framework that treats text embeddings as semantic illumination for feature modulation, thereby improving semantic consistency in LMIS. TIRNet introduces two key blocks integrated at each decoder stage: (1) the Retinex-inspired Text Modulation Block (RTMB), which employs positive and negative illumination maps to enhance text-relevant foreground features and suppress background interference; and (2) the Consistent Detail Compensation Block (CDCB), which selectively recovers high-frequency details via a consistency-gated mechanism conditioned on illumination reliability. Furthermore, we propose a Multi-Scale Illumination Supervision Loss (MSIS-Loss), comprising a Region-Grounded Contrastive Loss (RGC-Loss) that enforces cross-modal similarity to be concentrated in text-relevant foreground regions and suppressed in background regions, and a Background Suppression Loss (BS-Loss) that provides pixel-level supervision for negative illumination maps, jointly ensuring a precise cross-modal alignment at each decoder stage. Extensive experiments on the MosMedData+ and QaTa-COV19 datasets demonstrate that TIRNet achieves state-of-the-art performance in LMIS. The code is available at: this https URL.
[CV-47] Improving Adversarial Robustness via Activation Amplification and Attenuation ECCV2026
链接: https://arxiv.org/abs/2606.27784
作者: Taïga Gonçalves,Yongsong Huang,Tomo Miyazaki,Shinichiro Omachi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ECCV 2026
Abstract:The existence of adversarial attacks is often attributed to the presence of non-robust features in neural networks. While prior defenses reduce their impact via pruning, masking, or feature recalibration, we instead propose to jointly learn to amplify and attenuate these signals through a simple activation scaling mechanism. To this end, we introduce Activation Amplification and Attenuation (A3), a lightweight plug-in module that enhances adversarial robustness with minimal modifications of the activations. A3 dynamically rescales the activations using a learnable mask and a scaling factor derived from the original activation magnitudes. The influence of adversarial perturbations can be amplified or attenuated using the same learnable parameters by simply flipping the sign of the scaling operation. The amplified signals serve as negative references to construct novel contrastive and ranking loss functions. Experimental analysis shows that learning to degrade the predictions in amplification mode simultaneously improves adversarial robustness in attenuation mode. Moreover, A3 relies on only a small number of learnable parameters, with most of its behavior being determined by the scaling mechanism rather than additional network capacity. Extensive experiments demonstrate that integrating A3 into different backbones, datasets, and training methods consistently improves adversarial robustness while introducing negligible computational and memory overhead compared to existing plug-in modules. Code is available at: this https URL.
[CV-48] MindFlow: Harmonizing Cognitive Semantics and Acoustic Dynamics for Facial Animation Generation in Dyadic Conversations ECCV2026
链接: https://arxiv.org/abs/2606.27779
作者: Hejia Chen,Haoxian Zhang,Xu He,Xiaoqiang Liu,Pengfei Wan,Shoulong Zhang,Shuai Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026
Abstract:Generating lifelike facial animation for dyadic conversations requires reconciling high-level cognitive intent with precise low-level motor reflexes, yet existing methods fall short in the semantic understanding of dialogue context and in precise dynamic control. In this paper, we propose MindFlow, a dual-pathway generative framework inspired by the Ventral-Dorsal pathway model in neuroscience, which decouples generation into two collaborative streams, thereby harmonizing deep semantic reasoning with fine-grained control. In the Ventral module, we transform the conventional Sentence-Action approach into a novel Chunk-State approach that models raw acoustic streams as a context-aware, evolving emotional state chain, capturing subtle paralinguistic nuances and mid-utterance emotional shifts missed by sentence-level modeling. The Dorsal module features a conditional autoregressive flow matching network for high-fidelity facial motion, driven by high-frequency acoustic cues and modulated by emotion states, plus a Selective Acoustic Injector for adaptive audio gating to ensure robustness in talking-and-listening dynamics without interference. Extensive experiments demonstrate that MindFlow achieves superior semantic appropriateness and motion naturalness compared to state-of-the-art baselines.
[CV-49] RUST: Efficient Abdominal Trauma Recognition via Image-to-Ultrasound-Video Transfer Learning MICCAI2026
链接: https://arxiv.org/abs/2606.27777
作者: Enguang Wang,Hao Zhou,Shuo Gao,Tuo Liu,Guangquan Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2026, 11 pages, 5 figures
Abstract:Abdominal ultrasound is indispensable for rapid, noninvasive trauma triage. However, interpreting the subtle dynamic cues embedded in continuous scanning is time-intensive and operator-dependent. Parameter-Efficient Image-to-Video Transfer Learning (PEIVTL), which efficiently adapts pre-trained image models to the video domain, notably through visual-textual alignment, offers a promising paradigm for ultrasound video analysis. Nevertheless, substantial spatiotemporal and semantic variations arising from physician-dependent scanning practices continue to limit the effectiveness and generalizability of this framework. We propose TRUST, a scan-aware PEIVTL framework that explicitly models fine-grained spatiotemporal variations to enable reliable ultrasound video understanding. First, we introduce a Cross-Frequency Collaborative Adapter (CFCA) that establishes mutual constraints between low- and high-frequency components, enhancing discriminative spatial feature extraction under heavy speckle corruption. Second, we design a Multi-Granularity Motion-Aware (MGMA) module that integrates local temporal convolutions with motion-prior-guided global self-attention, jointly capturing stable intra-view patterns and abrupt inter-view transitions to characterize complex scanning dynamics. Third, a Visual Query Semantic Aggregation (VQSA) module dynamically generates text prototypes conditioned on visual features, enabling adaptive visual-textual alignment robust to intra-class variability under diverse scanning conditions. Experiments on in-house ultrasound trauma datasets demonstrate that TRUST outperforms state-of-the-art methods by 9.63% with superior computational efficiency.
[CV-50] ModaFlow: Modality-Aware Flow Matching for High-Fidelity Virtual Try-On
链接: https://arxiv.org/abs/2606.27773
作者: Xiangyu Sai,Meysam Madadi,Sergio Escalera,Yong Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
Abstract:Image-based virtual try-on has emerged as a compelling task in e-commerce and augmented reality, yet existing methods struggle to simultaneously preserve fine garment semantics and adapt to diverse person body geometries under large clothing-body deformations. We present ModaFlow, a modality-aware flow-matching based framework for high-fidelity virtual try-on that achieves precise alignment between textual descriptions and garment appearance. Unlike prior methods that treat multimodal conditions uniformly, ModaFlow introduces a modality-aware guidance scheme: visual garment embeddings extracted by a pretrained image prompt adapter provide deterministic, persistent structural guidance, while textual embeddings generated from garment descriptions are controlled via classifier-free guidance (CFG) with adaptive scaling and zero-initialized velocity. To further enhance flow field accuracy, we propose two regularization losses, cosine similarity and perceptual flow discrimination, that jointly improve directional consistency and perceptual realism of the velocity field. Additionally, a mask manipulation strategy stochastically samples among box, transparent, and relaxed masks during training, simulating diverse occlusion scenarios and enabling robust inference under unpaired settings where only a box mask is available. Experiments show that ModaFlow achieves state-of-the-art results in both qualitative and quantitative evaluations, reducing FID by approximately 30% on paired and 20% on unpaired benchmarks.
[CV-51] An Embedded Real-Time License Plate Recognition System for Complex Traffic Scenes ITSC
链接: https://arxiv.org/abs/2606.27772
作者: Anuki Pasqual,Dulan Lokugeegana,Manimohan Thiriloganathan,Nuthya Rathnayake,Kithsiri Samarasinghe,Udaya S. K. P. Miriya Thanthrige
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: Accepted at IEEE Intelligent Transportation Systems Conference (ITSC) 2026
Abstract:Vehicle license plate recognition is an integral component of intelligent transportation systems. In this work, we present an embedded real-time license plate recognition system customized for developing countries. We address the challenge of handling complex, unstructured traffic scenes with diverse vehicle types while implementing the system on an embedded platform for low-cost deployment. Our method consists of license plate detection on a multi-vehicle image, followed by character recognition on the detected license plates. Both steps use lightweight convolutional neural networks to balance accuracy and efficiency. We also introduce the SL-LPR dataset of Sri Lankan road images, which contains a variety of vehicle types and traffic conditions typically seen in developing countries. On this dataset, the license plate detection and character recognition models achieved 93.6% mAP and 87.88% accuracy, respectively, and were competitive against larger models on several public datasets. To achieve real-time performance in a resource-constrained embedded environment, we applied low-bitwidth quantization using the Brevitas library and implemented FPGA acceleration for the models using the FINN framework. The end-to-end system can operate at 11.5~FPS when implemented on the Xilinx Kria KV260 platform. These results demonstrate that our system is effective for real-time license plate recognition on an embedded device, even in complex traffic scenarios. The SL-LPR dataset is available for research use at: this https URL.
[CV-52] NormGuard: Reward-Preserving Norm Constraints in Flow-Matching Reinforcement Learning
链接: https://arxiv.org/abs/2606.27771
作者: Tianlin Pan,Lianyu Pang,Cheng Da,Huan Yang,Changqian Yu,Kun Gai,Wenhan Luo
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reinforcement learning (RL) post-training improves the reward alignment of flow-based generators, but often degrades perceptual quality in ways that are not captured by the reward proxy. We identify a simple structural signature of this drift: across three post-training methods (NFT, AWM, DPO), RL fine-tuning inflates the per-step velocity norm |v_\theta| by 5% to 15% relative to the reference. A form of norm inflation has been studied in classifier-free guidance (CFG), where rescaling the velocity back to a reference norm at inference time can mitigate the resulting artifacts. However, this inference-time correction does not transfer cleanly to RL: rescaling v_\theta to match |v_\textref| at inference time neither improves reward nor fixes the quality degradation, because the inflation is co-adapted into the model weights. Furthermore, an adjoint sensitivity analysis shows that velocity magnitude rescaling carries no coherent first-order reward signal at the batch level, indicating that suppressing norm inflation is unlikely to remove a consistently reward-carrying component. Since inference-time renormalization fails while norm suppression carries no reward cost, training-time intervention is the appropriate strategy. Together, these findings motivate \methodname, a hinge penalty that activates only when |v_\theta| exceeds |v_\textref| and composes additively with any velocity-local base loss. Across two base models, three post-training methods, and two reward proxies, \methodname consistently improves MLLM-judged image quality and forensic realism while preserving reward, with gains that amplify under few-step inference and are not explained by early stopping.
[CV-53] PixelU: A U-Shaped Transformer for Efficient End-to-End Pixel Diffusion
链接: https://arxiv.org/abs/2606.27760
作者: Zipeng Guo,Lichen Ma,Yu He,Xiaolong Fu,Jingling Fu,Junshi Huang,Yan Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:End-to-end pixel-space diffusion models bypass the lossy compression of Latent Diffusion Models (LDMs) but struggle to jointly model low-frequency semantics and high-frequency signals in high-dimensional space. Existing works heavily rely on complex pixel decoders to alleviate this issue. In this paper, we challenge this trend by revealing that these decoders primarily compensate for the optimization difficulties inherent to velocity prediction ( v -prediction). Under the clean data paradigm ( x -prediction), they are redundant. Motivated by this insight, we advocate for simplicity over complexity and introduce PixelU, a minimalist, single-stage U-shaped Diffusion Transformer tailored for pixel space. PixelU abandons auxiliary decoders in favor of zero-cost skip connections, which provide an “information highway” that directly routes uncorrupted high-frequency spatial details from shallow to deep layers. To further enable the backbone to focus exclusively on modeling low-frequency semantics, we introduce a constant-channel spatial down-sampling mechanism as a natural low-pass filter, which compresses deep features into a compact, low-frequency semantic manifold. Extensive experiments demonstrate that this decoupling of frequencies could outperform the strong baseline (JiT-G) with only about 1/3 of its computation cost. On ImageNet 256 \times 256 and 512 \times 512, PixelU achieves FID of 1.63 and 1.92 respectively, surpassing recent pixel-space methods and establishing a simple yet powerful new paradigm for end-to-end diffusion models.
[CV-54] Panoramic Scene Analysis: A Survey from Distortion-Aware Engineering to Sphere-Native Foundation Modeling
链接: https://arxiv.org/abs/2606.27745
作者: Qinfeng Zhu,Lei Fan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Panoramic images capture the complete visual sphere in a single frame, providing spatial context unattainable by conventional cameras. Yet this completeness comes at a geometric cost: the 2-sphere cannot be faithfully mapped to the plane, and every planar representation introduces distortions that violate the assumptions underlying standard vision architectures. This survey traces the evolution of panoramic scene analysis along a methodological trajectory, from projection-based adaptation, through distortion-aware engineering, to sphere-native modeling and geometry-aware tokenization for foundation models, and argues that this evolution reflects a progressive deepening of geometric commitment rather than a simple accumulation of techniques. We organize the literature along two orthogonal dimensions: architectural design (how operators interact with spherical geometry) and training paradigm (how knowledge is transferred across domains). Covering dense prediction (semantic segmentation, depth estimation, and room layout estimation), unified multi-task understanding, open-world perception, vision-language reasoning, and dynamic video analysis, we identify a central unresolved tension: among the methods surveyed, none simultaneously delivers strict spherical equivariance and full reuse of perspective-pretrained foundation-model weights, and we argue that this is a structural rather than incidental gap. We further expose five systematic gaps in current evaluation protocols, namely the absence of spherical-area-weighted metrics, seam-consistency testing, polar-robustness stratification, cross-projection generalization, and open-world protocol standardization, and propose a six-point research roadmap toward general-purpose panoramic intelligence. The corresponding repository is publicly available at: this https URL.
[CV-55] SIFT: Self-Imagination Fine-Tuning for Physically Plausible Motion in Video Diffusion Models ECCV2026
链接: https://arxiv.org/abs/2606.27741
作者: Ruoyu Wang,Jialun Liu,Huayang Huang,Haibin Huang,Jiepeng Wang,Chi Zhang,Xuelong Li,Yu Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026
Abstract:Recent advances in video diffusion models have greatly improved visual fidelity, yet their generated motions often violate physical plausibility. We observe a common kinematic failure, “motion entanglement”, the unintended coupling of independent motion sources, such as camera movement and object motion. We identify that this issue stems from data bias and the reconstruction-based training design of diffusion models. Training on noisy videos that still retain coarse motion cues inadvertently encourages the model to replicate existing motion without an incentive to learn how to model kinematically-grounded motions. To address this, we propose a Self-Imagination Fine-Tuning (SIFT) paradigm, which enables the model to learn from its own generated videos rather than directly reconstructing real ones, breaking the reconstruction shortcut. We further employ motion-aware discriminative supervision and a progressive hard-case replay strategy to stabilize and accelerate learning. By leveraging freely-generated text prompts, our method can densely cover a broad motion space, including rare or finely-disentangled scenarios that would be costly to collect as video data. Extensive experiments demonstrate that our approach substantially improves the physical realism, motion disentanglement, and controllability of generated videos.
[CV-56] Learning 1-Bit LiDAR-based Localization with Auxiliary Objective ECCV
链接: https://arxiv.org/abs/2606.27729
作者: Kaijie Yin,Zhiyuan Zhang,Tian Gao,Wentao Zhu,Cheng-zhong Xu,Hui Kong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: European Conference on Computer Vision(ECCV)
Abstract:6-DoF LiDAR-based localization is a fundamental capability for autonomous systems operating in large-scale outdoor environments. Many deep-learning-based localization methods have achieved promising performance so far. However, as one of the always-on modules competing for limited on-board computational resources, the localization module is expected to consume only a small portion of the overall compute budget. Most existing learning-based methods are still too heavy for this purpose. In contrast, binary neural networks (BNNs) offer an appealing solution, but the 1-bit compression causes severe information loss and performance drop. In this paper, we address this challenge by proposing Binarized LiDAR-based Localization (BiLoc), the first binary neural network framework for 6-DoF LiDAR localization. Specifically, we reinterpret the training of BNNs from the perspective of the information-bottleneck principle, aiming at retaining minimal yet sufficient representations for pose estimation while suppressing redundant variations. And we introduce an auxiliary objective that adaptively regulates information retention in the binary encoder, effectively mitigating the information loss caused by binarization. This auxiliary objective provides additional optimization signals that compensate for the limited representational capacity and the gradient mismatch inherent in BNNs. Extensive experiments on large-scale outdoor LiDAR datasets demonstrate that BiLoc establishes a new state of the art for LiDAR localization with BNNs.
[CV-57] Scene and Human in One World: Reconstruction in a Feedforward Pass
链接: https://arxiv.org/abs/2606.27720
作者: Boao Shi,Qiao Feng,Yiming Huang,Lingjie Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstructing humans in dynamic scenes from moving monocular cameras remains challenging due to scale ambiguity, human-scene misalignment, and occlusion interference. Rather than treating human mesh recovery and scene reconstruction as separate tasks, we believe that accurate human-scene reconstruction requires the two tasks to mutually inform each other: parametric human models offer semantic structure and metric-scale priors, while scene geometry provides spatial context for human localization and alignment. Built on this insight, we introduce SHOW, a mask-promptable human mesh recovery framework that couples feed-forward 3D scene reconstruction with Human Mesh Recovery in a unified metric space. SHOW injects human semantics and scale priors from parametric human models into normalized point-map prediction, enabling metric-scale scene reconstruction from inherently scale-ambiguous monocular input. In turn, the recovered scene geometry constrains human mesh estimation, encouraging spatially consistent human placement and improved human-scene alignment. To handle complex multi-person and cluttered scenes, SHOW further incorporates a promptable masking mechanism that enables flexible target-human selection while suppressing background distractions and occlusion interference. Through joint training, the model learns both human-aware geometric features and geometry-constrained human features, producing aligned metric-scale reconstructions from monocular human-centric videos. Extensive experiments demonstrate that SHOW improves metric-scale consistency, human-scene alignment, and reconstruction accuracy under challenging camera motion, occlusion, and cluttered backgrounds.
[CV-58] MASS: Motion-Aligned Selective Scan for Refinement in Flow-Based Video Frame Interpolation ECCV2026
链接: https://arxiv.org/abs/2606.27718
作者: Jun-Sang Yoo,Seung-Won Jung
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ECCV 2026
Abstract:Video frame interpolation (VFI) remains a challenging task, particularly when dealing with large, non-linear motions and complex occlusions. While flow-based methods are prevalent, they often struggle with ambiguous correspondences. Recent VFI methods based on selective State Space Models (SSMs) are still limited by static grid-based scanning that misaligns with physical motion. In this paper, we propose Motion-Aligned Selective Scan (MASS), a novel framework that reformulates feature scanning from static spatial grids to dynamic motion trajectories. MASS builds a feature sequence along each pixel’s flow-guided trajectory and aggregates it with an SSM. Specifically, we introduce a learnable non-linear path integration to approximate complex curved trajectories via residual velocity updates, and a velocity-aware SSM that dynamically adjusts the sampling budget and step size based on motion magnitude. This adaptive strategy allocates denser sampling to fast-motion regions while keeping static regions efficient. Furthermore, the aggregated states guide a refinement module to rectify intermediate flows and masks in an end-to-end manner. Extensive experiments indicate that MASS achieves highly competitive overall performance on standard benchmarks, establishing state-of-the-art results particularly in challenging scenarios with large displacements and complex dynamics.
[CV-59] ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval
链接: https://arxiv.org/abs/2606.27708
作者: Siqiao Xue,Chunxue Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ZooClaw Team
Abstract:Adapting a foundation vision-language encoder to a specialized retrieval task creates a fundamental tradeoff: gains on the target distribution come at the cost of the foundation model’s broad generalization, and fashion retrieval is a stringent instance of this problem. We present ZooClaw-FashionSigLIP2, a fashion-specialized SigLIP2-base model that resolves this tradeoff with a simple recipe – full fine-tuning with knowledge distillation on curated in-domain data, followed by \wiseft~\citepwortsman2022wiseft weight interpolation with the base model – and outperforms LoRA, larger backbones (up to 1B parameters), and external training data. Under fair evaluation, ZooClaw-FashionSigLIP2 outperforms all baselines on every benchmark in our suite. In addition, we release ZooClaw-Fashion, a new high-quality fashion retrieval benchmark, and a systematic quality analysis of widely-used benchmarks that exposes and mitigates structural biases in their public ground truth. We open-source the model weights and all evaluation artifacts to facilitate future research.
[CV-60] Class-frequency Guided Noise Schedule for Diffusion Models
链接: https://arxiv.org/abs/2606.27696
作者: Jiequan Cui,Beier Zhu,Qingshan Xu,Xiaojuan Qi,Bei Yu,Hanwang Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: technical report
Abstract:In this paper, we are the first to examine the correlations between class frequency and the multi-scale noise schedule within diffusion models. For score-based generative models, low-density regions often lead to inaccurately estimated scores, thereby compromising the generation quality. Although the multi-scale noise schedule can alleviate this issue during the diffusion process, low-frequency classes still face the challenge of large low-density regions, resulting in more inaccurate estimated scores than high-frequency classes. Furthermore, high-frequency classes tend to dominate the score space, causing a convergence of most data points towards generating samples from these classes. Consequently, samples generated within low-frequency classes exhibit suboptimal quality and limited diversity. To address this challenge, we propose the \textitClass-frequency Guided (CFRG) noise schedule, leveraging the insight that low-frequency classes should be endowed with larger-scale noises. To illustrate the effectiveness of our method, we conduct experiments on various tasks, including image generation, image classification, and text-to-image generation, using imbalanced datasets, \textiti.e., CIFAR-100-LT, and ImageNet-LT. By employing the CFRG noise schedule, we achieve substantial improvements over baselines, manifesting the crucial role of frequency statistics in noise schedule design.
[CV-61] wo-Stage Cross-Domain Cervical Abnormality Screening with Cytopathological Image Synthesis and Knowledge Distillation
链接: https://arxiv.org/abs/2606.27678
作者: Jincheng Li,Yuzhi He,Yihui Zhan,Xinmei Zhang,Yifei Sun,Zelin Liu,Lichi Zhang,Minye Shao,Lili Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cross-domain diagnosis remains a major challenge in cervical cell pathology due to pronounced domain shifts across institutions and the subtle visual differences among disease stages, which jointly impair model generalization. To address these issues, this paper proposes a two-stage framework for cross-domain cervical cell detection. In the first stage, we propose the Spatially-Continuous Unpaired Neural Schrödinger Bridge (SC-UNSB), which constructs a synthetic intermediate domain to mitigate cross-domain distribution shifts by modeling image translation as an entropy-regularized optimal transport process. In the second stage, we propose a dual-level feature alignment strategy within a knowledge distillation, which progressively aligns shallow structural features and deep semantic representations to facilitate the transfer of domain-invariant knowledge from the source to the target model. Experimental results demonstrate that the proposed method effectively mitigates domain shift and category ambiguity, improving the cross-domain detection performance.
[CV-62] DIM-WAM: World-Action Modeling with Diverse Historical Event Memory
链接: https://arxiv.org/abs/2606.27677
作者: Kai Wang,Zhaopeng Gu,Yixiang Chen,Yuan Xu,Qisen Ma,Peng Su,Zhaowen Li,Yan Huang,Liang Wang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:World-action models have shown promising robot-manipulation performance by jointly predicting future visual states and actions. However, existing methods mainly rely on short-term history and short-horizon future prediction, which is insufficient for long-horizon tasks whose correct execution depends on earlier observations and task progress. Such temporally dependent tasks require effective use of complementary temporal information, including recent local context, cross-stage historical events, immediate future dynamics, and global task progress. To address long-term forgetting and poor awareness of the global task state, we introduce DiM-WAM, a memory-augmented world-action model that integrates multi-scale historical context, local future dynamics, and global task progress. The memory extracts compact visual event information from real observations, updates multiple memory banks through independent similarity-based merging, and then reads the bank-identity- and time-embedded long-term context to condition video and action denoising. A progress-supervision objective further encourages memory tokens to encode not only completed historical events but also the current task stage and its implications for the remaining task. On RMBench, DiM-WAM raises average success from 28.4% with LingBot-VA to 69.8%, exceeding the explicit-memory Mem-0 baseline at 42.0%. On four real-world Franka tasks, it improves average stage success from 70.7% to 91.5% and full-task success from 52.5% to 80.0%. Project page: this https URL\textttthis https URL.
[CV-63] Multi-Modal Conditioned High-Resolution Transformer for Urban Electromagnetic Field Map Prediction Download PDF
链接: https://arxiv.org/abs/2606.27671
作者: Do-Eon Kim,Dongryul Park,Seungyoung Ahn,Namwoo Kang,Seong-heum Kim,Seongsin Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Predicting electromagnetic field (EMF) strength in urban environments is essential for cellular network planning but computationally expensive with physics-based simulators. We propose a multi-conditioned dense prediction framework that generates 500 500 EMF maps from building layout images and antenna configurations. Our architecture uses a High-Resolution Transformer (HRFormer) backbone with two complementary conditioning mechanisms: Feature-wise Linear Modulation (FiLM) injects scalar antenna parameters into all backbone stages, while cross-attention fuses 1-D radiation pattern tokens with spatial features at the deepest stage. We further introduce transmitter-relative spatial channels encoding distance, proximity, and bearing from the antenna, enabling coordinate-consistent test-time augmentation (TTA) that reduces test MAE by 6.3%. To address the prediction difficulty imbalance across EMF maps, we design a composite loss combining masked L1, multi-scale structural similarity (MS-SSIM), and a focal L1 term that upweights high-signal pixels, outperforming individual loss components in all metrics. Our best model achieves a test MAE of 0.0461, a 25.2% improvement over a plain UNet baseline and 31.8% over an HRFormer-only this http URL-
[CV-64] Explainable AI for Biodiversity Monitoring and Ecological Image Analysis
链接: https://arxiv.org/abs/2606.27667
作者: Brinnae Bent,Holly R. Houliston,Jiayi Zhou,Günel Aghakishiyeva,David W. Johnston
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:Artificial intelligence is transforming biodiversity monitoring by enabling automated analysis of ecological imagery collected from camera traps, drones, satellites, underwater platforms, and other sensing systems. These tools can expand the scale and speed of conservation assessments, yet many computer vision models remain difficult to inspect, making it challenging to determine whether predictions are based on ecologically meaningful signals or on spurious correlations, sampling biases, and other artifacts that may undermine conservation decisions. We argue that explainable artificial intelligence (XAI) should become a standard component of ecological model validation because conservation practitioners increasingly depend on understanding not only whether a model is accurate, but why it is accurate. We provide practical guidance for applying XAI to three common ecological computer vision tasks: image classification, object detection, and image segmentation. To illustrate how XAI can support ecological model auditing, refinement, and deployment, we present two case studies using aerial imagery: harbor seal detection and cetacean anatomical segmentation. These examples demonstrate how explanation methods can identify biologically meaningful cues, reveal false positives driven by background and shape confounds, uncover edge and occlusion effects, and guide data collection, augmentation, and retraining strategies. More broadly, they show how explainability can help assess whether model reasoning aligns with ecological understanding. We conclude by identifying key challenges and opportunities. By making model behavior more transparent and scientifically interrogable, XAI can help ensure that AI-supported ecological evidence is more reliable, understandable, and actionable for biodiversity conservation.
[CV-65] MVPruner: Dynamic Token Pruning for Accelerating Multi-view Vision-Language Models in Autonomous Driving ECCV26
链接: https://arxiv.org/abs/2606.27660
作者: Nan Yang,Zhanwen Liu,Linfeng Zhang,Shangyu Xie,Yang Wang,Wenzhuo Zhou,Xiangmo Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ECCV26
Abstract:Vision-Language Models (VLMs) improve generalization and interpretability in autonomous driving but suffer from efficiency issues due to long visual token sequences, particularly in standard multi-view settings. Existing token pruning methods employ fixed pruning rate allocation and static importance metrics, ignoring dynamic inter-view importance differences and the evolving information importance during inference. Our analysis reveals that multi-view VLMs inherently encode task-related view priors in deeper layers and exhibit dynamic information requirements. Motivated by these findings, we propose MVPruner, a two-stage adaptive token pruning method that aligns pruning behavior with the model’s dynamic information requirements. The first stage allocates pruning budgets based on the information diversity of each view, and retains tokens with consistent contribution across stages, ensuring semantic representational capacity. The second stage allocates budgets and selects tokens guided by instruction text to guarantee task alignment. Experimental results on four benchmarks demonstrate the superior performance of our method. For example, DriveMM equipped with MVPruner achieves 87.3% reduction in FLOPs, 4.97* speedup in prefilling phase while retaining 98.5% accuracy on DriveLM benchmark.
[CV-66] GeoFace: Consistent Multi-View Face Generation with Geometry-Constrained Diffusion
链接: https://arxiv.org/abs/2606.27659
作者: Yeji Choi,Jinhyeok Choi,Jaewon Min,Minkyung Kwon,Jin Hyeon Kim,Seungryong Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present GeoFace, a geometry-constrained multi-view diffusion framework for consistent face generation from a single input. % While recent multi-view diffusion models achieve photorealistic synthesis at the per-view level, they lack an explicit mechanism to enforce a shared 3D structure across views, often leading to inconsistent geometry across viewpoints. To address this, GeoFace proposes a unified dual-stream framework for joint generation of multi-view RGB images and 3D face geometry, where the appearance and geometry streams interact through shared attention layers. To encourage the two streams to mutually constrain each other, we introduce a geometry-guided attention alignment loss that supervises the cross-attention between appearance and geometry tokens with 3D-consistent correspondences, enabling the appearance stream to correctly reference pose-invariant geometric cues for robust alignment across viewpoints. Geometry is represented as a canonical UV position map, derived from a FLAME mesh fitted to multi-view observations, serving as a view-invariant shared constraint across all generated views. Experiments on RenderMe-360 and NeRSemble demonstrate that GeoFace consistently outperforms existing methods in both visual quality and cross-view geometric consistency, facilitating more efficient 3D reconstruction.
[CV-67] mporal-Emerged Prompting for Segment Anything in Multiframe Infrared Small Target Detection ICML2026
链接: https://arxiv.org/abs/2606.27655
作者: Yinghui Xing,Donghao Chu,Shizhou Zhang,Di Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 43rd International Conference on Machine Learning (ICML 2026)
Abstract:Accurately localizing and segmenting small targets in low signal-to-noise ratio (SNR) infrared sequences remains a challenging task. Since targets are often indistinguishable from the background in individual frames, existing methods, even when equipped with advanced foundation model and powerful inter-frame association mechanisms, still fail to detect them. Motivated by the observation that targets tend to emerge gradually from the background over time and become distinguishable, we propose Temporal-Emerged Prompting for Segment Anything Model (TEP-SAM), a principled framework designed to explicitly exploit such temporal-emerged cues to modulate and prompt SAM. TEP-SAM operates by jointly modeling global motion patterns and local motion deviations to locate potential targets. It further enhances target region features by leveraging motion discrepancy, thereby generating temporal-emerged cues for SAM and enabling non-interactive segmentation. By bridging large-scale semantic pretraining with task-specific temporal modeling, TEP-SAM effectively adapts SAM to the challenging multiframe infrared small target detection task. Extensive experiments demonstrate the effectiveness of our approach, particularly under severely low-SNR conditions and in complex dynamic background.
[CV-68] VLM-Aware Meta-Optic Front-End Design for Frozen Vision-Language Models
链接: https://arxiv.org/abs/2606.27646
作者: Chanik Kang,Raphaël Pestourie,Haejun Chung
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注: 18 pages, 6 figures, 3 tables
Abstract:Conventional machine-vision pipelines typically rely on high-quality optics that produce clean, human-interpretable images, and optical design has therefore been driven by image-level criteria such as resolution, aberration correction, and pixel fidelity. However, such optics are often impractical for size-, cost-, or form-factor-constrained applications, where compact meta-optics offer an attractive alternative but operate under strict physical efficiency limits. We propose CODA, a co-design framework that optimizes a continuous-density meta-optic front-end for frozen-model recognition using differentiable image formation and adjoint-gradient updates of Maxwell-based simulations. CODA directly optimizes the cross-entropy loss of a fixed zero-shot CLIP classifier without learned reconstruction, image signal processing, or image-fidelity auxiliary objectives. In a two-dimensional simulated imaging benchmark on ImageNet-100, CODA improves CLIP ViT-L/14 zero-shot accuracy from 53.75 \pm 3.57 % with a focal-concentration baseline to 65.41 \pm 3.99 % . The optimized optics further transfer without re-optimization across CLIP, SigLIP, and DINOv2 on ImageNet-100, CIFAR-100, and Food-101. These results demonstrate that, under constrained meta-optic imaging, downstream recognition can be improved by aligning optical design with frozen vision-model objectives rather than conventional image-formation criteria.
[CV-69] CascadeOcc: Rethinking 3D Occupancy World Models with Cascaded VQ Representations
链接: https://arxiv.org/abs/2606.27644
作者: Kyumin Hwang,Wonhyeok Choi,Jaeyeul Kim,Jihun Park,Daehee Park,Sunghoon Im
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE Signal Processing Letters (SPL), 2026
Abstract:This letter proposes CascadeOcc, a novel occupancy world model that prioritizes intrinsic structural hierarchy over extrinsic auxiliary modalities for autonomous driving. Occupancy world models – forecasting the future driving environment and planning the driving trajectory – effectively bridge perception and planning, but current approaches often heavily rely on external modalities or large language models, failing to fully exploit the inherent structural potential of occupancy representations themselves. To enhance representational capacity for complex 3D scenes, we integrate a cascaded Vector Quantized (VQ) mechanism into an autoregressive framework. Following a coarse-to-fine principle, CascadeOcc progressively refines fine-grained details from global structures through a multi-scale architecture. Additionally, we incorporate a TimeMixer to capture multi-scale temporal dependencies, establishing a dual-hierarchy mechanism in both space and time. Experimental results on 4D occupancy forecasting and motion planning benchmarks demonstrate that CascadeOcc achieves superior performance among vision-centric approaches, validating that optimizing inherent representations is a powerful alternative to relying on external foundation models.
[CV-70] AI-Generated Image Recognition via Fusion of CNNs and Vision Transformers
链接: https://arxiv.org/abs/2606.27637
作者: Xuan-Bach Mai,Hoang-Minh Nguyen-Huu,Quoc-Nghia Nguyen,Hoang-Tung Vu,Minh-Triet Tran,Trung-Nghia Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SOICT 2024
Abstract:Recent advancements in synthetic data technology have opened a new era where images of remarkable quality are generated, blurring the lines between real-life images and those produced by Artificial Intelligence (AI). This evolution poses a significant challenge to ensuring the reliability and authenticity of data, underscoring the need for robust detection methods. In this paper, we present a robust approach aimed at addressing these pressing concerns. Our methodology revolves around leveraging fusion strategies, combining the strengths of multiple detection methods for identifying AI-generated images. Through extensive experimentation on the CIFAKE dataset, our model showcases remarkable performance, achieving an impressive accuracy rate of 97.32%. This accomplishment underscores the efficacy of our approach in accurately distinguishing between AI-generated images and real-life images, thus contributing to the advancement of data authentication techniques amidst the proliferation of synthetic data.
[CV-71] Denoising ICF Images with Multiplicative Uniform Noise: A Self-Supervised Study Based on the Log-Domain Noisier2Inverse Framework
链接: https://arxiv.org/abs/2606.27635
作者: Gyeongha Hwang,Bradley Thomas Wolfe,Naima Naheed
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper documents the implementation and evaluation of a self-supervised denoising framework on Inertial Confinement Fusion (ICF) images corrupted by Multiplicative Uniform noise: the \emphLog-Domain Noisier2Inverse framework. This framework is developed and analysed in this work; the key theoretical result – that minimising the log-domain self-supervised loss is equivalent to supervised learning in the transformed domain – is presented with full proof. We document significant implementation challenges arising from the unique characteristics of ICF imagery, describe the fixes applied at each stage, and report final quantitative results. The log-domain approach with per-image JSON Uniform noise loading (Variant~B) achieves the best result: a mean PSNR of 21.41\db and SSIM of 0.8358 , a +19.46\db improvement over the noisy input baseline of 1.95\db , substantially outperforming BM3D log-domain ( 4.47\db , SSIM 0.5181 ) and Noise2Self ( 4.75\db , SSIM 0.0177 ). Variant~A, using fixed Gaussian noise loading, achieves 21.39\db PSNR and SSIM 0.8436 . Of the three evaluated methods, Log-Domain Noisier2Inverse and Noise2Self are entirely self-supervised during training, requiring no clean ground truth data; BM3D is a classical filter-based method requiring no training at all. The clean reference images are used solely for quantitative evaluation of all three methods.
[CV-72] Qwen -Image-2.0-RL Technical Report
链接: https://arxiv.org/abs/2606.27608
作者: Yixian Xu,Kaiyuan Gao,Yuxiang Chen,Yilei Chen,Zecheng Tang,Zihao Liu,Zikai Zhou,Deqing Li,Hao Meng,Kuan Cao,Jiahao Li,Jie Zhang,Liang Peng,Lihan Jiang,Ningyuan Tang,Shengming Yin,Tianhe Wu,Xiaoyue Chen,Yan Shu,Yanran Zhang,Yi Wang,Yu Wu,Yujia Wu,Zekai Zhang,Zhendong Wang,Xiao Xu,Kun Yan,Chenfei Wu
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 6 figures, 1 table
Abstract:We present Qwen-Image-2.0-RL, a post-training pipeline that applies reinforcement learning from human feedback (RLHF) and on-policy distillation (OPD) to improve both the visual quality and instruction-following capability of the Qwen-Image-2.0 diffusion model. To provide reliable reward signals, we construct task-specific composite reward models by fine-tuning vision-language models with a pointwise scoring paradigm and chain-of-thought reasoning. For text-to-image generation, the reward models cover alignment, aesthetics, and portrait fidelity dimensions. For image editing tasks, the reward system addresses instruction-following accuracy and face identity preservation. Building on this reward system, we develop a scalable GRPO-based RL training framework, incorporating a hybrid classifier-free guidance (CFG) strategy to preserve pre-trained knowledge, prompt curation via intra-group reward range filtering, and per-category reward weight calibration. To merge the task-specialized RL policies for T2I and editing, we propose on-policy distillation as the final training stage, which consolidates multiple teachers into a single student model through trajectory-level velocity matching. Extensive evaluation shows that Qwen-Image-2.0-RL achieves 57.84 overall score on Qwen-Image-Bench (+2.61 over the base model), Elo ratings of 1193 in text-to-image arena (+78) and 1349 in image edit arena (+93), demonstrating consistent gains in aesthetic quality, prompt adherence, and editing accuracy.
[CV-73] On the stability of scale-space metrics
链接: https://arxiv.org/abs/2606.27605
作者: William Leeb
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, 7 figures
Abstract:We study the stability of a classical family of metrics defined over functions’ Gaussian scale-space representations, focusing on the comparison of images (functions of two variables). These metrics have precedents both in harmonic analysis, specifically the theory of Besov spaces, and in classical methods of image processing; special cases are also known to be metrically equivalent to certain Wasserstein distances. We quantify these metrics’ robustness to geometric deformations, and introduce rotationally-invariant versions that are stable to changes in angle when comparing tomographic projections. We also describe computationally efficient algorithms for evaluating the metrics from finite samples, and prove their robustness to additive noise. The results are illustrated through numerical experiments.
[CV-74] Spectral Subsurface Scattering from RGB via Biophysical Skin Inversion
链接: https://arxiv.org/abs/2606.27604
作者: Carlos Aliaga,Adrian Jarabo
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 9 figures
Abstract:In this paper we present a spectral optical inversion for skin for path tracing-based rendering of subsurface scattering. Skin is a complex multilayered medium, with appearance determined by the mixture of biophysical chromophores. However, current methods rely on medium homogeneization, with optical parameters obtained via albedo inversion from a reflectance texture and hand-tuned scattering distance and anisotropy. This results into significant art-skilled manual labor for authoring, and an inaccurate scattering profile for skin. To solve these problems, we generalize existing albedo inversion techniques, and propose a framework that predicts full-spectral skin scattering parameters from a single RGB diffuse albedo. Our method builds upon a new mixture-of-media representation, that approximates the aggregated multilayered appearance of skin by mixing the aggregated scattering of three uncorrelated media. We train a chained neural decoder that maps RGB diffuse albedo to the optical properties of the mixture of media, including anisotropy, scattering radius and scattering albedo. Then, we show this mixture can be used in a random-walk-based path tracer with minimal modifications, by simply randomly selecting the medium to traverse.
[CV-75] Dismantling Pathological Shortcuts: A Causal Framework for Faithful LVLM Decoding ICML2026
链接: https://arxiv.org/abs/2606.27596
作者: Liu Yu,Can Chen,Ping Kuang,Zhikun Feng,Fan Zhou,Gillian Dobbie
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 29 pages, 25 figures. Accepted by ICML 2026
Abstract:Large Vision-Language Models (LVLMs) exhibit sophisticated reasoning but remain susceptible to object hallucination. Deviating from the prevailing attention intensity assumption, we reveal a deeper dynamic structural misalignment: hallucination is triggered at decision-critical steps where specific attention heads, acting as risky mediators, decouple from visual evidence to lock onto language priors. This establishes a pathological shortcut that bypasses visual grounding. To dismantle this, we propose Fox (Faithfulness and Observational-flow via eXpression-rectification), a training-free inference-time framework. Fox diagnoses structural misalignment using a visual attention entropy probe to localize risky mediators unsupervisedly. We then execute a targeted causal intervention via numerical logit saturation to physically sever the shortcut path. Finally, a conflict-gated cooperative decoding strategy reconciles interventional faithfulness with observational fluency. Extensive experiments demonstrate that Fox achieves SOTA performance, outperforming SID by 29.1% while preserving linguistic richness. Code is available at this https URL.
[CV-76] CoIn: Comprehensive 2D-3D Inpainting with Gaussian Splatting Guidance
链接: https://arxiv.org/abs/2606.27584
作者: Hana Kim,Minje Kim,Tae-Kyun Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:3D scene inpainting is essential for reconstructing areas corrupted by occlusions or limited viewpoints. While recent methods leverage Gaussian Splatting (GS) for efficient 3D editing, they often depend on precise multi-view segmentation masks and are inherently constrained to object removal tasks. We propose CoIn, a novel framework that bridges 2D inpainting models and 3DGS through a multi-stage consistency pipeline. Our approach first generates initial inpainted images using a diffusion model, enabling the use of arbitrary-shaped masks and diverse tasks like object insertion. We then introduce Reference Adaptive GS with Feature Attention to reconstruct a coarse 3D scene by adaptively weighing towards a reference view (2D - 3D). This 3D representation provides geometric guidance to the diffusion process via GS-based Reference Feature Warping, ensuring multi-view consistency (3D - 2D). Finally, a Texture-Enhancing Discriminator refines the 3D scene to achieve high photometric realism (2D - 3D). Experiments show that CoIn, effectively leveraging bidirectional information flow, achieves state-of-the-art performance and effectively handles both object removal and object insertion with flexible mask input.
[CV-77] Beyond Points: Spherical Distributional Part Prototypes for Interpretable Classification
链接: https://arxiv.org/abs/2606.27582
作者: Duarte Leão,Diogo Pereira Araújo,Catarina Barata,Carlos Santiago
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Prototype-based neural networks aim to provide intrinsic interpretability by grounding predictions in a small set of part prototypes. However, modern vision backbones typically operate in normalized, directional embedding spaces where each semantic part exhibits substantial intra-class variability. As a result, point prototypes often become redundant or unstable, hurting both explanation quality and robustness. We propose vMFProto, a distributional part-prototype framework that models each class as a mixture of von Mises-Fisher components on the hypersphere. Each prototype learns its own concentration, capturing part-specific variability, and we use entropic optimal transport (OT) to obtain structured patch-to-prototype assignments. A two-stage training schedule performs OT-driven prototype discovery followed by end-to-end refinement with patch-level distillation and distribution-aware diversity regularization. Experiments on CUB-200-2011, Stanford Dogs, and Stanford Cars with frozen DINO backbones show that vMFProto achieves state-of-the-art explanation quality (consistency, stability, and distinctiveness) with competitive accuracy. Qualitative results confirm that vMFProto yields localized, non-redundant part evidence.
[CV-78] Distribution-based deep multiple instance learning for tumor proportion scoring in NSCLC
链接: https://arxiv.org/abs/2606.27579
作者: Krzysztof Pysz,Artur Bartczak,Jarosław Kwiecień,Piotr Krajewski,Witold Dyrka
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Tissues and Organs (q-bio.TO)
备注:
Abstract:Accurate assessment of tumor proportion score (TPS) in non-small cell lung cancer (NSCLC) is critical for treatment planning and prognosis. Key challenges include the tedious manual work required to annotate each slide, combined with the limited number of experts certified for this task. Multiple instance learning (MIL) has proven to be an effective approach for predicting TPS scores at the slide level; however, existing methods struggle with non-expressive (zero class) images. Our approach involves two models: (1) an embedding-extraction and multiclass-classification network that captures the histopathological features of individual patches, and (2) a MIL model that aggregates these embeddings to predict zero-inflated beta (ZIBeta) parameters representing the overall TPS probability distribution for the entire slide. Using only slide-level TPS scores as labels, we demonstrate how this end-to-end framework can leverage a novel distribution-based architecture to improve prediction accuracy and explainability. ZIBeta modeling significantly outperforms baseline linear and ridge regression while capturing expected accuracy through distribution concentration.
[CV-79] DeLux: Cross-Modal Local Artifact Restoration in Video Using Neuromorphic Data
链接: https://arxiv.org/abs/2606.27576
作者: Bartosz Stachowiak,Dariusz Brzezinski
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Conventional RGB cameras suffer from lighting artifacts such as flare, glare, flicker, and overexposure, leading to irrecoverable information loss that necessitates computational restoration. However, existing approaches treat these problems in isolation, failing to recover structural details completely obscured by complex spatially discrete image degradations. In this paper, we propose a novel cross-modal restoration paradigm and present DeLux, a modular proof-of-concept pipeline that leverages neuromorphic event streams as a structural prior to guide the targeted detection and inpainting of lighting artifacts in RGB video. Validation on synthetic benchmarks and real-world automotive footage demonstrates that DeLux effectively suppresses local artifacts and restores affected regions. The proposed approach outperforms existing RGB-only baselines and event-guided HDR models, achieving an average MS-SSIM of over 0.99 across all artifact types and demonstrating up to an 88% reduction in artifact severity in real-world automotive footage. The synthetic artifact generation tools and curated real-world evaluation datasets are made publicly available to foster future research on cross-modal restoration.
[CV-80] Perceptual 3D Simulation With Physical World Modeling CVPR2026
链接: https://arxiv.org/abs/2606.27575
作者: Wanhee Lee,Klemen Kotar,Rahul Mysore Venkatesh,Jared Watrous,Daniel L. K. Yamins
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published as a conference paper at CVPR 2026
Abstract:Predicting how a scene will evolve after a desired 3D transformation from images is a central goal in vision, graphics, and robotics. Yet unlike ideal simulators with full access to 3D geometry and dynamics, real world systems must rely on perceptual inputs and local actions that are inherently partial and incomplete. In this work, we present P3Sim, a physical world modeling system that simulates future scene states under both partial observations and incomplete 3D transformation signals. P3Sim is composed of three interacting components: a learned physical world model, a geometric conditioning module, and a persistent scene memory. The world model interprets perception as probabilistic inference over multimodal scene variables, providing predictions of the distributions of any scene variable conditioned on any combination of others. The geometric conditioning module provides a partial 3D transform signal for conditioning the world model at inference time. The persistent scene memory integrates predictions over time, enabling online updates and consistency under uncertainty. By combining learned inference with explicit geometric structure, P3Sim balances data-driven flexibility with built-in inductive bias. This design yields a flexible perceptual simulator that generalizes across diverse 3D transformation tasks, such as novel view synthesis, object manipulation, and dynamic scene prediction, advancing toward general purpose 3D scene understanding and transformation.
[CV-81] Radar Guided Camera Verification for Automatic Emergency Braking Rethinking Object Detection in Radar Camera Fusion
链接: https://arxiv.org/abs/2606.27556
作者: Ram Charan Akula,Sivanathan Kandhasamy,Manikandan Ganesan
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 8 figures
Abstract:Radar camera fusion is widely used in Automatic Emergency Braking AEB systems because radar provides reliable range and velocity measurements while cameras provide a proper visual confirmation of the objects . Most of the deployed systems perform this confirmation using computationally intensive object detectors. However, if the radar has already localized a target, the camera may only need to verify the obstacles presence rather than solving a full problem by identifying the object. Our work proposes a radar scoped edge density gate that performs obstacle verification within radar guided image regions of interest. This method requires no training data, model weights, or GPU acceleration and was integrated into a complete radar camera fusion AEB system with brake by wire actuation. Evaluated on a real instrumented vehicle across 72 driving sessions and 131,603 camera frames, the proposed approach reduced the camera search space by up to 98.7 percentage, achieved a mean processing latency of 0.121 ms per ROI, an AUC of 0.898, and a recall of 0.994. Across 33 staged threat scenarios, the complete AEB system recorded zero missed brake events.
[CV-82] Understanding Cross-Rig Generalization in Automotive Perception: a Multi-Rig Benchmark and Rig Variation Metrics ECCV2026
链接: https://arxiv.org/abs/2606.27554
作者: Tim Alexander Bader,Tim Dieter Eberhardt,Maximilian Dillitzer,Wilhelm Stork
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026; Project Page: this https URL
Abstract:Camera-based perception systems for autonomous driving are typically developed and evaluated using fixed sensor rigs, while real-world vehicle fleets exhibit substantial variation in camera placement, orientation, field of view, and camera count. This mismatch introduces a cross-rig domain gap in which only the geometric observation process changes. To study this effect under controlled conditions, we introduce Plentiful CARLA Camera Rigs, a benchmark that renders identical driving scenes under 14 systematically designed camera rigs. This setup enables direct analysis of cross-rig generalization without confounding changes in scene content or appearance. Using the benchmark, we analyze cross-rig transfer behavior of representative multi-view perception architectures and observe substantial performance shifts induced by geometric rig variation. To facilitate structured analysis, we further introduce two calibration-based descriptors derived from rig metadata: Rig Variance, capturing internal rig diversity, and Rig Contrastive Distance, measuring geometric discrepancy between rigs. Our experiments show that geometric rig differences strongly correlate with relative cross-rig performance shifts and that Rig Contrastive Distance provides a reliable proxy for ranking transfer difficulty between sensor rigs.
[CV-83] Beyond MoCap: Scaling Motion Tokenizers with Synthetic Human Motion for Generative Modeling
链接: https://arxiv.org/abs/2606.27547
作者: Yiwen Yan,Wanning He,Yu-Wing Tai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human motion generation models are fundamentally constrained by the limited diversity of motion capture datasets, which predominantly contain common, repetitive actions and fail to cover the long tail of complex human movements, resulting in a restricted motion vocabulary in learned latent representations and poor generalization to rare, compositional, and highly dynamic motions. In this work, we propose a framework for expanding the motion representation space by leveraging large-scale synthetic human motion, introducing a data generation pipeline that produces diverse, physically plausible motion sequences beyond the distribution of existing datasets and integrating it with a redesigned VQ-VAE tokenizer that adapts to this expanded motion space. Unlike conventional tokenizers trained on narrow data distributions, our approach jointly scales both the training distribution and the discrete codebook, enabling the model to capture a significantly richer set of motion primitives. We demonstrate that training with synthetic motion substantially improves the coverage and compositionality of the learned motion vocabulary, leading to consistent gains across motion generation tasks such as text-to-motion and motion continuation, while remaining fully compatible with existing frameworks including MotionGPT. Our results suggest that the primary bottleneck lies in the limited support of the learned motion representation, rather than model architecture alone. Scaling synthetic motion in tandem with representation learning offers a principled path toward more expressive, controllable, and generalizable human motion synthesis.
[CV-84] MemoBench: Benchmarking World Modeling in Dynamically Changing Environments
链接: https://arxiv.org/abs/2606.27537
作者: Haoyu Chen,Kaichen Zhou,Hang Hua,Kaile Zhang,Jingwen Qian,Wufei Ma,Haonan Chen,Chunjiang Liu,Yizhou Zhao,Xiaoyuan Wang,Weiyue Li,Alan Yuille,Paul Pu Liang,Yilun Du
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video generation models aspire to simulate dynamic environments, and several benchmarks now evaluate memory consistency across frames. However, most assess consistency only while the target remains in view, and the few that force objects out of view evaluate static scenes where nothing changes during occlusion. To bridge this gap, we introduce MemoBench, a diagnostic benchmark built around the disappear-and-reappear paradigm in dynamically changing environments: a target object undergoes a physical process, disappears from view, and must be correctly recovered in its updated state upon reappearance. We curate 360 ground-truth clips spanning synthetic and real-world scenes, and design an evaluation suite combining automated metrics with VQA-based assessment across four diagnostic pillars. Evaluation of eight state-of-the-art models reveals key insights and open challenges regarding memory consistency under the disappear-and-reappear paradigm.
[CV-85] Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge ICML2026
链接: https://arxiv.org/abs/2606.27527
作者: Thomas Shih-Chao Liang,Zhuoran Yu,Yong Jae Lee
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ICML 2026
Abstract:Large Language Models (LLMs) possess broad conceptual knowledge acquired through large-scale text pretraining, yet their potential to supervise models in other modalities remains underexplored. In this work, we propose LaViD–Language-to-Visual Knowledge Distillation–a simple and effective framework for transferring high-level semantic knowledge from a language-only teacher to a vision-only student model. Instead of relying on paired multimodal data, LaViD elicits conceptual signals from an LLM by prompting it to generate multiple-choice questions (MCQs) that probe semantic distinctions between visual classes. Each class is mapped to a soft label distribution over these MCQs, forming a rich conceptual signature that guides the student through an auxiliary distillation loss. Notably, despite using a language-only teacher without access to image data, LaViD consistently outperforms recent methods like MaKD that distill from vision-language models across multiple fine-grained benchmarks. It also achieves competitive or superior performance compared to state-of-the-art visual distillation methods such as DKD and MLKD, with further gains when combined with logit standardization. On the Waterbirds dataset, LaViD substantially improves worst-group accuracy, demonstrating enhanced robustness to spurious correlations with distillation. Code is available at this https URL.
[CV-86] ssellating The Earth ECCV2026
链接: https://arxiv.org/abs/2606.27514
作者: Daniel Cher,Hamza Iqbal,Eric Xing,Brian Wei,Nathan Jacobs
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: European Conference on Computer Vision – ECCV 2026
Abstract:Geolocation encoders, which map geographic coordinates to learned representations, are emerging as an effective means of capturing visual and non-visual characteristics from a latitude-longitude pair alone. However, existing approaches project coordinates onto fixed bases (e.g., spherical harmonics), allocating representational capacity uniformly and devoting equal resources to the open ocean and to a developing city. We introduce Tessellating the Earth (TTE), a location encoder built from learnable Spherical Voronoi partitions that concentrates representational capacity where it is needed in a fully differentiable, end-to-end manner. Each Voronoi site carries its own embedding and migrates during training toward discriminative areas. To bridge the gap between local spatial structure and global semantic understanding, we introduce \emphglobal semantic tokens: a set of shared learnable concept tokens that distill semantic knowledge from the satellite imagery into a compact vocabulary the location encoder can reference at inference, enabling geographically distant sites covering similar environments to share semantics. TTE sets a new state of the art for location encoders across a suite of geospatial classification and regression tasks, and achieves the strongest results when used as a geographic prior for fine-grained species classification on iNaturalist-2018. Code, and weights are available at this https URL.
[CV-87] Structured-Li-GS: Structured 3D Gaussians Splatting with LiDAR Incorporation and Spatial Constraints
链接: https://arxiv.org/abs/2606.27509
作者: Huaiyuan Weng,Huibin Li,Chul Min Yeum
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, ISPRS Congress 2026
Abstract:In this study, we develop a Structured framework for Gaussian Splatting (3DGS) with LiDAR integration (Structured-Li-GS). It is a lightweight Gaussian Splatting pipeline that leverages LiDAR-inertial-visual SLAM. Structured-Li-GS achieves high-quality 3D reconstructions with fewer Gaussians by training on accurate, dense, colorized point clouds. Gaussian primitives are anchored using sub-sampled point clouds, and their ellipsoidal parameters are initialized from local surface geometry. Our training strategy integrates a comprehensive set of loss terms, including photometric, flattening, offset, depth, and normal losses, guided by the dense point cloud, enabling accurate reconstruction without Gaussian densification. This approach produces up-to-scale, high-fidelity results with a moderate model size. For experimental validation, we develop a custom hardware-synchronized LiDAR-camera handheld scanner. Experiments on both benchmark datasets and our real-world in-house dataset demonstrate that Structured-Li-GS surpasses state-of-the-art methods while using fewer Gaussians.
[CV-88] ruEye: Fine-Grained Detection of AI-Generated Human Subjects in Images
链接: https://arxiv.org/abs/2606.27505
作者: Jay Barot,Dan Lin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 Pages, 3 figures
Abstract:AI generated images are proliferating across the Internet. While some are used for entertainment, others are weaponized for fraud and social engineering attacks on social media users. Existing detectors overfit to generators seen during training, treat detection as opaque binary classification, or rely on costly Large Language Models (LLMs) to explain their outputs. In this paper, we present TruEye, a novel model for fine grained detection and localization of AI manipulated or AI generated humans and scenes. Unlike conventional detectors that assign a single authenticity label, TruEye is the first to distinguish among five compositional categories of synthetic content, including the most challenging case in which a real human is composited into a real scene where they were never physically present. At its core is a mask conditioned dual stream transformer that separates human and scene tokens while preserving patch level spatial correspondence. Specialized reasoning within each stream and region gated cross attention enforce semantic coherence between subject and background, while token level supervision and global compositional classification yield robust, interpretable predictions without invoking an LLM. By restricting intra stream attention to semantically coherent tokens, TruEye also runs over 100\times faster than LLM based competitors. Experiments on 6 datasets and our newly curated FineSyn dataset, show that TruEye surpasses state of the art detectors with higher accuracy, faster inference, and stronger generalization to unseen AI generated or manipulated images.
[CV-89] ReWorld: Learning Better Representations for World Action Models
链接: https://arxiv.org/abs/2606.27504
作者: Tianze Xia,Lijun Zhou,Kaixin Xiong,Jingfeng Yao,Yu Zhu,Zhenxin Zhu,Bing Wang,Guang Chen,Hangjun Ye,Wenyu Liu,Haiyang Sun,Xinggang Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages,3 figures
Abstract:World Action Models (WAMs) model future environment evolution under action conditioning, offering a scalable paradigm for autonomous driving. However, existing approaches focus largely on model architecture design, and how a WAM can efficiently learn better world representations for planning remains underexplored. To address this gap, we propose ReWorld, the first representation learning framework specifically designed for autonomous-driving world action models. In WAMs, standard training supervises only the output ends of the generation and planning modules, leaving the intermediate representations that carry world knowledge to be shaped only indirectly, as byproducts of fitting these outputs. The core idea of ReWorld is to treat intermediate representations as direct targets of optimization, shaping them along three complementary dimensions. On the Video DiT responsible for generation, we impose future-predictive supervision on its intermediate representations. On the Action DiT responsible for planning, we first align its intermediate representations cross-modally with the video world representation, then further shape them to be discriminative around safety-critical boundaries via hard-negative supervision. In addition, we systematically analyze the effectiveness of existing representation learning methods in video generation world models, and discuss why their performance is limited on this task. Experiments on nuScenes and NAVSIM show that ReWorld improves fine-tuned video generation by 23.9% in FVD (81.3 to 61.9), raises closed-loop PDMS from 89.1 to 90.4 without any post-training such as RL or post-processing, and accelerates from-scratch convergence by approximately 2x.
[CV-90] SelectAnyTree: A Promptable Instance Segmentation Model for 3D Forest LiDAR Point Clouds
链接: https://arxiv.org/abs/2606.27491
作者: Trung Thanh Nguyen,Daniel Lusk,Kilian Gerberding,Janusch Vajna-Jehle,Tuan-Anh Vu,Duc Viet Le,Tu Vo,Phi Le Nguyen,Yasutomo Kawanishi,Takahiro Komamizu,Ichiro Ide,Julian Frey,Teja Kattenborn
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated instance segmentation of forest LiDAR point clouds is increasingly critical as forest monitoring moves toward scalable, detailed, 3D measurement. Yet, progress is constrained by label scarcity for tree instances; a single hectare can hold millions of points and hundreds of overlapping, complex crowns, making manual annotation from scratch with raw data laborious and error-prone. Annotations are often corrected from automatic pre-segmentations, but remain costly as these provide no interactive or AI-assisted refinement. Inspired by the promptable paradigm of foundation segmentation models, we propose SelectAnyTree, a promptable instance segmentation model that delineates any individual tree in a 3D forest point cloud from a few clicks. It introduces two key components: Click-to-query prompt encoder and Canopy Height Model (CHM)-guided first prompt. The former turns each click into a single content query, encoding its 3D position and positive/negative polarity together with a pooled local backbone feature. The latter provides treetops as a geometry- and ecologically guided first prompt without any user input. The resulting prompt query is then decoded into one tree mask by a state-space query decoder to efficiently capture long-range context in large-scale forest scenes with linear-time complexity. We evaluate SelectAnyTree in interactive and instance-level settings across seven diverse forest regions and an independent held-out test dataset, demonstrating strong generalization beyond the training domains. It segments a target tree to 78.2 Intersection over Union (IoU) from a single click, 24.8 points above the strongest promptable baseline, and reaches every accuracy target with the fewest clicks, while using far fewer parameters and less inference time than prior promptable models. The source code is available at this https URL.
[CV-91] Fine-tuning a multimodal large language model for clinician-grade autism behavioral scoring from short home videos
链接: https://arxiv.org/abs/2606.27484
作者: Mohammadmahdi Honarmand,Parnian Azizian,Aaron Kline,Kae Nurge,Zerin Nasrin Tumpa,Saimourya Surabhi,Kaitlyn Dunlap,Yang Qian,Ali Kargarandehkordi,Sameer Neupane,Peter Washington,Dennis P. Wall
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autism spectrum disorder (ASD) affects 1 in 31 US children, yet median age at diagnosis exceeds four years. Artificial intelligence pipelines that provide quantified diagnosis using easy to access observational data (e.g., home videos) could help with earlier diagnosis, and timely delivery of early treatments. We fine-tuned Gemini 2.5 Pro on 400 clinician-rated home videos with low-rank adaptation, training only on 30 behavioral features previously validated to produce reliable predictions when passed to various ML models. On 99 held-out children (49 ASD, 50 neurotypical), inter-rater reliability with clinicians (per-feature weighted Cohen’s kappa) improved by 40% (p0.001), with 27 of 28 evaluable features improving. As an emergent zero-shot capability, direct ASD diagnosis F1 improved by 53% (p0.001), matching or exceeding clinician outcomes. Classifier-assisted pipelines using fine-tuned LLM-derived behavioral features matched clinician-scored inputs across all tested pathways and achieved 77% accuracy (95% CI: 68-85%) and an AUC of 86% (95% CI: 78-92%). Fine-tuned multimodal LLMs can serve as scalable behavioral feature extractors for use in autism assessment and diagnosis.
[CV-92] SemCityLoc: Aerial 6DoF Localization Using Semantic 3D City Models ECCV2026
链接: https://arxiv.org/abs/2606.27444
作者: Jingfeng Mao,Xuyang Chen,Qilin Zhang,Oussema Dhaouadi,Guangming Wang,Brian Sheil,Daniel Cremers,Yan Xia,Olaf Wysocki
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ECCV 2026
Abstract:Aerial 6DoF localization typically relies on precise GNSS signals or radiometrically rich 3D reconstructions, limiting scalability and on-board deployment. We propose SemCityLoc, a semantic-geometric alignment system that reframes aerial pose estimation as structured surface registration between foundation-model-derived visual priors and standardized LoD-compliant 3D city models. Instead of matching sparse contours or dense texture, our method aligns semantic surfaces and monocular depth with lightweight semantic 3D building models, increasing pose discriminability in repetitive and occluded urban environments. To enable accurate evaluation, we introduce SemCityLockeD, the first real-world benchmark combining centimeter-accurate UAV poses with standardized LoD1–LoD3 semantic city models and challenging low-altitude imagery. Experiments demonstrate substantial improvements over existing map-based approaches, improving recall by up to 36% and reducing mean positional error from 9.89m to 2.62m in challenging urban canyons. Our results indicate that semantically structured geometry provides sufficient and scalable constraints for high-precision aerial localization without radiometric scene reconstructions. The code and data are available at this https URL.
[CV-93] Not All Relations Rotate Alike: Transformation-Aware Decoupling for Viewpoint-Robust 3D Scene Graph Generation
链接: https://arxiv.org/abs/2606.27412
作者: Jingjun Sun,Chaowei Wang,Zhirui Liu,Jiaxu Tian,Ming Yang,Yaoxing Wang,Shan Gao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Image and Video Processing (eess.IV)
备注:
Abstract:3D Scene Graph Generation (3DSGG) represents 3D scenes as structured object-relation-object graphs, providing a compact relational abstraction for spatial understanding. In embodied intelligence settings, the same 3D scene may be observed by agents from viewpoints that differ by yaw rotations. However, current 3DSGG models often fail to produce relation predictions that follow the expected transformation behavior under such viewpoint shifts. This behavior reveals an empirical mismatch related to predicate-level transformation heterogeneity: directional predicates such as left, front, right, and behind should transform with the observation frame, whereas most contact, support, and semantic predicates such as standing on and attached to should remain stable. To reduce this mismatch, we propose Transformation-Aware Decoupling (TAD), a viewpoint-robust 3DSGG framework that decouples relation reasoning according to predicate transformation behavior and is supported by viewpoint-stable object representations. TAD decomposes relation reasoning into two parts: one learns cues that should stay stable across viewpoints, while the other learns directional cues that should change with the observation frame. The two parts are merged for standard multi-label predicate prediction. Transformation-specific descriptors and group-aware auxiliary supervision encourage the two branches to capture complementary relation cues. Extensive experiments on 3DSSG show that TAD achieves state-of-the-art robustness under yaw viewpoint changes without training-time rotation augmentation, while maintaining competitive performance under the standard benchmark. The project page is available at this https URL.
[CV-94] RANSAC Scoring Done Right
链接: https://arxiv.org/abs/2606.27385
作者: James Pritts,Felix Seegräber,Kevin Köser
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: pre-print
Abstract:The most widely used RANSAC variants score candidate models by counting inliers or summing per-point scores that saturate beyond a residual threshold. Every such score requires a user-supplied parameter that is a function of the inlier scale, which must itself be estimated from contaminated data. We remove this dependence by reversing the usual order of inference: rather than estimating the scale and then scoring against it, we marginalize the inlier scale analytically in closed form under a conjugate Inverse-Gamma prior for a fixed inlier partition, then optimize over partitions. A single closed-form expression spans the non-informative Jeffreys limit and informative empirical-Bayes priors, so the same score adapts across data-rich and data-scarce regimes without any change to the algorithm. The proposed RANSAC score is the first in which the inlier scale is genuinely absent from the formula. The score admits O(N log N ) computation via sort-and-sweep. On a benchmark of nearly 70 000 image pairs spanning different two-view estimation problems and both engineered and learned feature pipelines, the proposed score exceeds the state of the art (RANSAC, MSAC, GaU, MAGSAC): it stays nearly flat under threshold miscalibration where baselines degrade, reaches near-optimal accuracy from as few as two validation pairs where baselines need ont he order of 100 times more,. and tightens its prior regularization as validation data grows scarce.
[CV-95] ortho-Gaussian: Splatting True Digital Orthophoto Maps
链接: https://arxiv.org/abs/2411.19594
作者: Xin Wang,Wendi Zhang,Hong Xie,Haibin Ai,Qiangqiang Yuan,Zongqian Zhan
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: This work has been submitted to the IEEE Transactions on Geoscience and Remote Sensing for possible publication
Abstract:True Digital Orthophoto Maps (TDOMs) are essential products for digital twins and Geographic Information Systems (GIS). Traditionally, TDOM generation involves a complex set of traditional photogrammetric process, which may deteriorate due to various challenges, including inaccurate Digital Surface Model (DSM), degenerated occlusion detections, and visual artifacts in weak texture regions and reflective surfaces, etc. To address these challenges, we introduce TOrtho-Gaussian, a novel method inspired by 3D Gaussian Splatting (3DGS) that generates TDOMs through orthogonal splatting of optimized anisotropic Gaussian kernel. More specifically, we first simplify the orthophoto generation by orthographically splatting the Gaussian kernels onto 2D image planes, formulating a geometrically elegant solution that avoids the need for explicit DSM and occlusion detection. Second, to produce TDOM of large-scale area, a divide-and-conquer strategy is adopted to optimize memory usage and time efficiency of training and rendering for 3DGS. Lastly, we design a fully anisotropic Gaussian kernel that adapts to the varying characteristics of different regions, particularly improving the rendering quality of reflective surfaces and slender structures. Extensive experimental evaluations demonstrate that our method outperforms existing commercial software in several aspects, including the accuracy of building boundaries, the visual quality of low-texture regions and building facades. These results underscore the potential of our approach for large-scale urban scene reconstruction, offering a robust alternative for enhancing TDOM quality and scalability.
[CV-96] Enhanced Neural Video Representation Compression across Extreme Complexity and Quality Scales
链接: https://arxiv.org/abs/2606.28163
作者: Ho Man Kwan,Tianhao Peng,Fan Zhang,Mike Nilsson,Andrew Gower,David Bull
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Implicit neural representations (INRs) have recently emerged as a promising approach to video compression, delivering competitive rate-distortion performance alongside rapid decoding. However, existing neural video codecs struggle to balance complexity and scalability. Lightweight models often suffer from degraded compression performance when scaled to different bitrate/quality levels, whereas high-performance models exhibit limited scalability, as their model complexity typically increases with quality. This lack of a unified architecture capable of maintaining consistent complexity across a wide range of bitrates severely limits their diverse real-world deployment. To address these challenges, we introduce NVRC++, a novel INR-based video codec that utilizes a lightweight INR with multiple high-resolution feature grids, providing high scalability at any given complexity level. This is paired with an optimization framework that enables efficient overfitting on high-resolution grids for long video sequences, thereby exploiting spatio-temporal redundancies without prohibitive computational or memory overhead. Additionally, an advanced entropy model is designed for efficiently compressing the high-dimensional grid parameters. As a result, NVRC++ provides four complexity levels (from 7kMACs/pixel to 360kMACs/pixel), each spanning wide bitrate and quality ranges while supporting real-time decoding. The experimental results show that NVRC++ offers a much faster decoding speed (up to 7.6x) compared to the SOTA INR-based video codec, NVRC, while delivering comparable performance.
[CV-97] Differentiable design of the PIAA-ZWFS: a flexible wavefront sensor that approaches the fundamental limit
链接: https://arxiv.org/abs/2606.28136
作者: A. K. Taras,S. Y. Haffert,L. Desdoigts
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注: Submitted to Astronomy Astrophysics (AA)
Abstract:Extreme adaptive optics (AO) is necessary for high contrast astronomy at scales of the habitable zone of nearby systems. We seek to evaluate wavefront sensors that approach fundamental limits of wavefront sensing, enabling adaptive optics systems to run faster or on fainter targets. We present the phase-induced amplitude apodisation Zernike wavefront sensor (PIAA-ZWFS): an adaptation of the conventional Zernike wavefront sensor (ZWFS) that leverages lossless apodisation of the pupil to concentrate the starlight in the focal plane. We optimise and evaluate the sensor with a differentiable modelling framework, drawing on concepts from Bayesian experimental design to minimise the variance of a maximum likelihood estimator that uses the system in the high Strehl regime. Our architecture shows state-of-the-art performance in simulation for different apertures, bandwidths, photon fluxes and source sizes, closing the gap to the fundamental limit by a factor 10 (2.5) compared to the conventional ZWFS (optimised ZWFS) in a typical photon-limited case. For extended sources, we show that even an ideal point source sensor rapidly becomes sub-optimal, and our system outperforms it for stellar diameters larger than 0.8\lambda/D. We verify that these gains do not come at the cost of dynamic range with either linear or non-linear reconstructors. Finally, we present a proof that there must be a trade-off between the information gained about amplitude and phase errors for any wavefront sensor. The PIAA-ZWFS is a viable wavefront sensor operating near the fundamental sensitivity limits.
[CV-98] MLVC: Multi-platform Learned Video Codec for Real-World Deployment ECCV2026
链接: https://arxiv.org/abs/2606.28027
作者: Tanel Pärnamaa,Martin Lumiste,Ardi Loot,Evgenii Indenbom,Andrei Znobishchev,Ando Saabas
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ECCV 2026
Abstract:Neural video codecs have surpassed classical codecs in coding efficiency but remain impractical for deployment due to cross-platform incompatibility and high computational cost. Existing quantization-based solutions fail to produce deterministic results across diverse hardware platforms, leading to catastrophic decoding failures. We introduce MLVC, a hardware-robust neural video codec designed for practical cross-platform inference. The key idea is to explicitly transmit scale parameters through the hyperprior, which guarantees entropy coding consistency across devices without requiring bit-exact arithmetic. While this increases bitrate overhead, we recover most of the coding efficiency through architectural improvements (gated memory, ReGLU activation), a long-term reference recovery mechanism, and domain-specific perceptual training. On the VCD video conferencing benchmark, MLVC achieves 70% BD-rate (MOS) improvement over hardware HEVC, the strongest deployable baseline, while reaching subjective quality competitive with DCVC-RT, which cannot operate across diverse platforms. Both the encoder and decoder run at 100 FPS on average on commodity NPUs from Apple, Intel, and Qualcomm. MLVC is the first neural video codec to combine competitive compression performance, real-time speed, and cross-platform robustness across diverse consumer devices, making it suitable for widespread deployment. Code will be released.
[CV-99] Enhancing Co-packaging Optics Enabled Silicon Photonics Security Assurance Hardware Fingerprinting
链接: https://arxiv.org/abs/2606.27612
作者: Liton Kumar Biswas,M Shafkat M Khan,Himanandhan Reddy Kottur,Hao Wang,Hamed Dalir,Navid Asadizanjani
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Author manuscript version of paper published in IMAPSource Proceedings 2025. Final published version available through IMAPS. 6 pages
Abstract:Silicon photonics enables integration of optical components using standard semiconductor processes, greatly improving data communication bandwidth and energy efficiency. However, photonics integrated circuits (PICs) face unique security challenges, such as counterfeit or tampering threats, that conventional electronic security methods do not address. We propose a novel hardware fingerprinting technique that embeds two dimensional photonic crystal patterns into the density control filler regions of a PIC. Each PhC pattern is designed to resonate a specific visible to near infrared wavelengths, producing a distinctive optical signature (based on wavelength, polarization, and incident angle) for each device. Finite difference time domain (FDTD) simulation using ANSYS Lumerical is employed to optimize nanostructure dimensions and spacing so that each device’s reflection/absorption spectrum contains unique narrowband peaks. No extra fabrication steps or materials are required beyond standard lithography, keeping costs low. The embedded nanostructures have sub-50nm precision, making forgery extremely difficult. Our method yields a high resolution, scalable fingerprint for silicon photonic chips, enabling cost-effective device authentication and improved supply chain security.
人工智能
[AI-0] Agent ic Hardware Design as Repository-Level Code Evolution
链接: https://arxiv.org/abs/2606.28279
作者: Cunxi Yu,Chenhui Deng,Nathaniel Pinckney,Brucek Khailany
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:We present HORIZON, a self-evolving agent framework that treats hardware design as repository-level code evolution. A Markdown harness is compiled into a project pack containing domain knowledge, an executable evaluator, an acceptance predicate, and a git/runtime policy; a hands-free agent loop then evolves an isolated git worktree, using repository operations for state management, tracing, and replay. This extends prior works of repository-scale self-evolution from EDA software systems, to hardware-design artifacts themselves. We evaluate our approach on ChipBench, RTLLM, Verilog-Eval, and nine CVDP categories, achieving 100% benchmark completion across all suites with a fully hands-free agentic loop. However, we do not claim that agentic AI for hardware design is solved: these benchmarks are controlled proxies for a much broader engineering problem in chip design. Section~\refsec:discuss examines the limitations of the current study and highlights open research challenges.
[AI-1] Parameter Efficient Hybrid Transformer (PEHT) for Network Traffic Prediction via Dynamic Urban Congestion Integration
链接: https://arxiv.org/abs/2606.28274
作者: Abdolazim Rezaei,Mehdi Sookhak,Mahboobeh Haghparast
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate network traffic prediction is a critical element for efficient resource allocation in dynamic urban cellular networks. However, prediction remains challenging because network demand is influenced by complex mobility patterns, congestion dynamics, and heterogeneous user behavior. This paper introduces the Parameter-Efficient Hybrid Transformer (PEHT), a network traffic prediction framework that integrates urban mobility and congestion information into a Transformer-based architecture. PEHT separates primary network communication features from secondary urban mobility features and incorporates Low-Rank Adaptation (LoRA) into the Transformer encoder to reduce the number of trainable parameters while maintaining high predictive accuracy. A multimodal fusion strategy then injects external mobility and congestion features into the decoder to improve traffic forecasting. Experiments on the Telecom Italia Milan dataset and multiple synthetic congestion scenarios show that PEHT outperforms state-of-the-art baselines in terms of RMSE, MAE, and R^2 . The implementation is available in the GitHub repository.
[AI-2] How Width and Data Shape Generalization Scaling Laws in Quadratic Neural Networks
链接: https://arxiv.org/abs/2606.28242
作者: Julius Girardin,Emanuele Troiani,Yizhou Xu,Vittorio Erba,Florent Krzakala,Lenka Zdeborová
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Understanding how performance scales jointly with model size and data is a central problem in modern machine learning. Existing theoretical works on scaling laws typically describe generalization as a function of data or compute, often in fixed-feature or infinite-width regimes and for online SGD. Here, we instead study how generalization scales with the number of trainable parameters and the number of samples in a feature-learning model. We analyze \ell_2 -regularized empirical test error minimization in a quadratic two-layer network in a finite-sample setting with structured data. This setting allows for an explicit characterization of the generalization error as a function of the number of samples, model width, and regularization. Our results reveal a phase diagram with distinct scaling regimes as the number of parameters varies. In particular, the generalization error follows data-dependent power laws controlled by the spectral structure of the target. We further characterize the transitions between regimes, including the onset of interpolation, and their impact on generalization.
[AI-3] Govern the Repository Not the Agent : Measuring Ecosystem-Level Risk in AI-Native Software
链接: https://arxiv.org/abs/2606.28235
作者: Daniel Russo
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous coding agents now open and merge pull requests in shared repositories at scale, and the field evaluates them the way it has always evaluated components, one agent at a time, on isolated benchmark tasks. Yet agents that each pass their own tests still leave repositories that accumulate problems no single contribution accounts for. We ask whether this problem belongs to the individual agent or to the repository where it accumulates. We study integration friction, the cost of integrating a contribution into a codebase that other contributors are concurrently changing. Across more than 930,000 agent-authored pull requests, we measure how much of the variation in friction stays with the repository after the contribution, its author, its size, and its agent are accounted for. About half does, and it survives full controls. In the same repositories, agent-authored contributions concentrate this repository-level friction roughly twice as much as human ones (intraclass correlation 0.30 versus 0.16), a gap that holds after controlling for codebase size, age, task shape, process maturity, and merge path. The risk is a property of the ecosystem, not the agent. AI-native software is therefore better measured and governed at the ecosystem level than one agent at a time.
[AI-4] he Remittance Blueprint: Data-driven Intelligence for Sri Lanka
链接: https://arxiv.org/abs/2606.28190
作者: Dhinanjaya Fernando,Dinura Ginige,Kalana Lakshan,Chanupa Gurusinghe,Lasana Pahanga,Subavarshana Arumugam,Sandeepa Weerasekara,Sandareka Wickramanayake,Nisansa de Silva
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures
Abstract:This study analyzes Sri Lankan migration and remittances over 32 years (1994-2025). Using a 384-month harmonized dataset, we apply exploratory data analysis, stationarity corrected time-series modeling (ADF, Johansen, VAR/VECM), and supervised learning. Results reveal remittance inflows are primarily driven by external macroeconomic variables, specifically exchange rate dynamics and global oil prices, rather than domestic indicators. Impulse response analysis confirms the asymmetric impact of currency depreciation and oil price shocks. Predictively, multivariate machine learning models outperform traditional univariate approaches; Ridge Regression achieves a 73.8% accuracy improvement over SARIMA (Annualized RMSE: USD 494.8 Mn). The optimized framework projects 2026 remittances at USD 9,001 million under stable conditions. These findings highlight the structural dependence of remittances on global economies, emphasizing the need for robust exchange rate policies, skilled migration, and formal financial channels to enhance long-term economic resilience.
[AI-5] CPAgents : Agent ic Composite Phenotype Generation for Cardiac Disease Association MICCAI2026
链接: https://arxiv.org/abs/2606.28179
作者: Zuoou Li,Wenlong Zhao,Kelly Yu,Weitong Zhang,Paul M. Matthews,Wenjia Bai,Bernhard Kainz,Mengyun Qiao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to MICCAI 2026
Abstract:Identifying robust associations between cardiac imaging phenotypes and clinical diseases is fundamental to population-scale cardiovascular research and reliable risk stratification. However, current phenome-wide association studies rely on pre-defined, single-variable phenotypes or expert-crafted features, which limits their ability to capture clinically meaningful non-linear effects and cross-phenotype interactions. To address this, we propose CPAgents, an iterative phenotype-Composition framework for cardiovascular Phenome-wide association study (PheWAS) that automatically constructs and validates interpretable composite phenotypes (e.g., polynomial, ratio, and interaction forms) from base imaging features. Specifically, our system coordinates three agents: (i) an Analyst that identifies statistical pathologies and nominates candidate transformations; (ii) a Proposer that generates constrained, medically and statistically motivated expressions under numerical safety rules; and (iii) a Verifier that evaluates candidates using multi-stage criteria and produces transparent evidence trails for accepted phenotypes. Evaluated on a population-scale cardiac imaging cohort, the discovered composite phenotypes markedly improve disease discrimination: across 72 classifier-disease-metric combinations, our variants achieve the top rank in 56 cases versus 18 for baselines, with gains observed across all nine clinical disease categories. Our framework yields compact, clinically interpretable phenotype formulas with transparent evidence trails, enabling scalable discovery of stronger phenotype-disease associations beyond expert-driven feature selection.
[AI-6] andem Reinforcement Learning with Verifiable Rewards
链接: https://arxiv.org/abs/2606.28166
作者: Difan Jiao,Raghav Singhal,Robert West,Ashton Anderson
类目: Artificial Intelligence (cs.AI)
备注: 21 pages,7 figures,8 tables
Abstract:Reinforcement learning with verifiable rewards (RLVR) has significantly improved the reasoning capability of large language models, reaching expert or even superhuman performance in domains such as competition math. However, whether weaker agents and humans can actually harness this capability is far less certain, with RLVR documented to drift reasoning toward idiosyncratic patterns such as poor readability and language mixing. Tandem training is a recently introduced paradigm that targets this compatibility problem: a trained, stronger senior co-generates each rollout with a frozen, weaker junior, and the two are rewarded as a team, so the senior is pushed to reason in ways the junior can follow. Yet this paradigm has so far been demonstrated only in proof-of-concept settings, leaving open whether it scales to the long chains of thought of the modern RLVR pipeline. In this work, we propose Tandem Reinforcement Learning (TRL), which carries the tandem training paradigm into RLVR. In TRL, the senior and a frozen junior alternate stochastically to co-generate the reasoning, the resulting generation is rewarded, and the standard GRPO loss is applied to the senior. Training Qwen3-4B-Instruct on competition math, we find that TRL matches vanilla GRPO on solo reasoning capability while three properties emerge together from the same rollout structure: stronger handoff robustness with the junior, reduced distributional drift from the junior, and a chain-of-thought more legible to the junior. Our results demonstrate a promising route for RLVR with practical payoffs in multi-model communication and human compatibility.
[AI-7] Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models ICML2026
链接: https://arxiv.org/abs/2606.28153
作者: Yanchen Yin,Dongqi Han,Linghui Li
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 323 pages, 19 figures. Accepted at ICML 2026 as a Oral presentation
Abstract:Jailbreak attacks bypass LLM safety alignment, yet their mechanisms remain poorly understood. We provide evidence that attacks do not comprehensively eliminate safety features, but instead selectively suppress specific attention heads. We identify two functionally differentiated types: Adversarially Compromised Heads (ACHs) concentrated in early layers, which are suppressed under attacks, and Safety-Aligned Heads (SAHs) in mid-layers, which maintain robust activations even when attacks succeed. Ablation studies support the causal role of ACHs and the contribution of SAHs to robust activations: suppressing a small number of ACHs is sufficient to induce jailbreak-like behavior on normally refused inputs, while removing SAHs substantially weakens mid-layer safety activations. Token-level attribution further shows that ACH suppression is driven specifically by attack-template tokens, providing a mechanistic account of why attacks can bypass refusal decisions through ACH suppression while leaving internal safety signals sustained by SAHs – a phenomenon we term Robust Harmful Features. To validate the practical significance of this robustness, we show that simply reading these persistent activations – without any training – yields competitive aggregate detection performance with strong adversarial robustness.
[AI-8] Beyond Sparse Supervision: Diffusion-Guided Learning for Few-Shot Graph Fraud Detection
链接: https://arxiv.org/abs/2606.28134
作者: Liming Liu,Chao Hu,Mingfei Lu,Yiwei Ge,Xingle Li,Heyuan Shi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph-based fraud detection is essential for safeguarding large-scale transaction systems, where undetected anomalies may lead to substantial financial losses and security risks. Real-world fraud graphs pose two coupled challenges: sparse and imbalanced supervision, where verified fraudulent labels are scarce and heavily skewed toward benign accounts, and representation dilution, where spatial message passing may oversmooth camouflaged anomalies while spectral filters may suppress fraud-relevant mid- and high-frequency irregularities. To address these challenges, we propose ADC-GNN, short for Attention-guided Diffusion-Contrastive Graph Neural Network, a unified framework that combines diffusion-guided feature augmentation, contrastive representation learning, and multi-hop spectral attention for few-shot graph fraud detection. The diffusion component is formulated as a feature-space denoising augmentation mechanism rather than a full topology-generative graph diffusion model: it constructs noise-perturbed node-feature views under a cosine schedule and uses contrastive learning to stabilize node representations across perturbations. The spectral attention module further adaptively emphasizes fraud-relevant hop-level and relation-level cues. We evaluate ADC-GNN primarily on three public benchmarks and additionally report a proprietary real-world telecom transaction dataset with approximately 60,000 records as a private case study. Under the 1% training setting, ADC-GNN achieves consistent improvements over original graph fraud baselines and four protocol-consistent recent graph anomaly/fraud baselines on the public benchmarks. Additional analyses on split stability, training ratios, oversampling alternatives, module-level ablations, diffusion schedules, and runtime and memory-consumption comparisons further characterize the effective operating regime of ADC-GNN.
[AI-9] AI-Driven Synthesis for High-Tech System Design: Automating Innovation
链接: https://arxiv.org/abs/2606.28126
作者: Luuk Oerlemans,Steven Westerhof,Theo Hofman
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET); Robotics (cs.RO)
备注:
Abstract:This article addresses the combinatorial complexity inherent in modern high-tech system design by presenting automation-in-design (AiD) as a transformative paradigm. We propose computational design synthesis (CDS), a framework utilising deep learning and generative AI to automate the creation of novel systems. Two case studies (e-drive system design and spatial dimensioning problem) serve as proof-points for this approach. The AI-driven methods used in the case studies represent a fundamental shift in engineering, advancing from simulation-based optimisation towards autonomous design with minimal human supervision.
[AI-10] Ontology-Guided Evidence Path Inference for Multi-hop Knowledge Graph Question Answering
链接: https://arxiv.org/abs/2606.28076
作者: Yongxue Shan,Meihan Wu,Cundi Fang,Jie Peng,Xiaodong Wang
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures
Abstract:Knowledge graph question answering (KGQA) aims to answer natural-language questions by reasoning over structured facts. Existing multi-hop KGQA methods mainly rely on topic-centered expansion, which faces two key challenges: the search space rapidly grows with noisy mixed-type paths, and retrieved paths may fail to satisfy the semantic constraints of complex questions. To address these challenges, we propose OPI, an ontology-guided evidence path inference framework for multi-hop KGQA. OPI introduces a relation-centric ontology graph to capture the head-tail type constraints of relations, providing a compact interface for answer-side constraints. Based on this ontology graph, OPI first introduces a bidirectional retrieval mechanism by mapping the predicted answer type to compatible final-hop relations and combining topic-side prefix expansion with answer-side final-hop matching, thereby suppressing noisy mixed-type expansion. OPI further adopts an iterative refinement strategy to reassess retrieved paths and candidate answers under the question context, filtering type-compatible but question-irrelevant evidence for more reliable answer prediction. Experiments on WebQSP, CWQ, and MetaQA show that OPI substantially reduces the search space, improves Hit@1/F1 by 4.6/5.0 points on WebQSP and 8.9/3.3 points on CWQ over the strongest prior results, and achieves near-saturated Hit@1 on MetaQA with the retrieval module alone.
[AI-11] JD Oxygen AI Item Center (Oxygen AIIC) V1: An Industrial-Scale LLM /VLM-Centric Solution for Item Understanding Management and Applications
链接: https://arxiv.org/abs/2606.28070
作者: Oxygen AIIC,Chan Long,Chao Liu,Chaofan Chen,Chaohui Dong,Chunyuan Guo,Danping Liu,Debin Liu,Deping Xiang,Fulai Xu,Guangyue Liu,Hao Li,Huichun Hu,Jian Yang,Jianan Wang,Jianbo Zhao,Jiaoyang Li,Jiaxing Wang,Jinglong Li,Jinjin Guo,Jun Fang,Jun Liu,Kai Zhou,Li Wang,Lili Gao,Liying Chen,Luning Yang,Mengdi Zhou,Pengzhang Liu,Qi Lv,Qianyun Wang,Qixia Jiang,Ruyue Li,Shimu Liang,Shuxing Wang,Sijie Zhang,Siqi Li,Tianhao Gao,Wang Ke,Weihu Huang,Wencan Lai,Wenjie Zhang,Xiaohui Zhang,Xiaojing Dong,Ya Liu,Yifeng Zhang,Yixiang Wang,Yongtai Zhang,Yongyi Liao,Zhaoru Chen,Zhen Chen,Zhiyong Ma,Zhiyuan Liu,Zhongwei Liu,Ziyan Xing
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:this http URL, one of the world’s largest e-commerce platforms, serves over 700 million active users and millions of merchants, with a catalog of tens of billions of SKUs. At this scale, high-quality, structured item knowledge underpins a better consumer experience, lower management costs, and higher operational efficiency-yet producing and serving it poses three industrial-scale challenges: fast-emerging concepts, high-quality knowledge production for massive SKUs, and diverse downstream requirements. To address these challenges, we present the JD Oxygen AI Item Center (Oxygen AIIC), an industrial-scale platform built on LLMs/VLMs for item-knowledge production and service. Oxygen AIIC is built around four core pillars: (i) ontology engineering driven by efficient human-AI collaboration, which supports the dynamic evolution and agile expansion of an ontology with millions of entries; (ii) a “Semantic Search then Discrimination”(S2D) knowledge identification architecture that, combined with throughput improvement strategies, enables scalable, extensible, and high-throughput AI Item Library production for tens of billions of SKUs; (iii) self-evolving item-understanding LLMs/VLMs that improve in a stable and controllable manner, enabling knowledge production with 94.2% precision and 82.8% recall; and (iv) a unified item tunnel that serves as the data and service hub. Oxygen AIIC now covers tens of thousands of JD categories and processes hundreds of millions of item updates per day on Huawei Ascend NPUs. It has accumulated hundreds of billions of item-knowledge assets. Deployed across core business scenarios-including search, recommendation, operations, category planning-Oxygen AIIC has delivered measurable gains at scale. Search-traffic coverage reaches 80.4%, item-information quality issues drop by 37%, the automated fill rate of core attributes during item listing exceeds 80%.
[AI-12] OperatorSHAP: Fast and Accurate Shapley Value Estimation for Neural Operators
链接: https://arxiv.org/abs/2606.28065
作者: Joshua Stiller,Santo M. A. R. Thies,Felix Czaja,Eyke Hüllermeier
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding model predictions is essential for physical applications, where outputs often inform safety-critical decisions, such as structural load assessment, weather warnings, and clinical diagnosis. Shapley values satisfy many desirable properties as an attribution method, but their computational cost during inference hinders their practical use. Current amortized explainers, such as FastSHAP, are limited to homogeneous inputs, which is problematic for physical applications where data often comes from irregular grids and geometries. We introduce OperatorSHAP, a grid-agnostic attribution method and training procedure that allows us to train FastSHAP-like explainers for neural operators. We establish a theoretical framework for attributions in function space, connecting to Aumann-Shapley values. We further show that OperatorSHAP’s explanations are consistent with state-of-the-art discrete Shapley values across resolutions and transfer across grid sizes without retraining.
[AI-13] oolPrivacyBench: Benchmarking Purpose-Bound Privacy in Tool-Using LLM Agents
链接: https://arxiv.org/abs/2606.28061
作者: Shijing Hu,Liang Liu,Zhu Meng,Zhicheng Zhao
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 24 pages, 7 figures, 15 tables
Abstract:Large language models (LLMs) have increasingly moved from standalone text generation systems to agents that invoke external tools, access environments, and execute multi-step tasks. However, conventional function-calling benchmarks mainly evaluate task completion and API correctness, while privacy evaluation benchmarks typically focus on final responses or privacy judgments. Neither perspective captures purpose-bound information flow across an executed multi-tool trajectory. Motivated by this limitation in current agent evaluation, ToolPrivacyBench audits whether task-private atoms are routed only to authorized tools and downstream sinks, thereby evaluating both task completion and privacy over-disclosure during tool use. The benchmark contains 2,150 cases, including 1,150 fully synthetic privacy-sensitive business workflows and 1,000 cases adapted from existing multi-tool and function-calling benchmarks. Each case is represented by a policy knowledge base. After an agent executes against mock business backends, the evaluator compares recorded tool arguments and backend audit logs with this policy knowledge base. The evaluation covers nine widely used agents to characterize purpose-bound privacy over-disclosure. The results show that successful tool execution does not imply appropriate privacy disclosure: an agent may complete a task while transmitting unnecessary private information through intermediate tool calls. ToolPrivacyBench therefore formalizes a need-to-know disclosure boundary, under which each tool should receive only the information necessary for its stated purpose, and uses trajectory-level auditing to identify privacy over-disclosure in multi-tool workflows.
[AI-14] Lifted Causal Inference
链接: https://arxiv.org/abs/2606.28024
作者: Malte Luttermann,Tanya Braun,Ralf Möller,Marcel Gehrke
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the Annals of Mathematics and Artificial Intelligence journal
Abstract:Lifted inference exploits indistinguishabilities in probabilistic graphical models by using a representative for indistinguishable objects, thereby speeding up query answering while maintaining exact answers. In this article, we show how lifting can be applied to efficiently compute causal effects in relational domains. More specifically, we introduce parametric causal factor graphs (PCFGs) to incorporate causal knowledge in lifted models and give a formal semantics of interventions therein. We further present the Lifted Causal Inference (LCI) algorithm to compute causal effects on a lifted level, thereby drastically speeding up causal inference compared to propositional inference, e.g., in causal Bayesian networks. In addition, we present partially directed parametric causal factor graphs (PD-PCFGs) as a generalisation of PCFGs to handle partial causal knowledge and extend LCI to perform lifted causal inference in a PD-PCFG, thereby extending the applicability of lifted causal inference to a broader range of models requiring less prior knowledge about causal relationships.
[AI-15] RelBall: Relation Ball with Quaternion Rotation for Knowledge Graph Completion
链接: https://arxiv.org/abs/2606.27967
作者: Yike Liu,Peijia Xie,Chao He,Huiling Zhu
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Real-world knowledge graphs are often incomplete, lacking many valid facts. Knowledge Graph Completion (KGC) aims to predict missing links using known triples, thereby enhancing graph coverage. A key challenge is modeling diverse relational patterns such as symmetry, antisymmetry, inversion, composition and semantic hierarchy. Existing models such as RotatE can capture symmetric, antisymmetric, inverse, and commutative composition patterns, yet struggle with non-commutative composition. Rotate3D addresses this by introducing non-commutativity via three-dimensional rotations, but still fails to capture the semantic hierarchies prevalent in knowledge graphs. Moreover, both models cannot effectively model one-to-many relations. To overcome these limitations, we propose RelBall, which extends Rotate3D with two innovations. First, our model introduces modulus transformation to model hierarchies, driving abstract concepts toward smaller moduli and concrete instances toward larger ones. Second, it introduces a tail-centric relation ball to model one-to-one, one-to-many, many-to-one, and many-to-many relations. RelBall offers the following advantages: (1) coverage of all relational patterns, including the ones mentioned above; (2) an interpretable hierarchical representation where the modulus directly reflect semantic levels; (3) support for one-to-one, one-to-many, many-to-one, and many-to-many relations. Experiments on multiple datasets demonstrate RelBall’s competitive link prediction performance against various baselines.
[AI-16] Reasoning Beyond Prediction: From Data-Driven to Causal Software Engineering
链接: https://arxiv.org/abs/2606.27960
作者: Roberto Pietrantuono,Luca Giamattei,Stefano Russo
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted for publication in Communications of the ACM
Abstract:Software engineering is an intellectually demanding, creative discipline that juggles a web of interdependent tasks to design, build, and assure the quality of increasingly complex systems. As our expectations from software soar - with demands spanning AI-driven products, pervasively distributed and cloud-native architectures, and deeply embedded cyber-physical environments - its complexity steadily increases. In response, a new wave of co-engineering methods and tools, fueled by deep learning, has emerged to augment the process, enhancing automation and decision support. Yet, these advances remain far from delivering the kind of intelligent support that modern software development demands. We call for a new paradigm of human-machine cooperation: one where machines don’t just automate routine tasks or predict from learned patterns, but actively amplify engineers’ reasoning through the lens of causation. As software becomes smarter, a smarter support is needed.
[AI-17] wo-Stage Fine-Tuning for Protein Sequence Generation with Targeted Amino-Acid Composition ICML2026
链接: https://arxiv.org/abs/2606.27939
作者: Violeta Basten-Romero,Rubén Muñoz-Tafalla,Anna María Díaz-Rovira,Bertran Miquel-Oliver,Isaac Filella-Merce,Víctor Guallar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Genomics (q-bio.GN)
备注: 17 pages, 5 figures, ICML 2026 Workshop GenBio
Abstract:Protein language models are standard priors for biological sequence generation, but steering them toward explicit distributional design targets remains largely unexplored. We study a constrained protein generation problem in which sequences must match a desired amino-acid (AA) composition profile while preserving plausible sequence statistics and diversity. The motivating application is synthetic feed protein design, where the AA composition of dietary proteins directly determines their nutritional value. We propose a two-stage pipeline in which domain-adaptive fine-tuning (FT) on an in-domain protein dataset is followed by iterative reward-weighted FT via reinforcement learning (RL) anchored against the FT model as a frozen reference. We evaluate the pipeline on two AA compositions and find that FT brings the average composition close to the target, while the subsequent RL enforces specific sequence constraints that FT alone cannot satisfy. We additionally evaluate the design choices of the proposed composition reward term against two baselines and an ablated variant, isolate the contribution of each training stage, and verify that AA composition alignment is achieved without degrading sequence quality.
[AI-18] Agent ic AI-Powered Re-Identification: An Emerging Scalable Threat to Mobility Microdata Privacy
链接: https://arxiv.org/abs/2606.27936
作者: Oscar Thees,Roman Müller,Matthias Templ
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 15 pages, 2 figures
Abstract:The widespread collection of fine-grained location data by commercial data brokers creates a re-identification risk that is not widely recognised by the public. While prior research has established that mobility traces are highly unique and that individuals can, in principle, be identified from a handful of spatio-temporal points, such attacks have historically required significant manual effort from skilled analysts, limiting their practical scale. In this feasibility study, we demonstrate in a real world setting that agentic AI fundamentally changes this threat model. We present an end-to-end pipeline in which large language model agents autonomously search the open web, cross-reference public records and social media, and resolve raw coordinate sequences to candidate identities - without human intervention. We evaluate the pipeline on a spatio-temporal dataset containing simulated location points anchored at and around true home and work addresses, focusing on a high-risk disclosure scenario. Our results demonstrate that, from spatio-temporal data and public sources alone, our agentic AI successfully re-identified 18 of the 25 re-identifiable individuals (72%) and 18 of 43 cases overall (41.9%). We discuss implications for Statistical Disclosure Control (SDC) practice and outline the near-future escalation that data custodians and regulators must anticipate. De facto anonymity - an implicit foundation of SDC practice - is shifting. Agentic AI strengthens the case that re-identification is reasonably likely by any means under the GDPR Recital-26 standard, at costs of minutes-and-dollars per target. Comments: 15 pages, 2 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Applications (stat.AP) MSC classes: 68P27 ACMclasses: K.4.1; I.2.11 Cite as: arXiv:2606.27936 [cs.CR] (or arXiv:2606.27936v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.27936 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-19] SEADA: An efficient methodology for optimizing mixed-precision DNNs on multi-precision spatial architectures
链接: https://arxiv.org/abs/2606.27884
作者: Leandro Fiorin,Marco Ronzani,Cristina Silvano
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixed-precision computation has been introduced in deep neural networks (DNNs) as an effective approach to reduce latency, energy consumption, and memory footprint. However, efficiently mapping mixed-precision networks onto multi-precision spatial architectures poses several challenges. These include determining the appropriate precision for each layer, balancing layer-wise accuracy sensitivity to quantization against architectural heterogeneity and system-level constraints, and accurately estimating the system-level cost of heterogeneous precision assignments. This work presents SEADA, an efficient methodology designed to address these challenges. SEADA comprises: (i) a configurable system-level analytical cost model of a multi-precision spatial accelerator architecture; (ii) a fast mapping tool that identifies near-optimal mappings of DNN workloads onto the target integer accelerator; (iii) analytical models for floating-point layers to estimate the overall benefits of mixed-precision execution; and (iv) a per-layer precision selection methodology based on bit-level entropy, enabling efficient assignment across multiple numerical precisions. SEADA’s efficiency provides designers with a robust framework for the design-space exploration of multi-precision architectures.
[AI-20] S2-VLA: State-Space Guided Vision-Language-Action Models for Long-Horizon Manipulation IJCAI2026
链接: https://arxiv.org/abs/2606.27872
作者: Zhipeng Xie,Zongyi Han,Xiangyi Wei,Shiliang Sun,Yang Li,Jing Zhao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to IJCAI 2026
Abstract:Vision-Language-Action (VLA) models have demonstrated strong capabilities in robotic manipulation, but their performance degrades significantly in long-horizon tasks due to cumulative error propagation. This limitation largely arises from static feature fusion mechanisms that rely on fixed weights to combine visual, language, and action representations, preventing the model from adapting to different phases of task execution. To address this limitation, we propose S ^2 -VLA, a framework that introduces a State-Space Guided Adaptive Attention (SSGAA) mechanism. SSGAA maintains a belief state that tracks task progression and generates dynamic gating weights to adaptively fuse information from three complementary sources visual features for spatial perception, task intents for high-level task planning, and temporal action sequences for execution consistency. This adaptive fusion allows the model to shift its focus throughout task execution, aligning with the evolving requirements of different task stages. Despite its compact 2B parameter size, S ^2 -VLA consistently outperforms larger 7B-scale models and achieves state-of-the-art performance on long-horizon manipulation benchmarks, including LIBERO and SimplerEnv. highlighting the importance of adaptive feature fusion for long-horizon robotic manipulation.
[AI-21] GNBAN: Graph Neural Basis Attention Networks for Long-Horizon Forecasting over Large Entity Sets
链接: https://arxiv.org/abs/2606.27863
作者: Janak M. Patel,Anirudh Deodhar,Dagnachew Birru
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 Figure
Abstract:Demand forecasting at the bottom of a retail hierarchy requires predicting tens of thousands of correlated long-horizon series across products, stores, and regions. Modern systems must scale across massive catalogs, capture shared demand dynamics, and remain interpretable enough to be trusted. Classical statistical methods need a separate model per series and are hard to manage at scale; deep autoregressive models struggle as the joint state grows to tens of thousands of dimensions; and recent graph-based forecasters, while capturing cross-entity dependencies, often produce opaque long-horizon forecasts. We propose GNBAN (Graph Neural Basis Attention Network), an end-to-end architecture combining heterogeneous graph representation learning with an interpretable basis-decomposition head. Retail data are represented directly as a heterogeneous graph derived from the relational schema, so a single model serves the entire catalog. Rather than predicting the horizon directly, GNBAN decomposes each forecast into trend, seasonal, and generic components. Its key innovation is a per-basis attention mechanism: each basis function keeps its own learnable query and retrieves information independently from the entity’s historical neighborhood, letting different bases specialize to distinct temporal patterns while preserving interpretability. On two large-scale benchmarks, M5 Walmart and Favorita Grocery Sales, evaluated under matched protocols, GNBAN improves volume-weighted WRMSSE by roughly 4-5% over a matched graph baseline. Qualitative analysis shows the learned decomposition exposes trend, seasonal, and residual demand drivers without post-hoc explanation methods. These results demonstrate that scalable relational forecasting and interpretable forecast decomposition can be achieved together in a unified graph-based framework.
[AI-22] Applicability of memorization indicators for early spotting of overfitting while recalibrating sEMG-decoders on low sample sizes
链接: https://arxiv.org/abs/2606.27855
作者: Stephan J. Lehmler,Tobias Glasmachers,Ioannis Iossifidis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning models for surface electromyography (sEMG) can benefit substantially from subject-specific (re-)calibration, since no sufficiently large and diverse datasets are available to train fully generic decoders. However, for user acceptance, the number of repetitions that can realistically be collected during calibration is severely limited, which increases the risk of overfitting and, in extreme cases, can even degrade performance compared to the uncalibrated model. Classical overfitting indicators such as validation performance and regularization with early stopping are difficult to apply in this low-sample regime, as they require additional held-out data that is rarely available in practical calibration scenarios. In this work, we investigate a recently proposed class of memorization indicators based solely on the activation statistics of rectified linear units (ReLU) in deep neural networks, which can be computed directly from training data without any extra validation set. We conduct a transferlearning experiment on a benchmark sEMG dataset, where a convolutional neural network is first pre-trained on multiple subjects and subsequently fine-tuned on individual users using only a small number of repetitions. During calibration, we monitor both decoding performance and the activation behaviour of the last hidden layer. Our results provide first evidence that decreases in test accuracy during fine-tuning are ac companied by characteristic changes in activation rates, indicating that activation-based memorization indicators are a promising tool for early spotting of unsuccessful learning in low-sample sEMG calibration settings.
[AI-23] WattLayer: Get Layers Right to Estimate Inference Energy of Neural Networks IJCAI ECAI2026
链接: https://arxiv.org/abs/2606.27841
作者: Adrien Sardi,Marie-Line Alberi Morel,Sara Alouf,Frédéric Giroire,Joanna Moulierac
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at IJCAI-ECAI 2026 Workshop SuRE
Abstract:The widespread adoption of Artificial Intelligence (AI) has led to increasing concerns about energy consumption, yet there is a lack of standardized methodologies to accurately estimate AI inference energy consumption, particularly across various tasks and architectures. In this study, we propose a task independent, layer-wise energy estimation model for AI architectures. Our model is evaluated on a large dataset of more than 100,000 layers for 295 neural network architectures across 3 widely-used tasks and 3 distinct hardware platforms. Our approach achieves a median error of 19.6%, outperforming state-of-the-art methods. We further show that layer-wise decomposition generalize to new tasks without complete retraining, by leveraging shared layers across architectures. It offer tools, insights and a precise methodology to empower stakeholders in designing energy-efficient AI systems.
[AI-24] NormAct: A Benchmark for Hidden Social Norm Compliance in Embodied Planning
链接: https://arxiv.org/abs/2606.27826
作者: Shiyun Zhao,Xinwei Song,Tianyu Guo,Xiaomeng Gao,Mingyuan Liu,Xu Han,Yuanyuan Zhang,Zhenliang Zhang,Xue Feng,Bo Dai
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) are increasingly deployed as embodied planners in egocentric environments, where task success requires not only achieving instructed goals but also acting in socially appropriate ways. While explicit goals may render certain actions optimal, implicit social norms often impose hidden constraints. Existing evaluations typically focus on explicit goal achievement or direct norm knowledge, seldom assessing whether planners can infer and apply these hidden constraints within action sequences. We introduce NormAct, a benchmark for embodied social-norm interactions that evaluates plans on Goal Achievement, Norm Compliance, and overall Task Success. NormAct uniquely embeds hidden norms within ordinary tasks, testing whether models can realize them without explicit instruction. Experiments with state-of-the-art MLLMs (GPT-5.4, Claude Opus 4.7, Gemini 3 Pro) reveal a significant gap: models achieve explicit goals in 67.3% of cases, but comply with hidden norms in only 26.4%. Cue-condition experiments indicate that this gap stems not from a lack of general social knowledge, but from challenges in activating and grounding relevant norms in context. To address this, we propose NormPerceptor, a context-conditioned cue generator that infers scene-relevant norms prior to planning, increasing Task Success from 24.2% to 46.7%. Our results underscore the importance of enabling embodied agents to proactively detect hidden norms, ground them in visual evidence, and integrate them as action-planning constraints. Our benchmark is publicly available at this https URL.
[AI-25] Pepti-drift: Toxicity-Repulsive Drifting for Antigen-Conditioned Discrete Peptide Generation
链接: https://arxiv.org/abs/2606.27824
作者: Takashi Fujiwara,Hikaru Shindo,Kaushalya Madhawa,Jun Jin Choong,Keisuke Ozawa
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: preprint
Abstract:Peptides are a promising therapeutic modality that combine the chemical tunability of small molecules with the target specificity of macromolecular therapeutics. However, designing antigen-specific binding peptides while avoiding toxicity remains a major challenge for therapeutic peptide discovery. Here, we present Pepti-drift, a toxicity-aware latent refinement framework that generates peptide candidates through a single antigen-conditioned drift step. In a peptide embedding space, Pepti-drift learns to attract generated peptide latents toward antigen-matched binding peptides while repelling them from toxicity-associated regions. This is challenging because binding-promoting physicochemical features often overlap with toxicity-associated features in peptide representation space. To address this, we introduce a warm-up strategy to stabilize this competing objective by first learning binding-oriented attraction and then increasing toxicity repulsion.
[AI-26] ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents
链接: https://arxiv.org/abs/2606.27814
作者: Qitai Tan,Zefang Zong,Yang Li,Peng Chen
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Training small language-model agents for long-horizon interactive tasks requires both fast imitation and reward-driven improvement. On-policy distillation (OPD) provides dense teacher guidance and typically improves rapidly in the early stage, but its gains saturate once the student approaches the teacher, limiting the final performance ceiling. Reinforcement learning (RL) directly optimizes environment rewards and encourages exploratory improvement toward a higher reward-defined ceiling, but sparse and delayed feedback makes early-stage learning much less efficient than OPD. In this paper, we propose ATOD (Annealed Turn-aware On-policy Distillation), a hybrid online distillation algorithm that explicitly exploits this complementarity. (1) ATOD uses an annealed OPD-RL schedule: OPD dominates early training to approach teacher-level behavior, while RL is gradually strengthened to drive reward-based exploration. (2) ATOD introduces Turn-level Disagreement-Uncertainty Reweighting (T-DUR), which softly amplifies high-utility turns and improves dense supervision in long trajectories. Experiments on ALFWorld, WebShop, and Search-QA show that ATOD consistently outperforms competing post-training baselines: across the three student sizes, ATOD improves average success rate by 3.03 points over OPD and 23.62 points over GRPO, while surpassing the corresponding teacher models by 2.16 points.
[AI-27] Grounded Iterative Language Planning : How Parameterized World Models Reduce Hallucination Propagation in LLM Agents
链接: https://arxiv.org/abs/2606.27806
作者: Xinyuan Song,Zekun Cai
类目: Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:World models for language agents come in two useful forms. An agent-based world model calls an LLM API and reasons flexibly in language, but its errors appear as hallucinated state changes that are hard to score with ordinary regression losses. A parameterized world model is a trained transition predictor; its errors are easier to measure with quantities such as NodeMSE, delta accuracy, and validity accuracy, but it is usually weaker as a standalone planner. We compare these two families on four graph-structured planning benchmarks and introduce operational hallucination metrics for the agent-based case. The comparison motivates \textbfGrounded Iterative Language Planning (GILP), which trains only a small parameterized backbone and combines it with API-based agent reasoning. The backbone supplies valid actions, predicted state deltas, risk, and value; the LLM drafts an action and imagined delta; and a consistency gate asks for revision when the two disagree. On real GPT-4o-mini calls, GILP reduces hallucinated-state rate from 0.176 to 0.035. In calibrated simulator ablations, it raises success from 0.668 to 0.838 while adding only ~22% extra LLM calls.
[AI-28] Optimizing Teacher-Student Partitioning for Scalable Knowledge Distillation on HPC Systems
链接: https://arxiv.org/abs/2606.27797
作者: Adrian P. Dieguez,Victor Conchello Vendrell,Alex Batlle,Vinnam Kim,Jordi Ros-Giralt,Harris Teague
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Knowledge Distillation (KD) enables training smaller student models under the guidance of larger teacher models, and the widely adopted TRL library implements it. Yet, TRL treats both models symmetrically, missing opportunities to exploit their pronounced asymmetry in memory footprint, and communication requirements. This paper presents an HPC-aware methodology for KD that decouples teacher and student partitioning efficiently. Our approach achieves up to 67% higher samples-per-second than TRL by avoiding unnecessary teacher-model data structures and selecting the best split strategy. We combine vertical and horizontal partitioning of models, deriving an analytical expression that identifies the existence of inflection points between splitting regimes. These results showed that exploiting teacher–student asymmetry through topology-aware parallelism notably accelerated GKD training on production HPC clusters at our company
[AI-29] Understanding Rollout Error in Graph World Models
链接: https://arxiv.org/abs/2606.27780
作者: Xinyuan Song,Zekun Cai
类目: Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:World models are often used for planning by rolling learned dynamics forward. Many planning environments, however, are not vectors or images; they are graphs of agents, tools, skills, routes, and dependencies. In these settings, a local prediction error may stay local or spread through the graph, and the failure mode changes again when edges are predicted rather than fixed. This paper studies long-horizon rollout error in Graph World Models (GWMs). We formulate a unified fixed-edge and dynamic-edge GWM framework with action nodes for node-, edge-, and graph-level decisions. We develop graph-valued rollout bounds that separate topology-induced amplification from model-induced amplification, and we introduce a joint node-edge operator for dynamic-edge rollouts. Guided by the analysis, we propose Error-Aware GWM, which combines spectral regularization, rollout consistency, and critical-node weighting. Across synthetic topologies and heterogeneous agent-graph testbeds, rollout error and planning regret grow with horizon, dynamic-edge training is needed when structure evolves, and Error-Aware GWM prevents long-horizon divergence while preserving prediction accuracy. Real-world graph benchmarks clarify the scope of GWMs: they are most useful for dynamic graph rollout and agent planning, while specialized graph models remain strong on static or sparse prediction tasks.
[AI-30] RS-Diffuser: Risk-Sensitive Diffusion Planning with Distributional Value Guidance
链接: https://arxiv.org/abs/2606.27766
作者: Shiqiang Gong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: ICIC 2026 Oral
Abstract:Offline reinforcement learning enables policy learning from fixed datasets without additional environment interaction, making it appealing for safety-critical applications where online exploration is costly or unsafe. Diffusion-based decision-making methods have recently achieved strong performance in offline RL by modeling rich, multimodal trajectory distributions. However, existing diffusion planners are typically risk-neutral and therefore may overlook rare but catastrophic outcomes that are crucial in real-world deployment. In this work, we propose RS-Diffuser, a risk-sensitive offline diffusion planning framework that combines diffusion-based trajectory generation with distributional value critics. RS-Diffuser learns a diffusion planner over future state trajectories, a separate inverse dynamics model for action decoding, and a Monte Carlo distributional critic that estimates the full return distribution of candidate plans through quantile regression. At sampling time, we incorporate a risk-sensitive guidance signal into the denoising process, using gradients computed from tail-aware objectives such as Conditional Value at Risk to steer generation toward desired risk profiles. As a result, a single trained model can flexibly produce risk-averse, risk-neutral, or risk-seeking behaviors by changing only the inference-time risk parameter. Extensive experiments on risk-sensitive D4RL and risky robot navigation benchmarks demonstrate that RS-Diffuser achieves state-of-the-art performance, improving both overall return and worst-case robustness while reducing safety violations.
[AI-31] owards Reliable and Robust LLM Planning : Symbolic Feedback-Driven Iterative Self-Refinement Framework
链接: https://arxiv.org/abs/2606.27757
作者: Jiajing Zhang,Jiamei Jiang,Chenyang Zhang,Feifei Mo,Linjing Li,Daniel Zeng
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have attracted widespread attention from academia and industry, yet their deployment raises critical security concerns regarding robustness and reliability. Planning, a core component of intelligent behavior, remains challenging for LLMs, which often produce infeasible or incorrect solutions in long-horizon decision-making tasks due to inherent complexity. In this paper, we propose a symbolic feedback-driven iterative self-refinement framework to enhance the robustness and reliability of LLMs in long-horizon planning. Specifically, a natural language prompting mechanism is introduced to map logical symbols into natural language descriptions, enabling LLMs to better capture task constraints and semantics. We further design a symbolic verifier that identifies errors and converts them into corrective instructions interpretable by the LLM, thereby guiding self-refinement. In addition, we leverage a plan recognizer to infer goal reachability, facilitating more effective guidance toward desired goals. Empirical results demonstrate that the proposed framework consistently improves both feasibility and correctness in long-horizon planning tasks. This highlights its effectiveness in enhancing the reliability of LLM-based planning and potential to enable more trustworthy AI systems.
[AI-32] Drop-Then-Recovery: How Redundant Are Vision-Language-Action Models?
链接: https://arxiv.org/abs/2606.27755
作者: Guoheng Sun,Kaixi Feng,Shwai He,Xiaochuan Gong,Yexiao He,Ziyao Wang,Zheyu Shen,Wanghao Ye,Ramana Rao Kompella,Gaowen Liu,Ang Li
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language-Action (VLA) models enable instruction-driven robotic manipulation, but they inherit oversized language backbones from pretrained VLMs whose capacity far exceeds what is needed for short robotic instructions. This raises a basic question: how much of a VLA model is actually necessary for closed-loop control? In this work, we study architectural redundancy in VLA models by using transformer block removal as a controlled intervention. We introduce \textbfDrop-Then-Recovery (DTR), an analysis protocol that removes selected blocks from a pretrained VLA model and then fine-tunes the resulting model to measure whether the removed capacity was necessary for downstream control. To make this intervention reliable, we propose \textbfGateProbe, a one-shot virtual-gate sensitivity metric that ranks blocks by their contribution to the downstream action loss. Across multiple VLA architectures, manipulation benchmarks and even real-robot industrial scenarios, we find a strong asymmetry in post-removal recoverability: \ul\textitlanguage backbones are highly redundant for standard robotic manipulation tasks, whereas vision and action pathways are substantially less tolerant to removal. On LIBERO, removing half of the LLM blocks even improves OpenVLA-OFT from 95.0% to 98.3% under the same downstream fine-tuning budget, and retaining only two language blocks still recovers baseline-level performance. These results suggest that current VLA benchmarks may exert limited pressure on deep language grounding and compositional instruction understanding, and that future VLA architectures should allocate capacity more deliberately across language, vision, and action components. The code is available at this https URL.
[AI-33] From General-Purpose Audio Tagging to Spatially Grounded Sound Event Localization and Detection
链接: https://arxiv.org/abs/2606.27751
作者: Stefano Giacomelli,Stefano Damiano,Claudia Rinaldi,Fabio Graziosi,Toon van Waterschoot
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Technical Report (KU Leuven - UnivAQ)
Abstract:This report investigates the extension of pretrained General-Purpose Audio Tagging (GP-AT) models toward spatially grounded Sound Event Localization and Detection (SELD). The proposed AT2SELD framework couples a pretrained AT backbone with compact First-Order Ambisonics (FOA) spatial processing, track-wise SED and Cartesian DOA estimation, permutation aware supervision, and calibration. It characterizes how semantic audio priors support localization-aware scene analysis under data, computation, and deployment constraints. The framework is developed through informed multi-stage Neural Architecture Search (NAS). Stage 1 shows that spectral FOA descriptors, based on magnitude, phase, and Intensity Vectors (IVs), provide the most reliable interface for semantic-to-spatial transfer. Stage 2 identifies early residual spatial encoding as the main capacity-sensitive component, while late track-wise abstraction and recurrent smoothing act mainly as refinement stages. Stage 3 shows that late cross-stitch coupling improves semantic-spatial interaction, whereas early fusion is costlier and less effective. Diagnostic evaluation analyzes the selected architecture under class balancing, focal loss, activity-conditioned DOA supervision, threshold calibration, and transfer across STARSS23, TAU2019, TAU-NIGENS2020, and TAU-NIGENS2021. Focal loss improves the activity point, active-only DOA supervision mitigates inactive target dominance, and validation-selected thresholds recover calibration without replacing spatial learning. Cross-dataset and oracle-activity analyses indicate strong fixed source localization on TAU2019, transferable representations from TAU NIGENS2021, and meaningful but uncertain behavior on STARSS23. Overall, GP-AT priors appear promising for SELD design when embedded in spatial-aware architectures and optimized through integrated calibration and deployment oriented strategies.
[AI-34] Flexformer: Flexible Linear Transformer with Learnable Attention Kernel
链接: https://arxiv.org/abs/2606.27748
作者: Haoran Zhang,Feng Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformer models rely on attention mechanism to capture long-range dependencies but suffer from quadratic complexity, limiting their scalability to long sequences. Kernel-based linear attention reduces this complexity but typically relies on fixed or weakly learnable kernels, restricting expressiveness and performance. In this work, we propose Flexformer, a flexible linear Transformer that learns attention kernels in a fully data-driven manner. Flexformer builds on random Fourier feature-based linear attention and treats spectral frequencies as trainable parameters, enabling the model to learn a broad family of attention kernels. We develop both stationary and nonstationary variants, with the latter offering strictly greater expressiveness. Extensive experiments on language modeling and sequence classification demonstrate that Flexformer consistently outperforms baselines. Moreover, Flexformer can be effectively distilled from pretrained Transformers to recover softmax attention and exhibits strong kernel transferability across domains, achieving both high efficiency and competitive performance on long-sequence tasks. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.27748 [cs.LG] (or arXiv:2606.27748v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.27748 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-35] oE: A Hierarchical and Explainable Claim Verification Framework with Dynamic Multi-source Evidence Retrieval and Aggregation
链接: https://arxiv.org/abs/2606.27736
作者: Zhaoqi Wang,Zijian Zhang,Kun Zheng,Zhen Li,Xin Li,Chunlei Li,Jiamou Liu
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:The rapid spread of fake news poses increasing threats to information ecosystems, especially as AI-generated misinformation under Generative Engine Optimization (GEO) poisoning allows adversarially crafted content to be systematically surfaced by retrieval systems, contaminating LLM reasoning. In this paper, we propose Tree of Evidence (ToE), a hierarchical evidence reasoning framework for automated fact-checking that models each claim as a dynamically expanding argument tree. ToE integrates a reinforcement learning-driven multi-source retrieval agent, an evidence evaluation agent, and an argument tree aggregation algorithm to iteratively decompose, retrieve, and verify claims through an explainable evidence chain. We further provide a theoretical analysis of the retrieval process, deriving a formal error bound that guarantees the learned policy converges to a neighborhood of the information-theoretically optimal policy. Experiments across multiple datasets and backbone LLMs demonstrate that ToE achieves improvements ranging from 4 to 24 percentage points over competitive baselines, with particularly pronounced gains on adversarially poisoned inputs.
[AI-36] he Simulacrum: Decision-Theoretic Pretraining for Near-Optimal Time-Series Forecasting and Inference
链接: https://arxiv.org/abs/2606.27711
作者: Pablo Montero-Manso,Marcel Scharth
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO)
备注:
Abstract:We introduce a neural network-based framework for learning time series estimators through a process we term decision-theoretic pretraining. Analysts specify a generative world, a distribution over data-generating processes, and a target decision objective. A neural network trained on stratified simulations from this world approximates the corresponding optimal decision rule, yielding a neural estimator that provides forecasts, parameter estimates, predictive intervals, or model-selection for zero-shot inference on previously unseen time series. The joint specification of the generative world and objective enables the estimators to directly approximate process-level, finite-sample properties: near-optimal risk, bias control, minimax performance, and uniform calibration. Our experiments demonstrate that these neural estimators can outperform traditional baselines such as maximum likelihood estimation and model selection via AICc, for the same model structural model classes. Furthermore, even when trained purely on simulations of structural models, they achieve competitive or state-of-the-art forecasting accuracy on major real-world benchmarks, compared with statistical, neural or large pre-trained models. We illustrate the framework by addressing two longstanding challenges: finite-sample bias and miscalibration in AR§ models, and the forecast combination puzzle. These applications highlight the approach’s main advantage: its ability to approximate solutions to analytically intractable or computationally prohibitive time series problems, including complex structural equations or optimality criteria. Ultimately, by enabling explicit control over decision-theoretic trade-offs, the framework equips analysts with highly efficient estimation tools tailored to their specific analytical needs. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO) Cite as: arXiv:2606.27711 [cs.LG] (or arXiv:2606.27711v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.27711 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-37] Room for Error: Large-Scale Simulation of Over-the-Air Acoustic Attacks
链接: https://arxiv.org/abs/2606.27701
作者: Andrew C. Cullen,Neil Marchant,Jiani Xie,Paul Montague,Benjamin I.P. Rubinstein
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 20 pages
Abstract:While voice control is rapidly becoming a ubiquitous vector of human-AI communication, the risks facing these systems remain poorly understood. This is, in part, a product of the difficulties in scaling strictly digital adversarial workflows to the physical world. These scale barriers have led the community to abstract away key acoustic factors relating to detectability and the influence of geometry on acoustics. These methodological and metrological shortcomings undermine our understanding of risk. We illuminate these issues through real-world testing, conceptual discussions, and a novel, high-throughput reality simulation framework. By testing over 8 million adversarial evaluations, we demonstrate that acoustic awareness yields relative Word Error Rate increases of up to 94.5% under Whisper and wav2vec. We employ this framework to explore a formalize and operationalize a Dual-Form Signal to Noise Ratio to decouple source stealth from victim attack efficacy, resolving a crucial limitation in current works. This lays the groundwork for repeatable, verifiable research that embraces, rather than abstracts, the acoustic environment.
[AI-38] What Was That Again? Certified Robustness for Automatic Speech Recognition
链接: https://arxiv.org/abs/2606.27698
作者: Andrew C. Cullen,Neil Marchant,Jiani Xie,Paul Montague,Benjamin I.P. Rubinstein
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Sound (cs.SD)
备注: 17 pages
Abstract:Automatic Speech Recognition systems are notoriously both sensitive to adversarial and benign perturbations. While this has been repeatedly demonstrated using reference datasets, detecting such behaviors in deployed systems is incredibly challenging, due to the absence of oracle knowledge of the true transcription. We demonstrate that employing a certification-inspired mechanism can significantly decrease WER, increase recall, and decrease the Spearman correlation between confidence and WER. We achieve this through a dual-gate diagnostic pipeline: a Two-Sided Atomic Audit that accumulates statistical wealth to certify both token existence and adversarial exclusion, and a Rank-Based Tournament that selects the winning sequence. Our evaluations across four diverse architectures demonstrate up to a 55% relative reduction in Word Error Rate, while also providing granular word- and sentence-level certifications to enhance acoustic security.
[AI-39] Halt Fast! Early Stopping for Certified Robustness
链接: https://arxiv.org/abs/2606.27694
作者: Andrew C. Cullen,Paul Montague,Benjamin I.P. Rubinstein
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 24 pages
Abstract:Randomized Smoothing (RS) provides rigorous robustness guarantees for neural networks without architectural constraints, yet its adoption is limited by extreme computational costs. Standard RS requires tens of thousands of model evaluations per input and forces practitioners to commit to fixed sample sizes a priori. In this work, we present a novel meta-learning framework for anytime-valid certified robustness that adaptively deploys computational resources. By using a lightweight meta-learner to predict image-specific priors for a sequential E-process, we achieve a 20-fold reduction in sample complexity compared to traditional methods while maintaining rigorous statistical guarantees. Beyond raw efficiency, we demonstrate how anytime-validity enables adaptively allocating compute based upon application-specific risk thresholds, a form of resource triage impossible under classic certification frameworks. That this is achievable while also providing similar certification performance demonstrates that our approach provides a pathway for real-time, safety-critical certification deployments.
[AI-40] Deployment-Side Adaptiveness in Multi-Horizon Volatility Forecasting KDD2026
链接: https://arxiv.org/abs/2606.27688
作者: Riku Green,Zahraa S. Abdallah,Telmo M Silva Filho
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for KDD 2026 Machine Learning in Finance Workshop
Abstract:In financial forecasting, predictive performance depends not only on which model is trained, but also on how the trained model is deployed. We study this issue in multi-horizon volatility forecasting. Our starting point is that a trained multi-output (MIMO) forecaster does not define a single deployable predictor: by changing the inference-time rollout rule, the same trained model induces a family of forecasts with different accuracy and cost profiles. Across 20 stock-volatility series, three forecast horizons, and architectures ranging from linear models to PatchTST, we find that non-default rollout rules often improve over standard MIMO deployment. However, the best fixed rule varies substantially across architectures and horizons, making any single static replacement unreliable. We therefore evaluate validation-based deployment policies over the induced rule family. Under the primary MSE objective, validation-selected singletons provide a low-cost improvement over default MIMO, while small rule subsets recover much of the benefit of larger ensembles at substantially lower inference cost. We also find that policy rankings are metric-sensitive: MSE-selected policies do not transfer uniformly to QLIKE, a finance-standard volatility loss. These results show that inference-time deployment is a meaningful source of adaptiveness in financial forecasting, and that trained volatility forecasters should be evaluated not only by their architecture, but also by their deployment policy.
[AI-41] CBD: API-Only LLM Black-Box Unlearning through Controlled Behavioral Divergence
链接: https://arxiv.org/abs/2606.27683
作者: Zhiqiang Xie,Yijing Lin,Zhipeng Gao,Dong In Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Edge devices increasingly invoke large language models (LLMs) through API services for context aware edge intelligence, while edge generated data may be collected to improve LLMs and may introduce sensitive, copyrighted, harmful, or outdated information into model behavior. Machine unlearning offers a practical way to remove the influence of undesired data without retraining LLMs. However, existing methods still face two gaps. The first is API only black box access, where target model parameters and internal logits are unavailable. The second is how to preserve retained utility when unlearning target data and retained data share highly similar prompt structures or semantic patterns. To address these challenges, we propose Controlled Behavioral Divergence (CBD), an API only black box unlearning framework. CBD uses two auxiliary models to create controlled behavioral divergence between retained inputs and unlearning target inputs, converts this divergence into an unlearning relevance score, and routes unlearning related prompts away from the target LLM. To improve discrimination accuracy under high similarity between target and retained data, CBD constructs a gradient statistics based discriminative basis by estimating empirical Fisher matrices and solving a regularized generalized eigenvalue problem, guiding the unlearning signal toward target specific information rather than shared prompt structures. Compared with eleven white box and gray box unlearning baselines, CBD achieves a better unlearning utility trade off and its performance varies little across settings. On ToFU forget10, CBD approaches the retrained reference on the forget set while raising model utility to 74.90, about 15% above the second best baseline. On WMDP, it lowers hazardous knowledge accuracy to 25.68, near random guessing, while preserving MMLU accuracy of 52.67. Code is at this https URL.
[AI-42] MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy
链接: https://arxiv.org/abs/2606.27652
作者: Zhiyuan Han,Beier Zhu,Wenwen Tong,Chengwei Qin,Xinyi Wang,Jiayu Zhang,Jiangnan Chen,Hewei Guo,Dongchuan Ran,Lewei Lu,Xun Yang
类目: Artificial Intelligence (cs.AI)
备注: Under review
Abstract:We find that explicit reasoning does not necessarily translate into better multimodal emotion recognition (MER) accuracy, even though it makes predictions more interpretable. Specifically, for reasoning-based MLLMs, fast thinking by triggering direct answers often outperforms slow thinking after deliberative reasoning. Our empirical analyses show that fast thinking improves recall with broader and more confident predictions, whereas slow thinking favors precision through conservative filtering of incorrect categories. Building on these insights, we propose MER-R1, a reinforcement learning framework that turns slow-fast complementarity into explicit optimization. Dual-objective disentanglement separates recall and precision into two optimization signals, allowing them to be jointly optimized rather than traded off against each other. Slow-fast confidence calibration further aligns the final slow-thinking answer with fast-thinking intuition, strengthening correct emotions while suppressing incorrect ones. In this way, MER-R1 unifies the recall-oriented intuition of fast thinking with the precision-oriented selectivity of slow thinking. We further provide theoretical justification for this synergy, showing that it mitigates variance-induced interference during optimization. Extensive experiments on MER-UniBench and MME-Emotion show that MER-R1 achieves state-of-the-art performance and makes reasoning genuinely benefit emotion recognition.
[AI-43] HybridCodec: Modeling Discrete and Continuous Representations for Efficient Speech Language Models
链接: https://arxiv.org/abs/2606.27627
作者: Artem Ploujnikov,Francesco Verdini,Samir Sadok,Mirco Ravanelli
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted
Abstract:Discrete audio representations have become increasingly popular for building multimodal text-audio systems and integrating audio capabilities into Large Language Models (LLMs). However, numerous studies report performance degradation on various downstream tasks due to information loss during discretization. To address this, we propose a novel approach combining temporally compressed discrete tokens with dimensionality-reduced continuous residuals. Our framework consists of a hybridized discrete-continuous focal modulation codec and a hybrid Transformer. This architecture performs autoregressive inference in the discrete domain, coupled with non-autoregressive prediction and continuous residual upsampling. Experimental results show that our approach significantly improves the retention of speaker characteristics compared to discrete-only methods, while simultaneously reducing the number of required autoregressive steps.
[AI-44] Global Explanations for Multivariate Time Series Forecasting Models via K-Order Markov Approximations IJCAI2026
链接: https://arxiv.org/abs/2606.27599
作者: Amadeo Tunyi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the Workshop on Explainable Artificial Intelligence (XAI), International Joint Conference on Artificial Intelligence (IJCAI 2026)
Abstract:While many explainable AI (XAI) methods have been proposed, most are not designed for time-series forecasting models and often rely on the implicit assumption that timestamp features are independent. This assumption ignores the fundamental property of temporal dependence and can lead to explanations that violate the sequential and causal structure of the data. We introduce \textscKARMA, a method for explaining time-series predictors by constructing a Markov surrogate model that captures the temporal dependencies learned by the predictor. Our approach revolves around three main aspects: identifying the minimal history length K that is predictively sufficient for the model, estimating the best-fitting K -order Markov transition kernel from the discretized history space, and a five-level global explanation hierarchy that can be derived from the Markov transition kernel, which we illustrate using real-world weather data (Beijing PM 2.5). We also certify using complex synthetic data with known true causal edges that KARMA (i) recovers the data causal structure as learned by the model via a controlled experiment and (ii) identifies temporal dependencies better than established attribution methods such as TimeSHAP.
[AI-45] Odyssey: Constructing Verifiable Local Truth-Preserving Foundation Models
链接: https://arxiv.org/abs/2606.27593
作者: Sridhar Mahadevan
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 34 pages
Abstract:We introduce a categorical framework called ODYSSEY for constructing verifiable, local truth-preserving foundation models as compositions of foundries: building-block architectural components that specify a cover of local contexts, local representation families, restriction maps, gluing rules, obstruction policies, update obligations, and human-facing views. A foundry is an organized sheaf of knowledge that carries within it an argumentation component. Concrete foundries are built from generic foundries such as evidence/argument, operational decision, institutional/financial, market meaning, scientific challenge, research-program, assistant-build, and evaluation-harness foundries. Universal Foundry Learning (UFL) formalizes foundry construction as a composition of left and right Kan extensions, with left Kan extension rolling local artifacts into candidate foundries and right Kan extension enforcing the restriction, gluing, obstruction, and argumentation conditions required for promotion. Foundry SQL (FSQL) is a small typed query surface for slicing maintained foundry artifacts that uses TICKET (Topos Integration using Causal Kan Extension Transformers) certification for admitting external or pre-built models into durable ODYSSEY state. ODYSSEY is fully implemented and tested across a wide spectrum of concrete foundries, showing that the same categorical machinery supports domain construction, artifact replay, sheaf diagnostics, grounded Toulmin/local-LLM scrutiny, residual-obstruction ledgers, and optimized TICKET-compatible causal-claim extraction across heterogeneous sources. This paper is to be presented as a 2.5 hour tutorial at ICML 2026. The tutorial home page is at this https URL.
[AI-46] SceneBot: Contact-Prompted General Humanoid Whole Body Tracking with Scene-Interaction
链接: https://arxiv.org/abs/2606.27581
作者: Sirui Chen,Shibo Zhao,Zhen Wu,Jiaman Li,Guanya Shi,C. Karen Liu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 15 pages 10 figures
Abstract:Current humanoid reinforcement-learning policies excel at free-space motions but struggle with contact-rich tasks, as pure kinematic tracking cannot resolve the physical ambiguities of interacting with objects and uneven terrain. To address this, we introduce SceneBot, a unified motion-tracking framework capable of handling freespace locomotion, terrain traversal, and whole-body manipulation. SceneBot conditions a single policy on both reference motions and per-link contact labels, explicitly defining expected environmental interactions. To overcome the lack of annotated interaction data, we propose a hindsight scene reconstruction approach that infers scene-interaction graphs from retargeted human motion. Trained on 7.5 hours of this reconstructed, contact-rich data, SceneBot successfully generalizes to unseen motions and environments. Our results demonstrate that SceneBot is the first general framework to seamlessly unify free-space and contact-rich behaviors executing complex, long-horizon tasks like carrying a box upstairs and establishing contact conditioning as a powerful interface for humanoid control. All code and data will be open-sourced. More demos and information are available at: this https URL
[AI-47] Retroactive Advantage Correction: Closed-Form V-Trace Bias Correction for Delay-Aware RLHF ICML2026
链接: https://arxiv.org/abs/2606.27580
作者: Arnav Raj
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the ICML 2026 Workshop on Reinforcement Learning from World Feedback (RLxF). Code: this https URL
Abstract:Reinforcement learning from human feedback (RLHF) in production does not always have a synchronous reward signal. Code-execution verifiers, slow judge ensembles, and queued human review can return several gradient steps after the rollout that produced them, breaking the synchronous-reward assumption underlying standard PPO. We address this gap with Retroactive Advantage Correction (RAC): each pending slow completion is queued, aged through a non-negative kernel, and reinjected as a clipped residual into the next optimiser step’s advantage. We prove that under an unbiased clipped importance ratio, the cumulative RAC correction is exactly unbiased when the effective delay kernel reinjects all of its mass, and carries a bias linear in the unreinjected fraction otherwise; at the no-delay identity kernel it reduces to V-trace. On a tabular Markov decision process (MDP) proof-of-concept, RAC reduces the closed-form policy bias by up to 47.9x at the two-slow-channel configuration, beating wait-for-slow at lower wall-clock cost. RAC integrates with PPO and GRPO through a two-line reward-manager patch.
[AI-48] PEBS: Per-rater Empirical-Bayes Shrinkage for RLHF Reward-Model Calibration ICML2026
链接: https://arxiv.org/abs/2606.27578
作者: Arnav Raj
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the ICML 2026 Workshop on Pluralistic Alignment. Code: this https URL
Abstract:Reward models for Reinforcement Learning from Human Feedback (RLHF) pool preferences across thousands of annotators and fit one global affine calibrator, collapsing raters with systematically different rating-scale offsets and slopes into a single average-rater fit that does not match any individual annotator. PEBS is a per-rater empirical-Bayes shrinkage estimator: it fits per-rater affine calibrators on a held-out slice of each annotator’s ratings and applies Morris-James-Stein empirical-Bayes shrinkage toward the population mean, in closed form and without retraining the reward model. On PRISM, PEBS reduces within-user held-out RMSE by 8.58% over the pooled population-slope baseline. The procedure replicates on PluriHarms harm ratings (Qwen-2.5 base, in-family) with a +9.66% RMSE reduction over the same population-slope baseline. PEBS is a closed-form post-hoc estimator for annotator-specific affine calibration in RLHF reward modeling; it leaves the reward base model unchanged and estimates only the rater-level map used at inference time for new ratings.
[AI-49] hia-gat: A Heterogeneous Interaction-Aware Graph Attention Network For Frame-Level Traffic Conflict Risk Prediction On Freeways
链接: https://arxiv.org/abs/2606.27577
作者: Mahshid Malazizi,Seyedmehdi Khaleghian,Mina Sartipi,Toru Hirano,Yunfei Xu,Hoang H. Nguyen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper formulates frame-level freeway risk assessment as a multi-agent scene graph-level binary classification problem, where each video or trajectory frame is labeled risky if any TTC- or PET-based conflict violates a specified severity threshold. We construct a relation-aware graph per frame with vehicles as nodes and two interaction types as edges: same-lane (longitudinal) and adjacent-lane (lateral), augmented with physics-informed edge features aligned to rear-end and lane-change conflict mechanisms. Building on a structured benchmarking suite of non-graph models and graph baselines, we propose HIA-GAT, a dual-stream heterogeneous graph attention network that processes longitudinal and lateral interactions through dedicated attention pathways and fuses them via a conflict-type-aware gating mechanism with event-level gate supervision derived from SSM conflict attribution. Experiments on the NGSIM I-80 and US-101 freeway datasets across nine TTC and PET threshold configurations show that HIA-GAT achieves the best average risk-ranking performance (AUC 0.835 on I-80 and 0.867 on US-101), with the largest gains on PET-only (lane-change) settings where relational structure is essential. Beyond accuracy, the learned gate provides interpretable per-vehicle attribution of dominant conflict type, supporting actionable, real-time freeway safety monitoring. We show that graph structure is critical for modeling lateral conflict risk, while longitudinal risk can often be captured by non-relational aggregation.
[AI-50] On the Inseparability of Instructions and Data in Shared-Embedding Sequence Models
链接: https://arxiv.org/abs/2606.27567
作者: Dewank Pant,Shruti Lohani,Avijit Kumar
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 1 figure, 2 tables
Abstract:Prompt injection is the top security risk for LLM-integrated applications, yet every defense proposed so far has been broken. We prove this is not a coincidence: in shared-embedding architectures that lack enforced control-data separation, perfect prompt-injection prevention is mathematically impossible. We formalize prompted systems as Prompted Action Models whose outputs include control-authoritative actions: refusal decisions, tool authorization, policy routing, and memory writes. We define Semantic-Faithful Control (SFC), the property that such behavior depends only on the meaning of untrusted input, not on how it is encoded. We then prove SFC is unachievable within the shared pipeline, via three results: a provenance-recovery impossibility (shared representations make trusted and untrusted content statistically inseparable, bounded by total variation distance); control-path exposure (untrusted tokens enter control-relevant computation through the same attention value-aggregation that determines outputs); and a finite-coverage invariance gap (finite training cannot certify invariance over infinite semantic-equivalence classes). We ground each quantity in measurements on production tokenizers and models. The result is structural, not a gap in current defenses. It mirrors the code-data confusion in Von Neumann machines that gives rise to buffer overflows, a vulnerability class that took decades of layered defenses (DEP, Write-XOR-Execute, ASLR, stack canaries, and ultimately memory-safe languages) to contain, because no single mechanism sufficed. The implication is the same: prompt injection cannot be eliminated by better in-pipeline classification or alignment alone. It requires architectural separation of instruction and data channels. We identify the root cause and the class of solution it demands. Comments: 18 pages, 1 figure, 2 tables Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.27567 [cs.CR] (or arXiv:2606.27567v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.27567 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-51] Benchmarking Multi-Modal Graph-based Social Media Popularity Prediction
链接: https://arxiv.org/abs/2606.27539
作者: Utkarsh Sahu,Zhisheng Qi,Li Zhu,Yizhao Yang,Jun Li,Ryan Rossi,Yu Wang
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Social media popularity prediction aims to forecast the future reach or influence of online content from early-stage observations. Accurate prediction enables key downstream applications, such as advertising optimization and strategic content planning by users, creators, and platforms. Despite substantial progress, existing popularity prediction works often fail to jointly consider multimodal content and temporal social interaction signals. Moreover, the literature remains highly fragmented across datasets, modalities, observation windows, prediction targets, and evaluation protocols. This fragmentation prevents fair comparison and obscures a systematic understanding of how textual, visual, temporal, and interaction-based signals jointly shape popularity dynamics. To address these challenges, we introduce MMG-Pop, a Multi-modal Graph-based Popularity Prediction benchmark, which unifies datasets, modalities, temporal interaction signals, and representative baselines under a standardized evaluation protocol. Furthermore, we propose MMG-PopNet, a unified multi-modal graph-based network that jointly models the aforementioned multi-modal signals and graph-structured social interactions. Extensive experiments on MMG-Pop, comprising four datasets across Bluesky and Reddit platforms, demonstrate the superior performance of MMG-PopNet and yield new insights into cross-platform training generalization, multi-task prediction benefits, multi-modality contributions, and LLM prediction limitation. These findings establish a unified foundation for future research on social dynamics modeling and intervention under heterogeneous modalities and socially-aware agentic ecosystem paradigms.
[AI-52] Internalizing the Future: A Unified Agent ic Training Paradigm for World Model Planning
链接: https://arxiv.org/abs/2606.27483
作者: Xuan Zhang,Zhijian Zhou,Lingfeng Qiao,Yulei Qin,Ke Li,Xing Sun,Xiaoyu Tan,Chao Qu,Yuan Qi
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) agents have demonstrated strong capability in sequential decision-making, yet they remains fundamentally reactive in long-horizon tasks. Unlike humans who employ “what-if” reasoning to evaluate potential plans before commitment, standard agents lack an internal world model to simulate future outcomes. Therefore, we propose to internalize future-aware planning by training a single autoregressive model to verbalize both a prospective state rollout and a plan-conditioned success estimate-a textual analogue of the Q-value. Crucially, we identify a format-capability gap: simply fine-tuning agents on look-ahead traces during post-training leads to superficial mimicry of foresight without genuine predictive grounding. To bridge this gap, we introduce a three-stage training paradigm: (i) World Model Agentic Mid-Training (WM-AMT) to inject latent predictive capabilities into the policy; (ii) Format-Eliciting SFT (FE-SFT) to structure this injected capability; and (iii) Foresight-Conditioned Reinforcement Learning (FC-RL) to refine the calibration and utility of the generated simulations. Evaluated on search and mathematical reasoning tasks, our approach consistently outperforms other training baselines. Our results demonstrate that effective internal world modeling in LLM agents requires a capability-first training pipeline to achieve grounded and calibrated foresight.
[AI-53] Speculative Refinement: A Hybrid Autoregressive Diffusion Decoding Strategy and Its Behavior Across Benchmarks
链接: https://arxiv.org/abs/2606.27474
作者: Aditi Gupta,Neel Mishra,Kushagra Trivedi,Pawan Kumar
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 7 pages + 2 pages References
Abstract:How should we evaluate generation systems that combine autoregressive (AR) and diffusion decoding? We study this question through Speculative Refinement (SpecRef), a training-free hybrid method that warm-starts a masked diffusion language model from an AR draft using entropy-guided selective masking. Evaluating SpecRef across six benchmarks (HumanEval, MBPP, GSM8K, BBH, ARC-Challenge, HellaSwag) with three distinct evaluation protocols (execution-based pass@1, exact-match, log-likelihood scoring), we surface several findings relevant beyond our specific system: (1) code benchmarks conflate structural discovery with logical correctness: providing a syntactic scaffold lifts accuracy from near zero to over 20% without changing the model, indicating that much of the baseline failure is structural; (2) a refinement tension phenomenon where multi-stage correction degrades already-correct tokens, exposing benchmark saturation ceilings invisible to single-model evaluation; (3) log-likelihood and generative evaluation produce different model rankings for the same model pair, suggesting they measure different capabilities; (4) standard Python post-processing silently breaks code evaluation for non-AR generators. These observations apply to any multi-stage or non-autoregressive generation pipeline and point toward more diagnostic evaluation practices.
[AI-54] When Does Personality Composition Matter for Multi-Agent LLM Teams?
链接: https://arxiv.org/abs/2606.27443
作者: Aryan Keluskar,Amrita Bhattacharjee,Huan Liu
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 6 figures
Abstract:Personality prompting shapes how large language models communicate, yet whether these behavioral shifts affect objective task outcomes remains under-explored. Prior work shows that agents prompted with low agreeableness produce adversarial language, while those prompted with high agreeableness become cooperative, but the relationship between communication style and task performance has not been systematically examined across multiple domains. In this work, we investigate whether personality composition matters for multi-agent team performance by manipulating personality traits across frontier LLMs on three task domains: structured coding, open-ended research collaboration, and competitive bargaining. We find that personality effects depend critically on task structure. In coding tasks, low agreeableness leads to large communication shifts that have little effect on milestone completion. In open-ended collaboration and bargaining, the same manipulation substantially degrades performance. We discuss implications for multi-agent system design and the limits of personality manipulation.
[AI-55] owards Evaluation of Implicit Software World Models in Coding LLM s ICML2026
链接: https://arxiv.org/abs/2606.27406
作者: Egor Bogomolov,Yaroslav Zharov
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to DL4Code workshop at ICML 2026
Abstract:Software engineering, whether performed by humans or by AI agents, requires reasoning about how software behaves. We call the internal model that supports such reasoning the software world model, and view current code-execution benchmarks as covering one well-studied slice of it – control flow. In this paper, we take a step toward a broader evaluation by shifting the observable axis to execution resources: alongside test outcome and exception class, we predict peak memory, wall-clock time, and ranked profiler outputs at method and line granularity. We use SWE-bench Verified as the source of data to hold the test close to real-world software engineering tasks. All tested models, frontier ones included, show modest performance and brittle behaviour, suggesting a notable lack of understanding of how software is executed, as opposed to how its source code is written.
[AI-56] Agent ic Publication Protocol: An Attempt to Modernize Scientific Publication
链接: https://arxiv.org/abs/2606.27386
作者: Sirui Lu,Xiao-Liang Qi
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures and 1 table
Abstract:Scientific publication is still organized primarily around static manuscripts, even though much of scientific progress depends on tacit know-how: how to run code, reproduce figures, interpret edge cases, choose useful follow-up directions, and avoid failed paths. Large language model agents create an opportunity to publish not only knowledge, but also operational know-how in a form that future readers and researchers can directly use. This paper outlines the Agentic Publication Protocol (APP), a lightweight repository format for packaging a paper together with code, data, environment information, reproducibility instructions, and an agent-facing instruction file. APP treats a version-controlled repository as the publication object and uses \textttthis http URL and optional skills to define a paper agent that can explain the work, reproduce key results when possible, and support follow-up research. We describe the design principles and details of the protocol, as well as the agent skills useful for publishing papers under the protocol. We also describe development tools for evaluating and improving the protocol and associated agent skills. Finally, we provide a broader discussion of the future of scientific research in the agent era.
[AI-57] AI-Model Network: Concept Current State and Future
链接: https://arxiv.org/abs/2606.27382
作者: Li Zhetao,Zeng Xiyu,Wang Jianhui,Xiao Yong,Liu Zhongren,Wu Junru,Lai Junjie,Huang Jijun,Long Saiqin
类目: Artificial Intelligence (cs.AI)
备注: 31 pages, 14 figures
Abstract:While the primary function of computers lies in computation and processing, the core value of the Internet is rooted in sharing and collaboration. Computers create the Internet, and the Internet empowers the value of computers. The rapid development of the Internet, cloud computing, and big data is pushing artificial intelligence into the era of large models (LMs). However, the practical application of LMs is currently hindered by high training costs and deployment complexities, driving a shift toward lightweight, private, and domain-specific models. With the rapid proliferation and wide distribution of heterogeneous models, enabling effective interaction and collaboration among them has emerged as a critical bottleneck that urgently needs to be addressed in LM development. Drawing inspiration from the development of the Internet, this paper proposes the concept, vision, and system architecture of world wide AI-model network (AI-ModelNet). It is a novel paradigm that achieves interconnection, capability sharing, and collaborative reasoning by establishing pathways between models. We first briefly review the current state of single-model and multi-model research. Subsequently, the systemic vision and hierarchical architecture of AI-ModelNet are articulated, followed by validation of the framework’s feasibility through a prototype system and diverse application cases. Finally, key directions for future research are discussed preliminarily.
[AI-58] OverFlowLight: Real-Time Gridlock Prevention and Traffic Signal Optimization for Urban Intersections
链接: https://arxiv.org/abs/2606.27381
作者: Mingyuan Li,Boyang Huang,Tianqi Jiang,Chenpu Li,Chunyu Liu,Yang Li,Ruimin Li,Qiang Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Queue overflow, a severe consequence of urban traffic congestion, occurs when vehicle queues exceed intersection capacity, obstructing upstream traffic and triggering cascading gridlocks. Prevailing traffic signal control (TSC) algorithms, primarily optimized for throughput, often fail to address overflow during peak hours, exacerbating congestion and creating safety hazards. We propose OverFlowLight, a real-time framework designed to preemptively resolve overflow and enhance overall TSC performance. It first introduces a mechanism to accurately detect overflow in real-time by leveraging multi-modal sensing from cameras and radars. Upon detection, it dynamically generates and inserts dedicated overflow phases into the signal cycle to clear the blocking queues. This is orchestrated by a hybrid control design that combines rapid rule-based overflow intervention with controller back ends such as reinforcement learning (RL) for longer-horizon efficiency. We conducted extensive real-world deployments of OverFlowLight across 43 intersections in three major cities. The framework demonstrates seamless integration with existing RL-based TSC agents, highlighting its modularity and practical applicability. Empirical results show that OverFlowLight reduces overflow incidents by 60.4% and increases network throughput by 18.2% compared to deployed baselines. Furthermore, it substantially diminishes the need for manual intervention common with expert-tuned signal plans. This work presents the first practical, scalable, and data-driven framework for actively preventing traffic gridlock, offering a crucial component for building resilient and efficient urban transportation systems. Our demonstration videos, codes and datasets are available at the anonymous URL, this https URL.
[AI-59] DataStates-LLM : Scalable Checkpointing for Transformer Models Using Composable State Providers
链接: https://arxiv.org/abs/2601.16956
作者: Avinash Maurya,M. Mustafa Rafique,Franck Cappello,Bogdan Nicolae
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:
Abstract:The rapid growth of Large Transformer-based models, specifically Large Language Models (LLMs), now scaling to trillions of parameters, has necessitated training across thousands of GPUs using complex hybrid parallelism strategies (e.g., data, tensor, and pipeline parallelism). Checkpointing this massive, distributed state is critical for a wide range of use cases, such as resilience, suspend-resume, investigating undesirable training trajectories, and explaining model evolution. However, existing checkpointing solutions typically treat model state as opaque binary blobs, ignoring the 3D heterogeneity'' of the underlying data structures--varying by memory location (GPU vs. Host), number of logical’’ objects sharded and split across multiple files, data types (tensors vs. Python objects), and their serialization requirements. This results in significant runtime overheads due to blocking device-to-host transfers, data-oblivious serialization, and storage I/O contention. In this paper, we introduce DataStates-LLM, a novel checkpointing architecture that leverages State Providers to decouple state abstraction from data movement. DataStates-LLM exploits the immutability of model parameters during the forward and backward passes to perform ``lazy’', non-blocking asynchronous snapshots. By introducing State Providers, we efficiently coalesce fragmented, heterogeneous shards and overlap the serialization of metadata with bulk tensor I/O. We evaluate DataStates-LLM on models up to 70B parameters on 256 A100-40GB GPUs. Our results demonstrate that DataStates-LLM achieves up to 4 \times higher checkpointing throughput and reduces end-to-end training time by up to 2.2 \times compared to state-of-the-art solutions, effectively mitigating the serialization and heterogeneity bottlenecks in extreme-scale LLM training.
[AI-60] Parameter-Efficient Quantum-Inspired Fast Weight Programmers for Traffic-Matrix Forecasting
链接: https://arxiv.org/abs/2606.27821
作者: Kuo-Chung Peng,Jiun-Cheng Jiang,Chun-Hua Lin,Tai-Yue Li,Nan-Yow Chen,Samuel Yen-Chi Chen
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 3 figures
Abstract:Traffic matrices (TMs) capture network-wide origin-destination demand and are central to traffic engineering, yet accurate whole-matrix forecasting remains challenging when prediction must be performed under the memory, update, and training-budget constraints of online network control. This paper investigates whether compact quantum-inspired recurrent models can provide effective TM forecasts without relying on dedicated graph, transformer, or diffusion modules. We adapt gated quantum-inspired Kolmogorov-Arnold network fast-weight programmers (QKAN-FWPs) to direct multi-step Abilene TM forecasting, where each model predicts the next 20 five-minute frames of a 144-channel origin-destination (OD) matrix from a two-hour history. We benchmark three QKAN placement variants against a matched-size long short-term memory (LSTM) network, a larger LSTM, and a classical gated fast-weight programmer under a shared fixed-budget training protocol. Among the evaluated recurrent models, G-QKANFWP achieves the best pooled root-mean-square error (RMSE), while using only 22.4% of the larger LSTM. It also outperforms both the matched-size LSTM and the classical G-FWP baseline, indicating that the gain is not due to gated fast-weight framework alone. Convergence and channel-wise analyses further show that the quantum-inspired variants obtain lower validation-loss area under the learning curve (AULC) than matched-size recurrent baselines, while G-QKANFWP and GQKAN-FWP achieve substantially more OD-channel wins. These results identify a classical slow programmer with a quantum-inspired fast programmer as a promising accuracy-efficiency design for resource-conscious network traffic-matrix forecasting.
[AI-61] Reconstructing the Developmental Trajectory of Adipocytes in Human Adipose Tissue Using Single-Cell RNA Sequencing
链接: https://arxiv.org/abs/2606.27657
作者: Weny S. M Sitinjak,Humasak Tommy Argo Simanjuntak
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注: 20 pages, 10 Figures, The manuscript is currently under review at the International Journal on Electrical Engineering and Informatics
Abstract:Obesity is a global health crisis associated with metabolic disorders such as type 2 diabetes and cardiovascular disease. This study employed single-cell RNA sequencing to reconstruct the developmental trajectory of human adipocytes from adipose tissue samples. Our analysis identified 15 transcriptionally distinct cell clusters, including 7 transitional states, revealing the dynamic process of adipocyte differentiation. We detected 16 functionally active signaling pathways mediating cellular communication between adipocytes and their progenitors. Among these, insulin-like growth factor (IGF) and fibroblast growth factor (FGF) pathways emerged as the most prominent networks, showing consistent activity across differentiation stages (p0.05). The study revealed depot-specific differences, with visceral adipocytes undergoing additional extracellular matrix remodeling absent in subcutaneous differentiation. Spatial analysis further showed that IGF signaling was particularly active in perivascular niches, while FGF activity dominated in mature adipocyte zones. These results provide the first comprehensive map of human adipocyte development, highlighting IGF and FGF pathways as potential therapeutic targets. The identified signaling networks offer new insights for developing interventions to promote healthy adipose expansion or inhibit pathological fat accumulation. This work advances our fundamental understanding of adipose tissue biology while providing clinically relevant data for metabolic disorder treatments.
[AI-62] GRAFT: Biological Graph and Hypergraph Benchmarks for Linked Gene Expression and Phenotypic Trait Prediction in Arabidopsis thaliana
链接: https://arxiv.org/abs/2606.27413
作者: Manuel Serna-Aguilera,Vanshika Jindal,Fiona L. Goggin,Jiamei Li,Aranyak Goswami,Alexander Bucksch,Suxing Liu,Khoa Luu
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2508.14934
Abstract:Understanding which genes control which traits in an organism remains one of the central challenges in biology. Despite significant advances in data collection technology, our ability to map genes to traits is still limited. This genome-to-phenome (G2P) challenge spans several problem domains, including plant breeding, and requires methods capable of reasoning over high-dimensional, heterogeneous, and biologically structured data. Current datasets and data repositories, however, are not well-equipped for this task. Current studies do not link gene expression and trait data, and most focus on very specific traits, limiting the breadth of possible correlations. To address this gap, we present the novel Gene-Graph Regression for Arabidopsis Functional Traits (GRAFT) dataset, a curated multi-modal dataset linking gene expression profiles with phenotypic trait measurements in Arabidopsis thaliana, a model organism in plant biology. GRAFT supports tasks such as phenotype prediction and interpretable graph learning. In addition, we benchmark conventional regression and explanatory baselines, including a biologically-informed hypergraph baseline, to validate gene-trait associations. To the best of our knowledge, this is the first dataset to provide multimodal gene information and heterogeneous trait or phenotype data for the same Arabidopsis thaliana specimens. With GRAFT, we aim to foster research to accurately understand the relationship between genotypes and phenotypes using gene information, higher-order gene pairings, and trait data from multiple sources.
[AI-63] Compression-Driven Anomaly Detection in Brain MRI Using an Interpretable Quantum Autoencoder
链接: https://arxiv.org/abs/2606.27411
作者: Santanu Ganguly,Xing Liang,Dimitrios Makris
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:We study a quantum autoencoder (QAE) for compression-driven anomaly detection in brain MRI data. The approach leverages angle encoding to map image patches into quantum states, followed by a variational encoder-decoder architecture trained to discard information via auxiliary trash qubits. Anomaly scores reflect the degree to which inputs resist compression relative to normal data, with higher scores corresponding to deviations from the learned normal manifold. Evaluated on publicly available brain MRI DICOM datasets, the method achieves a slice-level ROC-AUC of approximately 0.95 and a patch-level ROC-AUC of approximately 0.813, outperforming classical autoencoder and PCA baselines. Analysis of the learned parameters reveals a pronounced encoder-decoder asymmetry, where effective anomaly detection arises from structured information compression within the encoder rather than increased parameter magnitude or decoder expressivity. This results in a controlled compression-reconstruction trade-off with a clear operating regime that supports principled threshold selection. Qualitative evaluation further shows that the QAE produces spatially localized anomaly heatmaps aligned with tumorous regions. The results, supported by promising baseline performances, demonstrate that quantum autoencoders provide an interpretable and controllable mechanism for anomaly detection based on incompressibility with respect to a learned latent representation. This work highlights the potential of quantum autoencoders as a principled tool for studying compression dynamics in quantum machine learning, with promising implications for decision support in medical imaging workflows.
[AI-64] Automated brain tumor detection in MRI images using CNN and ResNet architectures
链接: https://arxiv.org/abs/2606.27405
作者: Annapurna V K,Asha N,K Paramesha,Shabana Sultana,Kirankumar Humse
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning has shown significant potential in medical image analysis, particularly for disease detection using MRI scans. Accurate and early diagnosis of brain tumors remains challenging due to the complexity of brain structures and reliance on manual interpretation. This work presents an automated deep learning-based approach for brain tumor detection from MRI images using Convolutional Neural Networks and Residual Networks. Transfer learning is applied with two pretrained architectures, ResNet18 and ResNet50, to classify MRI scans into tumor and non-tumor categories. Experiments are conducted on a dataset of 3,929 brain MRI images, evaluating the impact of model depth and fine-tuning strategies. The results show that ResNet18 achieves a higher accuracy of 97% compared to 96% for ResNet50, demonstrating better generalization on limited medical data. The proposed framework enables fast, accurate, and cost-effective brain tumor detection, supporting early diagnosis and clinical decision-making.
机器学习
[LG-0] VGB for Masked Diffusion Model: Efficient Test-time Scaling for Reward Satisfaction and Sample Editing
链接: https://arxiv.org/abs/2606.28301
作者: Kijung Jeon,Thuy-Duong Vuong,Molei Tao
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Numerical Analysis (math.NA); Probability (math.PR); Machine Learning (stat.ML)
*备注: 72 pages
Abstract:Inference-time scaling is a promising paradigm to improve generative models, especially when outputs must satisfy structural constraints or optimize downstream rewards. We consider Masked Diffusion Model (MDM) and introduce MDM-VGB, a discrete diffusion sampler that augments unmasking generation with theoretically principled reward-guided remasking. Inspired by the recent success of the classical Jerrum-Sinclair backtracking Markov chain in reward-tilted generation, MDM-VGB extends the backtracking random walk from a fixed prefix tree to a masked-state graph, allowing tokens to be unmasked and remasked at arbitrary positions. The resulting sampler favors unmasking and remasking moves that lead to higher-value partial configurations, enabling both effective high-reward generation and efficient repair of low-reward samples. We prove that MDM-VGB is robust to process-verifier noise and achieves quadratic complexity, while popular test-time heuristics such as best-of- N can incur exponential complexity due to error accumulation. Our theoretical findings are corroborated by strong empirical performance, particularly on popular constraint-satisfaction and scientific benchmarks such as Sudoku and QM9.
[LG-1] PAC-Bayesian Certificates for Quadratic Closed-Loop Control
链接: https://arxiv.org/abs/2606.28281
作者: Domagoj Herceg
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:PAC-Bayesian bounds provide finite-sample guarantees for data-dependent randomized predictors, but applying them to learning-based control is difficult because the natural objective is a quadratic trajectory cost. Such losses are unbounded, non-Lipschitz , and lead to response-dependent Chernoff terms. We employ System Level Synthesis parameterization, which exposes the closed-loop trajectory map of a linear system directly and makes the quadratic control loss amenable to explicit certification. Moreover, we provide a set of PAC-Bayes-Chernoff certificates for posterior distributions over feasible closed-loop responses. For Gaussian disturbance trajectories with arbitrary covariance, we derive an exact one-sided Gaussian transform and a tractable quadratic upper bound expressed through closed-loop sensitivity quantities. We also derive a posterior-localized surrogate for settings where pointwise closed-loop response certificates are unavailable or have support related admissibility issues. Although PAC-Bayes certifies a non-degenerate posterior, the convex quadratic form of the SLS loss transfers the certificate to the posterior mean response. We present a deterministic mean response deployment result that is particularly suitable for control while retaining the stochastic posterior in the bound. Additionally, we provide a data-driven bound for this deployment, transitioning away from an oracle bound. Minimizing this bound naturally results in a learning algorithm for control selection from data. Numerical experiments on a double integrator show that the algorithm acts as a sensitivity-aware finite-sample regularizer, improving held-out cost and reducing closed-loop sensitivity in the low-data regime
[LG-2] Disentangling Continuous-Time Latent Dynamics: Identifiability of Latent SDEs via Diffusion Shifts
链接: https://arxiv.org/abs/2606.28228
作者: Yuanyuan Wang,Wenjie Wang,Haoxuan Li,Mingming Gong,Kun Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Causal representation learning for time series has developed strong identifiability results in discrete-time latent causal models, but identifiability in continuous-time latent stochastic differential equation (SDE) models remains largely open. We address this gap using environment-induced shifts in diffusion covariance. We study additive-noise latent SDEs observed through an unknown nonlinear diffeomorphism, with shared drift but environment-specific diffusion covariance. We show that two diagonal diffusion regimes with pairwise distinct coordinate-wise variance ratios identify the latent coordinates up to permutation and scaling, without any sparsity assumption on the drift. We first prove this result for linear Ornstein–Uhlenbeck systems and then extend it to general additive-noise latent SDEs. Under mild smoothness, the instantaneous drift-Jacobian causal graph is identifiable up to the same permutation. We propose a two-stage estimator for latent disentanglement and optional graph recovery; experiments on synthetic systems confirm the predicted identifiability boundary, and an application to Hardanger Bridge monitoring data illustrates the approach on real sensor trajectories.
[LG-3] Physics-Informed Neural Network with Transfer Learning for State Estimation in Lithium-Ion Batteries using the Single Particle Model with Electrolyte
链接: https://arxiv.org/abs/2606.28220
作者: Gift Modekwe,Qiugang Lu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Physics-informed neural networks (PINNs) have emerged as a powerful tool for solving nonlinear partial differential equations (PDEs), including battery electrochemical models. They typically en-force conservation laws within the loss function to ensure physically consistent solutions. Tradi-tional numerical methods such as finite difference, finite volume, and finite element techniques, re-ly on discretization and can be computationally expensive for nonlinear systems. To address this challenge, PINNs offer improved scalability, particularly for reduced-order models like the single particle model with electrolyte (SPMe). The SPMe describes lithium-ion battery dynamics through coupled diffusion, transport, reaction kinetics, and voltage equations. Despite these advantages, training SPMe-based PINNs from scratch for different battery chemistries or operating conditions is demanding and often leads to slow convergence. To overcome this limitation, this work introduces a transfer learning framework for SPMe-PINNs. The model is first pretrained to learn general elec-trochemical dynamics and then adapted to a target battery by transferring weights, freezing se-lected layers, and fine tuning the remaining parameters, including estimating key electrochemical variables. Validation using PyBaMM demonstrates accurate voltage prediction, indicating that the proposed approach preserves electrochemical consistency while reducing training time and ena-bling efficient generalization across batteries.
[LG-4] Non-Linear Strategic Classification Made Practical
链接: https://arxiv.org/abs/2606.28204
作者: Jack Geary,Boyan Gao,Henry Gouk
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 15 pages, 4 figures, 2 tables
Abstract:Algorithmic developments in Strategic Classification have been mostly limited to linear classifiers in settings where the best response has a closed-form solution or can be easily approximated. While some work has explored the role of non-linear classifiers in strategic settings, progress in this direction is impeded by the computational intractability of the strategic behaviour. Addressing this, we present a novel method for approximating the best response by exploiting Lagrangian duality. By reformulating the strategic response as a constrained optimisation problem, we can construct a Lagrangian that is amenable to first order optimisation methods. This approach reproduces closed-form strategic behaviour in linear settings and can be straight-forwardly applied to non-linear settings. We show how the Implicit Function Theorem can be used in conjunction with our proposed response formulation during classifier learning to compute the total gradient of the loss. This connects the classifier parameters directly to the consequent strategic behaviour, yielding a novel training algorithm that can exploit this relationship. Experimental evaluation shows that the resulting models achieve improved strategic accuracy on common machine learning datasets.
[LG-5] COCOLogic-V2: Identifying Logical Inconsistencies via Truly Hard-Negatives
链接: https://arxiv.org/abs/2606.28194
作者: David Steinmann,Antonia Wüst,Kristian Kersting,Wolfgang Stammer
类目: Machine Learning (cs.LG)
*备注:
Abstract:While interpretable models such as concept bottleneck models (CBMs) and program synthesis methods enable verification of model decisions, their evaluation is typically limited to simple tasks, leaving complex reasoning on real-world images largely unexplored. We introduce COCOLogic-V2, an object-centric dataset for visual inductive reasoning on real-world images covering a broad subset of first-order logic. By categorizing samples into positive variants, near-boundary (NB), and far-from-boundary (FB) negatives, COCOLogic-V2 enables fine-grained diagnosis of model accountability. Our evaluations show that models tend to separate positive and FB samples well but fail on NB samples, while perceptual noise and large rule-induced search spaces pose additional challenges in few-shot settings. Together, these results highlight that visual inductive reasoning remains an open challenge and COCOLogic-V2 provides a concrete foundation for advancing methods in this direction.
[LG-6] Recovering Sharp Conductivity Features in the Finite-Data Calderón Problem with Physics-Informed Neural Networks
链接: https://arxiv.org/abs/2606.28158
作者: Ali AlHadi Kalout,Pablo Tejerina-Pérez,Konstantin Karchev,Pedro Tarancón-Álvarez,Leonid Sarieddine,Raul Jimenez,Max Engelstein,Guy David
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Numerical Analysis (math.NA)
*备注: 41 pages, 10 figures
Abstract:Physics-informed neural networks (PINNs) have recently emerged as a promising framework for addressing the Calderón inverse problem from limited boundary data. In this work, we revisit neural Calderón inversion by introducing multiscale boundary excitations based on randomized wavelet functions and investigating the role of Fourier-feature encoding (FFE) for representing sharp conductivity variations. We propose a physics-informed reconstruction framework that represents the unknown conductivity and the associated family of electric potentials with separate neural networks conditioned on the applied boundary excitations. The governing elliptic PDE is enforced through physics-informed residuals, while finite Dirichlet-to-Neumann (DtN) data are incorporated through boundary losses. Using synthetic data from a finite-difference forward solver, we evaluate the method on conductivity fields with inclusions, sharp interfaces, smooth profiles, and heterogeneous media. Results show that the framework recovers dominant conductivity structures from finite boundary measurements with relative errors between 3%-12% approximately. We show that FFE improves the reconstruction of localized sharp features, particularly for inclusions and interfaces, but are not universally optimal, with raw-coordinate networks performing competitively for smoother fields. These results highlight coordinate representations and boundary excitation design as key factors in neural Calderón inversion.
[LG-7] Regularized Reward-Punishment Reinforcement Learning
链接: https://arxiv.org/abs/2606.28152
作者: Jiexin Wang,Eiji Uchibe
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:We propose KL-Coupled Policy Regularization (KCPR), a policy coordination framework for Reward-Punishment Reinforcement Learning (RPRL). Based on KCPR, we derive KL-Coupled Soft Optimality (KCSO) and develop its deep realization, klDMP. Unlike existing RPRL approaches that optimize reward-seeking and punishment-related policies largely independently, KCPR enables direct interactions between companion policies by treating each as a dynamically learned prior for the other. KCSO yields coupled soft-optimal policies and KL-regularized Bellman operators, allowing reward and punishment information to jointly influence value propagation. To improve learning stability, we introduce a companion-prior softening mechanism and evaluate separate replay-buffer designs for balancing reward- and punishment-related experience. Experiments in grid-world and Gazebo robotic navigation tasks demonstrate that klDMP improves safety and learning stability while maintaining competitive task performance compared with DQN, SQL and softDMP. These results suggest that policy-level coordination provides an effective mechanism for integrating multiple behavioral objectives and may serve as a useful design principle for reinforcement learning systems with interacting motivational processes.
[LG-8] Autoencoder Architectures for Athlete Performance Scoring from Wearable Telemetry
链接: https://arxiv.org/abs/2606.28145
作者: Mateusz Kubita,Jan Zubalewicz,Krzysztof Siwek
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, submitted to SPA 2026 Conference this https URL
Abstract:Wearable devices produce large, high dimensional training logs for everyday runners, and interpretation rather than data collection is now the limiting step. This paper evaluates five dimensionality reduction models, three autoencoder variants, PCA, and a Variational Autoencoder, on their ability to compress nine sensor runner profiles into a single scalar performance indicator, the latent score. Because the setting is fully unsupervised, model quality is assessed along two complementary axes: reconstruction error (Mean Squared Error) and latent score interpretability, measured via Spearman and Kendall rank correlations, Mutual Information, and Permutation Importance. These are combined into a composite selection criterion that prevents selecting models on reconstruction accuracy alone. Feature rankings from the four metrics are aggregated via a modified Borda count, and their stability is confirmed by bootstrap validation. A two feature linear baseline is included to anchor the comparison. Deep autoencoder achieved the lowest reconstruction error and the highest composite score. Once the PCA hidden layers were widened, the deeper variants became closely competitive with Deep AE on the composite criterion, indicating that the limiting factor was hidden layer capacity rather than the one dimensional bottleneck. Running pace, aerobic decoupling, and average heart rate emerged as the dominant latent score drivers across all models and resampling runs, consistent with established physiology.
[LG-9] MixTTA: Low-Rank Cross-Channel Mixing for Reliable Test-Time Adaptation ECCV2026
链接: https://arxiv.org/abs/2606.28142
作者: Mansoo Jung,Youngwook Kim,Jungwoo Lee
类目: Machine Learning (cs.LG)
*备注: To be published in the 19th European Conference on Computer Vision – ECCV 2026
Abstract:Test-Time Adaptation (TTA) methods commonly update the affine parameters of normalization layers to adapt deployed models under distribution shifts. However, per-channel affine parameters perform axis-aligned scaling and shifting, making them geometrically incapable of correcting cross-channel structural changes induced by distribution shift. To address this limitation, we propose MixTTA, a lightweight plug-in module that equips normalization layers with a low-rank cross-channel transformation, enabling inter-channel mixing at each layer. To ensure that the low-rank branch captures only cross-channel interactions, we also propose Decoupling Projection that enforces strict separation from the diagonal affine path, along with Spectral Projection that prevents rank-1 collapse under non-stationary test streams. MixTTA can be seamlessly integrated into any existing normalization-based TTA method. Experiments in both standard and wild TTA settings show consistent improvements over strong baselines while mitigating adaptation failure under challenging conditions. The source code is publicly available at this https URL.
[LG-10] Dangerous Liaisons of Convex Learning and Non-Affine Aggregation
链接: https://arxiv.org/abs/2606.28123
作者: Thomas Boudou,Batiste Le Bars,Nirupam Gupta,Aurélien Bellet
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:Last-iterate convergence and generalization guarantees in first-order convex learning hinge on the monotonicity of the update operator. While linear averaging preserves the monotonicity of gradient updates, this property is often violated when gradients are aggregated non-affinely, as in modern pipelines enforcing constraints like adaptivity, privacy, robustness or fairness. Whether it is possible to design non-affine aggregation rules that maintain monotonicity has remained an open question. We answer this question negatively: we prove that the monotonicity of aggregated gradients is preserved if and only if the aggregation rule is positively affine. Consequently, non-affine aggregation prevents steady convergence and substantially degrade algorithmic stability. We quantify these drawbacks and propose a path forward by identifying sufficient conditions under which monotonicity can be restored. Our results provide a unified theoretical framework explaining the disparate failure modes observed in modern learning systems.
[LG-11] When One Adapter Speaks for Many: Discovering Low-Rank Redundancy in Continual Fine-Tuning ICML2026
链接: https://arxiv.org/abs/2606.28117
作者: Tanguy Dieudonné,Giulia Lanzillotta,Enis Simsar,Louis Barinka,Thomas Hofmann
类目: Machine Learning (cs.LG)
*备注: ColorAI @ ICML 2026
Abstract:Low-Rank Adaptation (LoRA) has become the standard tool for parameter-efficient fine-tuning of large pretrained models. When applied sequentially across tasks in Continual Learning (CL), the standard assumption is that each new task requires a dedicated low-rank adapter. In this work, we challenge this assumption empirically and structurally. We show that task-specific LoRA adapters in CL exhibit significant low-rank redundancy: the subspaces spanned by adapters trained on different tasks substantially overlap, and in many cases earlier adapters can faithfully represent later tasks. Building on this observation, we propose LiteLoRA, a plug-and-play gating mechanism that learns at train time whether to recruit a new adapter or reuse existing low-rank representations. Our method reduces the number of active adapters by 20-70% while matching or exceeding state-of-the-art performance on standard CL benchmarks, revealing that structural redundancy is pervasive and that selective learning is sufficient to achieve stability without sacrificing plasticity.
[LG-12] Fair Classification with Efficient and Post-hoc Controllable Fairness-Accuracy Trade-off
链接: https://arxiv.org/abs/2606.28097
作者: Maaya Sakata,Kazuto Fukuchi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Post-hoc controllability of fair machine learning models, the ability to control the trade-off between fairness and accuracy after training, is valuable for practical deployment. Existing post-processing methods provide such post-hoc controllability but often suffer from significant accuracy degradation, whereas in-processing methods achieve efficient trade-offs but require computationally expensive retraining for each change in trade-off ratio. To achieve both post-hoc controllability and efficient trade-offs, we propose a novel fair classification algorithm that learns effective feature representations to improve the trade-off efficiency of post-processing fair classifiers, by a gradient-based optimization approach. Experimental results on real-world datasets demonstrate that our method achieves trade-off efficiency comparable to, or even surpassing, in-processing methods, without requiring any retraining.
[LG-13] From Detection to Action: Using LLM Agents for Fault-Tolerant Control
链接: https://arxiv.org/abs/2606.28011
作者: Javal Vyas,Milapji Singh Gill,Artan Markaj,Felix Gehlhoff,Mehmet Mercangöz
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:We propose an agentic Large Language Model (LLM) framework for active Fault-Tolerant Control (FTC) that transforms fault detection outputs into constraint-aware recovery actions grounded in plant-specific knowledge. The approach couples (i) a multi-agent workflow that decomposes operator duties into monitoring, planning, action synthesis, simulation, validation, and reprompting; (ii) a Digital Process Plant Twin (DPPT) that exposes plant data, models, and a simulation service for pre-execution testing; and (iii) a Graph Retrieval-Augmented Generation (Graph RAG) layer built on the CPSMod ontology, which organizes plant knowledge (structure, function, hybrid dynamics, control context, and fault semantics) into a graph that supports relation-aware, multi-hop retrieval for the agents. Corrective actions are generated as minimal-risk state-machine recovery paths and corresponding discrete commands or continuous setpoint adaptations, then validated deterministically against interlocks, envelopes, and dynamic feasibility before any actuation. If no acceptable plan is found within a bounded time window, control is handed to a safety fallback. The framework is evaluated in simulation on two representative benchmarks: a discrete batch Mixing Module and a Continuous Stirred-Tank Reactor (CSTR) under closed-loop PID regulation. Results with lightweight LLMs (GPT-4o-mini and GPT-4.1-mini) show that semantically grounded agents can derive valid recovery decisions within latency budgets compatible with the respective process dynamics, demonstrating a practical pathway from detection to validated corrective action across both discrete and continuous FTC tasks.
[LG-14] Benchmarking on Tasks That Matter: Dataset Selection for Preserving Model Rankings KDD2026 KDD
链接: https://arxiv.org/abs/2606.27997
作者: Rostislav Gusev,Alexey Zaytsev
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)
Abstract:Benchmarks of machine learning models often include many datasets, making evaluation expensive. For efficiency, it is preferable to perform evaluations on small, representative datasets instead. The selection of such subsets typically relies on heuristics and is rarely analyzed for the robustness of the resulting model rankings. We introduce a framework to perform the task of selecting datasets subsets with an evaluation of how different selection strategies preserve the global model rankings. Our framework includes bootstrap aggregation, which provides valid confidence intervals, allowing a principled comparison of selection strategies. We consider clustering, design criteria (A/D-optimality), random baselines, and greedy farthest-first (FAFI). For the latter, we derive upper bounds on selection quality in terms of ranking errors as a function of the number of selected datasets. Empirically, in time series classification (TSC, 112 datasets) and in a supplementary natural language processing benchmark derived from MTEB (57 tasks), several selection strategies improve rank preservation compared with random subsets, including simple FAFI. In contrast, in recommender systems (30 datasets), the improvement of strategies over random selection is small and typically statistically insignificant. For TSC, our best-performing strategy achieves a Spearman correlation of 0.95 with the full benchmark model rankings using only five selected datasets. Additional experiments indicate that the effectiveness of selection approaches depends on both the quality of dataset representations and the scale of the benchmarking regime. Comments: Accepted to the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026) Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2606.27997 [cs.LG] (or arXiv:2606.27997v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.27997 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3770855.3817569 Focus to learn more DOI(s) linking to related resources
[LG-15] Dual-Learning based Penalized Multi-Align Clustering for Multi-View Incomplete and Disorderly Data
链接: https://arxiv.org/abs/2606.27984
作者: Liang Zhao,Shubin Ma,Bo Xu,Qingchen Zhang
类目: Machine Learning (cs.LG)
*备注: 9 pages, 7 figures
Abstract:Multimodal feature fusion can effectively capture complex patterns in real-world data by integrating complementary information from different modalities. However, in many applications, such as boiler combustion monitoring, equipment failure, inconsistent sensor sampling frequencies, and network delays often cause missing modalities and temporal asynchrony. These issues lead to incomplete and disorderly multimodal data. To address them, previous studies have proposed several data fusion methods that align cluster centers before fusion. However, these methods have two key limitations. First, they cannot guarantee accurate sample-level alignment of data pairs. Second, they do not address significant discrepancies in data sizes across different classes, which may affect subsequent fusion performance. To address these problems, we propose a dual-learning based penalized multi-align clustering model, named DLPMAC. The dual-learning mechanism enables the model to learn prior knowledge from each modality, including semantic and structural information. This helps preserve semantic consistency and structural similarity across modalities at both local and global levels. In addition, the penalized multi-align module performs multi-to-multi data alignment through a penalty mechanism. It allows one sample to form data pairs with different samples from other modalities, thereby improving data-pair alignment accuracy. The penalty mechanism also prevents data aggregation, avoiding the case where excessive samples are linked to a single sample. Experimental results demonstrate the effectiveness of DLPMAC in addressing data alignment and fusion challenges from both sampling and clustering perspectives. Comments: 9 pages, 7 figures Subjects: Machine Learning (cs.LG) MSC classes: 62H30 Cite as: arXiv:2606.27984 [cs.LG] (or arXiv:2606.27984v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.27984 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-16] RECAST: Model Reconstruction via Counterfactual-Aware Wasserstein Geometry under Limited Data ICML2026
链接: https://arxiv.org/abs/2606.27948
作者: Xuan Zhao,Lena Krieger,Zhuo Cao,Arya Bangun,Hanno Scharr,Ira Assent
类目: Machine Learning (cs.LG)
*备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
Abstract:Counterfactual explanations (CFs) help understand machine learning models by identifying minimal input changes that would lead to alternative model outcomes. Recent work demonstrates their utility for reconstructing black-box models, enabling third-party auditing of opaque decision systems for fairness and accountability. Still, CF-based reconstruction may suffer from decision boundary shifts, overfitting, and restrictive assumptions requiring online query access to target platforms. We propose REconstruction via Counterfactual-Aware waSserstein opTimization (RECAST) under limited data and restricted access, a behavioral surrogate model based on Wasserstein barycentric prototypes. Our approach addresses decision boundary shifts by incorporating CFs as informative, though less representative, samples for both classes, maintaining high surrogate fidelity in low-sample regimes without requiring online access during reconstruction. To enhance fairness auditing, our method enables systematic group fairness diagnostics. Experiments on real-world datasets and various setups show that RECAST effectively achieves high fidelity and query efficiency, as well as stable results even when the access is limited and noisy.
[LG-17] Graph Dimensionality Reduction for Contextual Bandits: Structure-Specific Regret Bounds under Approximate Smoothness and Noisy Eigenspaces
链接: https://arxiv.org/abs/2606.27917
作者: Joyanta Jyoti Mondal,Ibne Farabi Shihab,Anuj Sharma
类目: Machine Learning (cs.LG)
*备注: 7 pages, 4 figures
Abstract:Contextual bandits with graph-structured arms arise in recommendation, citation retrieval, and social advertising, where arms connected on a graph tend to share reward signal. Standard dimensionality reduction ignores this structure, inflating exploration cost by a factor of d/k . We propose GraphDR-LinUCB, which projects arm features onto the graph’s low-frequency spectral subspace and runs linear UCB in the resulting k -dimensional space. We prove the first \wtO(k\sqrtT) regret bound for spectral-projection-based contextual bandits, reducing dimension dependence from d to k ; a perturbation argument extends this to noisy graphs, with an explicit penalty for reward-smoothness mismatch and graph-estimation error. Our central theoretical finding is that the high-frequency reward component need not incur a worst-case linear-in- T penalty: its actual cost depends on its realized impact along the played path, not on its total energy. A simple spectral comparison between subspaces ( \Gamma_k ) predicts which reducer wins on a given dataset, correctly calling five of six real-dataset outcomes without any fitted threshold. Across a synthetic benchmark and six real datasets (MovieLens, Amazon, LastFM, ogbn-arxiv, MIND), GraphDR-LinUCB reduces cumulative regret by 15\times over full-dimensional LinUCB and outperforms competing graph-aware methods on five of six; the single failure is precisely where the graph’s spectral subspace is misaligned with the reward.
[LG-18] A-SparseMG: Trend-Aware Sparse Forecasting via Multi-Scale Gating for Long-Term Time Series
链接: https://arxiv.org/abs/2606.27908
作者: Wenchao Liu,Hongbing Wang,Youji Zhu,Xiaodong Liu,Xiangguang Xiong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Long-term time series forecasting finds extensive applications in domains such as power demand, traffic flow, meteorological observation, and renewable energy dispatch. Forecasting dynamically varying long-term time series poses inherent challenges, including statistical nonstationarity, local high-frequency disturbances, and coupled cross-period dependencies, which make it difficult for lightweight models to balance parameter efficiency and forecasting performance. To address this issue, this study presents TA-SparseMG, a lightweight cross-period forecasting model built on SparseTSF’s sparse cross-period modeling framework. It incorporates three key modules: a trend-aware reversible instance normalization module, a scale-adaptive gated denoising module, and a multiscale gated-attention MLP forecasting module. The trend-aware normalization module captures input-window statistics and calibrates forecast-window distributions, effectively mitigating distribution shift. The scale-adaptive gated denoising module performs feature smoothing and residual suppression before period rearrangement, thereby reducing interference from high-frequency perturbations. The multiscale gated attention prediction module strengthens the prediction head’s adaptive representational capacity via conditional gating and feature modulation. Extensive experiments across multiple LTSF benchmarks demonstrate that the proposed TA-SparseMG consistently achieves superior, stable performance. Ablation studies confirm that each module independently improves distribution adaptation, input robustness, and cross-period feature mapping capability.
[LG-19] A Comparison of Fusion Techniques for Multi-Modal Human Activity Recognition on the HARMES Dataset
链接: https://arxiv.org/abs/2606.27886
作者: Ahmed Mohamady,Robin Burchard,Kristof Van Laerhoven
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in Human Activity Recognition (HAR) from wearable sensors have shown that multi-modal deep learning models consistently outperform their uni-modal counterparts. Modalities can include IMUs, RGB cameras, audio signals, and others. One important aspect of multi-modal deep learning is the sensor fusion approach we apply. Over recent years, multiple fusion paradigms have been proposed for multi-modal HAR. However, to the best of our knowledge, no head-to-head comparison of these paradigms exists on a common multi-modal HAR benchmark dataset. To address this research gap, we systematically compare seven state-of-the-art sensor fusion methods on the recently released HARMES dataset, which comprises 61 hours of fully labeled IMU, audio, and ambient humidity data. The chosen dataset focuses on 15 household and personal hygiene activities of daily living (ADLs). By applying the seven different fusion techniques to a state-of-the-art multi-modal model architecture, we show that Gated Multi-modal Fusion achieves the highest macro F1-score (0.82), surpassing the concatenation-based late fusion HARMES paper baseline of 0.76 by +6pp under leave-one-participant-out evaluation. All code used in our experiments is made publicly available on GitHub.
[LG-20] FlexMoE: One-for-All Nested Intra-Expert Pruning for MoE Language Models
链接: https://arxiv.org/abs/2606.27866
作者: Fan Mo,Yuxuan Han,Geng Zhang,Wangbo Zhao,Yang You
类目: Machine Learning (cs.LG)
*备注:
Abstract:Mixture-of-Experts (MoE) language models scale model ability with sparsely activated experts, making this architecture a standard recipe for modern large models. However, sparse activation does not remove the deployment burden of storing and serving all experts, and the available deployment budget can vary substantially across devices, users, and workloads. Existing MoE compression methods are still largely fixed-budget, typically optimizing one compressed endpoint at each chosen target budget. We study a different setting: converting a large pretrained MoE LLM into a nested family of deployable subnetworks across budgets. Our method first ranks expert FFN channels by their importance, then lets each expert learn a discrete action to prune its channels. By gradually increasing cost pressure, a single action-training run exports a series of action masks from high to low budgets, each of which identifies a reliable smaller subnetwork nested in the ranked base model. Moreover, we use a single recovery fine-tune at a mid pruning budget (40%) to recover degraded model quality and transfer the recovered model to other unseen budgets. Overall, our framework surpasses recent MoE compression baselines. Specifically, on Qwen2-57B-A14B, our method retains ~99.8% of base performance while pruning 50% of routed expert parameters even without fine-tuning. For deployment, our pruned subnetworks deliver real memory reduction and throughput gains, and further support realtime online budget switching with kernel-level co-design.
[LG-21] USAD: Uncertainty-aware Statistical Adversarial Detection
链接: https://arxiv.org/abs/2606.27832
作者: Zhijian Zhou,Xunye Tian,Jiacheng Zhang,Zesheng Ye,Yiyi Guo,Donghao Zhang,Liuhua Peng,Feng Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Statistical adversarial detection (SAD) treats detection as a two-sample test. Given a reference set of clean examples (CEs) and a batch of queries, potentially containing an unknown mixture of CEs and adversarial examples (AEs), SAD decides whether the query distribution drifts away from the CE distribution while controlling the false-alarm rate. Existing SAD-based methods mainly use maximum mean discrepancy (MMD) to measure the distributional discrepancy. However, MMD’s distributional properties limit its ability to capture characteristic uncertainty patterns of AEs that are crucial for detection: AEs typically exhibit abnormal feature spread (i.e., global uncertainty) and instability under perturbations (i.e., local uncertainty). To close the gap, we propose Uncertainty-aware Statistical Adversarial Detection (USAD), which explicitly captures these uncertainty patterns with two new statistics: (1) Variance Discrepancy (VD), which measures the difference in feature spread between AEs and CEs to capture global uncertainty differences. (2) Perturbation-based Covariance Discrepancy (PCD), which compares feature covariance under Gaussian perturbations to capture local uncertainty differences. By aggregating VD and PCD, USAD achieves superior detection performances over baseline methods against various adversarial attacks, highlighting the importance of considering characteristic behaviors of AEs for effective SAD. Our code is available at: this https URL.
[LG-22] Accelerating Hierarchical Sparse Predictive Coding with Hybrid Amortized Inference
链接: https://arxiv.org/abs/2606.27802
作者: Kazuhisa Fujita
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hierarchical predictive coding provides an interpretable framework for perception as error-driven inference in multi-layer generative models, while sparse coding imposes parsimonious latent representations through explicit sparsity constraints. Their combination yields hierarchical sparse predictive coding models with appealing computational and neuroscientific properties, but practical use is often limited by the cost of iterative latent inference. In such models, each input may require many recurrent refinement steps before a useful sparse representation is obtained, and this burden becomes more severe as the hierarchy deepens. We study this bottleneck by holding the hierarchical sparse energy fixed and varying the inference procedure. The comparison includes four schemes: classical iterative inference based on ISTA, an accelerated MFISTA reference, structurally informed amortized inference using a LISTA-style bottom-up encoder adapted to the hierarchical model, and a hybrid method in which this fast amortized initialization is followed by a small number of corrective energy-based refinement steps. Under this shared objective, we measure reconstruction quality, sparsity, latency, and stability on static image benchmarks. The results show that a shallow LISTA-style initializer plus short corrective recurrence improves over pure amortization while remaining much faster than long iterative inference.
[LG-23] Difference of Convex Programming in the Wasserstein Space with Applications to MMD Optimization
链接: https://arxiv.org/abs/2606.27767
作者: Clément Bonet,Pierre-Cyril Aubin-Frankowski,Youssef Mroueh
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Optimizing functionals over the space of probability measures is now ubiquitous in machine learning. A widely used approach is to perform the optimization directly over the Wasserstein space, but many objective functionals of practical interest are non-convex along Wasserstein geodesics, making the analysis of standard first-order methods challenging. In this work, we study a class of objectives over the Wasserstein space that admit a difference-of-convex (DC) decomposition and we lift the classical convex-concave procedure (CCCP) to this setting. Under smoothness and strong convexity assumptions on the convex components of the decomposition, we prove almost stationarity along the iterates of the resulting algorithm. Our main focus is on the Maximum Mean Discrepancy (MMD) and the Energy Distance (ED) functionals, for which we develop explicit Wasserstein DC decompositions, and establish local convergence of the scheme under mild assumptions. Empirically, we show that well-chosen DC decompositions yield faster and more stable convergence than Wasserstein gradient descent on these MMD objectives.
[LG-24] Layerwise Progressive Freezing: A Training Scaffold for Depth-Scalable Binary Networks
链接: https://arxiv.org/abs/2606.27759
作者: Evan Gibson Smith,Bashima Islam
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2601.22660
Abstract:Training binary neural networks (BNNs) from scratch is dominated by the straight-through estimator (STE), whose forward/backward mismatch produces severe accuracy degradation as networks deepen. We study an orthogonal axis: when and where binarization is enforced during training. We introduce StoMPP (Stochastic Masked Partial Progressive Binarization), which gradually replaces clipped weights and activations with their hard binary counterparts layer by layer from input to output, using stochastic partial masks with soft refresh. StoMPP delivers two complementary benefits. As a standalone training rule, it provides a fully STE-free procedure that improves over vanilla STE with gains that grow with depth (ResNet-50 BNN: +18.0/+13.5/+3.8 on CIFAR-10/100/ImageNet), and the pattern holds across ResNet-18/34/50, MobileNetV2, and BERT fine-tuning. Composed with surrogate gradients by applying STE only to frozen entries, it reaches +27.1/+19.8/+17.7 over vanilla STE on the same setting. Underlying both regimes is a single mechanistic finding: progression order is decisive. Forward layerwise progression prevents depth collapse, reverse progression collapses to near-chance, and binary-weight networks (without binary activations) are insensitive to order. We trace this asymmetry to activation-induced gradient blockades: a committed binary activation severs gradient flow upstream, and ordering controls when these blockades form. To isolate the progression’s contribution from any benefit conferred by STE, we conduct all ablations in the STE-free regime; the resulting characterization (schedule, refresh, ordering, dynamics) thus reflects the progression itself rather than its interaction with surrogate gradients.
[LG-25] PerturbCellRL: Verifier-Guided Reinforcement Learning for Single-Cell Perturbation Prediction
链接: https://arxiv.org/abs/2606.27752
作者: Dongxia Wu,Mingyu Li,Yuhui Zhang,Anurendra Kumar,Emma Lundberg,Serena Yeung-Levy,Emily B. Fox
类目: Machine Learning (cs.LG)
*备注:
Abstract:Single-cell perturbation models can reduce costly wet-lab screening by predicting how cells respond transcriptionally to interventions. While recent generative models improve population-level prediction, individual generated cells are not explicitly checked for biological consistency. We introduce PerturbCellRL, a reinforcement learning (RL) framework that post-trains a pretrained single-cell transcriptomic generator using a suite of cell-level verifiers as rewards. These verifiers define four rewards: Pearson top-k similarity, RMSE top-k proximity, DE Spearman, and Pathway activity. The Pathway activity verifier rewards cells whose pathway responses match known perturbation biology. We evaluate PerturbCellRL on multiple genetic and chemical perturbation benchmarks. Across these benchmarks, PerturbCellRL improves over the pretrained flow-matching generator on reward-aligned evaluation metrics and a held-out evaluation metric. Moreover, PerturbCellRL remains competitive with state-of-the-art methods on population-level metrics. Together, these results frame trustworthy single-cell prediction as verifier-guided generative alignment, moving beyond matching expression distributions toward predictions whose single-cell perturbation effects are explicitly checked for biological consistency.
[LG-26] he Weakest Link Tells It All: Outcome-Supervised Process Reward Modeling via Learnable Credit Assignment
链接: https://arxiv.org/abs/2606.27739
作者: Tianyu Jia,Yue Fang,Hongxin Ding,Rihong Qiu,Zhibang Yang,Zhijing Wu,Xu Chu,Junfeng Zhao,Yasha Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Process reward models (PRMs) enhance the reasoning capabilities of large language models (LLMs) by providing fine-grained feedback, yet training PRMs typically requires expensive stepwise annotations. Outcome-supervised PRMs offer a scalable alternative by learning from final-answer correctness alone, but this introduces a fundamental credit assignment challenge, i.e., attributing outcomes to responsible reasoning steps. Existing approaches rely on either uniform or causal assignment, both of which fail to anchor credit in step correctness and thus hinder process error identification. In this work, we propose Outcome-Supervised Process Reward Modeling via Learnable Credit Assignment (LCA), an outcome-supervised PRM framework that jointly learns credit assignment and reward modeling under the principle of Weakest Link Assignment: a reasoning chain is as strong as its weakest link. To address mutual dependence between credit assignment and reward modeling, we formalize outcome-supervised PRM as a Multiple Instance Learning (MIL) problem and introduce Softmax-Weighted-Sum (SWS) pooling, an MIL pooling technique tailored for strong dependence and redundancy among reasoning states. We prove Bayes consistency of our algorithm under mild assumptions. Extensive experiments demonstrate that LCA consistently outperforms state-of-the-art outcome-supervised PRMs across multiple tasks and backbones. Code is available at this https URL. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.27739 [cs.LG] (or arXiv:2606.27739v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.27739 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-27] Reduction of Probabilistic Chemical Reaction Networks ICML2026
链接: https://arxiv.org/abs/2606.27737
作者: Mauricio Montes,Gregoire Sergeant-Perthuis
类目: Machine Learning (cs.LG); Category Theory (math.CT)
*备注: Accepted to ICML 2026
Abstract:Programming adaptive behaviors at the cellular level is a long-standing goal that raises the question of how probabilistic computation can be implemented in biochemical systems. Chemical reaction networks (CRNs) provide such a substrate and have been shown to realize probabilistic models, including hidden Markov models and factor graphs, with dynamics reproducing Bayesian inference and belief propagation. However, encoding these algorithms typically requires prohibitively large reaction networks, and classical CRN reduction techniques do not directly apply. By recovering the factor graph structure encoded in Napp–Adams-compiled CRNs, we transport recent factor-graph reduction results to their chemical implementations, obtaining significantly smaller CRNs while preserving the belief-propagation fixed points on surviving variables.
[LG-28] Learning to Reason with Curriculum II: Compositional Generalization
链接: https://arxiv.org/abs/2606.27721
作者: Nived Rajaraman,Audrey Huang,Miroslav Dudik,Robert Schapire,Dylan Foster,Akshay Krishnamurthy
类目: Machine Learning (cs.LG)
*备注: 82 pages, 5 figures
Abstract:Compositional generalization, the ability to solve complex problems by combining solutions to simpler sub-problems, is a fundamental capability of both natural and artificial intelligence, and a key mechanism underlying chain-of-thought reasoning. However, the theoretical underpinnings of compositional generalization remain poorly understood: when and why does decomposing a problem into parts yield more efficient learning than solving it directly? We study this question through the canonical problem of learning to simulate semiautomata (predicting the outcome of T steps of sequential computation), a model that captures state tracking, regular language recognition, and modular arithmetic. We show that an autocurriculum-based approach building on Part I of this series, recursively decomposing longer sequences into shorter sub-problems, learning to solve them, and composing the solutions, achieves dramatically better statistical complexity than direct methods. (i) For a setting inspired by supervised fine-tuning (SFT) where the learner receives interactive feedback on intermediate states of the computation, curriculum facilitates learning from only 2^\mathcalO(\sqrt\log T) tokens of supervision; i.e., subpolynomial in the sequence length T , overcoming the \Omega(T) token barrier required by direct simulation. (ii) For a setting inspired by reinforcement learning with verifiable rewards (RLVR), where the learner improves a pre-trained reference model using an outcome verifier, we show that curriculum reduces the requirement on the reference model from coverage at the full sequence length T to coverage at a shorter block length B \ll T , an exponentially weaker condition.
[LG-29] Aurora: A Leverag e-Aware Spectral Optimizer
链接: https://arxiv.org/abs/2606.27715
作者: Alec Dewulf,Dhruv Pai,Li Yang,Ashley Zhang,Ben Keigwin
类目: Machine Learning (cs.LG)
*备注: 30 pages, 12 figures
Abstract:We show that for tall matrix parameters, like projection matrices in the MLP layers, the Muon update can have row norms that are arbitrarily non-uniform. This can lead to a self-reinforcing feedback loop whereby neurons receive persistently small updates and eventually do not contribute meaningfully to network outputs. This problem is effectively mitigated by an additional row normalization step, but current methods do this in a way that moves the Muon update geometry away from the polar factor of the momentum matrix, which we find is undesirable. We propose Aurora, an optimizer that enforces row-uniformity of matrix parameter updates while respecting Muon’s polar factor geometry. Aurora outperforms Muon in our pre-training experiments and, when combined with existing methods, achieves state-of-the-art performance among spectral optimizers on the optimizer track of the modded-nanoGPT speedrun. Additionally, we find that Aurora’s empirical gains over Muon scale with the MLP expansion factor, suggesting that Aurora may allow for effective training of very wide MLP layers.
[LG-30] Are Time-Series Foundation Models Ready for E-Nose Data? An Empirical Assessment of Their Embeddings
链接: https://arxiv.org/abs/2606.27672
作者: Taeyeong Choi,Mohammed Kamruzzaman
类目: Machine Learning (cs.LG)
*备注: Submitted to IEEE SENSORS 2026
Abstract:Inspired by advances in natural language processing and computer vision, “time-series foundation models” (TSFMs) have recently been introduced with the promise of strong generalization across diverse time-series tasks, including forecasting, classification, and anomaly detection, as well as across domains such as healthcare, climate science, and manufacturing. However, their utility for gas-sensing data remains largely unexplored. To address this gap, this paper systematically evaluates recent TSFMs on electronic nose (E-Nose) data. In particular, we investigate whether embeddings produced by representative TSFMs, including Chronos-2 and MOMENT, provide effective representations for gas identification and concentration prediction. Specifically, we show that fine-tuning is necessary to achieve satisfactory performance on E-Nose data, and fusing TSFM embeddings with representations learned by specialized predictive models can further improve the performance, suggesting both the potential and limitations of current TSFMs for gas-sensing applications.
[LG-31] RoR: Decoupled Temporal Rotation with Relational Circular Region for Temporal Knowledge Graph Embedding
链接: https://arxiv.org/abs/2606.27651
作者: Peijia Xie,Yike Liu,Chao He,Huiling Zhu
类目: Machine Learning (cs.LG)
*备注:
Abstract:In recent years, with the emergence of Temporal Knowledge Graphs (TKGs), research on learning entity and relation representations in TKGs has attracted increasing attention, giving rise to a large number of TKG embedding methods. TeRo is a simple and efficient temporal knowledge graph embedding approach. However, TeRo does not do well in modeling the mapping properties of various relations, such as one-to-many, many-to-one, and many-to-many. Meanwhile, it also has limitations in the expression of temporal information. To address these issues, we propose a novel TKG embedding method named TeRoR. This method divides the temporal evolution of entity embeddings, and conducts independent rotation transformations on head and tail entities in the complex vector space to strengthen temporal information modeling capacity. In terms of relational characteristics, we train a radius to constrain the rotated and translated head entities within a circular region centered on the tail entity, which effectively captures the diverse mapping properties of relations. Experimental results demonstrate that TeRoR achieves competitive performance against state-of-the-art models on four distinct TKG datasets.
[LG-32] Continual Learning for Sequential Personalization of Small Language Models: A Stability Monitoring Analysis
链接: https://arxiv.org/abs/2606.27634
作者: Thomas S. Paula,Lucas S. Kupssinskü,Rodrigo C. Barros
类目: Machine Learning (cs.LG)
*备注:
Abstract:Small Language Models (SLMs) are increasingly being considered for deployment on edge devices such as laptops, enabling private, low-latency, and locally personalized applications. However, personalization requires models to adapt over time to evolving user- or task-specific data, placing them in a continual learning setting. This creates the risk of catastrophic forgetting, where learning new information degrades performance on previously learned tasks or broader model capabilities. Recent benchmarks such as TRACE have shown that continual fine-tuning can significantly degrade the general abilities of aligned large language models. In this work, we present a study for sequential LoRA personalization of SLMs. We save model checkpoints after each adaptation stage and evaluate them on current tasks, previously seen tasks, and a fixed reference set. This checkpoint-level protocol enables us to monitor task performance, forgetting, and reference set drift over time. We show that lightweight reference set distributional diagnostics can reveal model-specific instability patterns during sequential LoRA personalization of SLMs, including cases where task-level metrics alone hide harmful adaptation. We hope this can highlight new research avenues for monitoring stability of SLMs in a continual learning setting.
[LG-33] Physics-Guided Robotic Radiation Source Localization along Arbitrary Measurement Paths in Unstructured Environments
链接: https://arxiv.org/abs/2606.27624
作者: Hojoon Son,Kai Tan,Fan Zhang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 18 pages, 14 figures, 2 tables
Abstract:Using robots to estimate the location of the radiation source is an effective way to improve efficiency and safety. Existing methods focus on planning the robot’s path to achieve precise estimation, typically approaching the source. However, approaching the source increases the risk of radiation damage to a robot. In addition, a path-planning algorithm designed solely for radiation source localization (RSL) limits the flexibility of missions that deploy robots into radioactive environments. This study presents an automation framework for robotic RSL that leverages a physics-informed machine learning (PIML) model to precisely estimate the source location, regardless of measurement paths, in unknown environments. Physics-inspired model tensors have been designed for PIML to handle attenuated gamma-ray flux signals from unknown obstacles, and multiple models are computed in parallel to improve the robustness and precision of the RSL. The proposed method is evaluated in high-fidelity simulation environments using Monte Carlo particle transport across diverse randomized domains, including spatial scales, radiation source types, obstacle materials and geometries, and robot trajectories. The method is also validated through physical experiments on configurations that are not included in the simulation-based evaluation. The continuous learning technique is applied in real-robot deployment to enhance the practical applicability of the online robotic RSL system. The proposed method advances robot radiation perception from pointwise flux detection to spatial intelligence.
[LG-34] FoggyTrust: Robust Federated Learning with Hierarchical Trust Networks
链接: https://arxiv.org/abs/2606.27622
作者: Emmanuel Rassou,Tomas Gonzalez
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 8 pages, 4 figures
Abstract:Byzantine-robust federated learning seeks to protect distributed model training from malicious or corrupted clients without requiring access to their private data. FLTrust addresses this challenge by introducing a trusted server-side root dataset that assigns trust scores to client updates for more robust aggregation. In this work, we propose FOGGYTRUST, a hierarchical extension of FLTrust that localizes trust computation to fog nodes, allowing the framework to better handle globally heterogeneous data while preserving robustness within locally homogeneous client groups. We further show that this two-level architecture can simultaneously address distribution mismatch in trust estimation and client drift across groups by combining local trust-based aggregation with heterogeneity-aware global optimizers such as FedAdam and SCAFFOLD. Across benchmark datasets, FOGGYTRUST achieves its strongest gains on more challenging heterogeneous settings, particularly on CIFAR-10 under Krum and Trim attacks, where it achieves an over 50% improvement over FLTrust. We also test FOGGYTRUST in a real-world safari dataset to show the promise of hierarchical trust networks for robust federated learning in socially impactful, safety-critical settings such as distributed wildlife monitoring.
[LG-35] COOPA: A Modular LLM Agent Architecture for Operations Research Problems
链接: https://arxiv.org/abs/2606.27611
作者: Chuanhao Li,Xiaoan Xu,Dirk Bergemann,Ethan X. Fang,Yehua Wei,Zhuoran Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Operations Research (OR) provides a rigorous framework for high-stakes decision-making, but effective OR modeling requires substantial domain knowledge, mathematical abstraction, and solver expertise. Recent LLM-based systems automate parts of this pipeline, yet remain limited by low accuracy on complex problems, opaque outputs, and narrow solver support. We propose COOPA (COoperative OPerations Agent), a modular LLM-agent architecture for interpretable and scalable OR decision support. It combines three components: iterative confidence-based modeling, which generates multiple candidate formulations, self-evaluates them across modeling dimensions, and selects one using a max-min confidence criterion; element-level provenance and confidence explanations, which link variables, parameters, constraints, and objectives to quoted source text and provide an audit trail for human verification; and multi-solver routing to specialized optimizer agents for different OR problem classes. Across three OR benchmarks, eight LLM backbones, and four baselines under identical conditions, COOPA achieves the best macro-average accuracy on six of eight backbones and improves over the strongest baseline by up to 6.7 percentage points. A within-system ablation isolates the contribution of iterative confidence-based modeling, while additional analyses and case studies illustrate the value of source traceability and multi-solver dispatch.
[LG-36] raining Observable Control Policies to Expose Agent State Through Actions
链接: https://arxiv.org/abs/2606.27609
作者: Andres Enriquez Fernandez,John J. Bird
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Physical or operational constraints often impose communications limitations on autonomous agents. Such limitations complicate monitoring or multiagent coordination. Even when strong communications are absent, some information may still be available. The remainder of the relevant agent state may be reconstructed via estimation. The actions taken by an agent are a potential source of information – as the agent interacts with the environment, these actions may be observed even in the absence of explicit communication. We investigate using actions to estimate the state of an agent, using reinforcement learning to develop policies which make the estimation problem more tractable. Policy observability is encouraged through the training reward and is analyzed using simulation of the trained agent. In an aircraft tracking problem a policy with enhanced observability is found that has minimal impact on nominal task performance.
[LG-37] Quantum Generative Diffusion Model for Real-World Time Series
链接: https://arxiv.org/abs/2606.27561
作者: Jack Waller,Filippo Caruso,Dimitrios Makris,Rajagopal Nilavalan,Xing Liang
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:Generative models have achieved remarkable success in data synthesis, though recent advances driven by increasing model scale have introduced challenges in computational cost and efficiency. Quantum machine learning offers a promising alternative, representing complex data distributions using compact, highly expressive models. Here, we propose QDiffusion-TS, the first quantum generative diffusion model for time series synthesis, and validate it on the IQM quantum processor. The framework extends a classical diffusion architecture by replacing feed-forward components within the denoising transformer with quantum neural networks, yielding a hybrid quantum transformer that reduces the number of trainable parameters in each replaced component by nearly three orders of magnitude. Evaluated on financial time series from Apple and Amazon, the model generates synthetic data that more accurately reproduces the real distributions, reducing Wasserstein distance by approximately 44% relative to its classical counterpart across both datasets. In a downstream forecasting task, augmentation with the generated data improves predictive performance by up to 71% in RMSE over a baseline trained solely on real data. These results show that quantum enhanced architectures can consistently match and frequently surpass classical performance with substantially fewer parameters, establishing a practical framework towards more efficient and scalable data-driven generative modelling.
[LG-38] Productionized Fairness Measurement Under Privacy Constraints
链接: https://arxiv.org/abs/2606.27558
作者: Osonde A. Osoba,Yuzi He,Saikrishna Badrinarayanan,Varun Mithal,Sakshi Jain,Natesh S. Pillai
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Fairness measurements in the form of disaggregated evaluations often rely on demographic signals that are legally constrained or culturally sensitive. Race and ethnicity signals are among the more difficult signals to curate and use for this task. This paper presents Privacy-Preserving Probabilistic Race/Ethnicity Estimation (PPRE) as a method for enabling fairness measurements with respect to race/ethnicity for U.S.\ LinkedIn members in a privacy-preserving manner. PPRE applies privacy technologies (specifically: secure two-party computation, differential privacy, and additive homomorphic encryption) on top of two race/ethnicity demographic signal sources (the Bayesian Improved Surname Geocoding estimator and a sparse golden survey set of self-reported demographics) to power a fairness measurement solution with respect to US-based race/ethnicity demographics. We detail its privacy guarantees and demonstrate its application on candidate- and viewer-side fairness measurements. We close with a transferable framework for institutions seeking to implement similar privacy-preserving measurement infrastructure.
[LG-39] Advancing Speaker-Based Vocal Effort Classification with WavLM and Data Augmentation in Naturalistic Non-Calibrated Speech Recordings ICASSP2026
链接: https://arxiv.org/abs/2606.27543
作者: Zahra Omidi,John H. L. Hansen
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 5 pages, 4 figures. Accepted to ICASSP 2026
Abstract:The variations in vocal effort range (e.g. whisper, soft, neutral, loud, shout) alter production and speech acoustics, reducing intelligibility and limiting the robustness of any subsequent speech technology. Classification is challenging since effort lies on a continuum, adjacent categories are easily confused, and labeled data remain scarce. Prior SSL approaches with wav2vec2, HuBERT, and AST improve performance on the AVID corpus but still suffer from boundary errors. In this study, we introduce WavLM for the first time in vocal effort classification and benchmark it against wav2vec2 and HuBERT. To address data scarcity, we conduct a systematic study of augmentation strategies, covering RIR convolution, additive noise, time masking, speed perturbation, band-limiting, MixUp, and CutMix. Augmentation consistently improves WavLM, with gains ranging from +0.6% to +1.8% absolute. We further propose Gaussian-neighbor soft labels, which further reduce near-boundary confusions by modeling the vocal effort continuum. Our best system, WavLM-BASE with gradual unfreezing, augmentation, and Gaussian-neighbor soft labels, achieves 78.2% mean accuracy, establishing a new state-of-the-art on AVID.
[LG-40] Learning from Annotation Uncertainty: Entropy-Aware Curriculum for Speech Emotion Recognition INTERSPEECH2026
链接: https://arxiv.org/abs/2606.27536
作者: Zahra Omidi,John H.L. Hansen
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures. Accepted to Interspeech 2026
Abstract:Speech emotion recognition (SER) often relies on hard consensus labels that collapse annotator disagreement. We study distribution-based supervision for 9-class SER on MSP-Podcast 2.0 using a WavLM-Base multitask model for categorical emotion and dimensional VAD. Hard-label training is compared with targets from primary and merged primary–secondary annotator vote distributions. Distributional objectives improve alignment with human vote distributions, reducing JSD/KLD relative to hard-label training. Analysis shows that hard supervision partly benefits from assigning ambiguous utterances to the residual Other class, whereas distributional supervision redistributes uncertainty across emotion categories. Entropy-stratified evaluation shows that high-ambiguity utterances remain challenging, but distribution-based supervision better captures perceptual uncertainty. These findings support moving beyond hard labels toward targets that reflect listener disagreement.
[LG-41] Boundary condition fidelity for bottom-hole pressure and CO2 plume prediction in geological carbon storag e
链接: https://arxiv.org/abs/2606.27515
作者: Romal Ramadhan,Seyyed A. Hosseini,Larry W. Lake
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:
Abstract:Accurate prediction of bottom-hole pressure (BHP) and CO2 plume migration is essential for safe geological carbon storage, yet practical simulations often rely on truncated domains where artificial boundaries distort pressure diffusion and CO2 saturation footprints. In this study, we evaluate how boundary-condition fidelity affects BHP and CO2 plume prediction by comparing ten reduced-domain boundary treatments against full-domain reference simulations in homogeneous and heterogeneous reservoirs. We test uniform pore-volume multipliers, transmissibility modifiers, corner-adjusted pore-volume corrections, layered corrections, and gradual modifiers using BHP RMSE, NRMSE, peak pressure deviation, and plume Intersection over Union (IoU) as performance metrics. Our results show that conserving corner pore volume is the most important requirement for truncated-domain modeling. We find that uniform treatments which neglect corner storage generate large pressure errors, with BHP RMSE of 362 to 382 psi in the homogeneous model and 250 to 304 psi in the heterogeneous model, and yield plume IoU values near 0.80 to 0.84, indicating roughly 16 to 20% of the combined plume area is misrepresented. Corner-adjusted scenarios substantially reduce pressure errors and raise plume IoU above 0.94, but we observe that transmissibility correction is not universally beneficial. In homogeneous reservoirs, uniform transmissibility adjustment improves pressure fidelity; in heterogeneous reservoirs, it can over-restrict flow across variable-permeability boundary faces, increasing BHP error and contracting the predicted plume. We find the gradual modifier with transmissibility correction provides the most consistent performance, achieving BHP NRMSE below 3.7% and plume IoU above 0.97 in both reservoir types.
[LG-42] Support-Constrained RL Enables Real-World Policy Improvement without Real-World Experience
链接: https://arxiv.org/abs/2606.27475
作者: Raymond Yu,William Huey,Mustafa Mukadam,Anusha Nagabandi,Abhishek Gupta
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 35 pages, 23 figures
Abstract:Robots trained on real world data tend to be imprecise, slow, and brittle to perturbations. Improving these policies with reinforcement learning (RL) is an appealing alternative, but this process often requires expensive training in the real world. Performing policy improvement in simulation instead provides a far cheaper alternative, but unconstrained RL in simulation can exploit contact and dynamics mismatches, resulting in unsafe behaviors that do not transfer to hardware. Common forms of regularization can furthermore limit improvement by overconstraining to an imperfect behavior prior. In this work, we propose Support-Constrained Off-Domain REinforcement (SCORE), a real-to-sim-to-real framework that constrains RL in simulation to the support of a generative policy pretrained on real data. We instantiate this constraint through flow steering, restricting SCORE to actions the base policy can already produce, which ensures transferable behaviors while maximizing policy improvement. Improving a policy with SCORE requires minimal effort: it learns from sparse rewards, avoids distillation, and leaves the base policy untouched. Across eight real-world dexterous multi-fingered robotic manipulation tasks, SCORE improves average success rate from 37.8% to 89.9%, compared to 59.5% for the best baseline, and reaches success in 36.8% fewer steps than the base policy. Ultimately, through extensive experiments and ablations, we show that simulation can substantially improve real-world manipulation policies when policy optimization is appropriately constrained, introducing a new paradigm for real-to-sim-to-real policy improvement. Videos and code are available at this https URL.
[LG-43] Operator Learning for Cubic Nonlinear Schrödinger Equation on Periodic Domains
链接: https://arxiv.org/abs/2606.27459
作者: Emmanuel E. Oguadimma,Victory C. Obieke,Xueying Yu
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Numerical Analysis (math.NA)
*备注: 21 Pages
Abstract:We consider the cubic nonlinear Schrödinger (NLS) equation on two-dimensional flat tori with varying aspect ratios. In this formulation, the choice of aspect ratio governs the Fourier resonance structure, so rational and irrational geometries can exhibit different high-frequency cascade behaviors. We present a geometry-conditioned Fourier neural operator (FNO) for the cubic defocusing NLS equation, where the input consists of the real and imaginary parts of the solution together with the aspect-ratio parameter (\omega^2). The model is trained to approximate the one-step solution operator and is evaluated on unseen trajectories generated from random-phase initial data using Fourier pseudospectral method. Our numerical experiments show that the learned operator captures the main solution dynamics on both tori and reproduces the distinct Sobolev norm behavior of the two geometries, with stronger (H^2)-growth on the rational torus and more constrained behavior on the irrational torus, consistent with the findings of \citehrabski2021energy. We perform ablation studies to examine the roles of retained Fourier modes, activation functions, Fourier-layer depth, and explicit geometry conditioning. The results indicate that including \omega^2 improves long-time predictive accuracy, especially for the rational geometry, and supports the use of geometry-aware neural operators for learning spectral-transfer phenomena in nonlinear dispersive partial differential equations.
[LG-44] Prism Transformer: Progressive Head Schedules for Hierarchical Attention Processing
链接: https://arxiv.org/abs/2606.27449
作者: Shubham Aggarwal
类目: Machine Learning (cs.LG)
*备注: 14 Page, 4 figure, 5 table
Abstract:Multi-head attention conventionally partitions the hidden dimension equally across all heads at every layer, enforcing an identical representational subspace dimension (dh = dmodel/h) throughout the models depth. In this work, we identify this uniform allocation as a fundamental structural bottleneck: due to their restricted dimensional space, early-layer heads are unable to faithfully capture complex, high-dimensional contextual patterns. To resolve this, we introduce the Prism Transformer, a novel architectural paradigm that replaces the static, uniform head configuration with a progressive head schedule. By monotonically increasing the head count across layers, the Prism Transformer naturally establishes a local-to-global representational hierarchy: early layers leverage fewer, exceptionally wide heads to capture complex, local compositional patterns, while deep layers deploy many, narrow heads to decompose these patterns into specialized linguistic features. Crucially, this structural shift is parameter-neutral, compute-neutral, and introduces zero training or inference overhead, preserving identical weight matrices and FLOP budgets as the standard Transformer. Across three model scales (124M, 354M, and 757M), the Prism Transformer consistently outperforms uniform baselines, achieving consistent reductions in validation loss alongside consistent gains on downstream zero-shot benchmarks (including PIQA, HellaSwag, ARC-Easy, and WinoGrande). Our findings demonstrate that non-uniform subspace allocation unlocks latent capacity within the standard Transformer budget, enabling more effective use of model capacity.
[LG-45] Learning in Markovian bandits with non-observable states and constrained decision epochs
链接: https://arxiv.org/abs/2606.27448
作者: Thomas Hira,Victor Boone,Urtzi Ayesta,Ina Maria Verloop
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper studies the problem of regret minimization in Markovian bandits with \emphnon-observable states and possibly \emphconstrained decision epochs. The focus is restricted to a ``pure’’ regret benchmark, that compares the performance of the learning algorithm to the best \emphpure policy which – akin to optimal policies of stochastic bandits – picks the optimal arm from start to finish without ever switching. We introduce a generalization of rested Markovian bandits, \emphself-degrading Markovian bandits, for which pure policies are always asymptotically this http URL show that without prior knowledge on the underlying bandit, the regret of algorithms that switch arms rarely necessarily scales super-logarithmically for every bandit, i.e., as \omega(\log(T)) , where T is the learning horizon. Despite the unreachability of the logarithmic regime, we design UCB-NOM, an optimistic algorithm inspired by UCB, of which the regret is nearly logarithmic. Lastly, we show that given prior knowledge on the Markovian bandit in the form of a bound on the bias functions of its arm, a proper instantiation of UCB-NOM achieves O(\log(T)) regret. We further show that this prior knowledge allows for a O(\sqrtT \log(T)) worst-case regret bound for UCB-NOM. Notably, our regret bounds do not depend on the number of states of the underlying Markov chains. Our findings suggest that the non-observability of states is a mild inconvenience in self-degrading Markovian bandits.
[LG-46] PairSAE: Mechanistic Interpretability from Pair Representations in Protein Co-Folding
链接: https://arxiv.org/abs/2606.27440
作者: Giosue Migliorini,Aristofanis Rontogiannis,Grigori Guitchounts,Nicholas Franklin,Axel Elaldi,Olivia Viessmann
类目: Machine Learning (cs.LG)
*备注: Accepted at the Machine Learning in Structural Biology (MLSB) 2025 workshop
Abstract:Foundation models for structural biology have achieved remarkable performance in predicting biomolecular structure and show promise for the design of proteins and small molecules. Yet understanding which internal features drive their outputs remains challenging. Standard sparse autoencoders (SAEs), effective on transformer-style sequence embeddings, do not transfer cleanly to pairformer-like architectures: naively operating on pairwise representations yields a quadratic blow-up of features and obscures concepts distributed jointly across sequence and pair representations. We introduce PairSAE, which summarizes pairwise tensors via an N-mode SVD into token-wise interaction roles, then uses a sparse autoencoder to learn a shared set of token-level features that decode into both sequence and pair representations. Evaluated on Boltz-2 activations for PLINDER protein-ligand complexes, PairSAE yields interpretable features that align with UniProt annotations and predict Boltz-2 affinity values. These results indicate that PairSAE links the latent space of foundation models for structural biology to interpretable structural concepts, clarifying what the model “knows” while avoiding pairformer-induced pitfalls that limit conventional SAEs. Comments: Accepted at the Machine Learning in Structural Biology (MLSB) 2025 workshop Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.27440 [cs.LG] (or arXiv:2606.27440v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.27440 Focus to learn more arXiv-issued DOI via DataCite
[LG-47] Unified Zero-Shot Time Series Forecasting: A Darts Foundation
链接: https://arxiv.org/abs/2606.27438
作者: Zhihao Dai,Dennis Bader,Alain Gysi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Since its initial release in 2020, Darts has become a widely used open-source Python library for time series analysis. A series of foundation models have recently claimed accuracy improvements in zero-shot forecasting, promising a paradigm shift from training custom models to harnessing pre-trained general-purpose forecasters. Foundation models, however, are often released as isolated packages with fragmented interfaces and limited interoperability with common tooling, making joint evaluation and integration within complete pipelines difficult. In Darts, we developed a unified \textttFoundationModel class collection (Chronos-2, TimesFM 2.5, TiRex, PatchTST-FM) that provides standardized, full-cycle forecasting interfaces with minimal external dependencies for integrating foundation models into the ecosystem. Existing Darts pipelines can now use foundation models with only a name change; new pipelines can use them for zero-shot or fine-tuned forecasting, uncertainty estimation, and backtesting, combined with data processing and evaluation tooling, all within a unified framework.
[LG-48] st-Input Generation for Tensor Programs: What Actually Finds Kernel Bugs
链接: https://arxiv.org/abs/2606.27396
作者: Dipankar Sarkar
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 8 pages, 1 figure, LNCS format. Companion paper: arXiv:2606.20128 (P1). Additional companions (P2, P4) to follow on arXiv this week; IDs will be added in a v2 replace
Abstract:Test-input generation for tensor kernels is folkloric. Most projects pick a representative shape and dtype, run a fixed-shape allclose-style check, and ship. We make the choices explicit and measure them. Using the gpuemu op-schema-aware seeded fuzzer (arXiv:2606.20128), we evaluate seven test-generation strategies across a 26-op corpus (16 correct controls and 10 LLM-style buggy variants seeded with documented transcription patterns) on an RTX 3060 GPU instance. Strategies vary the shape candidate set, the dtype mix, and the input value distribution. We report each strategy on two axes: bug recall and control false-positive (FP) rate. Boundary-only shape sampling is the operationally safe winner: 78% recall on the 10 buggy kernels with 0% FP on the 16 controls. Adversarial value sampling reaches higher recall (99%) but inflates control FP to 94% because the strategy injects NaN and Inf inputs and the validator’s NaN check fires on every kernel that propagates them, not only on buggy kernels. On the two softmax tail-mask bugs the “regular” strategy (no boundary shapes) catches 0%, while boundary raises recall to 100% and 62% respectively. That gap is the clearest single signal in the data. The corpus result is about which seeded bug patterns each strategy catches, not about the bug rate of any specific deployed LLM.
[LG-49] Forecasting Technological Directions in Wireless Networks and Mobile Computing via AutoML Framework
链接: https://arxiv.org/abs/2606.27394
作者: Ahmed Abolfadl,Marwa Mahmoud,Basma Afifi,Mervat Abu-Elkheir,Maggie Mashaly
类目: Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注: Conference: 2025 IEEE Middle East Conference on Communications and Networking (MECOM)
Abstract:The exponential increase in scientific publications has driven the emergence of new trends. Accurate forecasting of these developments is essential for researchers and professionals to stay updated with advancements in the field. This study presents an automated pipeline for trend prediction in the wireless networks and mobile computing domain by integrating clustering, topic modeling, and time series analysis. The process begins with the collection of 127,820 abstracts from high-impact journals and conferences, followed by extensive preprocessing and semantic embedding using the SPECTER model. AutoCluster applies meta-learning to select the most suitable clustering algorithm based on the dataset meta-features, ensuring semantically coherent groupings. AutoTopicModeling then employs a successive halving strategy to identify the best-performing topic model per cluster, followed by LLM-assisted topic labeling and optional label generalization. Finally, AutoTrendAnalysis transforms topic-labeled data into time series and applies forecasting models -ARIMA, STL, Prophet, or LSTM - to predict future topic popularity. Topics are classified as strong, weak, or noise signals based on forecast trajectories, offering interpretable insights into emerging and declining research themes. The framework is scalable, adaptive, and designed for robust trend analysis across scientific domains. Experimental results demonstrated high predictive accuracy, achieving a Root Mean Square Error (RMSE) of 36.76.
[LG-50] Elastic Time: Dynamic Frame Rate Bottlenecks for Neural Audio Coding INTERSPEECH2026
链接: https://arxiv.org/abs/2606.27320
作者: Dimitrios Bralios,Paris Smaragdis,Minje Kim
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Interspeech 2026
Abstract:Neural audio autoencoders have become a core component of compression, feature extraction, and generation. However, while existing systems support variable bitrate, the vast majority of models still operate at a fixed latent frame-rate, allocating equal temporal budget to regions with very different information density, which can result in unnecessarily long sequences. We introduce Elastic Time, a dynamic frame-rate bottleneck that converts fixed-frame-rate autoencoders to dynamic ones. Our method learns a lightweight latent predictor used to decide which frames can be skipped and later reconstructed, enabling efficient greedy boundary selection at inference. Experiments show our method enables deployment-time rate control while improving efficiency-quality tradeoffs relative to baselines. Overall, we provide a flexible mechanism for adjusting temporal resolution in audio autoencoders, potentially facilitating more efficient downstream modeling for generation and long-context tasks.
[LG-51] Surprises in Proper Positive-Only Learning
链接: https://arxiv.org/abs/2606.28309
作者: Shai Ben-David,Farnam Mansouri,Anay Mehrotra,Manolis Zampetakis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Binary classification from positive-only samples is a variant of PAC learning in which the learner receives i.i.d. samples from the positive region of an unknown target concept, but is evaluated under the original distribution (which places mass on both positive and negative regions). This model dates back to Natarajan [1987, STOC], and the characterization of improper learning is well-known – it even appears in textbooks. The characterization of proper positive-only learning, however, has long remained open. In this work, we revisit and settle this question: a concept class is properly learnable from positive-only samples if and only if it has finite VC dimension and satisfies a new combinatorial condition, which we call uniform exterior separability. Together with several separation results, this characterization reveals a surprisingly rich landscape that differs sharply from standard PAC learning: proper and improper learning are separated, randomized and deterministic proper learning are separated, there are classes for which no ERM is a learner, and finite VC dimension does not suffice even for non-uniform learning. Along the way, we introduce new combinatorial dimensions that we believe can be of broader interest in learning theory.
[LG-52] Second-Order KKT Guarantees for Bregman ADMM in Nonconvex and Non-Lipschitz Optimization
链接: https://arxiv.org/abs/2606.28307
作者: Shuang Li,Zhihui Zhu,Qiuwei Li
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:We analyze Bregman ADMM for nonconvex linearly constrained problems under two-sided relative smoothness, a condition that replaces the standard Lipschitz gradient assumption with a Hessian comparison relative to a Bregman kernel. This setting covers polynomial objectives arising in matrix and tensor models for which a global Lipschitz-gradient constant need not exist. We show that on an invariant open state-space domain, one iteration of Bregman ADMM defines a smooth primal–dual fixed-point map whose strict-saddle KKT points are unstable fixed points; consequently, from random initialization the iterates converge to a strict saddle with probability zero. Combined with existing first-order convergence results, this yields almost-sure second-order stationarity of limiting KKT points. We extend the analysis to a multi-block star consensus formulation for distributed optimization. The technical novelty lies in a determinant reduction with a Bregman-specific symmetrization and scaling step in the two block spectral argument, together with a null space cancellation exploiting the star graph structure in the consensus case. Numerical experiments on distributed matrix factorization illustrate the theory, and a symmetric tensor factorization example demonstrates the broader Bregman proximal splitting idea beyond the separable consensus setting.
[LG-53] Bridging Ab Initio Symmetries and Global Nuclear Masses with Interpretable Neural Networks
链接: https://arxiv.org/abs/2606.28287
作者: Phong Dang,Evander Espinoza,Xiaoliang Wan,Michela Negro,Jerry P. Draayer,Feng Pan,Tomas Dytrych,Daniel Langr,David Kekejian
类目: Nuclear Theory (nucl-th); Machine Learning (cs.LG)
*备注:
Abstract:Ab initio modeling has established Wigner’s SU(4) and Elliott’s SU(3) as dominant symmetries of the nuclear force in light and intermediate-mass nuclei. We ask whether they also govern nuclear binding across the entire chart. Our aim is not high-precision prediction but physical insight, through interpretable, symmetry-based models. From the SU(3) and SU(4) Casimir operators we construct three neural-network (NN) mass models: Feature-Informed NN (FINN) for point predictions, Gaussian-Informed NN (GINN) adding uncertainty quantification, and Wigner-Informed NN (WINN) – a mass formula using the Casimirs as an operator basis. All are trained on AME2016 and validated on nuclei new to AME2020. The SU(4) operators alone cut the root-mean-square error (RMSE) by nearly half on train and test data, and by about a fifth on extrapolation, relative to the liquid-drop baseline – showing that Wigner’s symmetry carries predictive information beyond bulk properties. Despite its compact form, WINN reaches the lowest validation RMSE, 0.430 MeV – competitive with state-of-the-art mass models – which we read less as a benchmark than as evidence that its symmetry basis captures important physics. WINN further reveals i) an enhancement of the quadratic SU(4) Casimir near the neutron dripline, signaling restoration of Wigner’s symmetry, and ii) an unexpected gain of the quartic operator in the superheavy region. We thereby elevate emergent symmetries from the hidden order within individual nuclei to a governing principle of the whole nuclear chart.
[LG-54] Parameter-Efficient Continuous-Variable Photonic Quantum Neural Networks for Edge Quantum AI: Demonstration in Oral Cancer Detection
链接: https://arxiv.org/abs/2606.28252
作者: Akshay Bhagwan Sonawane,Sophie Choe,Lakshman Tamil
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Early detection of oral cancer markedly improves clinical outcomes, yet specialized diagnostic tools remain scarce in low-resource settings. Smartphone-based screening is a scalable alternative but needs lightweight models that run within edge-hardware constraints. Hybrid classical-quantum architectures are emerging candidates for parameter-efficient learning, yet most rely on qubit hardware that needs cryogenic operation, unsuitable for edge deployment. Continuous-variable (CV) photonic quantum computing, which operates at room temperature, offers a complementary route. We investigate a hybrid classical-CV quantum classifier for oral cancer detection from smartphone images. The pipeline combines a MobileNetV1 feature extractor, principal component analysis to 16 dimensions, and a parameterized CV-QNN of displacement, interferometric, and Kerr gates on a photonic backend. We propose a simplified \Phi \circ D \circ U_1 CV-QNN architecture that cuts trainable parameters 40-45% relative to the standard CV-QNN layer of Killoran et al. (2019a), and identify dimensionality-reduction and encoding-restriction strategies that mitigate barren plateaus, raising loss-gradient variance by roughly 58 orders of magnitude. Whether the simplified layer beats the full layer is width-dependent: the full layer holds a small but significant edge at two qumodes, whereas the simplified layer is significantly better at four qumodes using 44% fewer parameters. The strongest model, a four-qumode simplified CV-QNN with only 18 parameters, attains the highest validation AUC of all models, exceeds a 55-parameter classical baseline using 67% fewer parameters, and reaches 100% calibrated test accuracy across all seeds. These results support CV photonic quantum machine learning for parameter-efficient, room-temperature medical image classification and motivate progress toward edge quantum AI.
[LG-55] Physics-constrained neural networks for surrogate modeling of lossless periodic structures
链接: https://arxiv.org/abs/2606.28119
作者: Eric Prehn,Peter Jung
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures. Supplementary Document 1 and Supplement 2 (Visualization 1) are provided as ancillary files
Abstract:We introduce a physics-constrained neural network (PCNN) for the rapid prediction of rigorous coupled-wave analysis (RCWA) outputs in the form of Jones matrices. Starting from energy conservation in lossless layered periodic structures, we use the fact that RCWA outputs lie on a Stiefel manifold. This energy constraint is enforced as a hard condition by projecting onto the manifold using differentiable symmetric orthogonalization. The resulting surrogate enforces energy conservation by construction while preserving differentiability for gradient-based inverse design. The performance and generality of the proposed approach are demonstrated through the inverse design of a diffractive waveguide combiner for augmented reality glasses.
[LG-56] Mosaic: A Benchmark Suite for Differentiable Physics Solvers
链接: https://arxiv.org/abs/2606.27895
作者: Andrin Rehmann,Heiko Zimmermann,Dion Häfner
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注: 32 pages, 24 figures, 3 tables. Code available at this https URL
Abstract:Differentiable partial differential equation (PDE) solvers underpin solver-in-the-loop ML training, gradient-based optimal control, and inverse problems, yet the practical cost of obtaining correct, usable gradients from a given solver on a given problem is largely undocumented. Integration effort, computational cost, gradient accuracy, and numerical conditioning vary widely across solvers and are discoverable only by trial and error. We introduce Mosaic, an extensible benchmarking framework for differentiable PDE solvers that standardizes access to solver gradients. Each solver is packaged as a containerized component (Tesseract) exposing a uniform gradient API regardless of language or automatic differentiation (AD) strategy, enabling researchers to evaluate, compare, and build on non-trivial physical solvers. Our evaluation of 14 solvers across fluid dynamics, structural mechanics, and heat transfer demonstrates that the benchmark surfaces practically relevant differences: order-of-magnitude variation in computational cost and Jacobian conditioning, alongside structural incompatibilities that eliminate solvers from realistic tasks entirely. Despite this variation, all solvers that produce gradients converge to similar optima, indicating that the practical barriers are memory limits, numerical stability, and setup compatibility rather than gradient accuracy alone. Mosaic is open-source and available at this https URL.
[LG-57] Quantum Dynamic Time Warping for Multivariate Time Series Classification
链接: https://arxiv.org/abs/2606.27815
作者: Diego Alvarez-Estevez,Alejandro Mayorga-Redondo,Eduardo Mosqueira-Rey
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Dynamic Time Warping (DTW) is a cornerstone for time series classification, but its reliance on Euclidean distances fails to capture latent cross-channel correlations in complex multivariate data. We propose a hybrid Quantum Dynamic Time Warping (qDTW) architecture, replacing the classical distance metric with the parameterized geometry of a quantum Hilbert space. Through structural ablation on benchmarks up to C=8 spatial dimensions, we establish fundamental topological rules for quantum sequence alignment. We introduce a Unified Pre-Embedding Adjoint Ansatz that decouples trainable entanglement from classical data, eliminating the severe phase-scrambling and information bottlenecks inherent to traditional measurements. We demonstrate this decoupled architecture allows untrained quantum kernels to act as highly expressive baselines, while parameterized training effectively untangles deeply overlapping hyper-dimensional data. Furthermore, we identify a strict spatial-temporal expressivity tradeoff: temporal depth (data re-uploading) is necessary for dimensionally restricted univariate circuits, but applying it to wide multi-qubit registers triggers chaotic frequency-spectrum explosions and representation collapse. By navigating these topological hazards, our multivariate quantum architecture outperforms classical baselines, setting a new standard for integrating parameterized quantum circuits with dynamic programming Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG) Cite as: arXiv:2606.27815 [quant-ph] (or arXiv:2606.27815v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2606.27815 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-58] Distributed Air-Gap Flux and Rotor-Current Fusion for Operating-Regime Identification in a 10-MW Kaplan Hydrogenerator
链接: https://arxiv.org/abs/2606.27800
作者: Eduardo Jr Piedad,Rafel Roig,Xavier Escaler,Eduardo Prieto-Araujo,Oriol Gomis-Bellmunt
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, 7 tables
Abstract:Reliable monitoring of hydroelectric generators requires descriptors that capture both electrical loading and electromagnetic field behavior. This work investigates operating-regime identification in the Porjus U9 10-MW Kaplan hydrogenerator using synchronized measurements from ten stator-mounted Hall probes and six rotor-current channels. Seven steady guide-vane-opening settings are considered, and each 300s record is divided into 1s windows. The resulting windows are represented by spatial Fourier descriptors of the circumferential air-gap field, probe-wise temporal flux indicators, and channel-wise RMS rotor-current features. Correlation analysis and principal component analysis are used to examine how the feature groups vary with the operating point, and Random Forest, radial-basis-function support vector classification, and multilayer perceptron models are evaluated for supervised identification of the guide-vane-opening state. The analysis shows that RMS rotor-current features mainly track the loading axis, while the magnetic-flux features reveal complementary information associated with spatial imbalance, waveform distortion, and weak low-frequency modulation. Spatial descriptors alone provide limited separability, yielding test accuracies below 27%, whereas rotor-current features alone reach about 84-85%. Combining flux and current information gives the most discriminative representation; the SVC-RBF model achieves 99.5% test accuracy and macro-F1 score. The results indicate that distributed air-gap magnetic sensing, when fused with rotor-current measurements, can support accurate and interpretable data-driven monitoring of Kaplan hydrogenerator operating regimes.
[LG-59] CANNs: A Toolkit for Research on Continuous Attractor Neural Networks
链接: https://arxiv.org/abs/2606.27783
作者: Sichao He,Aiersi Tuerhong,Shangjun She,Tianhao Chu,Yuling Wu,Junfeng Zuo,Si Wu
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Code: this https URL ; Rust backend: this https URL
Abstract:Continuous attractor neural networks (CANNs) are the canonical computational framework for how the brain encodes continuous variables such as spatial position, head direction, and movement direction, and explain the activity of hippocampal place cells, entorhinal grid cells, and head-direction cells. CANN research, however, is fragmented: most results rest on lab-specific implementations, general-purpose simulators lack CANN-specific abstractions, and the path from spike trains to attractor geometry in real recordings lacks a standardized toolkit. Here, we present a comprehensive open-source toolkit that unifies the full CANN research workflow. It combines three tightly integrated components: 1) canns, a Python library on BrainPy/JAX that provides standardized 1D/2D CANNs, spike-frequency-adaptation variants, grid cell networks, hierarchical path-integration models, and brain-inspired attractor architectures, together with curated datasets, task generators, an analyzer module and trainer modules for biologically plausible plasticity; 2) canns-lib, a Rust acceleration backend delivering hundreds-of-times speedups for spatial-navigation workloads and modest gains for Ripser-based persistent homology; 3) ASA (Attractor Structure Analyzer), a PySide6 pipeline applying persistent homology and cohomology to experimental neural recordings to detect ring-like and toroidal attractor signatures in real data. The toolkit ships with full-detail reproducible pipelines that recover recent CANN results including SFA-driven anticipative tracking, theta sweeps in head-direction/place/grid systems, and hierarchical path integration.
[LG-60] Adversarial Contamination Meets Hard Thresholding: An Iterative Algorithm with Signal Adaptivity and Minimax Optimality
链接: https://arxiv.org/abs/2606.27685
作者: Shixiang Liu,Hanming Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 56 pages, 6 figures
Abstract:Pervasive data contamination – stemming from measurement errors, outliers, or adversarial corruption – has motivated the development of robust statistical methods. In this context, we propose a two-stage Adversarial Contamination-resistant Iterative Hard Thresholding (AC-IHT) algorithm for high-dimensional regression with contamination. Our nonconvex algorithm achieves minimax near-optimal (up to logarithmic terms) estimation by iteratively updating the coefficient vector and the contamination vector with different thresholding scales. We further demonstrate that our AC-IHT estimator is signal-adaptive: under proper signal conditions, it adaptively attains a sharper estimation rate and more accurate support recovery. Moreover, it enjoys the strong oracle property, laying a theoretical foundation for asymptotic inference. Numerical experiments confirm its superior finite-sample performance. Finally, we discuss theoretical extensions of the proposed procedure to generalized linear models and to heavy-tailed noise settings.
[LG-61] Sampling the Schwinger Model with Gauge-Equivariant Diffusion
链接: https://arxiv.org/abs/2606.27481
作者: Octavio Vega,Aida X. El-Khadra
类目: High Energy Physics - Lattice (hep-lat); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG)
*备注: Conference paper at PAI 2026. 6 pages, 1 figure
Abstract:We present a first study of a diffusion-based approach to accelerated sampling of the N_f = 2 lattice Schwinger model. Our work is inspired by recent and growing successes in developing such generative models for ensemble generation in LFT to overcome the well-known critical slowing down problem. We train a U(1)-equivariant score-based generative model to sample gauge link configurations from the marginal Schwinger model. By computing model likelihoods, we obtain unbiased estimates for observables that closely match those produced by MCMC simulations. We also demonstrate improvement over HMC as measured qualitatively by a reduction in topological freezing near critical parameters.
[LG-62] he Decision Geometry of Covariance Estimation for the Global Minimum-Variance Portfolio under Heavy Tails
链接: https://arxiv.org/abs/2606.27462
作者: Xavier Fonseca
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Portfolio Management (q-fin.PM)
*备注: 19 pages, 1 figure
Abstract:The global minimum-variance portfolio (GMVP) is the canonical decision built from an estimated covariance matrix, yet covariance estimators are universally evaluated by matrix-norm loss, which is not the object the decision depends on. We characterise exactly how covariance-estimation error maps into GMVP suboptimality. We prove an exact regret identity and a non-asymptotic bound showing decision regret depends on the estimation error only through its action on the portfolio weights, scaled by portfolio concentration and the conditioning of the true covariance. From this we derive the decision geometry: GMVP regret is invariant to a (p-1)-dimensional projection of the p^2-dimensional error matrix, with invariance to the covariance-scale direction as an exact special case. We then apply the framework to heavy-tailed returns (tail index kappa in (2,4)), establishing the regret convergence rate implied by the centred operator-norm rate, and confirm the theory on a skew-t/t-copula simulation design with pre-registered analysis. The decision-focused advantage is a sharper constant and a concentration discount rather than a faster rate; we report an honest high-conditioning boundary of the rate prediction. The results complement recent decision-focused learning approaches by supplying the exact estimation geometry and consistency theory they lack.
[LG-63] Directed Graph Topology Inference via Graph Filter Identification
链接: https://arxiv.org/abs/2606.27455
作者: Rasoul Shafipour,Andrei Buciulea,Santiago Segarra,Antonio G. Marques,Gonzalo Mateos
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Signal Processing (eess.SP)
*备注: 13 pages main body, 2 pages supplementary material. Submitted to the IEEE Transactions on Signal Processing
Abstract:We address the problem of inferring a directed network from nodal measurements generated by linear diffusion dynamics on the sought graph. Observations are modeled as the outputs of a graph convolutional filter, i.e., a polynomial (with unknown coefficients) of a local diffusion graph-shift operator encoding the latent graph topology, excited with an ensemble of independent graph signals with arbitrarily-correlated nodal components. Unlike prior efforts that considered undirected graphs and white signal excitations, here the graph-shift operator and the observations’ covariance matrix are not simultaneously diagonalizable. In this challenging context, we first rely on measurements of the output signals along with prior statistical information on the inputs to identify the diffusion filter. Such system identification problem involves solving a system of quadratic matrix equations, which we show is identifiable under spectral-diversity assumptions on the input covariances. For algorithmic purposes we recast it as a smooth quadratic minimization subject to Stiefel manifold constraints. Subsequent identification of the network topology given the graph filter estimate boils down to finding a sparse and structurally admissible shift that commutes with the given filter, thus, forcing the latter to be a polynomial in the sought graph-shift operator. A joint graph filter and topology identification algorithm is also proposed, which alternates between the aforementioned steps in a mutually reinforcing fashion to offer improved sample complexity. Numerical tests corroborate the effectiveness of the proposed algorithms in recovering synthetic digraphs and real-data case studies, and illustrate their potential utility on urban mobility analyses as well as portfolio optimization.
[LG-64] DFM: Difference Feature Modeling with Text-Guided Gated Contrastive Loss for Remote Sensing Image Change Captioning ICME2026
链接: https://arxiv.org/abs/2606.27410
作者: Yelin Wang,Zijia Song,Chuanguang Yang,Miaoyu Wang,Zhulin An,Libo Huang,Yongjun Xu
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: Accepted by IEEE ICME 2026
Abstract:The primary goal of Remote Sensing Image Change Captioning (RSICC) is to automatically generate descriptions of changes between remote sensing images captured at different time points. Existing models still rely on a single autoregressive generation paradigm, which tends to prioritize learning easily generated vocabulary over capturing discriminative differences between images. To address this, we reframe the training paradigm and propose a novel Difference Feature Modeling (DFM) framework. Specifically, we introduce a Text-guided Gated Contrastive Loss (TGCL) to guide the vision encoder to extract critical features from a text-modal perspective. Additionally, we incorporate a pre-trained Change Detection model to transfer stable change detection knowledge. In order to further enhance the representation, we design a Joint Feature Modeling (JFM) module to achieve the fusion of multi-scale difference representations, thereby capturing comprehensive spatiotemporal variations between multi-temporal images. Extensive experiments on multiple datasets demonstrate the effectiveness of our approach.
附件下载


