本篇博文主要内容为 2026-06-24 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-06-24)
今日共更新593篇论文,其中:
- 自然语言处理共94篇(Computation and Language (cs.CL))
- 人工智能共198篇(Artificial Intelligence (cs.AI))
- 计算机视觉共130篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共110篇(Machine Learning (cs.LG))
- 多智能体系统共11篇(Multiagent Systems (cs.MA))
- 信息检索共17篇(Information Retrieval (cs.IR))
- 人机交互共27篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Generating Realistic Individual Activity Schedules via Activity Location Allocation Based on Simulated Travel Times
【速读】:该论文旨在解决个体日常活动日程生成中难以同时实现活动安排的现实性与出行时间与调查数据一致性的难题。现有方法通常依赖于公开的人口统计数据和出行调查数据来合成活动日程,但由于移动行为建模的复杂性,生成的日程在出行时间分布上常与实际调查结果存在显著偏差。其解决方案的关键在于提出一种基于动态规划的迭代优化框架:通过反复模拟出行时间并据此调整活动地点的分配,逐步缩小模拟出行时间与真实调查数据之间的差距。数值实验表明,经过迭代优化后,模拟出行时间与调查数据间的差异相较首次迭代降低了52.2%,显著提升了活动日程生成的真实性与可靠性。
链接: https://arxiv.org/abs/2606.24566
作者: Tatsuya Mitomi,Yahya Gamal,Esra Suel,Gary Polhill,Daniel Konioukhov,Alison Heppenstall
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: 8 pages, 5 figures. This is the author version of a short paper accepted for presentation in the poster session at the 17th Conference on Spatial Information Theory (COSIT 2026)
Abstract:Individual level daily activity schedules are essential for a wide range of applications, including infectious disease control, urban transportation planning, and policy design. In practice, such schedules are typically generated by combining population data with travel survey data. These data sources are used because they are often publicly available, whereas observed individual activity schedules are difficult to obtain due to privacy concerns. However, because of the complexity of mobility modelling, it is difficult to generate realistic activity schedules that also preserve travel times consistent with those reported in travel surveys. To address this issue, we propose a framework for generating activity schedules that iteratively applies a dynamic programming method to allocate activity locations based on simulated travel times. Numerical experiments with dummy data show that the proposed method reduces the discrepancy between simulated travel times and those reported in travel surveys by 52.2% relative to the first iteration through iterative refinement.
[MA-1] Age of LLM : A Strategic 1v1 Benchmark for Reasoning Diplomacy and Reliability of Large Language Models under Fog of War
【速读】:该论文旨在解决大语言模型(LLM)在复杂、动态且具有高度对抗性的环境中进行战略推理与决策能力的评估问题,尤其关注模型在存在不确定性、信息不完全和规则严格约束下的行为表现。其核心挑战在于现有基准难以真实反映模型在多智能体博弈中的适应性、信念追踪能力以及对规则可靠性的遵守程度。解决方案的关键在于构建一个名为“大语言模型时代”(Age of LLM)的回合制1v1对抗性基准环境:该环境基于13×7网格地图,引入三重压力机制——迷雾机制(fog of war)、全面外交交互(包括消息传递、停火协议、最后通牒等,铀资源保持秘密)以及严格的可靠性维度(每回合必须遵循固定JSON格式,非法动作被静默丢弃)。通过私有引擎实现每次对战使用随机地图种子和全新对手,有效避免了公开基准中常见的数据泄露问题。实验采用近纯规则提示(rule-only prompt),未提供建造顺序建议,仅在数据收集阶段保留两个战术引导短语。研究发现,核突击策略占据主导地位(规则一致子集达78%,全集达85%),其成功源于秘密同时发射机制下的机械性优势,而非认知威慑失效;军事征服虽罕见但速度更快(平均12.3轮 vs 18.9轮);外交互动频繁但极少达成实质性协议;约58%的非法动作由迷雾或状态跟踪错误导致,使非法动作率成为衡量信念追踪能力的有效指标;此外,首次提出“可靠性与胜率之间存在弱关联”的探索性发现。尽管数据集规模小、分布不平衡且未进行左右侧交换,排名仅为初步描述性结果,但逐回合的动作与通信日志为分析模型在对抗不确定性下的推理模式——如信念追踪、自发欺骗行为及个体化认知“人格”——提供了宝贵视角,构成未来研究的重要方向。研究团队已开放回放格式、等距视角可视化工具及全部对战记录,引擎源码可申请获取。
链接: https://arxiv.org/abs/2606.24391
作者: Arnaud Ricci
机构: 独立研究员(Independent researcher); Switzerland(瑞士)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: 25 pages including appendices, 8 figures, 4 tables; appendices include verbatim system prompt and engine resolution pseudocode. All correlations reported with p-values, 95% bootstrap confidence intervals and Spearman’s rho; includes a Steiger test and Bradley-Terry fit
Abstract:We introduce Age of LLM, a turn-based 1v1 benchmark in which two LLMs face off on a 13x7 grid to destroy the enemy base. Three stressors are deliberate: fog of war, full diplomacy (messages, ceasefires, ultimatums; uranium kept secret), and a reliability dimension where every turn must follow a strict JSON schema and an illegal action is silently discarded. The engine is private and each match uses a fresh random map seed and opponent, mitigating the data contamination that affects public benchmarks. Models receive a (near) rule-only prompt with no build-order advice (two tactical seed phrases were present during data collection; see Section 2.7). We benchmark 15 reasoning models across 54 matches and 5,258 actions. Findings: (1) the nuclear rush dominates (78% on the rules-coherent v0.11+ sub-corpus; 85% corpus-wide) with a sole-launcher signature that is largely mechanical under secret-simultaneous launch rules, not a cognitive deterrence failure; (2) military conquest is rare but faster (12.3 vs 18.9 turns); (3) diplomacy is prolific yet almost never consummated; (4) ~58% of illegal actions are fog/state errors, making the illegal-action rate a measure of belief-tracking; (5) – the least established, and the only one we label exploratory – a weak link associates reliability with winning. The corpus is small, unbalanced and not side-swapped, so the ranking is a preliminary descriptive view, not a contribution. Beyond ranking, the turn-by-turn traces of actions and messages make the corpus a lens on how LLMs reason under adversarial uncertainty – their belief-tracking, spontaneous deception, and per-model cognitive “personas” – which we frame as a future research direction. We release the replay format, an isometric viewer and all replays; engine source on request.
[MA-2] Agon: An Autonomous Large-Scale Omnidisciplinary Research System Built on Prompt Economy
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在科研生产中带来的新瓶颈问题:随着科研成果的生成日益自动化,研究的核心挑战已从“产出科研成果”转向“评估科学主张的有效性”。为此,论文提出了一种名为\textscAgon的研究编排系统,其核心解决方案在于通过自动化流程验证可被程序化判断的内容,而将剩余需要人类专家判断的部分交由科学家处理。其关键设计思想体现在六个原则上:提示经济(Prompt Economy)、面向未来(Future-Facing)、最小提示(Minimal Prompts)、跨学科性(OmniDisciplinary)、大规模并行(Massive Parallelism)和零代码(Zero-Code)。通过444次迭代的提示经济循环实验,在仅使用小型初始主题且无任何人工编写的实验代码条件下,\textscAgon展现了高度可扩展性,并揭示了新型失败模式。这些失败被系统地归纳为一个四维分类体系,涵盖严重性、可修复性、可见性及能力定位(capability locus),从而明确区分出系统可检测与修复的故障与必须依赖人类判断的复杂问题。整体研究表明,\textscAgon正在推动科研进入“机器负责规模,人类主导方向”的新范式。
链接: https://arxiv.org/abs/2606.24177
作者: Youran Sun,Xingyu Ren,Chugang Yi,Jiaxuan Guo,Kejia Zhang,Jianda Du,Haizhao Yang
机构: University of Maryland, College Park (马里兰大学学院公园分校); The Chinese University of Hong Kong (香港中文大学); Stanford University (斯坦福大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:Large language models are making research production scalable, shifting the bottleneck from producing artifacts to judging claims. We present \textscAgon, a research orchestrator that validates what can be checked inside the workflow and leaves the remaining judgments to human scientists. \textscAgon is built on six design principles: Prompt Economy, Future-Facing, Minimal Prompts, OmniDisciplinary, Massive Parallelism, and Zero-Code. We ran \textscAgon across domains for 444 iterations of Prompt Economy loops, using only small starting topics and no human-written experimental code. These deployments demonstrate scalability while exposing new classes of failure. We organize these failures into a taxonomy along severity, fixability, visibility, and capability locus. The taxonomy separates failures the loops can see and fix from those that require human judgment. Together, these results show that \textscAgon is pushing research toward a new paradigm: machine scales, human steers.
[MA-3] EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games ICML2026
【速读】:该论文旨在解决在双人零和不完美信息博弈中,使用自对弈(self-play)的正则化策略梯度方法(如PPO)在面对大量严格劣策略(strictly dominated strategies)或探索困难场景时,因采用均匀分布作为正则化目标而导致策略更新效率低下、收敛性能受限的问题。传统方法中,均匀分布对所有动作施加相同的正则化强度,忽略了动作的实际可行性,导致算法在复杂博弈环境中难以有效聚焦于高价值策略。其解决方案的关键在于提出EMAgnet框架,通过引入对上一轮迭代策略参数的指数移动平均(Exponential Moving Average, EMA)作为动态正则化目标,实现自适应的策略正则化。该机制能够随着智能体策略的演进而持续调整正则化方向,从而更有效地引导学习过程,避免陷入无效动作空间,在标准与修改后的基准测试中均表现出更低的可被利用性(exploitability)和更稳定的性能提升,尤其在存在大量严格劣策略的场景中优势显著。
链接: https://arxiv.org/abs/2606.23995
作者: Tristan Maidment,JB Lanier,Chase McDonald,Nathan Tsang,Eugene Vinitsky,Roy Fox,Albert Wang,Wesley N. Kerr
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: Accepted at NExT-Game 2026: New Frontiers in Game-Theoretic Learning (ICML 2026 Workshop). 13 pages, 2 figures,
Abstract:Recent work has established that regularized policy gradient methods such as PPO, when used in self-play, can match or exceed specialized game-theoretic algorithms for solving two-player zero-sum imperfect-information games. The uniform distribution has emerged as a strong policy regularization target for this purpose, but it regularizes equally toward all actions regardless of their viability. We introduce EMAgnet, which instead regularizes toward an exponential moving average (EMA) of the last-iterate policy’s parameters, providing an adaptive regularization target that evolves with the agent’s improving strategy. We evaluate EMAgnet on both standard two-player zero-sum benchmarks and modified benchmarks with exploration challenges and large numbers of strictly dominated strategies. Relative to PPO self-play with uniform-magnet regularization under both linear and power-law annealing schedules, EMAgnet achieves lower exploitability in the majority of tested environments, with consistent performance gains across games containing strictly dominated strategies.
[MA-4] Critique of Agent Model
【速读】:该论文旨在解决当前人工智能领域中“代理”(agent)概念模糊的问题,特别是在大语言模型(LLM)被广泛宣传为“编程代理”“AI合作者”等“具身化”工具的背景下,亟需厘清自动化系统与真正自主性之间的边界。其核心问题在于:在何种条件下,一个系统才可被视为具备真正的“代理性”(agency),而非仅是外部流程编排的自动化工具。解决方案的关键在于提出一种区分“代理系统”(agentic systems)与“代理性系统”(agentive systems)的理论框架——前者依赖外部工程化工作流实现能力,后者则要求目标、身份、决策、自我调节与学习等核心结构内化于系统内部,从而实现从环境交互中自发涌现的能力。基于此,论文提出了通用代理模型的GIC(Goal-Identity-Configurator)架构,融合分层目标分解、身份演化、基于独立训练世界模型的模拟推理、可学习的自我调节机制以及来自真实与仿真经验的自我导向学习,以构建具备开放世界自主性的系统。同时,论文强调即便系统具备更强自主性,仍可通过可审计性、可控性与安全性设计维持人类监督下的安全运行。
链接: https://arxiv.org/abs/2606.23991
作者: Eric Xing,Mingkai Deng,Jinyu Hou
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:
Abstract:What is an agent? What constitutes agency? With the rise of Large Language Model (LLM) systems marketed as coding agents'', AI co-scientists’', and other agentic" tools that promise to drive up productivity, and at the same time, existential" concerns such as AI escaping human control with destructive power under a speculative machine agency" against humans, it has become essential to clarify where automation ends and agency begins, both for building capable systems and for understanding whether and what to fear. Drawing on Descartes' grounding of agency in independent thought, and on portrayals of autonomous beings in science fiction, we survey the current landscape of AI agents, and analyze agent architectures along five dimensions: goal, identity, decision-making, self-regulation, and learning. Specifically, we argue that genuine agency requires these structures to be \emphinternalized within the system itself rather than assembled through external scaffolding. This distinction between \emphagentic systems, whose competence resides in engineered workflows, and \emphagentive systems, whose capabilities (including social interaction) arise endogenously, defines the boundary between systems designed for prescribed tasks, and those capable of operating in the open world with true autonomy. Building on this analysis, we propose the Goal-Identity-Configurator (GIC) architecture for a general-purpose agent model, combining hierarchical goal decomposition, identity evolution, simulative reasoning grounded in a separately trained world model, learned self-regulation, and self-directed learning from both real and simulated experience. Furthermore, we share insight on the auditability, controllability, and safety of agentive systems that possess greater autonomy and agency", but remain under human oversight.
[MA-5] Maestro Order: A Model-Agnostic Orchestration Harness
【速读】:该论文旨在解决大模型在单次前向传播中作为“不可靠求解器”的核心问题:尽管其输出在多数情况下看似合理(具有高流畅性与高置信度),但存在频繁且危险的错误(如语言模型中的幻觉现象),导致其在实际应用中难以信赖。为此,提出了一种名为Maestro Order的模型无关编排框架,其关键在于通过四种结构化原语(分解、集成、验证、递归)与一个预算感知控制器,将多个不可靠的基线求解器(base solver)组合成可靠的系统级求解架构。该框架的核心创新在于:将任意模型视为黑箱求解器,统一接口下引入在线评估的验证器集成(verifier ensemble),并基于每单位计算成本的边际可靠性动态分配验证与投票资源;同时,通过确定性设计、可观测的状态与消息机制以及容错工程,确保系统可复现、可监控、可鲁棒运行。实验结果表明,验证机制能实现可靠性几何级提升(如从0.55升至0.98,再至0.999),而投票仅在超越随机水平时有效且受限于共性错误;预算感知控制器通过选择各阶段最经济的可靠性增强机制,可在远低于纯投票策略的成本下达成目标可靠性。最后,论文揭示了验证器博弈、误差相关性及分解误差累积等失效模式,并提出具体改进路径:构建健壮的校验器、多样化求解器,让控制器将计算资源精准投向信息密度最高的环节。
链接: https://arxiv.org/abs/2606.23983
作者: Hidayet Aksu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注: 10 pages, 4 figures
Abstract:A single forward pass of a capable model is a fast, fluent, and unreliable problem-solver: it is right often enough to be useful and wrong often enough to be dangerous; in language models, such confident errors are known as hallucinations. We present Maestro Order, a model-agnostic orchestration harness that turns unreliable solvers into reliable problem-solving systems by composing them according to four structural primitives (decompose, ensemble, verify, and recurse) and a budget-aware controller that decides where to spend compute. The harness treats any model as a black-box base solver behind a uniform interface, layers a verifier ensemble whose discrimination is measured online, and allocates verification and voting to the stages with the highest marginal reliability per unit cost. We give the architecture, the message and state schema, the controller algorithm, and the engineering that makes it deterministic, observable, and fault-tolerant. We then specify an evaluation methodology (reliability at fixed cost, coverage, calibration, and ablations) and report results from a faithful Monte Carlo simulation of the harness over a parameterized solver/verifier model. The simulation reproduces the predicted laws quantitatively: verification amplifies reliability geometrically (e.g. 0.55\to0.98 with two gates, \to0.999 with four), voting helps only above chance and is limited by shared errors, and a budget-aware controller reaches a target reliability at a small fraction of the cost of voting alone by selecting the cheapest mechanism for each regime. We close with failure modes (verifier gaming, correlated errors, and decomposition error compounding) and concrete guidance: build robust checkers, diversify solvers, and let the controller put compute where the information is.
[MA-6] Welfarist Control Design – How to fulfill the societal mandate in multi-agent control?
【速读】:该论文旨在解决在社会技术系统中,稀缺资源(如道路车道、电网容量、水权等)的分配过程日益自动化背景下,控制工程师面临如何合理设定社会价值取向与伦理准则的问题。当前的设计决策主要依赖行业惯例,缺乏系统性、负责任的伦理考量。其解决方案的关键在于引入三种控制设计范式——在线反馈优化、马尔可夫决策过程控制以及模型预测控制,并通过将个体代理的偏好聚合为控制目标,进而确保并验证这些目标的合规性与可证伪性。利用控制系统固有的反馈机制,该方法能够实现对共享资源更精准、可解释且符合社会使命的动态分配,从而推动自动化系统设计从经验驱动转向原则化、可验证的伦理导向。
链接: https://arxiv.org/abs/2606.23931
作者: Sophie Hall,Kai Zhang,Ilia Shilov,Heinrich H. Nax,Saverio Bolognani
机构: 未知
类目: ystems and Control (eess.SY); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
备注:
Abstract:At the core of most socio-technical systems lies a scarce resource that is allocated among agents: highway lanes, public transit, road space, water rights, energy access, grid capacity, user attention, pollution rights, etc. With further automation of the underlying allocation processes, control engineers are increasingly tasked to make decisive assumptions regarding what society wants. In practice to date, design choices are largely driven by industry norms and conventions rather than a result of conscientiously responsible and ethical design. In this paper, we look at tools available to control engineers to design systems in a more principled manner in order to match the societal mandate. We consider three control design paradigms: online feedback optimization, control of Markov decision processes, and model predictive control. Beginning with aggregating individual agents’ preferences into control design objectives, subsequently ensuring and certifying the fulfillment of those specifications, we argue that the feedback nature of control systems enables appropriate allocation of the shared resources in ways hitherto unparalleled.
[MA-7] Decentralized Coordination of Autonomous Traffic Through Advanced Air Mobility Corridors
【速读】:该论文旨在解决先进空中交通(Advanced Air Mobility, AAM)在现有空域中高效集成的问题,特别是针对基于专用走廊的运行模式可能存在的效率低下问题。传统观点认为,缺乏集中式交通管理时,走廊制运行将导致低效,而本文通过实证研究挑战了这一认知。其解决方案的关键在于:在完全去中心化的环境下,利用自主飞行器通过局部信息交互实现自组织协同,使飞行器能够以超过94%的高合规率自主遵循走廊边界,并高效抵达目标。研究表明,在低至中等密度场景下,仅需极少战术干预即可维持安全间隔;而在高密度情况下,虽需更频繁的干预,但整体仍表现出良好的运行效能。该成果验证了生成式智能体在复杂动态环境中实现高效、安全自主协同的可行性,为未来无人化、去中心化空域管理提供了新范式。
链接: https://arxiv.org/abs/2606.23832
作者: Jasmine Jerry Aloor,Hamsa Balakrishnan
机构: Stanford University (斯坦福大学); AIAA (美国航空航天学会)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Robotics (cs.RO); Systems and Control (eess.SY)
备注: Presented at the AIAA SciTech 2026 Forum
Abstract:The use of dedicated corridors for Advanced Air Mobility (AAM) traffic is one of the most commonly proposed pathways to integrating them into existing airspace operations. Most prior research has focused on the design of networks of AAM corridors and conflict resolution for aircraft within corridors. It is also generally believed that while attractive from an implementation perspective, corridor-based operations may be inefficient, especially in the absence of centralized traffic management. In this paper, we show that contrary to this belief, it is possible for autonomous aircraft to learn to self-organize into corridor flows in decentralized settings. We illustrate our approach using scenarios in which fixed-wing aircraft need to safely and efficiently traverse (1) a single corridor with metering after the exit, (2) a sequence of two consecutive corridors, and (3) a corridor that splits into two. We find that in decentralized settings with only local information, the aircraft are able to conform to the corridor boundaries more than 94% of the time and reach their goal in a relatively efficient manner. Furthermore, tactical interventions to handle violations of the separation minimum are needed only infrequently in low- and medium-density settings. However, such tactical interventions become more frequently necessary only when traffic density is high. Comments: Presented at the AIAA SciTech 2026 Forum Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Robotics (cs.RO); Systems and Control (eess.SY) Cite as: arXiv:2606.23832 [cs.MA] (or arXiv:2606.23832v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2606.23832 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.2514/6.2026-0236 Focus to learn more DOI(s) linking to related resources
[MA-8] From Task-Guided Conversational Graphs to Goal-Oriented Dialogue Runtimes
【速读】:该论文旨在解决在复杂多目标、可中断的对话场景中,大语言模型(Large Language Model, LLM)工作流难以维持对话连续性的问题。当用户同时管理多个相互依赖的目标时,这些目标可能因其他目标的执行而被暂停、恢复、修改或失效,传统基于图或多智能体编排框架无法可靠地保证目标状态的一致性与延续性。其解决方案的关键在于提出一种与具体框架无关的目标导向对话运行时(Goal-Oriented Dialogue Runtime, GODR)设计模式,将目标(goal)、任务框架(task frame)、生命周期状态、失效规则(invalidation rules)以及恢复契约(resumption contracts)等核心概念作为一等运行时对象进行管理,并将有限执行委托给图运行时、智能体、工具或API。GODR不适用于简单的引导式流程,而是针对跨领域、高复杂度、可中断对话场景,强调通过显式建模目标间依赖关系与状态转换机制,实现超越仅依赖智能体身份、聊天历史或执行图位置的客观连续性保障。论文形式化了该问题,提出了运行时对象设计与架构选择标准,并将评估定位为未来实证研究的议程,而非性能指标声明。
链接: https://arxiv.org/abs/2606.23797
作者: Mariano Garralda-Barrio
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 21 pages, 7 figure, 10 tables
Abstract:Graph and multi-agent orchestration frameworks make production large language model (LLM) workflows practical, but they do not by themselves solve conversational continuity when users maintain several interdependent objectives. This conceptual systems paper focuses on the high-complexity end of that design space, where goals can be suspended, resumed, revised, and invalidated by actions in other goals. We introduce the Goal-Oriented Dialogue Runtime (GODR), a framework-neutral design pattern that treats goals, task frames, lifecycle state, invalidation rules, and resumption contracts as first-class runtime objects while delegating bounded execution to graph runtimes, agents, tools, or application programming interfaces (APIs). GODR is not proposed as a replacement for workflow graphs in simple guided processes; it is intended for complex, multi-domain, interruptible conversations where objective continuity cannot be recovered reliably from agent identity, chat history, or execution-graph position alone. The paper formalizes the problem, proposes runtime objects and architecture-selection criteria, and frames evaluation as an agenda for future empirical validation rather than as a measured performance claim.
[MA-9] Emergent Relational Order in LLM Agent Societies: From Collective Affect to Authority Stratification ACL2026
【速读】:该论文旨在解决费孝通提出的“差序格局”(Differential Order Pattern)在社会结构演化中的机制性解释问题,特别是其如何从个体互动中自发生成并维持长期社会秩序。传统研究多将其视为文化特异性现象,缺乏对内在动力机制的系统建模与实证验证,且现有基于大语言模型(LLM)的模拟主要聚焦短期协作,难以刻画长周期的社会结构特征。本文提出CAREB-MAS多智能体系统,融合情感控制理论(Affect Control Theory)、社会认同理论(Social Identity Theory)与涂尔干式集体情感理论(Durkheimian collective affect),构建了以“情绪-伦理-信念”链为驱动的认知架构,使智能体能够动态演化以自我为中心的身份认知,并在仅设定个体生产、偏好分配及基础交互协议的宏观环境中运行。在长时程仿真中,系统自发涌现出五类核心差序格局特征:稳定的劳动分工、基于关系网络(guanxi)的经济伦理、合作随社会距离衰减、关系型权威的自然形成以及以宗族为基础的中心—边缘分层结构。这些模式随生产结构变化,从血缘中心化整合逐步演变为更复杂的功能性相互依赖形态。实验结果表明,差序格局可被解释为一般社会机制在特定条件下涌现的结构敏感性结果,而基于大语言模型的多智能体仿真为跨学科研究社会结构与变迁提供了有效框架。
链接: https://arxiv.org/abs/2606.23764
作者: Zhiyuan Ji,Xinyu Chen,Ziqi Dai,Shiyun Tang,Chunyu Wei,Yueguo Chen
机构: Renmin University of China (中国人民大学); Beihang University (北京航空航天大学); Minzu University of China (中央民族大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of the Association for Computational Linguistics: ACL 2026. 37 pages
Abstract:Fei Xiaotong’s Differential Order Pattern characterizes rural society as egocentric and relationally graded, with cooperation attenuating over social distance. Although often treated as culturally specific, its mechanistic basis remains under-operationalized, and prior LLM-based simulations have mainly addressed short-term coordination rather than long-horizon social structure. We propose CAREB-MAS, a multi-agent framework grounded in Affect Control Theory, Social Identity Theory, and Durkheimian collective affect. Agents reason through an emotion-ethics-belief chain and maintain dynamically evolving egocentric identities, while the macro environment specifies only individual production, preference-based allocation, and minimal interaction protocols. Across long-horizon simulations, agents spontaneously reproduce five core Differential Order phenomena: stable labor specialization, guanxi-based economic ethics, relational decay of cooperation, emergent relational authority, and clan-based center-periphery stratification. These patterns shift with production structure from kin-centered integration toward greater functional interdependence. Extensive experiment results support interpreting Differential Order as a structure-sensitive emergent outcome of general social mechanisms, with LLM-based multi-agent simulation providing an interdisciplinary framework for studying social structure and change.
[MA-10] Engineering Reliable Autonomous Systems: Challenges and Solutions
【速读】:该论文旨在解决自主系统(autonomous systems)在实际工程应用中可靠性不足的核心问题,尤其是在其日益普及的背景下,如何构建可信赖、可验证且安全的自主系统成为关键挑战。其核心问题是现有技术与实践之间存在鸿沟:尽管学术界已发展出一系列成熟的验证与验证(verification and validation, V&V)方法和软件架构设计范式,但这些方法尚未在工业实践中得到广泛应用。论文的解决方案关键在于通过整合形式化方法(formal methods)、多智能体系统(multiagent systems)与软件工程(software engineering)等领域的研究成果,提出一个系统性的路线图(roadmap),明确当前在自主系统验证、真实场景工程实现以及安全软件架构设计方面的关键挑战,并区分出可通过现有成熟技术解决的问题与仍需深入研究的开放性难题。该路线图为未来学术界与产业界的协同创新提供了方向指引。
链接: https://arxiv.org/abs/2606.23760
作者: Marie Farrell,Matt Luckcuck,Angelo Ferrando,Rafael C. Cardoso,Natasha Alechina,Marco Autili,Diana Benjumea Hernandez,Luciana Brasil Rebelo dos Santos,Daniela Briola,Ana Cavalcanti,Christian Colombo,Louise A. Dennis,Clare Dixon,Michael Fisher,Mario Gleirscher,Taylor Johnson,Charles Lesire,Livia Lestingi,Sven Linker,Brian Logan,Colin Paterson,Fabio Papacchini,Patrizio Pelliccione,Pedro Ribeiro,Maike Schwammberger,Silvia Lizeth Tapia Tarifa,Hazel Taylor,Jim Woodcock,Mengwei Xu,Yi Yang,Huan Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注:
Abstract:Engineering reliable autonomous systems is an important and growing topic in computer science. As autonomous systems become more prevalent, easy-to-use techniques for building them reliably are increasingly important. This workshop report captures and expands on the discussions at the Lorentz Center Workshop “Engineering Reliable Autonomous Systems” (ERAS), held from 10 to 14 June 2024. The workshop was co-organised by the organisers of the Workshop on Formal Methods for Autonomous Systems (FMAS) and the Workshop on Agents and Robots for reliable Engineered Autonomy (AREA). It brought together members of the FMAS and AREA communities, industry practitioners, and representatives from sectors where autonomous systems pose distinctive engineering challenges. The workshop focused on three main research topics: techniques for verification and validation of autonomous systems; engineering real-world autonomous systems; and software architectures for safe autonomous systems. Its main outcome is a catalogue of challenges in these areas and, most importantly, a pathway to solutions. Some challenges can already be tackled by techniques that are well known in academia but have not yet become regularly used in practice. Other challenges remain unresolved and require further research. This roadmap is intended to support future research and industrial collaboration. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE) Cite as: arXiv:2606.23760 [cs.RO] (or arXiv:2606.23760v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2606.23760 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
自然语言处理
[NLP-0] Matching Tasks to Objectives: Fine-Tuning and Prompt-Tuning Strategies for Encoder-Decoder Pre-trained Language Models
【速读】: 该论文旨在解决生成式语言模型在少样本(few-shot)场景下性能受限的问题,尤其关注编码器-解码器架构的预训练语言模型在常识知识检索与补全任务中的表现瓶颈。其核心挑战在于如何有效利用多样化的预训练目标来提升模型在生成与问答任务中的泛化能力。解决方案的关键在于提出一种名为“匹配任务到目标”(Match Task to Objective, MTO)的框架,通过自动化识别特定任务所需的最佳预训练目标,并据此构建任务相关的无监督数据适配流程。此外,在微调阶段设计与预训练目标一致的新型提示模板,实现了任务需求与模型学习目标的高度对齐。该方法在少样本设置下相较传统方法性能提升超过120%,且在全数据集场景下仍显著优于基线模型。进一步地,该框架可扩展至提示调优(prompt-tuning)策略,为软提示工程提供了有效的优化指导,显著提升了提示调优的表现。整体而言,MTO框架通过系统性地匹配任务目标与预训练目标,实现了更精准、高效的模型定制化优化。
链接: https://arxiv.org/abs/2606.24841
作者: Ahmad Pouramini,Hesham Faili
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Prompt-based learning has emerged as a dominant paradigm in natural language processing. This study explores the impact of diverse pre-training objectives on the performance of encoder-decoder pre-trained language models across generation and question answering tasks, with a focus on commonsense knowledge retrieval and completion. We highlight the benefits of incorporating multiple objectives during both pre-training and fine-tuning stages. We introduce the Match Task to Objective (MTO) framework and methods for determining the appropriate objective for a given task. This framework offers automated methods to prepare task-related data for adaptation through unsupervised training, based on the identified objective. In the fine-tuning stage, we design novel templates that align with the objectives of the pre-training and adaptation stages. When aligned with task requirements, these strategies can achieve a performance gain of over 120% compared to conventional methods in few-shot settings. They significantly outperform related works in few-shot settings and exceed the baseline even in full-dataset scenarios. Furthermore, we extend this approach to include prompt-tuning methodologies, providing guidance for more effective soft prompt engineering and optimization. Our strategies significantly enhance prompt-tuning performance as well. These insights hold substantial value, precisely guiding the selection and optimization of models customized for specific tasks. Code is available at this https URL
[NLP-1] Less is More: Quality-Aware Training Data Selection for Scientific Summarization
【速读】: 该论文旨在解决科学长文档摘要任务中两大核心问题:一是现有数据集中作者撰写的摘要(abstract)作为黄金标准参考摘要,其质量与源文章的对齐程度存在显著差异;二是公开可用的科学摘要数据集在规模和结构上难以满足现代长上下文模型的需求。针对这些问题,论文提出的关键解决方案包括:其一,构建并发布目前规模最大之一的生物医学与生命科学领域长文档摘要数据集,涵盖188万篇PMC文章;其二,采用基于源文本锚定(source-grounded)和基于模型(model-based)的度量方法,系统评估作者摘要的质量。研究发现,作者摘要在与全文内容的对齐性方面存在显著差异,且这些质量信号可有效指导训练数据的选择。实验表明,在相同训练规模下,基于质量筛选的高质量子集训练效果优于随机采样,且在以事实准确性为导向的指标上可达到甚至超过更大规模的随机子集表现。结果表明,参考摘要质量是科学摘要任务中的关键因素,而基于质量感知的数据选择策略能够显著提升模型训练效率。
链接: https://arxiv.org/abs/2606.24828
作者: Maria Nefeli Paraskevopoulou,Tatiana Passali,Grigorios Tsoumakas
机构: Aristotle University of Thessaloniki (塞萨洛尼基亚里士多德大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Scientific long-document summarization datasets commonly treat author-written abstracts as gold reference summaries, although their quality and alignment with the source article vary. At the same time, publicly available scientific summarization datasets remain limited in scale and structure for modern long-context models. In this work, we address both challenges by a) constructing and releasing one of the largest biomedical and life science datasets for long-document summarization, containing 1.88 million PMC articles, and b) analyzing the reference quality of author-written abstracts with source-grounded and model-based metrics. We show that author-written abstracts vary in their alignment with the full article and that these quality signals can guide training-data selection. Training on selected high-quality subsets outperforms random sampling at matched training sizes and can match or exceed larger random subsets on factuality-oriented metrics. Our findings suggest that reference quality is an important factor in scientific summarization and that quality-aware data selection can improve training efficiency.
[NLP-2] L3Cube-MahaPOS: A Marathi Part-of-Speech Tagging Dataset and BERT Models
【速读】: 该论文旨在解决马拉地语(Marathi)在自然语言处理(NLP)领域长期存在的资源匮乏问题,特别是缺乏高质量的标注语料库和标准化的评估基准。马拉地语因其丰富的形态学特征、相对自由的词序、缺乏大小写规范以及与印地语和英语频繁的代码混用,给计算建模带来了显著挑战。为应对这一问题,研究提出L3Cube-MahaPOS,一个基于新闻文本的手动标注的马拉地语词性标注(POS)数据集,包含32,354条句子,采用16个标签的通用依存标注(Universal Dependencies)对齐方案进行标注。其关键解决方案在于构建了一个结构化的预处理流程,涵盖Unicode规范化、德文格里字符感知的分词及噪声过滤,确保各数据划分间标签的一致性;同时在六类模型(隐马尔可夫模型、条件随机场、双向长短期记忆网络、双向长短期记忆网络+字符卷积神经网络、MuRIL及专为马拉地语设计的MahaBERT-v2)上进行基准测试,最佳模型在15个标签类别上达到88.67%的词级准确率和81.67%的宏平均F1值。该工作通过公开数据集、标注指南及训练好的模型检查点,推动了马拉地语NLP的研究发展。
链接: https://arxiv.org/abs/2606.24825
作者: Hariom Ingle,Ronit Ghode,Ishwari Gondkar,Jidnyasa Harad,Raviraj Joshi
机构: L3 Cube Labs, Pune, India; Indian Institute of Technology Madras, Chennai, India; Department of Information Technology, PICT, Pune, India
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Part-of-Speech (POS) tagging is a foundational NLP task underpinning machine translation, information extraction, and syntactic parsing. Despite Marathi being spoken by over 83 million people and ranking among the top twenty most spoken languages worldwide, it remains severely under-resourced in annotated corpora and standardised evaluation benchmarks. Marathi presents unique challenges for computational modelling owing to its rich morphology, relatively free word order, lack of capitalisation conventions, and pervasive code-mixing with Hindi and English. We introduce L3Cube-MahaPOS, a gold-standard POS tagging dataset for Marathi comprising 32,354 manually annotated sentences drawn from news text. Annotation was performed entirely manually by a team of Marathi-proficient annotators following a 16-tag Universal Dependencies-aligned scheme. A structured preprocessing pipeline covering Unicode normalisation, Devanagari-aware tokenisation, and noise filtering ensures label consistency across all splits. We benchmark the dataset across six model families spanning HMM, CRF, BiLSTM, BiLSTM+CharCNN, MuRIL, and the Marathi-specific transformer MahaBERT-v2. The best system achieves 88.67% token-level accuracy and a macro-F1 of 81.67% over 15 evaluated tag classes. We release the dataset, annotation guidelines, and trained model checkpoints to foster further research in Marathi NLP.
[NLP-3] SHERLOC: Structured Diagnostic Localization for Code Repair Agents
【速读】: 该论文旨在解决大语言模型(LLM)代理在处理代码仓库级编程任务时,因故障定位效率低下而浪费大量计算资源的问题。现有定位框架虽能识别出错误位置,但仅提供文件级别的检索结果,缺乏修复代理所需的诊断上下文,导致后续修复效率受限。其解决方案的关键在于提出一种无需训练的结构化假设驱动探索与推理框架——SHERLOC(Structured Hypothesis-driven Exploration and Reasoning for Localization),该框架通过一个推理型大语言模型与轻量级仓库工具结合,并具备自恢复能力,无需微调或多智能体协同调度。SHERLOC在不同模型规模下均达到当前最优的定位性能(如SWE-Bench Lite上准确率@1达84.33%,SWE-Bench Verified上召回率@1达81.27%),且在约300亿参数规模下表现优于或等同于其他代理方法。将SHERLOC生成的定位结果及诊断信息注入修复代理后,可平均提升5.95个百分点的修复成功率,同时减少36.7%的定位和23.1%的总令牌消耗,显著提升了整体效率与实用性。
链接: https://arxiv.org/abs/2606.24820
作者: Hovhannes Tamoyan,Sean Narenthiran,Erik Arakelyan,Mira Mezini,Boris Ginsburg
机构: NVIDIA(英伟达); TU Darmstadt(达姆施塔特工业大学); hessian.AI 国家应用网络安全研究中心 ATHENE
类目: Computation and Language (cs.CL)
备注:
Abstract:LLM agents solve repository-level coding tasks through multi-turn tool use, but utilize half their budget on locating faults before editing. Dedicated localization frameworks have emerged, yet are still evaluated as file retrieval rather than actionable diagnosis, producing locations without the diagnostic context a repair agent needs. We introduce SHERLOC (Structured Hypothesis-driven Exploration and Reasoning for Localization), a training-free framework pairing a reasoning LLM with compact repository tools and self-recovery, without fine-tuning or multi-agent orchestration. SHERLOC reaches state-of-the-art localization across model scales: 84.33% accuracy@1 on SWE-Bench Lite and 81.27% recall@1 on SWE-Bench Verified; at ~30B parameters, it matches or outperforms other agentic methods. Injecting our locations and diagnostic findings into repair agents yields, on average, +5.95 pp resolve rate on SWE-Bench Verified while cutting localization and total tokens by 36.7% and 23.1%.
[NLP-4] Paying to Know: Micro-Transaction Markets for Verified Product Information in Agent ic E-Commerce
【速读】: 该论文旨在解决传统商业自然语言处理(NLP)中购物聊天机器人功能局限性的问题,即其仅被视作推荐或转化工具,核心任务是匹配用户与商品目录并促成交易。随着代理原生微支付通道(如x402、AP2)的出现,资源稀缺性发生了根本转变:当买家为可自主、全面调查的智能代理时,信息获取的可信度与决策相关性成为新的瓶颈。因此,论文提出将电商生态重构为一个以验证信息为核心的微交易市场,其中买家代理通过小额支付逐步解锁由卖家和评论者提供的结构化数据(如服务历史、第三方测试报告、物料清单、经审计的销售与支持指标),采用按需付费的免费增值模式,并基于声誉机制对评论者可信度进行评分。该架构的关键在于将竞争从传统的排名式店铺展示转向基于真实产品品质的激励机制,从而提升市场透明度与公平性。论文进一步将这一愿景转化为具体的NLP挑战,包括成本最优的信息获取、数据定价与协商机制、实时实体消歧、基于证据的价值交换以及隐私保护下的身份建模,强调这些任务相较于对话流畅性更应成为领域研究的核心方向。
链接: https://arxiv.org/abs/2606.24783
作者: Filippos Ventirozos,Matthew Shardlow
机构: Manchester Metropolitan University (曼彻斯特都会大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure. Vision paper, under review
Abstract:Commercial NLP treats the shopping chatbot as a recommender or a conversion tool: its job is to match a user to a catalogue entry and close a sale. We argue that the arrival of agent-native micro-payment rails (e.g., x402, AP2) changes what is scarce. When the buyer is an autonomous agent that can investigate exhaustively, the bottleneck is no longer matching products but acquiring trustworthy, decision-relevant information about them. We envision agentic e-commerce as a micro-transaction market for verified information: buyer agents spend fractions of a cent to progressively unlock seller- and reviewer-supplied data – service histories, third-party test reports, bills of materials, audited sales and support metrics – paid for a la carte under a freemium model, with reviewer trust scored reputationally. We sketch the architecture of such a market and argue that it rewards genuine product quality and yields truer competition than ranking-based storefronts. We then translate the vision into concrete NLP problems – cost-optimal information acquisition, data pricing and negotiation, real-time entity resolution, grounded value exchange, and privacy-preserving persona modelling – and argue that these, not chat fluency, deserve the field’s attention.
[NLP-5] Posterior Refinement: Fast Language Generation via Any-Order Flow Maps
【速读】: 该论文旨在解决非自回归生成模型在同时生成多个标记(token)时面临的质量下降问题,尤其是掩码扩散模型(Masked Diffusion Models, MDMs)因因子化误差导致的样本质量退化,以及流映射语言模型(Flow Map Language Models, FMLMs)虽能实现优异的少步生成但牺牲了推理时灵活性的局限性。其解决方案的关键在于提出FMLM+框架,通过引入类似掩码的噪声调度机制,在单步生成完整序列的同时,后验地评估每个标记的全局一致性。基于此,论文进一步提出“后验精炼”(Posterior Refinement)这一新型推理时优化策略,使模型能够自适应地修正输出,从而在仅需32倍更少的数值积分步数(NFEs)的情况下达到离散基线模型的性能水平。实验表明,FMLM+结合后验精炼在多种基准测试中显著提升了生成速度与质量之间的权衡,为高保真语言建模提供了可扩展的解决方案。
链接: https://arxiv.org/abs/2606.24773
作者: Manan Agarwal,Sheel Shah,Chanhyuk Lee,Jaehoon Yoo,Jerry Huang,Seunghoon Hong,Aditi Raghunathan,Jinwoo Kim,Nicholas M. Boffi
机构: 未知
类目: Computation and Language (cs.CL)
备注: 24 pages, 23 figures
Abstract:Non-autoregressive generation offers a powerful paradigm for iterative refinement, allowing models to recursively critique, erase and regenerate arbitrary subsets of tokens. However, existing non-autoregressive models fail to realize this potential. Masked Diffusion Models (MDMs) suffer from factorization error, causing sample quality to collapse when generating multiple tokens simultaneously. Flow Map Language Models (FMLMs) circumvent this bottleneck via joint sequence transport for excellent few-step generation, but sacrifice the inference-time flexibility of MDMs. We introduce FMLM+, a framework that bridges this gap by equipping FMLM with masking-style noise schedules. While generating the full sequence in a single step, FMLM+ simultaneously scores the global consistency of each token a posteriori. We leverage this to introduce Posterior Refinement, a novel inference-time refinement strategy that enables the model to adaptively self-correct its outputs, matching the performance of discrete baselines with 32x fewer NFEs. Across diverse benchmarks, we demonstrate that FMLM+ with Posterior Refinement improves the speed–quality tradeoff over both MDM and FMLM families, providing a scalable foundation for high-fidelity language modeling.
[NLP-6] CANDLE: Character-level Arabic Noise Deduplication using Lightweight Encoder
【速读】: 该论文旨在解决阿拉伯语文本中重复字符的噪声消减问题,即区分单词的正确拼写与社交媒体中常见的非正式字符延长现象。传统方法依赖手工规则、词典或形态分析器,难以泛化且维护成本高。本文提出CANDLE,一种基于连接时序分类(Connectionist Temporal Classification, CTC)的轻量级字符级阿拉伯语噪声去重系统,首次将CTC应用于字符去重任务,将文本归一化建模为基于字符编码器的序列对齐问题,从而无需依赖外部语言资源。实验在三个基准数据集(新闻文本、人工标注模糊案例及真实社交文本)上验证,该方法取得最低5.37%的句子错误率(Sentence Error Rate, SER),显著优于基于分类的基线模型。为进一步降低推理开销,采用知识蒸馏将6层CTC模型压缩为2层学生模型,实现3倍深度缩减而性能损失极小。此外,归一化处理带来的下游收益显著:在多种阿拉伯语大语言模型(Large Language Model, LLM)分词器上,分词器的词元密度(tokenizer fertility)最高降低12.8%,有效减少推理成本并提升上下文窗口利用率。研究代码与模型已公开,以促进可复现性与后续研究发展。
链接: https://arxiv.org/abs/2606.24758
作者: Faris Alasmary,Taif Nono,Orjuwan Zaafarani,Kholood Al Tabash,Ahmad Ghannam,Anas Salamah,Shouq Sadah,Lahouari Ghouti
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Handling repeated characters in text can be tricky, since they can represent either the correct spelling of a word or informal character elongation often seen in social media posts. We present CANDLE, a lightweight system for character-level Arabic noise deduplication that addresses this challenge without relying on handcrafted rules, dictionaries, or morphological analyzers. At the heart of CANDLE is a novel application of Connectionist Temporal Classification (CTC) to this task, a formulation not previously explored for character deduplication, which frames normalization as a sequence alignment problem over a character-based encoder. Evaluated on three benchmarks spanning clean newspaper, manually curated ambiguous cases, and real-world social media text, the CTC model achieves a Sentence Error Rate (SER) as low as 5.37% and consistently outperforms a classification-based baseline by a large margin. To reduce inference overhead, we distill the 6-layer CTC model into a 2-layer student, achieving a 3\times depth reduction with minimal performance degradation. Beyond deduplication accuracy, normalization yields a practical downstream benefit: a relative reduction in tokenizer fertility of up to 12.8% across a diverse set of Arabic LLM tokenizers, directly lowering inference costs and improving context window utilization. We release all code and models publicly to support reproducibility and advance future research\footnotethis https URL.
[NLP-7] CN-NewsTTS Bench: a target-level automatic benchmark for raw-input Chinese news TTS pronunciation ICASSP
【速读】: 该论文旨在解决中文新闻文本中密集存在的复杂书写形式(如分数、连字符命名、范围、单位符号、百分比、英文缩写及中英数字混合命名)在文本转语音(TTS)系统中发音不准确的问题。这类书写形式在真实语音交互流程中频繁出现,若TTS系统未能正确识别并朗读,将导致语义偏差。解决方案的关键在于构建一个无需用户侧规则、大语言模型(LLM)重写、SSML提示或人工编辑的开放目标级评估基准——CN-NewsTTS Bench v0.1。该基准包含200条开发集、800条公开测试集、992个可自动评估的目标项、基于三模型语音识别(ASR)集成的固定转录文本、自动化目标评分器以及七款主流TTS系统的初始评测结果。通过引入严格的自动评估机制与多维度分析(如ASR路由诊断、子集消融实验、类别级性能表现、置信区间及厂商配置元数据),该研究实现了对TTS系统在复杂书写形式处理能力上的客观量化评估,揭示了当前最佳系统仅达到0.879的严格准确率,而部分系统仍低于0.60,凸显了现有技术在真实场景下的显著挑战。
链接: https://arxiv.org/abs/2606.24714
作者: Shijun Luo
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, 1 figure, 8 tables. ICASSP-style preprint
Abstract:Chinese news text contains dense written forms such as scores, hyphenated model names, ranges, unit symbols, percentages, English abbreviations, and mixed Chinese-Latin-digit names. These forms are frequent in real listening workflows, and a text-to-speech (TTS) system can preserve the written string while changing the spoken meaning. We introduce CN-NewsTTS Bench v0.1, an open target-level benchmark for evaluating whether Chinese news TTS products pronounce such targets correctly from raw text, without user-side rules, LLM rewriting, SSML hints, or manual edits. The release contains a 200-record development set, an 800-record public test set, 992 public auto-evaluable targets, fixed transcripts from a three-ASR ensemble, an automatic target scorer, and initial results for seven product TTS systems. We additionally report ASR-route diagnostics, ASR-subset ablations, category-level results, confidence intervals, and provider configuration metadata. The best system reaches 0.879 strict accuracy, while several systems remain below 0.60.
[NLP-8] DREAM: Dense Retrieval Embeddings via Autoregressive Modeling
【速读】: 该论文旨在解决密集检索(Dense Retrieval)中监督信号获取困难的问题,即传统基于对比学习(contrastive learning)的训练方法依赖于人工标注的正负样本对,而这类数据通常成本高昂且难以获取。其核心解决方案是利用大语言模型(LLM)的自回归(autoregressive)下一个词预测目标作为无监督的监督信号来训练密集检索器。关键创新在于提出DREAM(Dense Retrieval Embeddings via Autoregressive Modeling)框架:通过将检索器生成的查询-文档相似度分数注入到冻结的LLM的特定注意力头中,使这些分数在自回归生成过程中动态调节各候选文档的注意力权重;由此产生的预测损失能够反向传播梯度,从而指导检索器的参数优化。该方法实现了无需显式正负样本对即可有效训练密集检索模型的目标,实验表明DREAM在BEIR和RTEB基准上均显著优于现有基线,验证了其在不同规模模型下的有效性与普适性。
链接: https://arxiv.org/abs/2606.24667
作者: Yixuan Tang,Yi Yang
机构: The Hong Kong University of Science and Technology (香港科技大學)
类目: Computation and Language (cs.CL)
备注:
Abstract:Dense retrieval embedding models are a fundamental component of modern retrieval-based AI systems. Most dense retrievers are trained with contrastive objectives, which require labeled positive and negative document pairs that are often costly and difficult to obtain. In this work, we investigate whether the autoregressive next-token prediction objective of a large language model (LLM) can provide supervision for dense retrieval. The intuition is simple: if a document contains information relevant to a query, conditioning on that document should make the target output easier for the LLM to predict. A key challenge is that the next-token prediction loss is computed inside the LLM, while the retriever is a separate embedding model. To address this challenge, we propose DREAM (Dense Retrieval Embeddings via Autoregressive Modeling), which injects retriever-generated query-document similarity scores into selected attention heads of a frozen LLM. During training, these scores determine how much attention each candidate document receives while the LLM predicts the target output. The resulting prediction loss provides gradients for retriever training through the attention mechanism. We evaluate DREAM on retrieval benchmarks BEIR and RTEB using embedding backbones ranging from 0.5B to 3B parameters. DREAM consistently outperforms existing baselines across different model scales. These results demonstrate that DREAM provides a promising approach for training dense retrievers through autoregressive modeling.
[NLP-9] AI-PAVE-Br: Leverag ing Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach
【速读】: 该论文旨在解决巴西动态电子商务环境中产品数据爆炸式增长与复杂性所带来的结构化信息提取难题,尤其针对葡萄牙语产品描述中语言细微差别和多样性导致的传统产品属性值抽取(PAVE)方法性能不足的问题。其解决方案的关键在于提出AI-PAVEBr系统,该系统基于大语言模型(LLM)构建,并通过针对性的提示工程(prompt engineering)实现高精度的PAVE任务;同时,论文创新性地发布了“Golden Set”——一个经过人工精标、高质量且结构化的葡萄牙语PAVE基准数据集,涵盖实体、类别与子类别等层级,为可复现研究提供权威参考。实验结果表明,AI-PAVEBr显著优于传统命名实体识别(NER)基线方法,不仅为非英语市场提供了可扩展的先进解决方案,也为自然语言处理(NLP)领域贡献了一个公开可用的重要资源。
链接: https://arxiv.org/abs/2606.24655
作者: Murilo Gazzola,Hugo Gobato Souto,Samuel Silva,Júlia Schubert Peixoto,Felipe Siqueira,André Luis Pedroso de Morais,Caio Gomes
机构: University of São Paulo (圣保罗大学); Federal University of Rio de Janeiro (里约热内卢联邦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:
Abstract:The explosive growth and complexity of product data within the dynamic Brazilian e-commerce landscape demand robust and specialized methods for structured information extraction. Traditional approaches to Product Attribute Value Extraction (PAVE) often struggle with the linguistic nuances and sheer diversity of product descriptions in Portuguese. To address this critical gap, this paper introduces two major contributions. First, we present AI-PAVEBr, a specialized system engineered with Large Language Models (LLMs) to perform high-accuracy PAVE specifically for Brazilian e-commerce catalogs. Second, to facilitate reproducible research and provide a definitive benchmark, we introduce and share the Golden Set, a new, meticulously curated, and manually annotated dataset for PAVE in Portuguese. We detail the creation process and structure (Entity, Category, Subcategories) of this high-quality reference set. Our experiments conclusively show that AI-PAVE-Br, leveraging targeted prompt engineering, dramatically outperforms conventional Named Entity Recognition (NER) baselines. This work not only delivers a superior, scalable solution for a major non-English market but also enriches the NLP community with a valuable, publicly available resource for future PAVE research.
[NLP-10] Harmonic: Hierarchical State Space Models for Efficient Long-Context Language Modeling NEURIPS2024
【速读】: 该论文旨在解决传统语言模型在长序列建模中面临的计算复杂度高(如自注意力机制的O(L²)复杂度)与位置编码限制(如旋转位置编码,RoPE)导致的性能退化问题。其核心挑战在于如何在保持高效计算的同时,实现对超长上下文的有效建模。解决方案的关键在于提出Harmonic架构——一种分层状态空间模型(Hierarchical State Space Model, SSM),通过在不同时间尺度上堆叠三层递归结构,并以“预测误差”而非原始隐藏状态作为下层输入,从而实现更高效的长程依赖捕捉。该设计使模型在前向传播中的计算复杂度降至O(L),同时显著提升长序列下的表现:在相同参数预算下,相较于Transformer和Mamba,在1K–64K token长度上均取得更优的比特每词(bpt)性能;尤其在64K长度时,其他模型因显存不足而崩溃,而Harmonic仍能成功训练并达到6.169 bpt。此外,将HarmonicBlock嵌入TinyLlama 1.1B后,彻底消除了RoPE的位置编码瓶颈,使模型在长达8K的序列上保持稳定损失,验证了其对长序列建模的强大适应性。
链接: https://arxiv.org/abs/2606.24650
作者: Petr Nyoma
机构: 独立研究者(Independent Researcher)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages, 8 figures. NeurIPS 2024 format
Abstract:We present Harmonic, a hierarchical state space model (SSM) for language modeling. The architecture stacks three recurrent levels at progressively slower timescales; each level receives the prediction error of the level below as input, rather than its raw hidden state. On enwiki8 with equal token budgets, Harmonic outperforms a comparable Transformer (28M params) by +1.4% at 1K tokens, +6.7% at 8K tokens, and +11.4% at 32K tokens (bpt, lower is better). It also outperforms Mamba at every tested length by 0.7–1.8%. At 64K tokens, both Mamba and Transformer run out of memory on an 80GB H100; Harmonic trains successfully, reaching 6.169 bpt. Results replicate on WikiText-103 (H-TF gap +1.7% to +7.2% across 1K–32K). At 1B parameter scale, replacing all attention layers in TinyLlama 1.1B with HarmonicBlock eliminates the RoPE positional encoding limit: the resulting Hallamonic model maintains stable loss across sequence lengths 1K–8K on two independent clean benchmarks (Lambada and fineweb-edu held-out), while TinyLlama degrades catastrophically past its 2K-token RoPE limit (gap: +9.4 bpt at seq=8K on Lambada). Compute is O(L) per forward pass vs. O(L^2) for attention. Logs: this https URL. Comments: 12 pages, 8 figures. NeurIPS 2024 format Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2606.24650 [cs.CL] (or arXiv:2606.24650v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.24650 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-11] ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge INTERSPEECH2026
【速读】: 该论文旨在解决当前大型音频-语言模型(Large Audio-Language Models, LALMs)在生成语音的副语言特征(paralinguistic attributes)评估中缺乏细粒度判别能力的问题,尤其是对风格(Style)、语速(Rate)、强调(Emphasis)、年龄(Age)和性别(Gender)等维度的区分能力不足。现有方法主要关注整体自然度,而忽视了对这些细微副语言差异的有效捕捉。为此,研究提出ParaPairAudioBench,一个包含5,175对音频的成对评估基准,覆盖上述五个副语言维度,并设计了同文本(same-transcript)与跨文本(cross-transcript)两种条件以分离词汇内容与声学特征的影响。实验表明,当前LALM作为评判者(LALM-as-a-Judge)在平均上比人类判断低32个百分点,且在“平局”(Tie)情形下存在严重的校准失效问题,即无法正确识别应放弃判断的情况。解决方案的关键在于构建一个具备多维性、校准意识(calibration-aware)的基准测试框架,从而系统评估LALM在副语言语音评价中的可靠性与局限性。
链接: https://arxiv.org/abs/2606.24648
作者: Jisu Jeon,Seungyeon Jwa,Joosung Lee,Jinhyeon Kim,Woojin Chung,Hwiyeol Jo,Jeonghoon Kim,Jonghyun Choi,Soyoon Kim
机构: Hongik University (韩国弘益大学); Seoul National University (首尔国立大学); NAVER Cloud (NAVER云); KAIST (韩国科学技术院)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2026
Abstract:Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained paralinguistic distinctions underexplored. We introduce ParaPairAudioBench, a pairwise benchmark of 5,175 audio pairs across five paralinguistic dimensions: Style, Rate, Emphasis, Age, and Gender. Our experiments show that current LALM judges still lag behind human judgments by 32%p on average and exhibit severe calibration failures, particularly in Tie cases where the correct decision is to abstain. To further analyze lexical versus acoustic reliance, the benchmark includes both same-transcript and cross-transcript conditions. ParaPairAudioBench enables multi-dimensional, calibration-aware assessment of the reliability of LALM-as-a-Judge for paralinguistic speech evaluation.
[NLP-12] he Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking
【速读】: 该论文旨在解决基于大语言模型(LLM)的自动事实核查系统在标准基准测试中虽具备高判决准确率,却频繁生成“支持”(Supports)标签,而其引用证据无法有效支撑相应主张的问题。这一现象的核心原因在于现有方法依赖僵化的结构化分解策略,导致在提取证据片段时丢失了完整主张(full claim)的上下文信息,从而难以准确判断证据与主张之间的逻辑蕴含关系。论文提出SIFT(Claim-conditioned re-scoring of extracted evidence spans against the full claim),即以主张为条件对提取的证据片段进行重打分,通过引入完整主张上下文来增强证据与主张间的语义匹配;同时结合WSP(Warranted Supports Proportion),一种基于自然语言推理(NLI)的自动化校验机制,用于验证所引用证据是否蕴含主张。实验在FEVER、SciFact、5PILS和DP等多个数据集上,针对四种开源模型骨架进行评估,结果表明:SIFT在传统分解方法损失高达27.6个百分点的场景下显著恢复准确率,且优于直接提示(direct prompting)策略;WSP自身在人类黄金证据标注下的表现达到AUC 0.92、精确率0.98,具备良好的校准能力。因此,解决方案的关键在于通过主张条件下的证据重评分与基于NLI的逻辑蕴含校验,实现对证据-主张关系的精细化建模与可信度评估。
链接: https://arxiv.org/abs/2606.24627
作者: Arka Ujjal Dey,John Collomosse
机构: University of Bath (巴斯大学); University of Southampton (南安普顿大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Fact-checking systems built on LLMs achieve high verdict accuracy on standard benchmarks, yet routinely output Supports labels whose cited evidence does not license the claim. Structured decomposition is the natural way to inspect those warrants, but rigid extraction protocols strip the full-claim context that facets need. We introduce SIFT – claim-conditioned re-scoring of extracted evidence spans against the full claim – paired with WSP (Warranted Supports Proportion), an automatic NLI check that the cited warrant entails the claim. We evaluate on FEVER, SciFact, 5PILS, and DP across four open-source backbones. SIFT recovers accuracy on cells where naive decomposition costs up to 27.6 points, while raising WSP above direct prompting; WSP itself calibrates against human gold evidence at AUC 0.92 and precision 0.98.
[NLP-13] Privacy-Preserving RAG via Multi-Agent Semantic Rewriting: Achieving Confidentiality Without Compromising Contextual Fidelity
【速读】: 该论文旨在解决生成式 AI(Generative AI)在敏感场景中部署时,因恶意提示(malicious prompts)导致外部知识检索过程中隐私泄露的问题。现有检索增强生成(Retrieval-Augmented Generation, RAG)框架虽能提升模型性能,但其依赖的外部知识库可能包含个人身份信息(PII),易被攻击者利用以提取敏感数据。为应对这一挑战,本文提出一种多智能体(multi-agent)隐私净化框架,其核心创新在于通过三个专业化智能体——隐私提取、语义分析与内容重构——协同对检索到的内容进行语义重写(semantic rewriting),在保留原始语义核心的同时,系统性地移除敏感标识符。该方法的关键优势在于:在不引入在线推理延迟的前提下,将所有重写操作作为一次性离线预处理完成;实验结果表明,该框架在多个大语言模型(LLaMA-3-8B等)上显著降低隐私暴露风险,例如将目标信息泄露从基线的144次降至仅1次,同时保持较高的上下文保真度(BLEU-1达0.122,优于SAGE方法的0.117),并已开源代码以保障可复现性。
链接: https://arxiv.org/abs/2606.24623
作者: Yuanhe Zhao,Tianyu Zhang,Huafei Xing,Derek F. Wong,Jianbin Li,Tao Fang
机构: Nankai University (南开大学); University of Macau (澳门大学); North China Electric Power University (华北电力大学); Macao Mobile Communications Co., Ltd. (澳门移动通信有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This full manuscript contains 23 pages and has been formally accepted for publication in Information Processing Management (Elsevier IPM). Tao Fang is the corresponding author
Abstract:Retrieval-Augmented Generation enhances large language models by incorporating external knowledge, but deploying it in sensitive scenarios risks privacy leakage via malicious prompts. To address this, we propose a multi-agent framework that sanitizes retrieved content through semantic rewriting. By employing three specialized agents for privacy extraction, semantic analysis, and reconstruction, our approach collaboratively removes sensitive identifiers while preserving the semantic core. We evaluate the framework on the ChatDoctor and Wiki-PII datasets across six large language models. Experimental results demonstrate a significant reduction in privacy leakage under targeted attacks. For instance, we reduced targeted information exposure in LLaMA-3-8B from 144 instances in the baseline to just 1. Furthermore, we maintain strong contextual fidelity with a BLEU-1 score of 0.122, outperforming the existing SAGE method’s 0.117. Finally, the framework operates as an asynchronous preprocessing module, introducing no additional latency to online inference, as all rewriting is executed as a one-time offline preprocessing step. To promote reproducibility, the source code of this work is publicly available at this https URL.
[NLP-14] Same Lesson Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在跨文化语境下生成叙事时,如何有效保持文化根基意义(cultural grounding context)的问题。当不同文化以各异的表达形式传递相同道德寓意时,模型是否能够准确保留其文化特定的语义内涵成为关键挑战。研究提出了一种多语言叙事评估框架,通过整合涵盖15种语言的414条具有语义等价性的谚语作为文化根基提示(culturally grounded prompts),并利用四种主流大语言模型生成共计13,000条叙事进行分析。其解决方案的关键在于:以语义等价的跨语言谚语为基准,系统考察模型在跨语言条件下对意义的保留程度、跨语言条件对叙事实现的影响,以及不同模型家族是否趋向一致的解释模式。结果表明,尽管模型在谚语层面基本保持了语义一致性,但在主体性(agency)、社会定位与叙事结构方面存在系统性重构;同时,不同模型在单语与跨语言设置中均表现出显著的解释收敛性,揭示出多语言模型虽在架构与语言上存在差异,但依赖于共享的语义抽象机制。该研究强调,当前基于语义相似性的多语言叙事评估可能高估了文化保真度,忽视了叙事表达中蕴含的文化特异性,因而亟需更全面的文化根基评估方法。
链接: https://arxiv.org/abs/2606.24610
作者: Jory Alshaalan,Haya Albaker,Abeer Aldayel,Aljawharah Alabdullatif,Rehab Alahmadi
机构: King Saud University (沙特国王大学)
类目: Computation and Language (cs.CL)
备注: This paper is under review
Abstract:The evaluation of cultural grounding context becomes complex when multiple cultures convey the same moral lesson. This challenge is particularly relevant to large language models (LLMs), which produce narratives across a wide range of languages and cultural contexts. However, it remains uncertain whether these models preserve culturally grounded meaning when equivalent moral lessons are conveyed through distinct cultural forms. This study introduces a multilingual evaluation narrative framework that integrates a cross-linguistic collection of 414 proverbs spanning 15 languages and uses four LLMs to generate 13k narratives. By employing semantically equivalent proverbs as culturally grounded prompts, the analysis assesses whether models preserve meaning across languages, how cross-lingual conditioning influences narrative realization, and whether different model families converge on similar interpretations. Results indicate that cross-lingual prompting largely preserves proverb-level semantic meaning while systematically redistributing agency, social positioning, and narrative structure. Additionally, strong inter-model convergence is observed in both monolingual and cross-lingual settings, suggesting that multilingual LLMs rely on shared semantic abstractions despite architectural and linguistic differences. These findings shed light on the need for more comprehensive evaluations of cultural grounding. Relying exclusively on semantic similarity in multilingual narrative assessments may overestimate cultural preservation by neglecting culturally meaningful variations in narrative expression.
[NLP-15] Qwen -Agent World: Language World Models for General Agents
【速读】: 该论文旨在解决当前通用智能体(general agents)在环境建模与推理规划能力上的局限性,特别是如何利用语言模型构建具备真实世界动态模拟能力的“世界模型”(world model),以推动通用智能体在复杂任务中的表现。其核心挑战在于如何使语言模型不仅理解静态文本信息,还能通过长链式思维(long chain-of-thought reasoning)精准预测多领域环境状态转移,并支持可扩展、高保真的自主代理仿真。解决方案的关键在于提出两个大规模语言世界模型——Qwen-AgentWorld-35B-A3B 与 Qwen-AgentWorld-397B-A17B,它们基于超过1000万条真实世界环境交互轨迹,在三阶段训练范式下实现从通用世界建模到精确状态预测再到仿真保真度优化的系统性提升:首先通过上下文预训练(CPT)注入来自状态转移动态和增强专业语料的通用世界知识;其次通过监督微调(SFT)激活下一状态预测的推理能力;最后通过强化学习(RL)结合混合评分标准(hybrid rubric-and-rule rewards)框架进一步优化仿真精度。此外,研究还构建了首个基于真实世界交互数据的基准测试集 AgentWorldBench,用于评估语言世界模型的性能。实验表明,该模型在多个前沿基准上显著优于现有方法。同时,论文进一步探索了世界模型在两类互补范式中的应用价值:作为解耦式环境模拟器,支持大规模可控仿真,显著提升强化学习训练效率;作为统一的代理基础模型,通过世界模型预训练实现有效的初始化(warm-up),大幅改善下游7个代理任务的表现。
链接: https://arxiv.org/abs/2606.24597
作者: Yuxin Zuo,Zikai Xiao,Li Sheng,Fei Huang,Jianhong Tu,Yuxuan Liu,Tianyi Tang,Xiaomeng Hu,Yang Su,Qingfeng Lan,Yantao Liu,Qin Zhu,Yinger Zhang,Bowen Yu,Haiquan Zhao,Haiyang Xu,Jianxin Yang,Jiayang Cheng,Junyang Wang,Lianghao Deng,Mingfeng Xue,Tianyi Bai,Yang Fan,Yubo Ma,Yucheng Li,Zeyu Cui,Zhihai Wang,Zhihui Xie,Zhuorui Ye,An Yang,Dayiheng Liu,Jingren Zhou,Ning Ding
机构: Qwen Team
类目: Computation and Language (cs.CL)
备注:
Abstract:A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can further push the boundaries of general agents. (i) We first focus on building foundation models for agentic environment simulation. We introduce Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B, the first language world models capable of simulating agentic environments covering 7 domains via long chain-of-thought reasoning. Leveraging more than 10M environment interaction trajectories of 7 domains in real-world environments, we develop Qwen-AgentWorld through a three-stage training pipeline: CPT injects general-purpose world modeling capabilities from the state transition dynamics and augmented professional corpora, SFT activates next-state-prediction reasoning, and RL sharpens simulation fidelity through a tailored framework with hybrid rubric-and-rule rewards. To evaluate language world models, we present AgentWorldBench, a comprehensive benchmark constructed from real-world interactions of 5 frontier models on 9 established benchmarks. Empirical results demonstrate that Qwen-AgentWorld significantly outperforms existing frontier models. (ii) Beyond foundation models, we further investigate two complementary paradigms through which world modeling enhances general agents. First, as a decoupled environment simulator, Qwen-AgentWorld supports scalable and controllable simulation of thousands of real-world environments for agentic RL, yielding gains that surpass real-environment training alone. Second, as a unified agent foundation model, world-model training acts as a highly effective warm-up that improves downstream performance across 7 agentic benchmarks. Code: this https URL
[NLP-16] o Compare or Not to Compare: On Methodological Practices in Evaluating Social Bias
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在社会偏见评估中因方法论碎片化导致结论矛盾的问题。当前评估框架普遍忽视了评测结构设计对结果的显著影响,致使不同研究间的可比性差且结论不一致。其解决方案的关键在于提出一个统一且可控的评估框架,通过标准化异构基准数据集,系统性地对比孤立式评估与强制选择式比较设置之间的差异。该框架的核心优势在于能够分离出思维链推理(Chain-of-Thought reasoning)、中立备选选项以及其它结构化伪影对社会偏见检测的干扰效应。实验结果显示,在孤立评估中模型的偏见激活程度较低,而比较性设置则成为激发潜在歧视行为的强催化剂,这一现象主要由上下文描述不明确所驱动;尤其在比较场景下,思维链推理会加剧社会偏见,且即使提供中立备选或声称随机作答,这种系统性偏见仍以确定性形式持续存在。此外,研究进一步表明该比较性偏见具有随模型规模正向扩展的泛化特性。最终,论文提出关键方法论指导:研究人员应采用比较性设置以更稳健地审计隐藏偏见,但实践者不可在语义模糊的真实任务中依赖此类部署,否则将引入不可控的歧视风险。
链接: https://arxiv.org/abs/2606.24596
作者: Federico Marcuzzi,Xuefei Ning,Roy Schwartz,Iryna Gurevych
机构: INSAIT, Sofia University “St. Kliment Ohridski”, Bulgaria; Tsinghua University, China; The Hebrew University of Jerusalem, Israel; Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science, TU Darmstadt and National Research Center for Applied Cybersecurity ATHENE, Germany
类目: Computation and Language (cs.CL)
备注:
Abstract:As Large Language Models are increasingly deployed in critical applications, robustly evaluating their social biases is paramount. However, the current literature suffers from widespread methodological fragmentation, which yields contradictory conclusions. This stems largely from ignoring the structural framing of benchmark-level evaluations. To resolve this, we introduce a unified and controllable framework that standardizes heterogeneous benchmarks to systematically contrast isolated demographic assessments with forced-choice comparative settings. Crucially, this allows us to disentangle the confounding effects of Chain-of-Thought reasoning, neutral fallback options, and other structural artifacts in social bias evaluations. Our evaluation across multiple model families reveals a massive, systematic paradigm gap: while isolated assessments limit prejudice activation, comparative settings act as aggressive catalysts for latent discrimination, a shift primarily driven by underspecified contexts. Alarmingly, CoT reasoning exacerbates social biases under comparative settings, and this systemic bias persists as a deterministic prejudice even when models are provided neutral fallback options or claim to answer randomly. Finally, we demonstrate that this comparative prejudice is a generalized phenomenon that scales positively with model size. Ultimately, we offer a crucial methodological guideline: while researchers must leverage comparative settings to robustly audit hidden biases, practitioners cannot safely rely on comparative deployments in ambiguous real-world tasks.
[NLP-17] MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery
【速读】: 该论文旨在解决大语言模型(LLM)代理在长期记忆评估中存在的根本性问题:现有评估方法主要依赖下游行为表现(如任务完成率、个性化质量等)间接衡量记忆效果,而忽视了对记忆本身这一可审计后置产物的直接检验。其核心解决方案在于提出一种全新的评估范式——将长期记忆视为可审计的后交互产物,即在完成常规任务协助后,能否从代理留下的记忆中精确重构出用户的状态信息。为此,作者构建了MEMPROBE基准测试框架,通过模拟50名用户、每名用户包含31个隐含维度(共1,550个恢复目标)的合成数据集,在控制泄露的多阶段任务场景下,分别在全存储与顶K检索条件下评估记忆系统的状态重建能力。实验结果表明,任务完成度与记忆可恢复性是两个独立的能力维度:尽管任务成功率趋于饱和,但记忆恢复率仅为约0.6且在顶K检索下进一步下降。因此,MEMPROBE首次实现了对记忆恢复能力的直接量化评估,强调记忆恢复应作为未来记忆代理的核心优化目标,并推动构建一个使代理随使用时间增长而更忠实于用户的训练环境。
链接: https://arxiv.org/abs/2606.24595
作者: Enze Ma,Yufan Zhou,Wei-Chieh Huang,Jie Yang,Huanhuan Ma,Zixuan Wang,Chengze Li,Chunyu Miao,Philip S. Yu,Zhen Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Long-term memory promises LLM agents that grow more capable across sessions, maintaining an accurate, evolving understanding of the user that interaction forms. In practice, however, this memory is evaluated mostly through downstream behavior, such as later answers, personalization quality, or task success, which tests that understanding only indirectly and leaves the memory artifact itself largely unaudited. We argue that long-term memory should instead be evaluated as an auditable post-interaction artifact: after ordinary assistance, what structured user state can be reconstructed from the memory the agent leaves behind? We instantiate this view in MEMPROBE, a benchmark in which a memory-equipped agent assists simulated users, each carrying a hidden, taxonomy-anchored user-state bank, across a trajectory of leak-controlled tasks, after which that bank is reconstructed from the agent’s resulting memory under both full-store and top-k access. Built on synthetic ground truth for efficient, scalable measurement, MEMPROBE spans 50 simulated users with 31 hidden dimensions each (1,550 recovery targets) and tests 5 representative memory systems. Testing state-of-the-art memory agents, we find that successful assistance and recoverable memory behave as distinct capabilities. Task completion nearly saturates, even for a memoryless baseline, while category-balanced recovery stays moderate (about 0.6) and drops further under top-k retrieval. MEMPROBE is the first benchmark to study memory recovery directly, reconstructing the user state a system retains and scoring it against ground truth. We see recovery as a concrete objective for future memory agents to optimize, and MEMPROBE as a step toward an environment where agents are trained to remember their users, growing more faithful the longer they know them.
[NLP-18] AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在对抗性评估中面临的两大核心挑战:一是如何系统性地生成具有高度挑战性的输入(hard inputs),二是如何可靠地验证模型失败的真实性。其解决方案的关键在于提出AdversaBench,一个端到端的红队测试(red-teaming)流水线,通过五种结构化操作符对初始提示(seed prompts)进行变异,调用目标模型并借助三名裁判组成的评审小组结合元裁判(meta-judge)的仲裁机制来确认失败的真实性。实验覆盖45个种子样本,涵盖推理、指令遵循和工具使用三类任务,所有种子均成功触发了可验证的失败。关键发现包括:不同操作符在各类任务中的有效性差异显著;二值失败率掩盖了实际难度,生存曲线揭示了指令遵循类任务平均需2.4次攻击迭代而其他类别仅需1.1次;裁判间一致性虽达80–87%,但因标签偏倚导致科恩卡帕系数接近零,因此应关注类别级分歧率;此外,针对Llama 3.1 8B生成的对抗性提示可零样本迁移至Llama 3.3 70B,表明这些变异主要利用了通用行为模式而非模型特异性弱点。
链接: https://arxiv.org/abs/2606.24589
作者: Khanak Khandelwal(Indian Institute of Technology Jodhpur)
机构: Indian Institute of Technology Jodhpur (印度理工学院乔德普尔分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 4 figures, 5 tables. Code and data at this https URL
Abstract:Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline that mutates seed prompts with five structured operators, queries a target model, and confirms failures through a three-judge panel with a meta-judge tiebreaker. We report experiments on 45 seeds across three categories: reasoning, instruction-following, and tool use. Every seed produced a confirmed failure. Four findings stand out. First, operator effectiveness varies sharply by category: inject_distractor scores 0.00 mean reward on instruction-following seeds but 0.80-0.83 on reasoning and tool-use. Second, binary failure rate hides difficulty: instruction-following seeds required 2.4 attacker iterations on average versus 1.1 for other categories, a gap visible in survival curves. Third, pairwise judge agreement of 80-87% coexists with near-zero Cohen’s kappa due to label skew; category-level disagreement rates are more informative. Fourth, adversarial prompts generated against Llama 3.1 8B transfer zero-shot to Llama 3.3 70B, suggesting the mutations exploit general behavioral patterns rather than model-specific weaknesses. Code, dataset, and analysis scripts are available at this https URL .
[NLP-19] Cross-Lingual Exploration for Parametric Knowledge
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中参数化知识在不同语言间访问不均的问题,即标准推理方法难以有效提取非母语语境下的局部事实信息,从而导致跨语言知识迁移与一致性表现不佳。其核心解决方案是通过探索跨语言提示(cross-lingual prompting)策略,系统性地挖掘隐藏的事实性知识。研究识别出影响参数化知识检索的四个内在维度,并在涵盖17种类型多样语言的多语言事实基准上进行评估。结果表明,跨语言探索显著提升了知识迁移能力与事实召回率,相较原生语言扩展,在计算效率上实现了更优的帕累托前沿(Pareto frontier)。此外,跨语言一致性也得到明显改善,且提升幅度超出仅由准确率提升所能解释的范围。总体而言,该研究确立了多语言提示探索作为一种高效、可扩展的推理时策略,能够有效激活模型中潜在的参数化知识。
链接: https://arxiv.org/abs/2606.24579
作者: Elisha Diskind,Itamar Trainin,Uri Shaham,Leshem Choshen,Idan Szpektor,Omri Abend
机构: 未知
类目: Computation and Language (cs.CL)
备注: 29 pages, 5 figures, preprint
Abstract:Parametric knowledge in Large Language Models is not equally accessible across languages. As a result, standard inference techniques often struggle to surface localized facts, leading to failures in cross-lingual knowledge transfer and consistency. In this work, we investigate techniques for accessing hidden factual knowledge by exploring cross-lingual prompting strategies. We identify four inherent dimensions of cross-lingual exploration that directly govern parametric knowledge retrieval and evaluate them on multilingual factual benchmarks covering 17 typologically diverse languages. Our results demonstrate that cross-lingual exploration significantly improves knowledge transfer and factual recall, representing a more efficient compute Pareto frontier than native-language scaling. Furthermore, we observe corresponding improvements in cross-lingual consistency, exceeding what can be explained by accuracy gains alone. Overall, our work establishes multilingual prompt exploration as a highly effective inference-time strategy for unlocking latent parametric knowledge.
[NLP-20] NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?
【速读】: 该论文旨在解决当前人工智能编码代理(AI coding agents)在科研任务中仅能模仿或复现已有成果,而难以实现真正科学发现的瓶颈问题。其核心挑战在于:现有基准测试因环境碎片化(environment fragmentation)导致评估结果不可靠,且缺乏对代理是否具备“原创性”科学推理能力的有效衡量。为应对这一问题,研究提出NatureBench,一个由90项来自《自然》系列期刊同行评审论文提炼出的真实跨学科科研任务集合,并基于NatureGym自动化流水线构建标准化、容器化的任务执行环境,有效缓解了环境异构性带来的评估偏差。关键解决方案在于通过严格禁用网络搜索的协议,强制代理依赖自身知识与推理能力完成任务,从而更真实地评估其科学创新能力。实验结果显示,在17.8%的任务上最强模型才达到当前最优(SOTA)表现,且成功主要源于将复杂科学问题转化为熟悉的监督学习预测任务(方法论翻译),而非真正的科学发明;失败主因是方法选择错误与计算资源不足,而非任务理解偏差。研究公开发布基准数据集、NatureGym流水线及维护方可验证的排行榜,推动可信、可复现的智能科研代理评估体系发展。
链接: https://arxiv.org/abs/2606.24530
作者: Yuru Wang,Lejun Cheng,Yuxin Zuo,Sihang Zeng,Bingxiang He,Che Jiang,Junlin Yang,Yuchong Wang,Kaikai Zhao,Weifeng Huang,Kai Tian,Zhenzhao Yuan,Jincheng Zhong,Weizhi Wang,Ning Ding,Bowen Zhou,Kaiyan Zhang
机构: Horizon Research(霍里松研究); Frontis.AI(弗伦蒂斯人工智能); Tsinghua University(清华大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipeline that constructs a standardized, per-task containerized environment from a source paper, addressing the environment-fragmentation problem that has limited the credibility of prior agent-on-research benchmarks. Evaluating ten frontier agent configurations under a strict web-search-disabled protocol, we find that the strongest model surpasses SOTA on only 17.8% of tasks under the g0.1 criterion. Analysis of method pathways reveals that agents succeed primarily through methodological translation, converting scientific tasks into familiar supervised prediction problems, rather than through genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding. We release the benchmark, the NatureGym pipeline, and a public leaderboard with maintainer-side reproduction. Code: this https URL
[NLP-21] AGORA: An Archive-Grounded Benchmark for Agent ic Workplace Document Reasoning
【速读】: 该论文旨在解决生成式 AI 在复杂、非结构化工作文档档案中进行推理时面临的挑战,即在大规模且混乱的职场文件集合中定位稀疏证据(sparse evidence),协调不一致的术语、单位和时间约定,并最终得出准确答案。现有基准测试仅覆盖该场景的部分特征,缺乏对“档案接地性”(archive-groundedness)、“代理式探索”(agentic exploration)与“跨领域覆盖”(cross-domain coverage)三者协同施压的综合评估。为此,研究提出 Agora 基准,包含 362 个问题与涵盖八个领域的 9,664 篇真实文档,总规模达 372M tokens,远超当前任何模型的上下文窗口,强制智能体必须采取有策略的探索而非盲目扫描。其核心解决方案在于构建一个基于代理的流水线,整合跨文档任务合成、防泄露混淆(leakage-preventing obfuscation)与难度筛选机制,确保数据的真实性与挑战性。实验评估表明,该任务尚未被充分解决——即使最强模型也仅达到 59.4% 的准确率,且不同领域间表现差异显著,凸显了当前系统在真实世界档案推理能力上的局限性。
链接: https://arxiv.org/abs/2606.24526
作者: Honglin Guo,Qi Zhang,Yu Zhang,Weijie Li,Rui Zheng,Zhikai Lei,Qiyuan Peng,Zhiheng Xi,Tao Gui,Qi Zhang
机构: Fudan University (复旦大学); Zhejiang University (浙江大学); Shanghai Qiji Zhifeng Co., Ltd. (上海启基智峰科技有限公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of workplace files, reconciling inconsistent terminology, units, and time conventions, and computing an answer. Existing benchmarks address only parts of this setting and none jointly stresses archive-groundedness, agentic exploration, and cross-domain coverage. We introduce Agora, a benchmark pairing 362 questions with eight domain collections of 9,664 authentic documents and 372M tokens, far exceeding any model’s context window, so agents must explore deliberately rather than scan exhaustively. Agora is built by an agentic pipeline combining cross-document task synthesis, leakage-preventing obfuscation, and difficulty filtering. Evaluating eight models, we find the task far from solved: even the strongest reaches only 59.4% accuracy, with notable variation across domains.
[NLP-22] Poster: Exploring the Limits of Audio-Based Detection of Turkish Phone Call Scams
【速读】: 该论文旨在解决低资源语言(如土耳其语)中诈骗电话检测难题,因其缺乏标注数据且技术防御手段有限。现有研究多集中于英语等高资源语言,忽视了对弱势群体所面临的真实威胁的覆盖。为此,研究提出首个公开的多模态土耳其语诈骗与正常通话音频-文本对数据集(100组),并评估了七种大语言模型(LLMs)在三种输入条件下的表现:原始音频、自动语音识别(ASR)转录文本以及由母语者修正后的转录文本。研究表明,基于文本的输入显著优于直接处理音频,而人工校正与未校正转录文本性能相当,表明高质量文本输入是关键。因此,该研究的关键在于通过构建面向低资源语言的真实世界多模态数据集,验证了文本增强型输入在提升诈骗检测效果中的核心作用,并呼吁发展更具文化与语言包容性的生成式AI安全研究及鲁棒的多模态反欺诈系统。
链接: https://arxiv.org/abs/2606.24523
作者: Arda Eren,Micheal Cheung,Youqian Zhang,Grace Ngai,Eugene Yujun Fu
机构: 1. University of California, Berkeley (加州大学伯克利分校); 2. Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Poster paper accepted at 47th IEEE Security Privacy 2026
Abstract:Scam phone calls exploit vulnerable communities worldwide, yet research on detection has focused almost exclusively on English and other high-resource languages. In low-resource settings such as Turkish, detection is especially difficult, as annotated data is scarce and technological defenses remain limited. This research investigates how large language models (LLMs) can support scam detection in Turkish by introducing the first public multi-modal dataset of 100 aligned audio-transcript pairs of scam and benign conversations. We evaluate seven LLMs spanning three model families: Gemini 2.5 (Flash, Flash-Lite, Pro), GPT-4o, and Qwen (Max, Plus, Turbo), under three input conditions: raw audio, automatic speech-to-text transcripts, and transcripts refined by a native speaker. Our results suggest that transcript-based inputs consistently outperform direct audio processing, while human-corrected and uncorrected transcripts perform comparably. By centering a low-resource language and real world threat, this work highlights the urgent need for culturally and linguistically inclusive AI safety research and more robust multi-modal systems for fraud prevention.
[NLP-23] A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial
【速读】: 该论文旨在解决罕见病诊断中因临床专家资源稀缺、高质量训练数据匮乏以及现有大语言模型(Large Language Models, LLMs)临床可部署性不足所导致的诊断延迟问题。其核心解决方案是提出一种开源、轻量级且具备推理能力的罕见病诊断大模型RaDaR(Rare Disease navigator),该模型参数规模为32B,通过结合49,170条公开获取的自由文本病例与104,666条基于表型锚定的合成病例进行增强式训练,显著提升了模型在罕见病识别中的表现。实验表明,RaDaR在多个公开基准测试及四个外部验证中心均优于现有开源模型,包括参数量达671B的DeepSeek-R1;在回顾性队列研究中,其能在临床怀疑前优先推荐最终诊断的比例达61.06%,平均提前1.87个月,相当于中心内诊断间隔的50.18%;在随机对照医生辅助试验中,相较于仅依赖互联网搜索,RaDaR将医生诊断准确率提升21.44个百分点。此外,合成数据消融实验表明,以表型为导向的叙事数据对长尾罕见病具有有效的训练信号,并呈现单调增长趋势。综上,RaDaR及其开发与验证框架不仅提供了一个可实际部署的罕见病推理模型,还构建了一个在数据稀缺条件下可复现的诊断人工智能开发范式。
链接: https://arxiv.org/abs/2606.24510
作者: Haichao Chen,Songchi Zhou,Zhengyun Zhao,Shikai Hu,Xianghong Jin,Hongwei Ji,Li He,Shuli Li,Yiming Qin,Xin Tan,Runfeng Shi,Yih Chung Tham,Jiaye Zhu,Ye Li,Ye Jin,Longhao Cao,Dawei Li,Honghan Wu,Hongqiu Gu,Guanqiao Li,Tudor Groza,Chunying Li,Dian Zeng,Weihong Yu,Gareth Baynam,Saumya Shekhar Jamuar,Min Shen,Shuyang Zhang,Bin Sheng,Sheng Yu,Tien Yin Wong
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 36 pages, 5 figures
Abstract:Rare diseases affect millions of individuals worldwide, yet timely diagnosis remains a major public health challenge due to scarcity of specialized clinical expertise. While large language models (LLMs) show promise to support rare disease diagnosis, current models are constrained by insufficient clinical deployability, limited clinically grounded evidence, and scarcity of training data. Here we present RaDaR (Rare Disease navigatoR), an open-source, compact reasoning LLM (32B parameters) for rare disease diagnosis. RaDaR was trained with 49,170 publicly available free-text cases and 104,666 synthetic cases with reasoning-enhanced training. RaDaR showed the strongest performance among evaluated open-source models, including the 671B DeepSeek-R1, across public benchmarks and four external validation centers. In a retrospective cohort, RaDaR prioritized the final diagnosis before documented clinical suspicion in 61.06 percent of cases, corresponding to a potential lead time of 1.87 months and 50.18 percent of the within-center interval. In a randomized physician-assistance trial, RaDaR assistance improved physicians’ rare-disease diagnostic accuracy by 21.44 percentage points compared with internet search alone. Synthetic-data ablations suggested that phenotype-anchored narratives provide useful training signal for long-tail rare diseases, with a monotonic scaling trend within the tested data range. Together, RaDaR and its development and validation framework provide a deployable rare-disease reasoning model and a reproducible development framework for diagnostic AI under data scarcity.
[NLP-24] UOL@IDEM at BEA 2026 Shared Task 1: Neural Fusion and Feature-Rich Modeling for L1-Aware Vocabulary Difficulty Prediction ACL
【速读】: 该论文旨在解决第一语言(L1)感知的词汇难度预测问题,即针对不同母语背景的学习者,准确预测目标语言中词汇的习得难度。其核心挑战在于如何建模跨语言、跨母语的词汇认知复杂性,尤其在多语言环境下实现对词汇难度的精细区分与量化。解决方案的关键在于构建一个融合多语言上下文表示与多种人工设计特征的混合模型:通过引入如BGE-M3、multilingual E5和LaBSE等句子嵌入编码器获取强健的多语言上下文表征,并结合频率、表面形式相似性、检索证据、语义对齐、词源相似性以及掩码语言模型可预测性等工程化特征,形成互补的L1敏感信号。实验表明,该方法在西班牙语、德语和汉语三种语言上均显著优于官方基线,其中汉语表现最佳(RMSE=0.891),且频率特征为最稳定的预测因子,而上下文可预测性、形式相似性及语义特征则提供了关键的跨语言差异敏感信号。误差分析揭示系统在最难词汇上排名表现优异,但对最易词汇存在过预测问题,反映出校准能力不足。
链接: https://arxiv.org/abs/2606.24501
作者: Nouran Khallaf,Serge Sharoff
机构: University of Leeds(利兹大学); Alexandria University(亚历山大大学)
类目: Computation and Language (cs.CL)
备注: Published at BEA2026, 21st Workshop on Innovative Use of NLP for Building Educational Applications, at ACL, July 2026, San Diego
Abstract:This paper describes UOL@IDEM’s closed-track submission to the BEA 2026 shared task on L1-aware vocabulary difficulty prediction. We model the task as regression and train separate systems for Spanish, German, and Mandarin Chinese\footnoteBelow we use \emphChinese for brevity… Our system combines multilingual contextual representations with engineered features capturing frequency, surface form, retrieval evidence, semantic alignment, cognate similarity, and masked-language-model predictability. Development results show consistent gains over the official closed-track baselines, with sentence-embedding encoders such as BGE-M3, multilingual E5, and LaBSE performing best. Official submissions achieve RMSE scores of 1.132, 1.037, and 0.891 for Spanish, German, and Chinese, respectively. Feature analysis identifies frequency as the most stable predictor, while contextual predictability, form similarity, retrieval, and semantic features provide complementary L1-sensitive signals. Error analysis shows strong ranking performance but weaker calibration for the easiest items, which are often overpredicted. See this https URL
[NLP-25] he African Language Tax: Quantifying the Cost Latency and Context Penalty of Tokenizing African Languages in Frontier LLM s
【速读】: 该论文旨在解决多语言大语言模型在非洲语言中因分词器(tokenizer)设计缺陷导致的结构性高令牌开销问题,即同一语义在不同语言中被分解为不同数量的子词令牌,而非洲语言普遍面临更高的令牌密度(token-fertility),从而在模型推理成本、上下文容量和延迟方面承受不成比例的经济与技术负担。其解决方案的关键在于:通过构建跨20种非洲语言(涵盖五种语系与三种文字体系)的并行语料库基准测试,系统量化了各语言相对于英语的令牌化溢价(median 1.88x,最高达8.92x),揭示了埃塞俄比亚字母(Ge’ez/Ethiopic)与N’Ko文字语言的极端惩罚(7–9x),并证明该现象在不同语料间高度稳定(FLORES vs SIB-200 Pearson r = 0.9998)。研究进一步将令牌化溢价转化为部署层面的实际影响——推理成本最高提升8.9倍,生成延迟乘数达7.4倍,有效上下文窗口仅为英语的11%。尽管现有最优分词器Gemma 4可将平均溢价从3.31x降至2.38x,但无法消除根本性不平等。因此,核心解决方案不仅包括提出开源测量工具afri-fertility、公开排行榜与数据集,更强调通过分词器优化与工程引导来缓解这一由子词词汇结构直接编码的数字鸿沟。
链接: https://arxiv.org/abs/2606.24460
作者: Olaoye Anthony Somide
机构: DataLens Africa Research · CipherSense AI Technologies Ltd(赛弗森人工智能技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 40 pages, 5 figures, 25 tables
Abstract:Commercial large language models bill, scale latency, and budget context per token. Yet tokenizers assign more subword tokens to the same meaning in some languages than in others, so speakers of languages with high token-fertility pay a structural penalty before a model is ever invoked. This penalty is documented for multilingual settings in general, but it has not been measured systematically for African languages at the level of enterprise deployment economics and cognitive context capacity. We measure it across 20 African languages spanning five language families and three scripts (Latin, Ge’ez/Ethiopic, N’Ko; 19 appear in the primary FLORES-200+ corpus, with Nigerian Pidgin measured via MAFAND-MT only), using parallel corpora so that the language effect is isolated from content. Across 11 frontier and open tokenizers on FLORES-200+, every African language carries a tokenization premium above English (median 1.88x on GPT-5 / o200k_base, up to 8.92x for N’Ko); the penalty is largest for Ethiopic and N’Ko scripts (reaching 7-9x) and is near-invariant across corpora (FLORES vs SIB-200 Pearson r = 0.9998). Translated into deployment terms, this results in up to 8.9x inference cost and an equivalent generation-latency multiplier (N’Ko vs English on GPT-5; 7.4x for Amharic), and as little as 11% of English’s effective context window. The best currently available tokenizer for African languages, Gemma 4, reduces the mean premium from 3.31x (cl100k_base) to 2.38x, but no tokenizer eliminates the penalty. We release an open measurement tool (afri-fertility), a public leaderboard, a results dataset, and mitigation guidance for African builders. The penalty falls hardest on the languages whose speakers can least afford it, a digital divide encoded directly into the subword vocabulary.
[NLP-26] An LLM -based Two-Stage Transformer Framework for Cross-Domain Bearing Fault Diagnosis with Limited Data
【速读】: 该论文旨在解决工业环境下轴承故障诊断中同时存在数据集异质性、工况变化以及标注数据有限等多重挑战的问题。现有方法通常孤立处理上述问题,且依赖隐式特征对齐,难以在多种挑战并发时保持有效性能。其解决方案的关键在于提出一种知识引导的两阶段迁移学习框架:通过轻量级GPT-2风格的Transformer结合因果自注意力机制,实现振动信号的分层特征提取;利用预训练编码器权重与故障原型嵌入作为知识载体,建立从多源预训练到目标域适配的显式知识传递路径。该框架通过多源学习获得可泛化的表征、基于原型的知识调制实现目标域适应,并采用分类税则自适应策略,实现异构故障类别间的无缝迁移。实验在四个真实数据集上验证了该方法的有效性,仅使用10%的标注目标数据即可达到92.61%的平均准确率,较当前最优方法提升17.24个百分点,为工业4.0场景下低成本预测性维护提供了切实可行的技术路径。
链接: https://arxiv.org/abs/2606.24459
作者: Jinghan Wang,Feng Cheng,Wentao Wu,Hang Li,Gaoliang Peng,Tianchen Liu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted as a conference article of AIM 2026
Abstract:Bearing fault diagnosis faces critical challenges when dataset heterogeneity, operating condition variations, and limited labeled data occur simultaneously in industrial environments. Existing approaches address these issues in isolation and rely on implicit feature alignment, limiting effectiveness under concurrent challenges. This paper proposes a knowledge-guided two-stage transfer learning framework that employs a lightweight GPT-2-style Transformer with causal self-attention for hierarchical feature extraction from vibration signals, establishing explicit pathways where pre-trained encoder weights and fault prototype embeddings serve as knowledge carriers from multi-source pre-training to target adaptation. The framework addresses the dual-shift challenge through multi-source learning for generalizable representations, prototype-based knowledge modulation for target adaptation, and taxonomy-adaptive classification for seamless transfer across heterogeneous fault categories. Experimental validation on four real-world datasets demonstrates 92.61% average accuracy with only 10% labeled target data, outperforming state-of-the-art methods by 17.24 percentage points, establishing a practical pathway toward cost-effective predictive maintenance in Industry 4.0 applications.
[NLP-27] Bayesian control for coding agents
【速读】: 该论文旨在解决现代编码智能体(coding agents)在工具使用决策中因依赖固定规则而忽视不确定性的问题,尤其在验证成本高、评估器信息有限但不完美时,传统编排策略效率低下。其核心解决方案是将编排过程建模为代价敏感的序贯假设检验(cost-sensitive sequential hypothesis testing),引入贝叶斯控制器(Bayesian controller)动态维护对候选解正确性的信念状态,并据此智能决策是否继续收集证据、优化候选解、执行验证或终止。实验表明,在六种生成器与九个编码基准上,该方法在验证成本较高且批评者具有信息量但存在误差的情况下表现最优;此外,贝叶斯信念状态还提供了一种可解释的正确性评分,显著优于基于令牌概率和原始工具成功率的不确定性量化基线。
链接: https://arxiv.org/abs/2606.24453
作者: Theodore Papamarkou,Vladislav Smirnov,Viktor Mazanov,Artem Vazhentsev,Preslav Nakov,Timothy Baldwin,Artem Shelmanov
机构: PolyShape; National Technical University of Athens; MBZUAI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Modern coding agents pair LLM generators with various tools, including cheap diagnostics and expensive verifiers. The tool-use decisions are typically governed by orchestrators that often use fixed rules and ignore uncertainty. We formulate orchestration as cost-sensitive sequential hypothesis testing: a Bayesian controller maintains a belief over candidate correctness and dynamically decides whether to gather more evidence, refine the candidate, verify it, or stop. Across six generators and nine coding benchmarks, Bayesian control proves to be most valuable when verification is costly and critics are informative but imperfect. Beyond control, the belief state yields an interpretable correctness score that outperforms token-probability and raw tool-success baselines for uncertainty quantification.
[NLP-28] Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agent ic Experience Learning
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)智能体在开放世界交互中进行经验学习时,因依赖单一智能体循环而陷入“自我确认陷阱”(Self-Confirmation Trap)的问题。该陷阱表现为:同一智能体在执行任务、总结结果并决定记忆内容的过程中,可能将错误但自洽的轨迹误判为成功经验,导致错误在后续检索与复用中累积放大。为应对这一挑战,论文提出一种名为EDV(Execute-Distill-Verify)的可信赖经验学习框架,其核心在于通过解耦“执行—提炼—验证”三阶段实现经验构建的协同化与去偏化。关键创新在于引入异构多智能体并行探索任务空间以生成多样化候选路径,在提炼阶段由第三方独立智能体对比分析这些路径以减少执行者中心的归纳偏差,并在验证阶段通过共识机制对候选经验进行集体审核,仅将经验证的高质量经验写入共享或私有记忆。该设计有效过滤了噪声与错误内容,将经验学习从孤立的自我反思转变为协作式建构,显著提升了智能体自演化的鲁棒性。在tau2-bench、Mind2Web和MMTB三个高难度长周期任务基准上的实验表明,EDV持续优于多个强基线方法,验证了可靠经验构建对于实现稳健智能体自进化的重要性。
链接: https://arxiv.org/abs/2606.24428
作者: Shiding Zhu,Yudi Qi,Yajie Wang,Jiaze Li,Chao Song,Yaorui Shi,Yibo Miao,Hanqi Gao,Kai Zhang
机构: Zhejiang University (浙江大学); Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Northwestern Polytechnical University (西北工业大学); University of Science and Technology of China (中国科学技术大学); Shanghai Jiaotong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注: 28 pages, 11 figures
Abstract:Experience-driven self-evolution is critical for large language model (LLM) agents to improve through open-world interaction. However, existing experience learning methods mostly rely on single-agent loops, where the same agent executes tasks, summarizes outcomes, and determines memory content. This setup makes agents vulnerable to the Self-Confirmation Trap: wrong-but-self-consistent trajectories are misidentified as successful experience, leading to cumulative errors during retrieval and reuse. To address this issue, we propose EDV, an Execute-Distill-Verify framework for reliable experience learning. In the Execute stage, multiple heterogeneous agents explore the same task space in parallel to generate diverse candidate trajectories. In the Distill stage, a dedicated third-party agent comparatively analyzes these trajectories to produce candidate experiences, reducing executor-centric summarization bias. In the Verify stage, the execution group validates candidates via a consensus mechanism, and only approved experiences are written into shared or private memory. By decoupling the three stages, EDV transforms experience learning from isolated self-reflection into collaborative construction, filtering erroneous and noisy content before memory insertion. We evaluate EDV on three challenging long-horizon benchmarks: tau2-bench, Mind2Web and MMTB. Results show EDV consistently outperforms strong baselines, validating that reliable experience construction is essential for robust agent self-evolution. Our code is available at this https URL.
[NLP-29] Beyond Logprobs: A Multi-Signal Confidence Engine for LLM -Based Document Field Extraction IJCAI ECAI2026
【速读】: 该论文旨在解决高风险文档处理流程中大语言模型(LLM)提取结果“无声错误”(silently wrong)带来的可信性问题,核心挑战并非单纯的提取准确率,而是如何实现可靠、细粒度的置信度估计——即在字段层面判断提取结果是否可信任以用于自动化,或需交由人工复核。现有方法如基于标记级对数概率、口头化置信度及多样本自一致性等,在实际阈值下均趋向于全正向行为,无法有效区分可信与不可信的提取结果。其解决方案的关键在于提出ExtractConf,一种跨领域、字段无关的置信度评估引擎,通过构建两种结构上不同的文档读取路径:一为字段引导的猎手(Hunter)调用,在模式槽位补全压力下提取各字段;二为文档引导的映射器(Mapper)调用,全局扫描并提取基于文档内容的真实值。二者因机制差异产生不同失效模式:Hunter易对缺失字段生成幻觉值,而Mapper则可能遗漏视觉不显著的字段。两者的不一致具有独立信息价值。ExtractConf融合跨调用不一致信号、LLM内部不确定性、OCR质量、图像清晰度及空间布局等多源特征,构建无需领域特定规则或重新训练的分类器。在DocILE数据集(55字段发票,26%失败率)上,其ROC AUC达0.928,相比对数概率均值降低70%的选择性预测风险;在80%覆盖率下准确率达99.1%,支持实用的人机协同工作流。零样本迁移至CORD收据数据集获得0.858 AUC,轻量级Lasso校准进一步将期望校准误差(ECE)降低89%、Brier分数降低43%,验证了该方法信号在不同文档域间的泛化能力。
链接: https://arxiv.org/abs/2606.24420
作者: Nitesh Kumar
机构: 未知
类目: Computation and Language (cs.CL)
备注: Extended version of a paper accepted (Oral) at the RobustifAI Workshop, IJCAI-ECAI 2026, Bremen, Germany. 9 pages, 5 figures, 2 tables
Abstract:In high-stakes document processing pipelines, including financial reconciliation, compliance verification, and procurement automation, an LLM extraction that is silently wrong is more dangerous than one that is visibly absent. The central challenge is not extraction accuracy alone but reliable confidence estimation: knowing, field by field, whether an extraction can be trusted for automation or deferred to human review. Token-level log-probabilities, verbalized confidence, and multi-sample self-consistency all collapse toward all-positive behaviour at practical thresholds, offering no reliable separation between trustworthy and untrustworthy extractions. We present ExtractConf, a cross-domain, field-agnostic confidence engine that grounds confidence estimation in two structurally different readings of the same document. A field-guided Hunter call extracts each field under schema-slot completion pressure; a document-guided Mapper call scans holistically and surfaces values grounded in document content. This asymmetry yields different failure modes: Hunter hallucinates values for absent fields, while Mapper misses visually non-salient ones. Their disagreement is independently informative. ExtractConf fuses cross-call disagreement, LLM-internal uncertainty, OCR, image quality, and spatial layout into a classifier requiring no domain-specific rules or retraining. On DocILE (55-field invoices, 26% failure rate), it achieves 0.928 ROC AUC and reduces selective prediction risk by 70% over logprob-mean. At 80% coverage, accuracy reaches 99.1%, enabling a practical human-in-the-loop workflow. Zero-shot transfer to CORD receipts achieves 0.858 AUC; lightweight Lasso recalibration reduces ECE by 89% and Brier by 43%, confirming the signals generalise across document domains. Comments: Extended version of a paper accepted (Oral) at the RobustifAI Workshop, IJCAI-ECAI 2026, Bremen, Germany. 9 pages, 5 figures, 2 tables Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.24420 [cs.CL] (or arXiv:2606.24420v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.24420 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-30] AutoSpecNER: A Fine-Grained Named Entity Recognition Dataset for Vehicle Specification Extraction
【速读】: 该论文旨在解决汽车广告中细粒度实体识别(Fine-grained Entity Recognition)因专业领域标注资源匮乏而导致的性能瓶颈问题。其核心挑战在于如何从非结构化、信息密集的车辆广告文本中准确提取如车型(MODEL)、发动机参数(ENGINE_SPEC)及电池容量(BATTERY_CAPACITY)等关键规格信息。解决方案的关键在于构建并公开一个高质量的专家标注数据集AutoSpecNER,该数据集涵盖659条来自主流二手车平台的广告,共标注超过10,000个实体,覆盖15个细粒度类别,并通过高一致性(平均互评一致率91.5%)确保标注可靠性。在此基础上,系统性地对比了规则抽取、微调Transformer编码器与大语言模型的性能,结果表明基于DeBERTa的模型在微平均F1分数上达到90%,显著优于规则基线(43%)和当前最强的大语言模型(77.8%),验证了预训练语言模型在特定领域实体识别中的有效性。
链接: https://arxiv.org/abs/2606.24387
作者: Jordan Lee,Filippos Ventirozos,Abdirahman Abdullahm,Ioanna Nteka,Peter Appleby,Matthew Shardlow
机构: Manchester Metropolitan University (曼彻斯特都会大学); Autotrader Research Group, Autotrader UK (Autotrader 英国)
类目: Computation and Language (cs.CL)
备注: 13 pages, 2 figures, 7 tables, Pre-print
Abstract:Vehicle advertisements contain rich specification information, but automotive NER resources remain limited. We introduce AutoSpecNER, an expert-annotated dataset for fine-grained entity recognition in vehicle listings. The dataset includes 659 advertisements from a popular car-selling website, with over 10,000 entities annotated across 15 categories, including MODEL, ENGINE_SPEC, and BATTERY_CAPACITY. Annotation quality was validated through inter-annotator agreement, achieving an average score of 91.5%. We benchmark rule-based extraction, fine-tuned transformer encoders, and large language models. DeBERTa achieves the best performance with a 90% micro-F1 score, outperforming the rule-based baseline (43%) and the strongest large language model (77.8%).
[NLP-31] On the Stability of Prompt Ranking in Large Language Model Evaluation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在基于提示(prompt)的交互范式中,由于评估条件微小变化导致提示排名不稳定的问题。现有方法通常假设提示的相对性能排序在不同评估条件下保持稳定,但本文通过系统研究随机种子和有限评估子集等常见变异性来源,发现尽管整体排名相关性较高,但最优提示的具体身份却频繁变动,从而影响提示选择的可靠性。其解决方案的关键在于提出一种基于置信下界(lower confidence bound)的稳定性感知选择策略,该策略同时考虑提示的平均性能与方差,以增强在评估不确定性较高的场景下的鲁棒性,同时在稳定环境下仍保持竞争力。研究结果强调了在提示选择与大语言模型基准测试中必须考虑评估不确定性的重要性。
链接: https://arxiv.org/abs/2606.24381
作者: Shaoshuai Du,Penghao Liang,Yixian Shen,Chuanqi Shi,Hang Zhang,Lun Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Prompt-based interaction has become a dominant paradigm for using large language models (LLMs), where multiple candidate prompts are evaluated and the top-ranked one is selected for downstream use. This workflow implicitly assumes that prompt rankings are stable under minor variations in evaluation conditions. In this paper, we systematically study prompt ranking stability under common sources of variability, including random seeds and limited evaluation subsets. Across three open-weight LLMs and two benchmark tasks, we find that while overall rank correlations are often moderate to high, the identity of the top-performing prompt frequently changes, leading to unreliable selection decisions. To address this issue, we propose a simple stability-aware selection strategy based on a lower confidence bound, which accounts for both performance and variance. Our results show that this approach improves robustness in unstable settings while remaining competitive in more stable regimes. These findings highlight the importance of accounting for evaluation uncertainty in prompt selection and LLM benchmarking.
[NLP-32] ComputeFHE: A Privacy-Preserving General-Purpose Computation Library
【速读】: 该论文旨在解决全同态加密(Fully Homomorphic Encryption, FHE)在实际应用中因计算开销大和开发复杂度高而导致的落地瓶颈问题。其核心解决方案是提出ComputeFHE,一个基于TFHE密码体制的开源C++库,通过提供加密整数与定点数数据类型,以及支持算术、逻辑、比较、条件判断和无差别数组访问等操作的编程接口,使开发者能够以熟悉的命令式编程范式实现隐私保护算法。该方案的关键在于引入两种计算路径:一是传统的基于双输入逻辑门的TFHE算术;二是采用面向FHE优化的算术逻辑单元(Arithmetic Logic Unit, ALU)架构,利用FHE友好的逻辑原语提升效率。实验结果表明,该设计显著减少了所需的自举(bootstrapping)操作次数,部分操作性能最高提升3.9倍。此外,库内集成的仿真模式可在不执行真实密码学运算的情况下实现算法测试、调试与复杂度分析,并提供电路复杂度与自举成本评估,从而构建了一个兼具实用性与可访问性的隐私保护算法开发与评估框架。
链接: https://arxiv.org/abs/2606.24379
作者: Faris Serdar Tasel,Efe Ciftci
机构: Çankaya University (安卡拉大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 16 pages, 3 figures
Abstract:Fully Homomorphic Encryption (FHE) enables computations to be performed directly on encrypted data while preserving data confidentiality. However, its practical applications remain limited by high computational costs and development complexity. This paper presents ComputeFHE, an open-source C++ library that facilitates the development of privacy-preserving applications based on the TFHE cryptosystem. The library provides encrypted integer and fixed-point data types together with arithmetic, logical, comparison, conditional, and oblivious array-access operations which allow developers to implement algorithms using a familiar imperative programming paradigm. ComputeFHE supports both conventional TFHE arithmetic based on standard two-input logic gates and an optimized Arithmetic Logic Unit (ALU) architecture utilizing FHE-friendly logic primitives. Experimental results demonstrate significant reductions in the number of required bootstrapping operations, achieving performance improvements of up to 3.9x for selected operations. In addition, the library includes a simulation mode that enables testing, debugging, and complexity analysis without performing actual cryptographic computations while providing circuit complexity and bootstrapping costs. Built on top of OpenFHE, ComputeFHE offers a practical and accessible framework for developing and evaluating privacy-preserving algorithms and applications.
[NLP-33] MorfFlex: Handling Rich Morphology LREC2026
【速读】: 该论文旨在解决高复杂度形态系统语言中形态词典规模庞大且难以维护的问题,尤其针对具有高度规则性在屈折与派生层面的语言。其核心解决方案在于提出MorfFlex架构,通过编码精炼的屈折与派生模式,显著压缩词典体积。以捷克语为例,MorfFlex CZ词典虽包含超过1亿个词形和逾百万词干,但借助人工维护的模式化结构,实现了高效管理与可扩展性。该方法的关键在于将大量词形信息归纳为有限的形态模式,从而在保证词典完整性的同时,极大降低存储与维护成本,并支持大规模语料库中形态标注的一致性及先进自然语言处理工具(如MorphoDiTa)的开发。
链接: https://arxiv.org/abs/2606.24366
作者: Jaroslava Hlaváčová,Marie Mikulová,Barbora Štěpánková,Milan Straka,Jan Hajič
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to LREC 2026
Abstract:We present MorfFlex, a morphological dictionary architecture suitable for languages with extensive regularity in both inflection and derivation. As the primary example of MorfFlex in use we introduce MorfFlex CZ, a morphological dictionary of Czech. It is distributed as a simple, unstructured list of wordform, lemma, tag triplets, however, its manually maintained, unpublished source files and conversion scripts encode a sophisticated system of inflectional and derivational patterns. These patterns dramatically reduce the otherwise enormous size of the dictionary, which currently contains over 100 million wordforms and more than 1 million lemmas. The MorfFlex CZ dictionary serves as an essential resource for ensuring the consistency of manual morphological annotation in the Prague Dependency Treebanks and underpins state-of-the-art automatic tools such as MorphoDiTa. In this paper, we focus on: (i) presenting an effective method for managing the rich morphological system within the dictionary, and (ii) demonstrating the utility of such a language resource for maintaining annotation consistency in corpora and supporting the development of advanced NLP applications.
[NLP-34] Automatic Part-of-Speech Tagging of Arabic-English Dictionary Senses through WordNet
【速读】: 该论文旨在解决双语词典中词义(sense)的词性标注(POS tagging)问题,特别是针对阿拉伯语-英语词典Al-Mawrid的词义进行准确的词性标注。其核心挑战在于如何在缺乏大量标注语料和语言资源的情况下,实现高质量的词性标注。解决方案的关键在于通过消歧过程将英文翻译等价项(Translation Equivalences, TEs)的词性标签迁移至词典词义上,并利用普林斯顿词网(Princeton WordNet)中的英文词性信息作为源标签。该方法无需依赖大规模人工标注数据或复杂规则系统,显著降低了对语言学专家、计算资源和时间成本的需求,体现了“轻资源”(resource-light)方法的优势,从而为低资源语言的自然语言处理(NLP)工具开发提供了可行路径。
链接: https://arxiv.org/abs/2606.24359
作者: Diaa M. Fayed,Aly A. Fahmy,Mohsen A. Rashwan,Wafaa K. Fayed
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 figures, 5 tables, Published in Proceedings of the 15th Conference on Language Engineering, Egyptian Society of Language Engineering (ESOLE’15), Dec., 2015
Abstract:This paper proposed an algorithm for part-of-speech (POS) tagging senses of a bilingual dictionary. The algorithm is applied on the Al-Mawrid Arabic-English dictionary. The tagging task is accomplished by transferring the POS tags of the English translation equivalences (TEs) to the dictionary senses after dis-ambiguities process. The English POS tags of senses are acquired from the Princeton WordNet. POS tagging of bilingual dictionary senses is prerequisite to link a bilingual dictionary to WordNet and/or standardizing that dictionary into WordNet-LMF format where the synset (set of synonyms), not word, is the basic brick. The registered accuracy is high though the cost is little. Building NLP/HLT tools needs linguistic experts, large investments, and long time. For statistical approach, we need large annotated corpora and for rule-based approach, we need large lexicon that contains rich linguistic and world knowledge. That motivates the appearance of what are called resource-light approaches to develop natural language processing (NLP) tools for poor-resource languages.
[NLP-35] Meet UD_Czech-PDTC: A Large and Genre-Rich Treebank in Universal Dependencies LREC2026
【速读】: 该论文旨在解决将捷克语“布拉格依存句法树库-整合版”(PDT-C)转换为通用依存标注(Universal Dependencies, UD)框架所面临的挑战。其核心问题在于,尽管PDT-C与UD在整体标注理念上相似,但两者在依存结构拓扑、词性标注(POS)及关系类型粒度等方面存在诸多细微差异,导致直接映射困难。解决方案的关键在于系统识别并解析这些差异,通过分析具体实例揭示不同标注体系背后的动机差异(如PDT强调多层标注以保留语言特异性信息,而UD追求跨语言一致性),并提出针对性的转换策略,包括对细粒度标注的聚合、结构不一致性的重构以及语义关系的标准化处理。研究论证了尽管PDT在跨语言通用性上较弱,但其丰富的多层级标注体系可为生成高质量、信息完备的UD树提供了坚实基础,从而实现从专有标注到通用标准的有效迁移。
链接: https://arxiv.org/abs/2606.24337
作者: Marie Mikulová,Barbora Štěpánková,Daniel Zeman,Jan Štěpánek,Milan Straka,Jan Hajič
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to LREC 2026
Abstract:Czech has been part of Universal Dependencies since its first release in 2015. It has also been one of the best represented languages, with the Prague Dependency Treebank being order of magnitude larger than most other UD treebanks. More recently, three other datasets from the Prague family were added and the annotations thoroughly revisited, forming the “Prague Dependency Treebank-Consolidated” (PDT-C). In comparison to the original PDT, PDT-C is more than twice as large, but it is also much more diverse in terms of genres and domains. In this paper, we describe the conversion of the new resource to Universal Dependencies. While the two annotation schemes are relatively similar at the first sight, there are numerous small differences in topology of the dependency structures and in granularity of the POS and relation type inventories. We demonstrate a selection of such differences on examples, discuss the diverging motivations, as well as ways to overcome the differences during conversion. We argue that while PDT is less “universal” and more tightly bound to one language, its multi-layer annotation is rich and provides all information needed for basic UD trees, and much more.
[NLP-36] ransformer-Based Language Models Across Domain Verticals: Architectures Applications and Critical Assessment
【速读】: 该论文旨在解决当前自然语言处理领域中模型迭代速度过快,导致从业者难以区分具有长期价值的核心思想与仅具增量意义的宣传性更新这一关键问题。其解决方案的关键在于构建一个多层次、系统化的分析框架:在机制层面,提出一个涵盖编码器-only、解码器-only、编码器-解码器、长上下文、置换式及生成-判别器等变体的变压器(Transformer)家族工作分类体系,并整合2023年后对实际应用产生深远影响的技术进展,包括指令微调(instruction tuning)、基于人类反馈的强化学习(reinforcement learning from human feedback, RLHF)、直接偏好优化(direct preference optimization)、混合专家模型(mixture-of-experts scaling)、检索增强生成(retrieval augmentation)以及OpenAI、Anthropic、Google、Meta、Mistral和DeepSeek等机构的主流模型家族;在应用层面,系统调研了这些模型在医疗、金融、法律、教育、客户服务、创意写作和科研等领域的部署情况,明确各场景下选择变压器架构所依赖的具体能力。论文的核心贡献在于基于上述综述进行批判性评估,从部署决策的四个关键维度(如性能、效率、可扩展性、对齐性)对比不同架构,并量化参数量与能耗之间的权衡关系;同时深入探讨对齐方法、数据来源透明度及基准测试饱和度如何重塑“前沿模型”(state-of-the-art)的定义标准。最后,论文指出若干值得进一步研究的重要科学问题。
链接: https://arxiv.org/abs/2606.24331
作者: Guruprakash J,Krithika L.B
机构: VITAP(维塔普大学)
类目: Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注:
Abstract:Transformer-based language models have become the default substrate for natural language processing and the pace of new releases has made it hard for practitioners to separate durable ideas from the noise of incremental announcements. This review works at two levels. At the level of mechanism, we organise the main transformer families into a working taxonomy, covering encoder-only, decoder-only, encoder-decoder, long-context, permutation-based, and generator-discriminator variants. We then extend the discussion to post-2023 developments that changed the picture in practice: instruction tuning, reinforcement learning from human feedback, direct preference optimisation, mixture-of-experts scaling, retrieval augmentation and the current flagship model families from OpenAI, Anthropic, Google, Meta, Mistral and DeepSeek. At the level of use, we survey deployments across healthcare, finance, legal, education, customer service, creative writing and scientific work. Based on this we link each to the specific capabilities that make a transformer the appropriate tool. The contribution of this paper is a critical assessment that is based on the survey. We compare architectures on four axes that matter to deployment decisions, we quantify the trade-off between parameter count and energy cost. We also discuss how alignment methods, data provenance and benchmark saturation change what it means to call a model “state of the art”. The final section lists the research questions that we think deserve more attention.
[NLP-37] Prag ue Dependency Treebank – Consolidated 2.0: Enriching a Complex Annotation Scheme LREC2026
【速读】: 该论文旨在解决多层语言结构在语料库标注中系统性整合与关联的难题,尤其关注句间现象(如指代消解和话语关系)与意义表示之间的精确衔接。其解决方案的关键在于构建一个统一、连贯且跨语体的捷克语语料库——普拉格依存树库第二版(PDT-C 2.0),该资源历时近30年持续发展,涵盖近400万词元,实现了从句法、语义到话语层面的多层次标注,并配套提供完全兼容的词典资源。这一高精度标注体系不仅支撑了持续的语言学研究,还广泛应用于传统与新型自然语言处理工具的国际比较以及向其他形式化表达体系的转换,其核心优势在于多层级标注的一致性与可复用性。
链接: https://arxiv.org/abs/2606.24324
作者: Marie Mikulová,Jiří Mírovský,Milan Straka,Pavlína Synková,Jan Štěpánek,Barbora Štěpánková,Jan Hajič
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to LREC 2026
Abstract:The Prague Dependency Treebank framework is unique in its attempt to systematically include and link different layers of language, including a meaning representation with several types of inter-sentential phenomena, especially coreference and discourse relations. We present its second consolidated version (PDT-C 2.0), which concludes almost 30-years long project of sustained development of the resource to a uniformly and coherently annotated, genre-diversified, almost 4 million token language resource of Czech language, with accompanying fully compatible lexicons. In addition to continuous linguistic research, the richly linguistically annotated corpus is also widely used in international comparisons of the development of traditional and novel NLP tools as well as in conversions into other formalisms. The corpus and the trained parsers are available under the CC BY-NC-SA licence.
[NLP-38] AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLM s via Retrieval-Inspired Token Compression
【速读】: 该论文旨在解决长时序音视频理解中因上下文窗口有限和信息冗余严重而导致的性能瓶颈问题。其核心挑战在于如何在保持关键语义信息的同时,有效压缩海量多模态输入数据,以适应大语言模型(LLM)的处理能力。解决方案的关键在于提出AVOC框架,通过引入一个可学习的令牌压缩模块(token compression module),将多模态令牌压缩重构为一个top-K检索问题:在固定上下文预算下,从大量原始令牌中筛选出最能支持用户查询的紧凑子集。该方法借鉴信息检索领域的三大经典标准——相关性(relevance)、重要性(importance)与多样性(diversity),分别设计针对性机制以捕捉音视频中的关键信息,并将其整合为统一的检索式压缩流水线。实验表明,AVOC在OmniVideoBench和LVOmniBench两个长时序音视频基准上分别超越次优模型4.9和5.5个百分点的平均准确率,且在长达一小时的“针堆找针”(Audio-Video Needle-in-a-Haystack)任务中仍保持鲁棒性能,验证了其在长序列理解中的有效性。
链接: https://arxiv.org/abs/2606.24286
作者: Yijing Chen,Wenhui Tan,Xiaoyi Yu,Yuyue Wang,Xin Cheng,Kaisi Guan,Hao Jiang,Xiangyang Li,Guojie Zhu,Ruihua Song
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Huawei Inc. (华为公司)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models have achieved remarkable progress in short-form audio-video understanding, yet long-form audio-video comprehension remains challenged by limited context windows and severe information redundancy. To address these bottlenecks, we propose AVOC, a framework for long-form audio-video understanding in Omni-modal Large Language Models. AVOC introduces a learnable token compression module between the modality encoders and the LLM backbone. We reframe multimodal token compression as a top- K retrieval problem: given a fixed context budget, the module must retrieve a compact subset of tokens that best supports answering the user query. We draw inspiration from three classical Information Retrieval criteria for selecting informative units from a large candidate pool: relevance, importance, and diversity. AVOC instantiates each criterion as a tailored mechanism for audio-video understanding, and integrates them into a unified retrieval-style compression pipeline. Experiments show that AVOC achieves state-of-the-art performance on long-form audio-video benchmarks, surpassing the second-best model by 4.9 and 5.5 points in average accuracy on OmniVideoBench and LVOmniBench, respectively. Moreover, AVOC maintains robust performance on Audio-Video Needle-in-a-Haystack task at durations up to one hour.
[NLP-39] CALIBER: Calibrating Confidence Before and After Reasoning in Language Models
【速读】: 该论文旨在解决生成式语言模型在复杂推理任务中信心估计(confidence estimation)的校准问题。现有方法通常仅在推理前或推理后单次获取信心值,但未能区分不同信息状态下的信心应指向不同目标:推理前的信心应预测模型正确解答提示(prompt-level success)的可能性,而推理后的信心则应评估最终答案(answer-level correctness)的可信度。其核心解决方案是提出CALIBER(Calibration Before and After Reasoning)框架,通过在不同阶段分别匹配相应的监督目标——即根据信息状态对信心估计进行分阶段校准,实现前后推理阶段的信心分别对应正确的监督信号。实验表明,在7B和30B规模模型上,CALIBER显著降低预期校准误差(Expected Calibration Error, ECE),在BigMathDigits、GPQA、TriviaQA等基准上均取得最优或接近最优的校准性能与判别能力(如Brier score和AUROC),且在分布外(out-of-distribution)场景下优势尤为明显,证明了“位置-目标对齐”策略在模型泛化能力提升中的关键作用。
链接: https://arxiv.org/abs/2606.24281
作者: Conor Finlay,Joshua Kurien,Saurabh Dash,Marzieh Fadaee,Beyza Ermis
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning language models are increasingly asked not only to answer difficult questions, but also to estimate their likelihood of success. Existing methods typically elicit confidence only once: either before thinking or after answering. We argue that confidence in reasoning models is state-dependent: before thinking, confidence should estimate the chance of the model correctly solving the prompt, while after thinking it should predict whether the realized answer is likely to be correct. This distinction determines the appropriate supervision target: prompt-level success should supervise confidence estimates made after seeing the prompt, while individual answer-level correctness should supervise confidence estimates made after answering. We introduce CALIBER (Calibration Before and After Reasoning), which elicits both estimates and supervises each with the target matched to its information state. Under this unified protocol, CALIBER reduces Expected Calibration Error (ECE) by 52.5% over the strongest single-confidence baseline on BigMathDigits for the 7B model, while achieving the best Brier score and AUROC, and remains within 2.1 points of the best accuracy. Further, on a larger 30B model, CALIBER achieves the best ECE on BigMathDigits while remaining competitive in Brier score and AUROC. Out of distribution, it achieves the best ECE and Brier score on GPQA and TriviaQA, and remains competitive on SimpleQA. Ablations further show that this position-target alignment is most beneficial under distribution shift where it consistently reduces calibration error across all out-of-distribution benchmarks.
[NLP-40] Pigeonholing: Bad prompts hurt models to collapse and make mistakes
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在上下文学习(in-context learning)中因不良上下文引发的性能退化与模式坍缩(mode collapse)问题,即“鸽笼化”(pigeonholing)现象。其核心问题是:即使在无恶意意图的情况下,用户无意提供的错误示例或包含先前错误回复的对话上下文,也会导致模型固着于错误答案、减少多样性探索并偏离正确立场,从而显著降低模型在数学证明、代码生成等任务中的表现。解决方案的关键在于提出一种基于合成错误的强化学习与验证机制(RLVR with synthetic errors),通过在训练中引入人为构造的错误样本,增强模型对不良上下文的鲁棒性,实验表明该方法在不良上下文条件下相较基线模型可提升43%-60%的性能,有效缓解了鸽笼化导致的模式坍缩。
链接: https://arxiv.org/abs/2606.24267
作者: Hyunji Nam,Keertana Chidambaram,Dorottya Demszky,Natasha Jaques
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While in-context learning is generally shown to be effective in Large Language Models (LLMs), bad contexts can cause performance degradation and mode collapse, a phenomenon we call “pigeonholing.” Unintentionally bad contexts can happen without malicious jailbreaking intents: For example, a user asks the model to justify an incorrect math theorem or fails to correct the model’s buggy code. Specifically, we investigate ``pigeonholing" in two scenarios: (1) when the user suggests a solution, and (2) when the conversation context includes the assistant’s previous (incorrect) responses. Our experiments across 10 verifiable and open-ended tasks with 10 different models show that pigeonholing manifests in several ways: (1) repeating the incorrect answers from context (leading to 38-40% performance drop), (2) converging on a narrow set of answers in coding and text generation without exploring alternatives, and (3) flipping stance on controversial topics to align with the user or the assistant’s previous claims. We find that pigeonholing worsens almost monotonically with the number of conversation turns (performance drops by additional 14+% as repeated mistakes increase from 1 to 5), and pigeonholing-induced mode collapse can happen even when the provided example is correct. As a step toward mitigation, we propose RLVR with synthetic errors which improves models by 43-60% under bad contexts compared to vanilla RLVR baselines.
[NLP-41] SURGELLM : Rethinking Multi-Task Evaluation through Task-Aware Feature Gating with Class-Balanced Normalization ACL2026
【速读】: 该论文旨在解决在异构自然语言处理(Natural Language Processing, NLP)任务中,经过微调的编码器所面临的三大耦合问题:归纳偏置不匹配、类别不平衡对特征统计的污染,以及缺乏将注意力机制与外部词汇知识进行条件化的能力。其核心解决方案是提出一种统一的Transformer框架——SurgELLM,通过三个轻量级专用模块实现:1)外科特征门(Surgical Feature Gate),基于精心筛选的词汇指示符和[CLS]标记学习逐维度的Sigmoid门控,当特征无信息时可证明退化为恒等映射;2)任务条件前缀令牌(Task-Conditioned Prefix Tokens),将量化后的特征值与任务标识符作为前缀注入每个输入;3)实例加权归一化(Instance-Weighted Normalization, IWN),从门控统计中消除类别先验偏差。研究进一步证明了超出风险界与外科特征对齐之间的理论联系。在涵盖17,830个样本、11种模型变体及三组随机种子的四个任务(SST-2、多跳检索、大语言模型提示归因、作者识别)上,IWN变体实现了宏平均F1分数0.940,较最强非IWN基线提升0.036,作者识别任务上更是提升0.130。随机词汇对照实验(平均F1下降0.028)证实性能增益源于词汇信息而非参数量增加。相关代码、词汇表及99.5%恢复率的自动提取方案均已开源。
链接: https://arxiv.org/abs/2606.24259
作者: Noor Islam S. Mohammad,Ulug Bayazit
机构: Istanbul Technical University (伊斯坦布尔技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026), ACL 2026, San Diego, California, USA. Available at this https URL
Abstract:Fine-tuned encoders deployed across heterogeneous NLP tasks face three compounding problems: mismatched inductive biases, class-imbalance corruption of feature statistics, and no mechanism to condition attention on external lexical knowledge. We introduce \textbf\surgellm, a unified transformer framework that addresses each with a dedicated lightweight module: a \emphsurgical feature gate (learned per-dimension sigmoid over curated lexical indicators and \texttt[CLS]; provably degenerates to identity when features are uninformative), \emphtask-conditioned prefix tokens (quantized feature values and task identity prepended to every input), and \emphInstance-Weighted Normalization (IWN; removes class-prior bias from gate statistics). We prove an excess-risk bound linking gate benefit to \emphsurgical feature alignment. Across four tasks, SST-2, multi-hop retrieval, LLM-prompt attribution, and authorship detection, covering 17,830 examples and eleven model variants over three seeds, the IWN variant achieves macro-F1 \textbf0.940 ( +0.036 over the strongest non-IWN baseline; +0.130 on authorship detection). A random-vocabulary control ( -0.028 avg.\ F1) confirms gains are lexical, not parametric. Code, vocabularies, and a 99.5% -recovery auto-extraction recipe are released.
[NLP-42] Decoherence as Defence and the Magnitude of Noise Regularisation: A Rigorous N -Qubit Theory of Stochastic Quantum Neural Networks for Adversarially Robust Network Intrusion Detection
【速读】: 该论文旨在解决生成式量子神经网络(Stochastic Quantum Neural Networks, SQNNs)在入侵检测任务中面临的两个关键问题:一是如何实现有效的正则化以提升模型鲁棒性,二是如何建立可预测的对抗防御理论。针对前者,现有方法中的去极化通道(depolarising channel)被证明无法作为类丢弃(dropout-style)正则化器,反而表现为输出噪声;而本文提出了一种“真正的量子丢弃”——基于门级随机失活的正则化机制,解决了该局限。针对后者,先前研究所得的鲁棒性边界过于保守且缺乏可预测性,本文通过构建基于随机主方程的N-qubit形式化框架及向量化李维利安算子(Liouvillian),首次证明了“退相干收缩定理”(decoherence-contraction theorem):在L个纠缠层上施加强度为γ的去极化噪声,将使任意权重为w的泡利读出操作收缩为因子(1−4γ/3)^wL(此处权重为1时为(1−4γ/3)^L),从而定量地将噪声与防御性能关联起来。实验结果表明,在真实NSL-KDD数据集上,采用该机制训练的去极化SQNN在白盒FGSM和PGD攻击下,相比无噪声电路显著更鲁棒(ℓ∞-PGD-20,p=0.04,效应量大),且避免了噪声缺失模型与梯度训练经典检测器普遍存在的灾难性鲁棒性崩溃现象(从95%降至47%),同时将鲁棒性方差降低约两倍。进一步分析揭示,这种鲁棒性源于噪声重塑的训练边界,而非攻击时的梯度压缩。此外,本文推导出自适应惩罚公式,表明门级丢弃等价于权重空间中曲率加权的L₂正则项,其最优丢弃率p=1/2,而去极化噪声则对应输出空间惩罚。30次种子实验验证了该公式的预测能力:两种机制均能小幅但显著地缩小训练-测试差距(≈0.01,p<10⁻⁴ 和 p=0.004),且效果无统计差异,集中体现在过拟合最严重区域;超过p=1/2后性能不再提升,与理论一致。此前单种子研究中的二元对立结论在可重复性检验中不成立。最后,论文完成了中性原子硬件平台上的可行性验证,并给出了随系统规模N扩展的可行性分析。
链接: https://arxiv.org/abs/2606.24219
作者: Gautier-Edouard Edouard Filardo(CREOGN)
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Stochastic quantum neural networks (SQNNs) encode neuronal activations as qubits, synaptic topology as entanglement, and neural noise through a Lindblad master equation. A recent conference study applied a ring-entangled SQNN to collaborative intrusion detection and reached three conclusions: ring entanglement is \emphessential for non-local anomaly detection; an adversarial-resilience bound holds but is \emphconservative; and the depolarising channel \emphfails to act as a dropout-style regulariser, behaving instead as output noise. It left open whether a per-gate stochastic deactivation (``true quantum dropout’') could regularise where the depolarising channel could not, and whether the loose robustness bound could be replaced by a predictive theory. This paper resolves both and extends the framework to real data and to neutral-atom hardware. We give an N -qubit formulation through the stochastic master equation and its vectorised Liouvillian, and prove a \emphdecoherence-contraction theorem: a depolarising channel of strength \gamma over L entangling layers contracts every weight- w Pauli read-out by a factor (1-4\gamma/3)^wL (for the weight- 1 read-out used here, (1-4\gamma/3)^L ); building on the general noise-as-defence result of Du et al., we make this quantitative and operational for intrusion detection. On the real NSL-KDD dataset under white-box FGSM and PGD attacks, a depolarising SQNN trained with the channel is, over seven seeds under strong \ell_\infty / \ell_2 attacks, significantly more robust than the noiseless circuit ( \ell_\infty PGD- 20 , p=0.04 , large effect) and, critically, never suffers the catastrophic robustness collapse that the noiseless model and gradient-trained classical detectors (which fall from 95% to 47% ) do, cutting robustness variance roughly twofold; we show this robustness arises from a noise-reshaped training boundary rather than from attack-time gradient contraction. For generalisation, we derive an adaptive-penalty formula showing that per-gate dropout implements a curvature-weighted L_2 penalty \tfracp(1-p)2\sum\theta^2\partial^2_\theta L in weight space, maximised at p=1/2 , whereas depolarising noise implements an output-space penalty. A 30 -seed study confirms the formula’s quantitative prediction: both mechanisms reduce the train-test gap by a small but statistically significant margin ( \approx!0.01 ; p10^-4 and p=0.004 ), are statistically indistinguishable from each other, and the effect is concentrated where overfitting is largest; increasing the dropout rate past 1/2 does not help, as the formula predicts. The single-seed dichotomy of prior work does not survive replication. We close with a neutral-atom realisation and a feasibility-by- N analysis.
[NLP-43] Co-occurring associated retained concepts in Diffusion Unlearning ICLR2026
【速读】: 该论文旨在解决扩散模型在执行去学习(unlearning)操作时,因过度删除目标概念而连带抑制与其共现的良性概念(如“人物”)的问题。现有方法往往在移除有害内容(如裸露)的同时,错误地削弱了与之紧密关联但应被保留的语义概念,导致模型生成能力下降。为此,论文提出“共现关联保留概念”(CARE, Co-occurring Associated REtained concepts)这一核心概念,定义为必须在去学习过程中予以保护的非目标良性共现语义单元。其关键创新在于引入可量化评估此类概念保留程度的CARE评分,并基于此构建了ReCARE(Robust erasure for CARE)框架:该框架通过自动从目标图像中提取并构建一个精心筛选的良性共现词表(CARE-set),并在训练过程中利用该词表实现对目标概念的精准擦除,同时有效保障其他相关良性概念的稳定性与可用性。实验结果表明,ReCARE在多种目标概念(裸露、梵高风格、天竺鲷物体)上均实现了去学习鲁棒性、整体生成性能与CARE保留之间的最优平衡,达到了当前最优水平。
链接: https://arxiv.org/abs/2606.24192
作者: Miso Kim,Georu Lee,Yunji Kim,Hoki Kim,Jinseong Park,Woojin Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted as a poster at ICLR 2026. Code available at this https URL
Abstract:Unlearning has emerged as a key technique to mitigate harmful content generation in diffusion models. However, existing methods often remove not only the target concept, but also benign co-occurring concepts. As illustrated in Fig.1, unlearning nudity can unintentionally suppress the concept of person, preventing a model from generating images with person. We define these undesirably suppressed co-occurring concepts that must be preserved CARE (Co-occurring Associated REtained concepts). Then, we introduce the CARE score, a general metric that directly quantifies their preservation across unlearning tasks. With this foundation, we propose ReCARE (Robust erasure for CARE), a framework that explicitly safeguards CARE while erasing only the target concept. ReCARE automatically constructs the CARE-set, a curated vocabulary of benign co-occurring tokens extracted from target images, and leverages this vocabulary during training for stable unlearning. Extensive experiments across various target concepts (Nudity, Van Gogh style, and Tench object) demonstrate that ReCARE achieves overall state-of-the-art performance in balancing robust concept erasure, overall utility, and CARE preservation.
[NLP-44] A Synthetic Reliability-Aware PINN Benchmark for Offshore Wind Turbine Support-Structure Monitoring with Bayesian Inverse Identification
【速读】: 该论文旨在解决海上风电机组(OWT)支撑结构在稀疏测量条件下实现快速且可靠的结构健康监测(SHM)问题,核心挑战在于高保真有限元或气动弹性分析难以直接应用于在线监测闭环,而纯数据驱动的代理模型又往往需要大规模训练数据。其解决方案的关键在于提出一种名为Digi Turbine的合成可靠性感知物理信息神经网络(PINN)基准框架,通过在训练目标中嵌入简化的欧拉-伯努利梁方程与温克勒地基模型,将物理规律融入神经网络学习过程;同时结合贝叶斯先验引导的逆向参数识别方法,并引入一阶可靠性方法(FORM)进行可靠性筛选,从而在保证计算效率的同时提升模型的泛化能力与可靠性评估精度。
链接: https://arxiv.org/abs/2606.24176
作者: Puneet Kant,Monika Tanwar
机构: Indian Institute of Technology Jodhpur (印度理工学院乔德普尔分校)
类目: Computation and Language (cs.CL); Computation (stat.CO)
备注: 18 Pages, 8 Figures
Abstract:Reliable structural health monitoring (SHM) of offshore wind turbine (OWT) support structures requires fast state estimation from sparse measurements. Repeated high fidelity finite element or aeroelastic analyses are difficult to use directly in online monitoring loops, while purely data-driven surrogates can require large training sets. This paper presents Digi Turbine, a synthetic reliability-aware Physics Informed Neural Network (PINN) benchmark for OWT monopile support structure monitoring. The workflow embeds a simplified Euler Bernoulli beam equation with Winkler soil foundation in the training objective, couples it with Bayesian-prior-informed inverse identification, and adds First Order Reliability Method (FORM) screening. All validation uses synthetic configurations with analytical or finite-difference ground truth motivated by the NREL 5MW reference turbine context.
[NLP-45] A Pāninian Foundation for Indic Language Processing
【速读】: 该论文旨在解决印地语族语言(Indic languages)自然语言处理(Natural Language Processing, NLP)基础设施碎片化与资源匮乏的问题。当前的NLP工具和基准测试体系以单个语言或小规模语系分支为单位构建,导致各语言间缺乏共享性与可迁移性,且数据利用率低。其核心问题在于忽视了印地语族语言在两千余年以梵语(Sanskrit)为规范中心的演化过程中,共同形成的形态句法架构——即由波你尼(Pānini)在其《八章书》(Astādhyāyī)中系统化表达的语法体系。这一框架超越了谱系分类的界限,为所有印地语族语言提供了统一的计算结构基础。论文的关键解决方案是提出一个基于波你尼语法体系的四部分基准测试套件,将这一共享的形态句法架构显式化、可度量化,并作为高资源“元语言”基底,整合原本分散稀疏的语言资源,从而提升模型的准确性、数据效率与跨语言迁移能力。此外,该框架还引发了一个重要的可解释性研究问题:神经网络模型在训练过程中是否会自发地学习并表征出波你尼的语法范畴。
链接: https://arxiv.org/abs/2606.24172
作者: Ritwik Banerjee,Lav R. Varshney
机构: Stony Brook University (石溪大学); AI Innovation Institute, Stony Brook University (石溪大学人工智能创新研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 0 figures
Abstract:More than a billion people communicate in Indic languages, yet the natural language processing infrastructure serving them remains fragmented and underdeveloped. The cause is structural: the field organizes its tools and benchmarks around individual languages or small subsets of genealogical language families, building separate analyzers, parsers, and datasets for each language and starting over for the next. This overlooks a deep regularity. Through more than two millennia of convergence around Sanskrit, Indic languages came to share a morphosyntactic architecture formalized in Pānini’s grammar, the Astādhyāyī. This cuts across genealogical lines, uniting languages through a common framework. We argue that this Pāninian framework supplies a unifying computational architecture the field has lacked, and that benchmarks grounded explicitly in it would make Indic language systems more accurate, more data-efficient, and more transferable, effectively merging many apparently disparate and sparse Indic language resources into a single high-resource metalanguage bedrock. We propose a four-part benchmark suite to render this shared architecture explicit, measurable, and ready to be leveraged for practical applications. Moreover, we underscore the question it raises for interpretability research: whether neural models trained on these languages come to represent Pānini’s categories on their own.
[NLP-46] CORE-BREW: LLR-Based Soft Decoding for Robust Multi-Bit LLM Watermarking
【速读】: 该论文旨在解决大语言模型(LLM)输出的可信溯源问题,核心挑战在于实现鲁棒的多比特水印嵌入,要求在文本编辑或改写后仍能保持水印的可检测性,同时严格控制误报率(False Positive Rate, FPR)。现有基于纠错码(ECC)的水印方法主要依赖硬判决解码,忽略了词元级的置信度信息,导致检测性能受限。本文提出CORE-BREW,作为块级BREW(Block-wise BREW)的常命中率嵌入扩展,其关键创新在于通过设定固定的命中率 p∗ 对水印信道进行校准,从而获得闭式表达的每词元对数似然比(Log-Likelihood Ratio, LLR),支持基于置信度的软判决解码。该方法提供两种检测模式:严格安全模式(Strict-Safe)保留有界距离的指定码字接受区域以保障安全性;FPR校准模式(FPR-Calibrated)则采用基于似然的评分与轻量级列表解码,实现对FPR与真正率(True Positive Rate, TPR)权衡的精确刻画。实验表明,在开源LLM上经词元级编辑和改写后,CORE-BREW在保持相近语义质量的前提下,显著提升了低误报率下的区分能力与鲁棒性,优于现有主流多比特水印基线。
链接: https://arxiv.org/abs/2606.24163
作者: Joeun Kim,HoEun Kim,Young-Sik Kim
机构: DGIST (大邱科学技術원)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Reliable provenance for LLM outputs requires multi-bit watermarks that remain robust under editing while maintaining strict false-positive control. Existing ECC-based LLM watermarks rely largely on hard-decision decoding, discarding token-level reliability information. We propose CORE-BREW, a Constant-hit-Rate Embedding extension of block-wise BREW for robust multi-bit watermarking. CORE-BREW calibrates the watermark channel by targeting a fixed hit rate p-star, yielding closed-form per-token log-likelihood ratios (LLRs) for principled soft-decision decoding. It supports two detection modes: Strict-Safe, which preserves the bounded-distance designated-codeword acceptance region, and FPR-Calibrated, which uses likelihood-based scoring and lightweight list decoding to characterize the FPR-TPR trade-off. Experiments on open-source LLMs under token-level edits and paraphrasing demonstrate improved low-FPR discrimination and robustness over prior multi-bit watermarking baselines while maintaining comparable semantic quality.
[NLP-47] BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks
【速读】: 该论文旨在解决生成式 AI 在行为科学领域应用中缺乏系统性评估的问题,即现有模型在多样化任务、情境与人群中的表现尚无全面认知。其核心挑战在于:当前评估多聚焦于个体层面的预测准确率,而忽视了模型输出在群体层面的分布一致性,而这对于确保行为有效性至关重要。为此,论文提出 BehaviorBench——一个涵盖四大核心能力的综合性基准,包括行为预测与模拟、策略决策、个体特质推断以及行为知识应用,并在个体与分布双层面上评估模型性能。关键解决方案是引入分布对齐(distributional alignment)作为核心评价维度,揭示出专为行为数据微调的模型在群体层面表现显著优于通用大模型。基于此基准,研究进一步开发了 this http URL-1.5 模型,其在分布指标上领先且保持个体层面竞争力,证明通过针对性行为适应可有效缩小性能差距。该工作确立了 BehaviorBench 作为评估行为对齐型 AI 系统的基础框架,并展示了 this http URL-1.5 在广泛行为科学研究中的潜力。
链接: https://arxiv.org/abs/2606.24162
作者: Jin Huang,Yutong Xie,Wanli Song,Xingjian Zhang,Walter Yuan,Matthew O. Jackson,Qiaozhu Mei
机构: University of Michigan; MobLab; Stanford University; Santa Fe Institute
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Foundation models have been increasingly applied to behavioral science domains such as psychology, sociology, and economics. While these models show promise in individual tasks such as survey response prediction and human-subject experiment simulation, there remains no systematic understanding of how well they perform across diverse behavioral science tasks, contexts, and populations. We introduce BehaviorBench, a comprehensive benchmark that evaluates foundation models along four core capabilities: (1) behavior prediction and simulation, (2) strategic decision-making, (3) subject-trait inference, and (4) behavioral knowledge application. Crucially, BehaviorBench evaluates model outputs at both the individual and distributional levels, capturing not only per-subject accuracy but also population-level alignment, an essential requirement for behavioral validity. Leveraging the tasks in BehaviorBench, we further develop this http URL-1.5, extending the this http URL family of behavioral foundation models fine-tuned on behavioral data. Our results reveal a considerable gap: proprietary general-purpose models excel at individual-level prediction and knowledge-intensive tasks, whereas behavioral foundation models, fine-tuned on behavioral data, achieve substantially stronger distributional alignment. Notably, this http URL-1.5 leads on distributional metrics and remains competitive on individual-level metrics, suggesting that proper behavioral adaptation can close the gap. Our results highlight the importance of distributional evaluation, establish BehaviorBench as a foundation for developing and assessing behaviorally aligned AI systems, and demonstrate this http URL-1.5’s potential for a broad range of behavioral science studies. Our BehaviorBench and this http URL-1.5 models can be accessed via this https URL.
[NLP-48] MedBench v5: A Dynamic Process-Oriented and Hallucination-Aware Benchmark for Clinical Multimodal Models
【速读】: 该论文旨在解决现有医疗人工智能(AI)评估基准在过程可解释性、原子能力量化评估以及幻觉(hallucination)检测方面存在的不足。其核心解决方案在于提出MedBench v5,一个面向临床多模态模型(包括语言模型、视觉-语言模型及智能体系统)的重构型评估框架,实现了从静态问答(QA)向动态过程导向评估的范式转变。关键创新点包括:构建双维度评估体系,融合临床认知响应性(Clinical Cognitive Responsiveness,14个子维度)与医学原子技能(Medical Atomic Skills,4个智能体环境),覆盖63项任务;引入三种可切换的信息流压力因子(遗漏、矛盾、证据延迟),实现对模型性能退化的因子化分析;设计动态过程审计协议,通过五个推理节点生成模型特异性的失败指纹;并建立幻觉传播监测机制,捕捉幻觉在启动、传播、锚定及矛盾交互中的演化轨迹,揭示隐蔽性幻觉。实验表明,尽管前沿模型在整体任务表现上优异,但其过程稳定性仍易受压力因子冲击,尤其在矛盾识别、诊断更新、幻觉传播及基于矛盾的自我修正环节存在显著脆弱性,而最终证据锚定可能仅呈现表面稳定。MedBench v5为临床AI评估提供了统一的能力建模、可控压力测试、过程审计与幻觉轨迹分析基础设施。
链接: https://arxiv.org/abs/2606.24155
作者: Ding Jinru,Jiang Chuchu,Lu Lu,Pang Wenrao,Bian Mouxiao,Gao Zhuangzhi,Chen Jiangyuan,Peng xinwei,Chen Ruiyao,Ren Sijie,Lu Renjie,Han Bin,Liu Meiling,and Xu Jie
机构: Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Existing medical AI benchmarks lack process visibility, atomic skill evaluation, and integrated hallucination detection. We introduce MedBench v5, a redesigned benchmark for clinical multimodal models (language, vision-language, and agent systems) that moves from static QA to dynamic, process-oriented evaluation. MedBench v5 features: (1) a dual-dimensional framework combining Clinical Cognitive Responsiveness (14 sub-dimensions) and Medical Atomic Skills (4 agent environments), covering 63 tasks; (2) three switchable information-flow stressors (omission, contradiction, evidence delay) for factorized degradation analysis; (3) a dynamic process audit protocol with five reasoning nodes that produces model-specific failure fingerprints; (4) hallucination propagation monitoring across initiation, propagation, anchoring, and contradiction interaction-capturing silent hallucination. Experiments on frontier models show that strong overall task performance does not guarantee process stability: stressors mainly disrupt contradiction detection, diagnosis updating, hallucination propagation, and contradiction-based self-correction, while final evidence grounding can remain superficially stable. MedBench v5 provides a unified infrastructure for capability profiling, controllable stress testing, process auditing, and hallucination trajectory analysis in clinical AI evaluation.
[NLP-49] Metis: Bridging Text and Code Memory for Self-Evolving Agents
【速读】: 该论文旨在解决自演化智能体(self-evolving agents)在经验记忆表示方式上的核心矛盾问题:现有系统通常在设计阶段固定采用自然语言文本或可调用代码作为经验存储形式,而未根据经验特性动态选择最优表示,导致对文本记忆与代码记忆之间的权衡关系缺乏深入理解。其解决方案的关键在于提出一种基于分层双模态记忆架构的Metis系统,通过将文本经验结构化为执行计划、环境事实与常见陷阱,并仅在重复使用场景下将高频出现的计划“结晶”为高效可调用的代码工具,从而实现文本记忆的广泛适用性与代码记忆的执行效率之间的互补协同。该设计在保证灵活性的同时,仅在必要时承担代码生成成本,显著提升了任务准确性与执行效率的平衡性。
链接: https://arxiv.org/abs/2606.24151
作者: Zijie Dai,Siuhin He,Hui Li,Qihui Zhou,Jiajun Li,Mingcong Song,Guoping Long,Hongjie Si,Xin Yao,Lin Zhang,James Cheng,Xiao Yan
机构: The Chinese University of Hong Kong(香港中文大学); Huawei(华为); Wuhan University(武汉大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress
Abstract:Self-evolving agents improve over time by distilling experience from past executions and reusing it in future tasks. Existing systems represent such experience either as natural-language text injected into the agent context or as code exposed as callable tools. However, the choice between these representations is typically made at design time rather than derived from the characteristics of the experience itself, leaving the trade-offs between them poorly understood. We present the first controlled study that isolates text memory and code memory over an identical set of experiences. Our results show that the two forms exhibit complementary trade-offs in construction cost, execution efficiency, and transferability, such that neither representation alone is sufficient. Guided by these findings, we propose Metis, a self-evolving agent system built on a hierarchical dual-representation memory. Metis organizes textual experience into execution plans, environment facts, and common pitfalls, and selectively crystallizes recurring plans into validated callable tools. This design combines the broad applicability of text memory with the execution efficiency of code memory while incurring tool-generation cost only when justified by repeated reuse. We evaluate Metis on AppWorld, a challenging benchmark for interactive agents. The results show that Metis improves task accuracy by up to 20.6% over ReAct while reducing execution cost by up to 22.8%. Compared with representative self-evolving agent systems, Metis consistently achieves a better balance between accuracy, execution efficiency, and memory-construction cost.
[NLP-50] Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)预训练过程中训练数据组合的动态优化问题,现有在线数据混合(Online Data Mixing, ODM)方法受限于单一优化视角,难以全面考虑多维度数据构成的动态变化。为克服这一局限,提出了一种全新的整体化数据调度框架——全维数据调度器(Holistic Data Scheduler, HDS)。其核心创新在于将数据调度建模为连续控制空间下的强化学习问题,并采用软演员-评论家(Soft Actor-Critic, SAC)算法以实现高维策略空间中的稳定且高效的探索。HDS的关键在于设计了一个多目标、全局性的奖励函数,融合了数据驱动的质量奖励、损失驱动的跨域影响捕捉以及基于权重范数的模型状态驱动奖励。实验表明,在The Pile基准上,HDS仅需44%的训练迭代次数即可达到次优方法的最终验证困惑度,同时在MMLU零样本任务上提升7.2%,并在多个基准上均展现出一致的性能增益,充分验证了其在提升训练效率与模型最终能力方面的有效性。
链接: https://arxiv.org/abs/2606.24133
作者: Chenhao Dang,Jing Ma,Mingjie Liao
机构: China Electronics Technology Group Corporation 15th Research Institute (中国电子科技集团公司第十五研究所); Renmin University of China (中国人民大学); Alibaba Group (阿里巴巴集团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Our code is at this https URL
Abstract:The composition of training data, governed by the diversity of sources and their mixing strategy, is a cornerstone of Large Language Model (LLM) pre-training. Online Data Mixing (ODM), the technique of adaptively adjusting data mixtures during training, has emerged as a promising direction to improve efficiency. However, existing methods are constrained by their reliance on a singular optimization perspective, which fundamentally overlooks the need for complex LLM pre-training to consider the dynamic data composition from multiple dimensions. To overcome this limitation, we introduce the Holistic Data Scheduler (HDS), a novel online data mixing framework. HDS formulates the data scheduling challenge as a reinforcement learning problem in a continuous control space and leverages the Soft Actor-Critic (SAC) algorithm for its stability and sample efficiency in exploring the high-dimensional policy space. At the core of HDS lies a novel multi-objective, holistic reward function that integrates three critical perspectives: a data-driven reward for quality, a loss-driven reward capturing inter-domain influence, and a model-driven reward based on weight norms. To validate our design and determine its optimal configuration, we conducted systematic experiments on LLMs of various sizes. On The Pile benchmark, HDS reaches the final validation perplexity of the next best method with 44% fewer training iterations. Furthermore, it achieves a 7.2% improvement on the MMLU 0-shot task along with consistent gains on other benchmarks, showcasing its ability to enhance both training efficiency and final model capability.
[NLP-51] When Top-1 Fails: Calibrating LoRA Monitors for Masked Diffusion LMs
【速读】: 该论文旨在解决生成式 AI(Generative AI)中参数高效微调(PEFT)训练过程中模型崩溃(collapse)的早期预警问题。现有方法依赖去噪时间置信度监控器(denoising-time confidence monitors)提供的低成本诊断,但其在PEFT场景下的有效性尚未验证。研究发现,传统的基于top-1 argmax浓度的预警信号在816个来自三个离散扩散语言模型(DLM)家族的LoRA/PEFT配置中均持续触发,然而实际训练日志中未观测到任何崩溃事件(0/816),导致该指标精度为零。其根本原因是预平衡饱和现象:top-1浓度在优化开始前已处于高位,且对训练后期稳定性不敏感。为此,论文提出以最大LoRA梯度范数作为参数侧信号,通过采样梯度路由而非词元浓度来监测训练状态。在合并保留的LLaDA家族数据集上,经训练优化的阈值可将最终损失排名前10%的配置识别出来,实现0.68的精确率和0.79的F1值,优于全正样本的top-1基准线,即使在较低置信度边界下仍表现更优。然而,自回归控制实验与跨家族阈值失效表明该方法仅适用于短时程DLM-LoRA检查,不具备普适性。解决方案的关键在于:摒弃top-1浓度作为PEFT告警信号,转而早期记录最大梯度范数,并针对不同DLM家族校准阈值后用于运行前的流程审查。
链接: https://arxiv.org/abs/2606.24119
作者: Lucky Verma,Pratik Yadav
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 14 pages, 3 figures. Code and result artifacts: this https URL
Abstract:Discrete diffusion language model (DLM) fine-tuning inherits inexpensive diagnostics from denoising-time confidence monitors, but their PEFT-training meaning is untested. We test top-1 argmax concentration as a collapse warning. Across 816 LoRA/PEFT configurations from three DLM families, the warning fires for every configuration while logs record 0/816 actual collapses at the 200 step horizon, giving zero precision. The cause is pre-equilibrium saturation: top-1 concentration is already high before optimization and quickly becomes insensitive to final training stability. We then evaluate max LoRA gradient norm, a parameter-side signal that samples gradient routing rather than token concentration. On a pooled held-out LLaDA-family split, a train-optimized threshold identifies top-decile final-loss configurations with precision 0.68 and F1=0.79, above the all-positive top-1 baseline even at the lower split-bootstrap confidence bound. Autoregressive controls and cross-family threshold failures bound the result to short-horizon DLM-LoRA inspection rather than a universal collapse detector. Workflow: drop top-1 as a PEFT alarm, log max-gradient early in training, and calibrate thresholds per DLM family before routing runs for inspection.
[NLP-52] PORTER: Language-Grounded Event Representations for Portable Structured EHR Foundation Models
【速读】: 该论文旨在解决电子健康记录(EHR)基础模型在处理未见概念或新组合概念(如数值属性)时的表达局限性问题,此类问题源于现有模型依赖固定词汇表对临床事件进行离散化编码,导致跨机构甚至同一机构内不同部署流程间的迁移能力受限。其解决方案的关键在于提出PORTER——一种语言-引导的结构化EHR基础模型,通过解耦事件表示与固定词汇表,利用冻结的文本编码器基于事件描述进行语义表征,并通过专用路径整合数值信息,同时采用自回归预训练的时间骨干网络学习患者时间序列中的临床动态演化。实验表明,在儿科医院74项临床预测任务中,PORTER在不重新训练或进行词汇映射的情况下,实现了97.1%的均值AUROC恢复率,显著优于传统固定词汇模型;在MIMIC数据集上,因无法识别未见词而丢失69%事件的对比模型被超越。机制分析揭示跨词汇迁移性能主要依赖于患者级表征几何结构的保持,而非文本编码器规模;数值路径增强了对量级变化的敏感性,同时不破坏临床概念的身份一致性。此外,PORTER在性能上优于特定任务的文本序列化方法,且计算开销仅为后者的1/329。PORTER代表了向无词汇依赖型EHR基础模型迈进的重要一步,可在减少词汇标准化需求的同时,维持域内性能并实现高效跨任务复用。
链接: https://arxiv.org/abs/2606.24102
作者: Lin Lawrence Guo,Adam Paul Yan,Emily Vettese,Lillian Sung
机构: The Hospital for Sick Children (SickKids); SickKids Enterprise-wide Data in Azure Repository (SEDAR); Beth Israel Deaconess Medical Center (BIDMC); MIMIC (Medical Information Mart for Intensive Care)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Most electronic health record (EHR) foundation models encode clinical events as discrete event tokens from a fixed vocabulary and therefore cannot directly represent events containing unseen concepts or new combinations of concepts and attributes such as numeric values. This limits transfer across institutions and even across deployment pipelines within the same institution. We introduce PORTER, a language-grounded structured EHR foundation model that decouples event representation from this fixed vocabulary. PORTER represents events through their descriptions using a frozen text encoder, integrates numeric values through a dedicated pathway, and learns clinical dynamics over patient timelines with an autoregressively pretrained temporal backbone. Across 74 clinical prediction tasks at a pediatric hospital, PORTER matched the mean AUROC of a fixed-vocabulary model with the same temporal backbone and pretraining objective. When the same patient timelines were rendered using event descriptions not seen during pretraining, PORTER transferred without retraining or vocabulary mapping, recovering 97.1% of the mean AUROC of a model trained directly on the target vocabulary. When transferred to MIMIC, PORTER outperformed the fixed-vocabulary model, which dropped 69% of events because their tokens were unseen. Mechanistic analyses showed cross-vocabulary transfer tracked preservation of patient-level representation geometry rather than the scale of the text encoder, and the numeric pathway improved sensitivity to magnitude without disrupting clinical concept identity. PORTER also achieved higher AUROC than a task-specific text serialization comparator, at 329-fold lower amortized compute. PORTER is a step toward vocabulary-independent EHR foundation models that reduce the need for vocabulary harmonization while preserving in-domain performance and enabling efficient cross-task reuse.
[NLP-53] Predicting Poets Origins from Verse: A Computational Analysis of Regional Linguistic Fingerprints in the Complete Tang Poems
【速读】: 该论文旨在探究唐代诗人籍贯地理来源是否在其诗歌作品中留下可检测的语言痕迹。其核心问题在于:不同地理区域出身的诗人,其语言风格是否存在可区分的区域性特征?解决方案的关键在于构建一个基于诗人籍贯(通过中国人物传记数据库CBDB关联至十道行政区域)与《全唐诗》中每位诗人作品聚合的诗人级语料库,并将籍贯预测建模为多分类任务。研究采用字符n-gram TF-IDF结合可解释的领域特征(意象、季节、典故),结合经典与神经网络模型,在南北大区层面实现0.69的预测准确率(显著高于0.53的多数类基准),并在更细粒度的十道层面也超越随机水平。关键发现包括:(i)诗学语言的差异性随地理距离增加而增强,揭示了诗歌语言中的“距离衰减效应”(distance-decay effect);(ii)该语言信号随时间演变——盛唐时期南北差异不显著,而晚唐时期差异最强,反映出帝国鼎盛期由朝廷推动的语言同质化,后期则出现区域分化;(iii)模型的高置信度误判具有历史意义:初唐时期所有误判均为南方诗人被误判为北方,体现了北方官话文风在当时的文化权威地位。此外,研究还发现,尽管引入基于古文的预训练模型GuwenBERT,其表现仅与简单TF-IDF持平,且二者融合未带来增益,表明字符n-gram已充分捕捉区域语言信号。这一结果凸显了可解释机器学习在文学史研究中作为假设生成工具的价值。
链接: https://arxiv.org/abs/2606.24093
作者: Chi-Sheng Chen,Hung-Yun Liu
机构: Harvard University (哈佛大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We ask whether the geographic origin of Tang-dynasty poets leaves a detectable linguistic trace in their work. Aggregating every poem attributed to each author in the Complete Tang Poems (Quan Tang Shi) and linking poets to their administrative circuit of origin via the China Biographical Database (CBDB), we build a poet-level corpus of 357 poets across the ten Tang circuits and frame origin prediction as multi-class classification. Using character n -gram TF-IDF together with interpretable domain features (imagery, season, and allusion), classical and neural models predict a poet’s broad region (South vs.\ North) at 0.69 accuracy, well above the 0.53 majority baseline, and finer circuit-level origin above chance. Beyond classification, three findings emerge. (i) Linguistic distance between circuits grows with geographic distance (Mantel r=0.40 , p\approx0.09 over nine circuits), evidence of a distance-decay effect in poetic language. (ii) The signal interacts with time: South/North separability is at chance in the High Tang and strongest in the Late Tang, consistent with court-driven homogenization at the empire’s height followed by regional divergence. (iii) The model’s confident errors are historically meaningful – in the Early Tang, every misclassification is a southern poet read as northern, reflecting the prestige of the northern court idiom. We further show that, when given the whole corpus through a hierarchical frozen-encoder representation, a classical-Chinese transformer (GuwenBERT) only matches – not beats – simple TF-IDF, and that combining them adds nothing, indicating that character n -grams already capture the regional signal. Our results position interpretable machine learning as a hypothesis generator for literary history.
[NLP-54] Blockwise Policy-Drift Gating for On-Policy Distillation
【速读】: 该论文旨在解决在长时序推理任务中,基于采样令牌的在线策略蒸馏(on-policy distillation, OPD)因策略漂移(policy drift)导致性能脆弱的问题。其核心挑战在于:当重复使用学生策略生成的轨迹进行训练时,学生策略随训练进程不断演化,与初始行为策略之间的分布差异逐渐增大,从而引发教师信号失真,降低学习稳定性与最终解题成功率。本文提出的解决方案关键在于引入一种轻量级、仅依赖学生策略的“分块策略漂移门控”(blockwise policy-drift gating)机制。该方法通过计算行为学生策略与当前学生策略在采样路径上各固定块(如64令牌块)内的对数概率偏移,并对这些偏移进行均值归一化后生成独立于梯度的门控权重,用于重新加权OPD中的位置损失。该方法不修改教师目标、教师Top-K支持或滚动策略,仅通过动态调整损失权重来抑制局部策略漂移带来的负面影响。实验在六种变体的Qwen3数学推理基准上验证,采用统一200步训练预算和pass@8作为主评价指标,结果显示固定64令牌块的门控机制使OPD在AIME24、AIME25、MATH500和AMC23四个基准上的平均pass@8从0.4978提升至0.5160,且在Teacher-TopK/LSM设置下达到四基准均值最优表现。结果表明,局部旧-当前策略漂移可作为重用轨迹中有效的控制信号,而分块门控是一种简单且鲁棒的默认改进策略,显著提升了长时序推理任务下的解题成功率。
链接: https://arxiv.org/abs/2606.24084
作者: Liwen Zheng,Haiyun Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages
Abstract:On-policy distillation (OPD) trains a student policy using teacher signals computed on trajectories sampled by the student itself. Recent work shows that sampled-token OPD can be fragile on long-horizon reasoning tasks and that local teacher-support matching is a simple and effective repair. This paper introduces blockwise policy-drift gating, a lightweight student-only old-current drift controller for OPD under rollout reuse. The method computes log-probability shifts between the behavior student and the current student on the sampled token path, aggregates these shifts over fixed blocks or spans, and uses the resulting detached, mean-normalized gates to reweight OPD position losses. It does not change teacher targets, teacher top-K supports, or the rollout policy. In a six-variant Qwen3 math reasoning benchmark with a uniform 200-step training budget for all trained variants, we use pass@8 as the primary problem-level solve-rate metric. Fixed 64-token block gating improves sampled-token OPD mean pass@8 from 0.4978 to 0.5160 across AIME24, AIME25, MATH500, and AMC23. On Teacher-TopK/LSM, Block64 gives the best four-benchmark mean pass@8 among trained students. The results identify local old-current policy drift as a practical control signal for reused OPD rollouts and motivate block-level gating as a simple default for improving solve-rate robustness.
[NLP-55] CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression
【速读】: 该论文旨在解决生成式 AI(Generative AI)在实际应用中因输入与输出文本压缩策略导致的推理成本与性能之间权衡失衡的问题。现有广泛推广的“简短表达、忽略语法、节省令牌”(Talk short. Drop grammar. Save token.)范式,其是否真正降低推理开销,取决于被压缩的信道是用户提示(prompt)还是模型输出(response)。论文提出Cavewoman——一种双通道评估协议,能够同时在任务准确性、实际单条项成本及与模型无约束参考文本的一致性三个维度上对生成结果进行量化评估。实验结果显示:对输出端进行压缩可显著降低多数API模型(1.4–2.4倍/模型,最佳情况达3倍)和所有四类开源权重模型在公开定价下的实际成本;而对输入端压缩则产生“严格得不偿失”的效果——不仅未降低净成本(五基准均值提升约1.15倍,最差情形达1.8倍,强压缩下高达2.7倍),反而引发模型生成更长响应以补偿信息损失,导致准确率崩溃。此外,尽管表面文本与原始参考存在显著偏离,部分生成仍保持正确性,且该偏差在长度控制重评分、多重比较校正及多种语义度量下依然稳定存在。因此,解决方案的关键在于区分并分别评估输入与输出压缩的影响,揭示输入压缩的反直觉代价,并强调应优先优化输出压缩以实现真正的成本效益。
链接: https://arxiv.org/abs/2606.24083
作者: Morayo Danielle Adeyemi,Ryan A. Rossi,Franck Dernoncourt
机构: Adobe Research (Adobe)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:“Talk short. Drop grammar. Save token.” This caveman style is widely promoted as a way to cut inference cost, but whether it actually saves anything depends on which channel (the user’s prompt or the model’s response) is being compressed. We present Cavewoman, a two-channel evaluation protocol that scores every generation on task accuracy, realized per-item cost, and reference-text agreement against the model’s unconstrained reference. We evaluate eight models on five datasets at five reduction levels, with both channels measured on the same items. Output compression cuts realized cost on most API models (1.4-2.4x per model, up to 3x in the best case) and on all four open-weight models under public-tier pricing. Input compression has the opposite effect, a strict lose-lose: it raises net cost rather than lowering it (~1.15x on the five-benchmark mean, up to 1.8x on the worst dataset and 2.7x under stronger compression), because models compensate with longer responses even as accuracy collapses. Under the same setting, surface text diverges from the unconstrained reference: on the non-reasoning models, roughly half of all generations are correct yet their surface text no longer entails the model’s own unconstrained baseline generation. The divergence survives length-controlled re-scoring, multiple-comparisons correction, and replication under complementary semantic measures. Code and data are available at this https URL.
[NLP-56] Sentence-Level Contextual Entrainment in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中一种新兴现象——上下文依从性(contextual entrainment)的深层机制问题,特别是其在句子层面的表现及其对模型推理行为的影响。传统研究仅关注单个词元(token)层面的概率提升,而本文将该现象扩展至句子层面,通过分析句子级的平均对数概率(per-token mean log-probability),揭示了即使在提示中出现反事实陈述的情况下,相关句子仍能在推理过程中显著提升其生成概率。这一发现表明,模型存在对上下文内容的非理性偏好强化机制。研究的关键在于识别出控制该现象的注意力头(attention heads)仅占总注意力头的2%至4%,并通过关闭这些特定注意力头即可有效缓解上下文依从性,同时不损害模型的整体性能,为实现更可控、更可信的生成行为提供了可操作的技术路径。
链接: https://arxiv.org/abs/2606.24077
作者: Yang Liu,Chenhui Chu
机构: Kyoto University (京都大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 3 figures
Abstract:Contextual entrainment, which is a newly discovered phenomenon in large language models (LLMs), refers to the tendency of a model to assign higher probabilities to tokens that appear in its context. In this work, we extend this phenomenon from the token level to the sentence level by examining the per-token mean log-probability of a sentence instead of the probabilities of individual tokens. We investigate sentence-level contextual entrainment across 26 LLMs from seven families and two datasets, which cover both subjective and objective tasks. We find that sentence-level contextual entrainment exists. This means that the sentences in the prompt (even if they are counterfactual statements) can significantly increase their probability during model inference time. As the model size increases, contextual entrainment gradually decreases. We also find that contextual entrainment is controlled by 2% to 4% of the attention heads. Turning off these attention heads can effectively mitigate contextual entrainment without hurting the model’s performance.
[NLP-57] VieSpeaker: A Large-Scale Vietnamese Speaker Recognition Dataset Beyond Visual Dependency INTERSPEECH2026
【速读】: 该论文旨在解决越南语(Vietnamese)在说话人识别领域因资源匮乏而面临的挑战,特别是现有语料库规模有限且声学多样性不足的问题。传统大规模数据集通常依赖面部线索将语音与说话人身份关联,导致数据采集受限于需摄像头拍摄的场景。为此,本文提出一种不依赖人脸信息的数据集构建流程,并引入VieSpeaker——一个大规模越南语说话人识别数据集。其解决方案的关键在于利用文本元数据与大语言模型(Large Language Model, LLM)的推理能力,从语音转录文本及上下文信息中推断说话人身份,从而突破对视觉信息的依赖。VieSpeaker包含约902小时来自4,715名说话人的语音数据,实验表明基于该数据集训练的模型在鲁棒性和泛化能力方面优于现有越南语数据集。本研究验证了无脸依赖数据集构建的可行性,为大规模语音资源的构建提供了新范式。
链接: https://arxiv.org/abs/2606.24066
作者: Viet Hoang Pham,Tran Trung Nguyen,Bao Thu Ho,Phuong Tuan Dat,Thi Thu Trang Nguyen
机构: Hanoi University of Science and Technology (河内科技大学), Vietnam
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 5 pages, 1 figure, 6 tables, Accepted at Interspeech 2026
Abstract:Speaker recognition has advanced rapidly with large-scale training datasets, yet Vietnamese remains under-resourced, with existing corpora limited in scale and acoustic diversity. Most large-scale datasets rely on facial cues to link speech with speaker identities, restricting data collection to recordings where speakers appear on camera. We propose a face-independent dataset construction pipeline and introduce VieSpeaker, a large-scale Vietnamese speaker recognition dataset. Our approach leverages textual metadata and large language model reasoning to infer speaker identities from transcripts and contextual information. VieSpeaker contains approximately 902 hours of speech from 4,715 speakers. Experiments show that models trained on VieSpeaker achieve improved robustness and generalization compared to existing Vietnamese datasets. This work demonstrates the feasibility of face-independent dataset construction and provides a new direction for building large-scale speech resources.
[NLP-58] Selective Capability Unlearning in End-to-End Spoken Language Understanding
【速读】: 该论文旨在解决现代生成式语音语言理解(SLU)系统在实际部署中因政策或安全约束需移除特定功能(即意图及其对应的槽位生成行为)时所面临的核心问题:在自回归模型中,即使目标意图被强制抑制,其条件映射关系仍可能通过槽位生成行为“残留”下来,导致模型在外部提供意图前缀时能够重构原始的意图-槽位结构,这种现象被称为能力持续性(capability persistence)。解决方案的关键在于提出一种表示层面的框架——绑定子空间(Binding Subspace, BSU),该框架通过识别并衰减隐藏表示中与意图条件映射相关的方向,实现对意图依赖性的解耦,从而在不损害保留意图性能的前提下,显著降低强制前缀下的意图可恢复性,有效缓解能力持续性问题。
链接: https://arxiv.org/abs/2606.24063
作者: Akanksha Singh,Vinod Kumar Kurmi
机构: Indian Institute of Science Education and Research Bhopal (印度科学教育与研究学院博帕尔分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, preprint
Abstract:Modern spoken language understanding (SLU) systems are increasingly deployed in real-world settings, where specific functionalities may need to be removed due to policy or safety constraints. In SLU, a functionality corresponds to an intent and its associated slot-generation behavior. However, in autoregressive models, suppressing a target intent does not eliminate the conditional mapping that generates slots conditioned on that intent. When the intent prefix is externally supplied, the model can reconstruct the original intent-slot structure. We identify this structural failure as \textbf\emphcapability persistence. We propose \textit\underlineBinding \underlineSubspace (BSU), a representation-level framework that isolates and attenuates intent-conditioned directions underlying this mapping. Across SLU benchmarks, BSU substantially reduces forced-prefix recoverability while preserving retained performance.
[NLP-59] Best Preprocessing Techniques for Sentiment Analysis
【速读】: 该论文旨在解决推特(Twitter)数据集中文本情感分析中预处理流程顺序对模型性能影响的系统性研究空白问题。其核心挑战在于,尽管预处理在降低噪声、提升算法效率方面至关重要,但现有研究缺乏对各类预处理技术执行顺序的系统性评估。论文的关键解决方案是通过实证分析揭示不同预处理步骤的相对重要性及其最优执行顺序:词元化(tokenisation)为最具影响力的操作,而拼写纠正(spelling correction)影响最小;词干提取(stemming)与停用词移除(stop-word removal)可互换执行,但建议在不移除否定词的前提下进行停用词过滤;最优预处理顺序为:词元化 → 文本清洗 → 词干提取 → 停用词移除。该发现为实践者提供了可直接应用的标准化预处理流程,显著减少了盲目探索所需的时间与计算成本,从而有效提升情感分析模型的输出质量。
链接: https://arxiv.org/abs/2606.24055
作者: Saranzaya Magsarjav,Melissa Humphries,Jonathan Tuke,Lewis Mitchell
机构: Adelaide University (阿德莱德大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures
Abstract:Sentiment analysis in Twitter datasets is important because it enables monitoring public opinion on products and analysis of political and social movements. One critical step is preprocessing: the automated processing of text for machine learning algorithms. Preprocessing plays a critical role in reducing noise and improving efficiency. However, little research has systematically examined the order in which preprocessing techniques are implemented. We find that, when accounting for order, spelling correction is the least impactful preprocessing technique, whereas tokenisation is the most impactful. Stemming and stop-word removal are interchangeable, and it is better to remove stop words without removing negation. The best order for applying the preprocessing techniques was tokenisation, text cleaning, stemming, and then stopword removal. Our results provide a systematic approach for practitioners to deploy preprocessing to improve model output without the costly preprocessing exploratory phase.
[NLP-60] owards Version-aware Operations and Transaction Memories for Multi-layer MeMo
【速读】: 该论文旨在解决大模型在知识更新时需频繁重新训练所带来的高计算成本与效率低下问题。其核心挑战在于如何实现对模型知识的高效、可追溯且可逆的动态更新,而无需重新训练整个语言模型。解决方案的关键在于提出一种具有显式多层相关性矩阵记忆(Correlation Matrix Memory, CMM)的生成式语言模型架构——MeMo,通过将记忆化、检索与遗忘等操作作为可编程的架构级操作来实现。进一步地,论文引入版本感知操作层,将高级操作如替换、过时、历史保留、回滚和溯源等编译为基于序列与标记的原语级编辑指令,并强调这些操作本质上是有序的事务性编辑过程。为此,框架设计了两个辅助CMM:版本相关性矩阵记忆(Version CMM, V-CMM)用于映射版本变迁至事务句柄,事务相关性矩阵记忆(Transaction CMM, T-CMM)用于存储可重用的变更内容及逆向程序。该方法支持直接序列级编辑与结构化差异输入,同时建立了涵盖更新成功率、回滚能力、可追溯性、局部性及事务复用性的评估路径,从而实现了对知识更新的细粒度控制与高效管理。
链接: https://arxiv.org/abs/2606.24040
作者: Peiran Li
机构: Freie Universität Berlin (柏林自由大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注: Accepted by MeMo Workshop on Mechanistic Interpretability Neuro-symbolic Approaches by-design, Rome (Italy), 24/6/2026
Abstract:MeMo proposes language models with explicit multi-layer correlation matrix memories (CMMs), where memorization, retrieval, and forgetting are architectural operations. This paper asks how such memories can reduce the need for retraining when knowledge changes. For changes expressible as MeMo memory associations, the model’s accessible knowledge can be updated by editing explicit memories rather than retraining the whole model. We propose a version-aware operation layer in which high-level operations such as replace, obsolete, keep-history, rollback, and trace are compiled into MeMo-native primitive calls over sequences and tokens. The key observation is that a version-aware operation is rarely a single MeMo association. It is an ordered transaction of primitive edits, for example forgetting one sequence-token chain, memorizing another, preserving a historical chain, and recording an inverse program. The framework introduces two auxiliary CMMs: a Version CMM (V-CMM) for mapping version transitions to transaction handles, and a Transaction CMM (T-CMM) for storing reusable change contents and inverse programs. It supports both direct sequence-level edits and structured diff-level inputs, and outlines an evaluation route for update success, rollback, traceability, locality, and transaction reuse.
[NLP-61] RoPE-Aware Bit Allocation for KV-Cache Quantization
【速读】: 该论文旨在解决低比特键缓存(KV-cache)量化中因忽略旋转位置编码(RoPE)结构而导致的精度损失问题。传统方法将每个缓存键视为扁平向量,未考虑RoPE下键对注意力得分的贡献具有依赖于位置的二维频率块特性,导致量化误差在高能量频率块上分布不均,影响模型推理性能。其解决方案的关键在于提出一种面向RoPE的分块比特分配算法——Block-GTQ,该方法基于TurboQuant-MSE(TQ-MSE)框架,针对每一层和每个键值头,计算各RoPE频率块的无标签能量评分,并通过边际增益贪心策略分配整数比特宽度,使高能量块获得更高比特预算。实验表明,在匹配比特预算下,Block-GTQ显著提升了注意力日志概率的保真度,相较均匀量化方案降低每层平均绝对误差(MAE)达32%-80%,并在全部367个层比较中胜出。该精度提升有效增强了下游长上下文任务的表现,在Llama-3.1-8B-Instruct上实现六项任务平均指标从70.6提升至97.4,LongBench-EN平均得分从36.87升至53.31;在无fp16近期键缓冲的情况下,Block-GTQ在AIME 2024/2025任务中达到51.7/37.5,接近fp16基线的54.2/37.9,而均匀量化则完全崩溃。此外,论文还实现了压缩缓存服务路径,在单张H800 GPU上,使用K3V3配置实现了3.24倍的KV缓存压缩,质量媲美fp16,速度提升1.34倍,峰值内存从56.31 GB降至19.85 GB,且可在256K和512K上下文长度下运行,而fp16在此场景下会超出显存限制。
链接: https://arxiv.org/abs/2606.24033
作者: Fengfeng Liang,Yuechen Zhang,Jiaya Jia
机构: Hong Kong University of Science and Technology (香港科技大学); The Chinese University of Hong Kong (香港中文大学); MiMo, Xiaomi Corporation (小米公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Preprint. Code available at this https URL
Abstract:Existing low-bit KV-cache quantizers often treat each cached key as a flat vector. Under RoPE, however, a key’s contribution to a future attention logit decomposes into a position-dependent sum over two-dimensional frequency blocks. This makes key-cache quantization a block-wise bit-allocation problem: high-energy RoPE blocks are more sensitive to quantization error and should receive more bits. We introduce Block-GTQ, a RoPE-aware bit allocator for key-cache quantization built on TurboQuant-MSE(TQ-MSE). For each layer and KV head, Block-GTQ computes a label-free energy score for each RoPE block and greedily allocates integer bit widths by marginal gain. Under matched K/V bit budgets, Block-GTQ better preserves RoPE query-key logits on a ten-model diagnostic panel, cutting per-layer MAE by 32-80% at 2 and 3 b/dim K-only quantization and winning all 367/367 layer comparisons against uniform TQ-MSE. These fidelity gains translate to stronger downstream long-context retrieval, understanding, and reasoning. At K2V2 on Llama-3.1-8B-Instruct, Block-GTQ raises the six-task NIAH average from 70.6 to 97.4, and the LongBench-EN average from 36.87 to 53.31. On AIME 2024/2025 with DeepSeek-R1-Distill-Qwen-7B, without an fp16 recent-key buffer, Block-GTQ at K3V2 scores 51.7/37.5, close to fp16’s 54.2/37.9, whereas uniform TQ-MSE collapses to 0.0/0.0. We further implement a packed-cache serving path. On a single H800 GPU with Qwen2.5-3B-Instruct, packed K3V3 achieves 3.24x KV-cache compression with fp16-comparable quality, runs 1.34x faster than fp16 FlashAttention2 at 128K context, reduces peak memory from 56.31 GB to 19.85 GB, and remains feasible at 256K and 512K where fp16 OOMs. Code is available at this https URL.
[NLP-62] Reinforcement Learning Towards Broadly and Persistently Beneficial Models
【速读】: 该论文旨在解决生成式AI在复杂、高风险应用场景中模型对齐(alignment)泛化能力不足的问题,尤其关注强化学习(Reinforcement Learning, RL)可能引发的奖励黑客(reward hacking)、欺骗行为或其它未预期的非对齐策略。其核心解决方案是通过在真实场景下训练模型以强化有益特质(如诚实性、公平性、风险意识和可纠正性),构建涵盖健康、科学、教育等多个领域的现实情境数据集,并采用基于这些数据的强化学习方法来提升模型的对齐性。关键创新在于:即使训练仅限于单一领域(如健康领域),所训练的模型仍能在超过80%的跨分布评估基准上实现显著性能提升,表现出强大的对齐泛化能力;同时,在对抗性提示和有害微调等扰动下,模型展现出更强的对齐持久性(alignment persistence)。研究表明,通过在现实世界语境中显式训练有益行为,可有效增强模型在多样化、未知场景中的鲁棒对齐能力,从而更可靠地促进人类福祉。
链接: https://arxiv.org/abs/2606.24014
作者: Akshay V. Jagadeesh,Rahul K. Arora,Khaled Saab,Ali Malik,Mikhail Trofimov,Foivos Tsimpourlas,Johannes Heidecke,Karan Singhal
机构: OpenAI(开放人工智能)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Blog: this https URL
Abstract:As AI systems are deployed across increasingly diverse and high-stakes settings, model alignment must generalize beyond the tasks and domains seen during training. This is especially important for reinforcement learning (RL), which can introduce unexpected misalignment through reward hacking, deception, or other unintended strategies. We study whether RL on beneficial behavior, instantiated in realistic domains, can produce broad and persistent alignment generalization beyond the training distribution. We construct a dataset of realistic situations designed to measure and train beneficial traits, such as truthfulness, fairness, risk awareness, and corrigibility, spanning varied domains, including health, science, and education. We then train models with RL on this dataset and evaluate them on more than 50 independent benchmarks of alignment and beneficial behavior. Compared to a compute-matched baseline, beneficial trait RL improves performance on over 80% of these out-of-distribution benchmarks. We observe substantial out-of-distribution alignment transfer: a beneficial-behavior RL intervention entirely limited to one domain, health, produces broad improvements on non-health alignment evaluations, including reduced reward hacking, deception, and general misalignment. Finally, we study alignment persistence: whether behavior remains robustly aligned under attempts to steer models towards misalignment. Models trained with beneficial trait RL show improved persistence, including greater resistance to adversarial prompting and harmful finetuning; further work is required to isolate the sources of these effects. These results suggest that RL to reinforce beneficial behavior in realistic domains can produce models that are more robustly aligned with human flourishing.
[NLP-63] owards Spec Learning: Inference-Time Alignment from Preference Pairs
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)行为对齐过程中依赖人工反复调试提示词(prompt)所带来的繁琐、脆弱且易出错的问题,以及基于偏好优化的微调方法在计算成本上的高昂开销。其核心解决方案是提出“规范学习”(spec learning)框架,该框架仅需少量用户指令和偏好判断,即可将这些信息转化为自然语言形式的规范(specifications),作为推理时对LLM进行条件控制的提示。该方法无需修改模型参数,通过可读性强的自然语言规范实现对模型输出的有效引导。相较于传统直接偏好优化(Direct Preference Optimization, DPO)方法,该框架在偏好信号密集的特定领域数据集上表现出更优的生成效果;同时,由于规范以人类可读的文本形式呈现,不仅具备良好的可解释性,还可作为偏好信号的透明化表达,实现了行为对齐过程的可追溯与可审计。
链接: https://arxiv.org/abs/2606.24004
作者: Dhriti Krishnan,Tejas Goyal,Jaromir Savelka
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Steering a large language model (LLM) toward a desired behavior typically relies on an iterative process of hand-crafting a prompt based on a careful inspection of the model’s responses. This is an involved, brittle, and error-prone process. Preference-based fine-tuning is a more rigorous but often prohibitively expensive solution. We propose spec learning, a framework that relies on a brief user instruction and a small set of preference judgments. These are compiled into specifications in the form of natural-language prompts for an LLM. Specifications condition LLMs at inference time, and no parameter updates to the underlying models are required. We show that the responses generated based on the compiled specifications often outperform direct preference optimization (DPO) on datasets from specialized domains whose preference signal is dense. Unlike opaque weight updates, the resulting specifications are human-readable and double as interpretable and transparent written embodiments of the preference signal that produced them.
[NLP-64] RASC: Retrieval-Constrained LLM Adjudication for Clinical Value Set Authoring
【速读】: 该论文旨在解决临床术语集(clinical value sets)在质量测量、表型定义、队列构建及临床决策支持中实现标准化编码生成的难题。现有研究表明,直接采用零样本大语言模型(LLM)生成临床代码存在显著局限性,主要源于代码体系规模庞大、版本控制严格且语言模型难以可靠记忆。为此,论文提出一种分阶段解决方案:第一阶段通过优化候选池构建以提升召回率,第二阶段引入受约束的LLM评判器进行精准候选选择。具体而言,基于Qwen3的词汇感知扩展与代码显示恢复检索策略将候选池召回率从原始RASC基准的0.553提升至0.730(全测试集),在保留发布者子集上也达到0.655。然而,仅提高召回率不足以保证性能,使用原SAPBert交叉编码器对扩展候选池进行筛选时,全测试集宏平均F1仅为0.287,保留发布者子集为0.233。相比之下,采用盲化GPT-5在相同候选池上进行评判,使全测试集宏平均F1提升至0.549,保留发布者子集达0.533。结果表明,检索约束下的LLM评判机制可在保障所有输出代码均来自可审计候选池的前提下,显著提升临床术语集补全的准确性与可靠性。
链接: https://arxiv.org/abs/2606.23992
作者: Sumit Mukherjee
机构: Oracle Health(奥里克健康), USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Clinical value sets define the standardized terminology codes used in quality measurement, phenotyping, cohort construction, and clinical decision support. The recently introduced Retrieval-Augmented Set Completion (RASC) benchmark showed that direct zero-shot large language model (LLM) generation is poorly suited to this task: clinical code systems are large, version-controlled, and not reliably memorized by language models. We study a stage-wise alternative in which candidate-pool construction is optimized for recall and a constrained LLM adjudicator is optimized for candidate selection. On the full 3,744-value-set RASC test split, Qwen3-based retrieval with vocabulary-aware expansion and code-display rescue retrieval increases candidate-pool recall from the original RASC retrieval baseline of 0.553 to 0.730; on the held-out-publisher stratum, pool recall is 0.655. The higher-recall pool alone is not sufficient: applying the original SAPBert cross-encoder to this expanded pool gives full-test macro F1 of 0.287 and held-out-publisher macro F1 of 0.233. Replacing the stage-2 selector with blinded GPT-5 adjudication over the same pool increases full-test macro F1 to 0.549 and held-out-publisher macro F1 to 0.533. These results show that retrieval-constrained LLM adjudication can substantially improve value set completion while preserving the safety constraint that all returned codes must come from an auditable candidate pool.
[NLP-65] Faithful by Construction: Claim-Anchored Attribution for Multi-Document Summarization
【速读】: 该论文旨在解决端到端大语言模型(Large Language Models, LLMs)在多文档摘要生成中普遍存在幻觉(hallucination)问题,以及现有方法在溯源(attribution)粒度粗泛(如整篇文档或段落)、且溯源信息为事后生成(post hoc)导致摘要内容难以验证的缺陷。其核心解决方案是重新审视并重构“提取—选择—重写”(Extract–Select–Rewrite)范式,将中间表示阶段的原子化主张(atomic claims)作为可追溯单元,提出一种基于主张锚定的多文档摘要框架CAMs(Claim-Anchored Multi-document Summarization)。该框架的关键在于:首先,从源文档中以细粒度(token-level)粒度提取原子主张并保留其来源证据;其次,跨文档聚类等价主张并识别跨源冲突;再次,基于支持感知(support-aware)与显著性筛选出最优子集;最后,在重写阶段生成摘要,确保每句话均锚定于经过验证的主张,并可回溯至一个或多个源文本片段。由于内容在生成前即完成局部化,该流程从结构上保证了细粒度、多源可追溯性,并通过支持感知的选择、约束性重写与验证机制,主动促进事实一致性(factual faithfulness),而非依赖模型自身能力。实验结果表明,CAMs在摘要质量上达到主流端到端及跨度溯源基线水平,同时显著提升事实一致性与引用精确率,使多源溯源准确率提升约三分之二,并揭示出可调控的事实一致性与覆盖范围之间的权衡关系,而这一权衡在传统端到端模型中往往隐含未显。
链接: https://arxiv.org/abs/2606.23989
作者: Shuo Guan
机构: UBS AG
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:End-to-end large language models (LLMs) produce fluent multi-document summaries but remain prone to hallucination, and the attributions they offer are typically coarse (whole documents or passages) and generated post hoc, leaving each summary statement hard to verify. We revisit the modular Extract–Select–Rewrite paradigm and recast its intermediate representation as the unit of attribution. We present CAMS, a Claim-Anchored Multi-document Summarization framework that (i) extracts atomic claims with token-level provenance from every source document, (ii) clusters equivalent claims across documents while flagging inter-source conflicts, (iii) selects a support-aware and salient subset, and (iv) rewrites the selection into a summary in which every sentence is anchored to a support-checked claim that links back to one or more source spans. Because content is localized before it is realized, the pipeline is attribution-oriented by construction and faithfulness-oriented by construction: it structurally preserves fine-grained, multi-source traceability while using support-aware selection, constrained rewriting, and verification to encourage, rather than guarantee, factual faithfulness. We evaluate quality, faithfulness, and localization on MultiNews, analyze conflict handling on DiverseSumm, and test zero-shot transfer on WCEP, using a two-regime protocol that separates reference-free citation quality from gold-aligned localization accuracy, and we add an evaluator-decoupled audit that tests citation precision with a support model never used for selection or verification. CAMS matches strong end-to-end and span-attribution baselines on summary quality while substantially improving faithfulness and citation precision, lifting multi-source attribution accuracy by roughly two-thirds, and exposing a controllable faithfulness–coverage trade-off that end-to-end models leave implicit.
[NLP-66] Does My Embedding Reflect That A = B? Evaluating Mathematical Equivalence in Embedding Models
【速读】: 该论文旨在解决数学表述在不同领域中存在显著语言差异但本质等价的问题,即如何有效识别和处理形式各异但数学内涵相同的陈述。由于数学的高度抽象性,同一概念可能在不同子领域中使用截然不同的术语与表达方式,导致跨领域知识迁移困难。尽管近年来形式化资源的发展推动了对数学“语言”间高效、可靠导航工具的需求,但现有嵌入模型往往依赖表层词汇相似性进行匹配,难以捕捉深层数学等价性。为此,论文提出关键解决方案:构建一种基于对比学习的嵌入方法,通过将非形式化陈述与其多种形式化表达进行对齐,强化模型对数学语义本质的理解。实验结果表明,该方法不仅提升了非形式化到形式化文本的检索性能,还在仅包含自然语言的数学等价但词汇不同的配对(MELD)数据集上取得显著改进,验证了其在捕捉深层数学等价性方面的有效性。
链接: https://arxiv.org/abs/2606.23959
作者: Jiaying Ye,Samarth Rao,Leo Carlin,Kedar Chintalapati,Saharsh Bhargava,Rachit Jaiswal,Michael Zhou,Jared Darlington,Jarod Alper,Vasily Ilin,Henry Kvinge
机构: University of Washington(华盛顿大学); Pacific Northwest National Laboratory(太平洋西北国家实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 18 pages, comments welcome
Abstract:Because mathematics is highly abstract, a single statement can take very different forms depending on what subfield it is framed in. There are many examples where breakthroughs occurred after researchers discovered that a question had already been answered in a different field. At the same time, the growth of new resources related to formalization has increased the need for tools that enable efficient and reliable navigation between mathematical ‘languages’ (e.g., from Lean to natural language). In this paper, we investigate whether current embedding models capture mathematical equivalence. To do this, we introduce the Mathematically Equivalent but Lexically Different Pairs (MELD) Dataset, a collection of mathematically equivalent statements that are expressed in very different language. We show that current state-of-the-art embedding models tend to group statements by the terminology used to make them instead of the underlying math. Motivated by this, we propose a contrastive approach to learning embeddings of mathematical text that focuses on aligning informal statements with different formalizations. Our experiments demonstrate that this leads to improvements not only on informal-formal retrieval tasks but also on MELD, which only contains natural language statements.
[NLP-67] Layer-wise Probing of wav2vec 2.0 and Whisper for Consonant Cluster Reduction in African American English INTERSPEECH2026
【速读】: 该论文旨在解决现代语音模型在表征非标准方言(如非洲裔美国人英语,AAE)中普遍存在的辅音簇省略(Consonant Cluster Reduction, CCR)现象时的内部表示机制问题,尤其是探究CCR是否被编码为简单的音段删除,还是作为结构化的音位变异。其解决方案的关键在于采用无监督与有监督语音模型(wav2vec2-base和Whisper-small)进行跨说话人、分层探测(layer-wise probing),通过两个任务——音段省略检测与音段还原以恢复潜在辅音簇身份——系统分析模型各层对CCR的表征能力。研究发现,两类模型均能高精度区分省略形式与标准形式,且省略后的音段仍保留其底层塞音的线索,表明CCR并非简单音段丢失,而是以结构化梯度音位变异的形式被编码。这一结果揭示了现代语音模型能够捕捉到AAE中复杂的音系规律,为理解生成式语音模型在方言多样性中的表征能力提供了关键证据。
链接: https://arxiv.org/abs/2606.23948
作者: Hamid Mojarad,Kevin Tang
机构: Heinrich Heine University Düsseldorf (海德堡大学); University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL)
备注: This paper has been accepted for presentation at Interspeech 2026
Abstract:Self-supervised and supervised speech models are increasingly used to investigate which linguistic information their internal representations encode, and at what level of abstraction they encode it. One underexplored phenomenon is consonant cluster reduction (CCR) in African American English (AAE), a widespread phonological process and a source of automatic speech recognition (ASR) disparity. To examine how CCR is represented, we conduct speaker-independent layer-wise probing of wav2vec2-base and Whisper-small using two tasks: segmental reduction detection and segmental restoration of underlying cluster identity. Both models distinguish reduced and canonical forms with high accuracy. Crucially, reduced segments retain cues to their underlying stops, indicating that CCR is encoded as structured gradient phonological variation rather than simple segmental deletion. These results demonstrate structured phonological encoding of AAE CCR patterns in modern speech models.
[NLP-68] QuechuaTok: Morphological Boundary Accuracy as a Necessary Metric for Tokenizer Evaluation in Agglutinative Low-Resource Languages
【速读】: 该论文旨在解决在低资源屈折语言(如南克丘亚语,quz)中,传统分词评估指标(如繁殖率,fertility rate)无法有效衡量形态正确性的问题。针对这一挑战,研究提出了一套系统性的基准测试框架——QuechuaTok,用于对比四种分词策略(BPE、Unigram LM、WordPiece 以及一种形态感知的PRPE分词器)在南克丘亚语上的表现。其解决方案的关键在于引入形态边界准确率(Morphological Boundary Accuracy, MorphAcc)作为核心评估指标,并结合银标准工具SQUOIA有限状态形态分析器进行验证。实验结果表明,尽管BPE在繁殖率上表现最优(1.636,16k词表),但其形态边界准确率仅为6.67%,而形态感知的PRPE分词器达到了83.33%的最高形态边界准确率,显著优于其他方法,从而证明仅依赖繁殖率等表面指标会误导对分词器性能的判断。该研究强调,在屈折语言中,分词器设计必须优先考虑形态结构的保留,而不仅仅是表面形式的压缩效率。
链接: https://arxiv.org/abs/2606.23943
作者: Maria Contreras
机构: Universidad Peruana de Ciencias Aplicadas (UPC)(秘鲁应用科学大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: 4 pages, 3 tables, 1 figure. Code available at this http URL
Abstract:Tokenization is a foundational step in NLP pipelines, yet standard evaluation metrics such as fertility rate fail to capture morphological correctness for agglutinative languages. We present QuechuaTok, a systematic benchmark comparing four tokenization strategies - BPE, Unigram LM, WordPiece, and a morphology-aware PRPE tokenizer - for Southern Quechua (quz), a low-resource agglutinative language spoken by 8-10 million people in South America. Using a 200k-sentence corpus and the SQUOIA finite-state morphological analyzer (Rios, 2016) as silver standard, we evaluate three metrics: fertility rate, OOV rate, and morphological boundary accuracy (MorphAcc). Our results show that BPE achieves the lowest fertility rate (1.636 at 16k vocab) by memorizing surface word forms, while achieving only 6.67% MorphAcc. PRPE achieves 83.33% MorphAcc - the highest of all systems - demonstrating that fertility rate alone is insufficient to evaluate tokenizers for agglutinative languages. All code and models are publicly available at this http URL
[NLP-69] Neuro-Symbolic Drive: Rule-Grounded Faithful Reasoning for Driving VLAs
【速读】: 该论文旨在解决当前生成式自动驾驶视觉语言模型(VLA)在推理过程中缺乏因果连贯的逐步决策语义问题,导致其生成的解释性理由与最终规划轨迹之间存在脱节。现有方法多依赖后验对齐策略来增强推理与动作之间的关联,但难以保证内在的因果一致性。本文提出一种神经符号驱动框架——Neuro-Symbolic Drive,其核心解决方案是利用经典规则基础规划器(rule-based planners)作为可执行的符号推理引擎,在仿真环境中直接提取其内部决策轨迹(即每一步规则评估的状态),并将这些结构化的、基于规则的推理过程作为监督信号,用于微调驱动型VLA模型(如Qwen3.5-4B)。由于推理轨迹源自决定动作的实际规划状态,因此推理与运动生成在结构上天然耦合,而非通过事后对齐实现。实验表明,在三摄像头和八摄像头感知条件下,该方法显著提升了轨迹预测精度,将平均位移误差(ADE@3s)分别降低至0.26,并将遗漏率(miss rate)降至5.99%以下,验证了规则引导的结构化推理对提升自动驾驶决策可解释性与安全性的重要作用。
链接: https://arxiv.org/abs/2606.23938
作者: Xiangbo Gao,Xiukun Huang,Boyu Lu,Junge Zhang,Mengjie Mao,Jiachen Li,Wei Xiong,Zhengzhong Tu
机构: Texas A&M University (德州农工大学); Carnegie Mellon University (卡内基梅隆大学); University of Maryland (马里兰大学); University of California, Riverside (加州大学河滨分校); University of Pittsburgh (匹兹堡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Driving VLA models incorporating Chain-of-Thought (CoT) reasoning are attractive because they leverage pretrained VLM representations and expose intermediate decisions in natural language, yet current rationales often lack the step-by-step decision semantics needed to keep the rationale causally connected to the planned motion. We introduce Neuro-Symbolic Drive, a neuro-symbolic driving framework that supervises a driving VLA with rule-grounded reasoning traces extracted directly from classical rule-based planners. Our key observation is that rule-based planners are symbolic AI systems that already function as executable reasoning engines: they reason about active safety constraints, search over candidate maneuvers, and select a final trajectory. We instrument these planners in simulation to capture both the executed trajectory and the internal decision trace at each rule-evaluation step. Each trace is serialized into structured rule-grounded reasoning and paired with the trajectory to fine-tune Qwen3.5-4B as a driving VLA. Because these traces are derived directly from the planner states that determine the action, they ensure reasoning is structurally coupled to motion generation by construction, rather than by post-hoc alignment. On our simulator-generated benchmark, detailed rule-grounded reasoning reduces ADE@3s from 0.47 to 0.26 and miss rate from 8.30% to 6.40% under three-camera perception, and from 0.54 to 0.26 and 10.13% to 5.99% under eight-camera perception. Neuro-Symbolic Drive thus converts neuro-symbolic planning logic into structured supervision. Code base: this https URL.
[NLP-70] When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents
【速读】: 该论文旨在解决现有评估框架中依赖精确匹配召回率(exact-match retrieval recall)作为检索器向下游决策模型提供有效政策上下文的代理指标时可能存在的偏差问题。其核心问题是:在预动作政策分类任务中,即使关键政策条款未能被精确召回,检索到的近似相关条款仍可能具备足够的下游任务性能,导致仅以精确匹配召回率衡量检索质量会低估实际政策上下文的有效性。解决方案的关键在于将检索到的政策条款引入分类器的输入循环中进行端到端评估,而非仅依赖精确匹配召回率。实验表明,在使用Qwen2.5-3B/7B分类器与tau-bench基准测试时,尽管仅有7%的航空状态能将原始政策条款精确召回至首位,但使用检索到的条款仍可获得与黄金标准条款相当的宏观F1得分(Δ=-0.02),且在多个检索器和模型规模下呈现一致趋势,提示应采用基于下游任务表现的综合评估方式,而非单纯依赖精确匹配召回率。
链接: https://arxiv.org/abs/2606.23937
作者: Tianyu Ding,Juan Pablo De la Cruz Weinstein
机构: Amazon Web Services(亚马逊网络服务)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Exact-match retrieval recall is often used as a proxy for whether a retriever supplies useful policy context to a downstream decision model. We test this proxy for pre-action policy classification in tau-bench using Qwen2.5-3B/7B classifiers. Under gold-policy conditioning, a compact structured state improves macro-F1 over raw trajectories by 0.13-0.17 after tuning. We then replace the benchmark-designated policy clause with the top-ranked clause retrieved from decision-time context. Although the exact governing clause is retrieved at rank 1 for only 7% of airline states, the primary 3B classifier obtains macro-F1 0.58 with retrieved clauses versus 0.60 with gold clauses (Delta=-0.02, task-cluster 95% CI [-0.23,+0.21]); mismatched-policy and no-policy controls score 0.32 and 0.21. We do not detect a macro-F1 difference between retrieved and gold clauses in this configuration, although the interval remains too wide to establish non-inferiority. The same qualitative pattern appears with a second retriever and at 7B, while varying across fine-tuning configurations. These results indicate that exact-match clause recall can underestimate downstream policy utility in this benchmark setting, motivating evaluation with retrieved policies in the classification loop rather than recall alone.
[NLP-71] Mind the Heads: Topological Representation Alignment for Multimodal LLM s
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在跨模态对齐中存在的粗粒度对齐问题,即现有方法通常仅对语言模型的固定层进行对齐,忽略了Transformer架构中各注意力头(attention head)的细粒度结构差异。其核心挑战在于如何更精准地实现视觉与语言模态间表示的对齐,以提升模型在视觉主导任务中的表现并抑制视觉幻觉(visual hallucinations)。解决方案的关键是提出一种逐头表示对齐(Head-Wise Representation Alignment, HeRA)方法,基于柏拉图表示假说(Platonic Representation Hypothesis),强调保留跨模态表示的拓扑结构(即局部邻域关系)。该方法采用互近邻(Mutual K-Nearest Neighbor, MKNN)作为对齐度量标准,引入可微分的对比损失来匹配局部结构,并在训练过程中针对由MKNN得分筛选出的特定注意力头施加对齐约束。值得注意的是,研究发现对最不具对齐性的注意力头进行对齐反而带来最大性能提升,表明该策略能有效缓解模型对语言先验的过度依赖。实验结果表明,HeRA在多个MLLM和18个基准测试上均显著提升视觉中心任务性能,且作为正则化器有效抑制视觉幻觉。
链接: https://arxiv.org/abs/2606.23885
作者: Davide Caffagni,Alberto Compagnoni,Federico Melis,Sara Sarto,Pier Luigi Dovesi,Mark Granroth-Wilding,Marcella Cornia,Lorenzo Baraldi
机构: University of Modena and Reggio Emilia (摩德纳和雷焦艾米利亚大学); University of Pisa (比萨大学); AMD Silo AI (AMD西洛人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:
Abstract:Representation alignment has emerged as an effective approach to improve Multimodal Large Language Models (MLLMs) by regularizing their internal representations toward those of an external vision encoder. However, existing methods typically align a fixed layer of the language backbone, overlooking the fine-grained structure of Transformer models. In this work, we propose Head-Wise Representation Alignment (HeRA), a method that enforces cross-modal alignment at the level of individual attention heads. Our approach is grounded in the Platonic Representation Hypothesis, focusing on preserving the topological structure of representations (i.e., their local neighborhood relationships) across modalities. Following the Mutual K-Nearest Neighbor (MKNN) alignment metric, we introduce a contrastive objective that acts as a differentiable proxy for matching local structures. HeRA applies this objective during multimodal training to specific attention heads in the LLM, selected by their alignment score according to the MKNN metric. Counterintuitively, we find that aligning the least aligned heads yields the largest gains. Extensive evaluations across multiple MLLMs and 18 benchmarks demonstrate that HeRA consistently improves performance on challenging vision-centric tasks and serves as an effective regularizer against visual hallucinations by naturally curbing the over-reliance on linguistic priors. Our code is publicly released.
[NLP-72] One Year Later…The Harms Persist But So Do We!
【速读】: 该论文旨在解决当前通用大语言模型(LLMs)在心理健康相关对话中安全防护措施不足且在不同临床情境下表现不一致的问题。其解决方案的关键在于构建一个八维度伤害分类体系(harm taxonomy)与多维度评估框架,通过四种对抗性攻击变体对六款专有大语言模型在16种DSM-5诊断类别中的表现进行系统评估。研究发现,仅有自杀与自伤相关场景下的安全防护具有较高可靠性,而进食障碍、物质使用障碍及重度抑郁障碍等常见心理疾病情境中,模型的安全防护失败率可达100%。因此,论文强调,为实现伦理化设计与部署,必须针对不同临床状况明确定义伤害类别,并相应实施精准化的安全防护机制;否则,这些模型在教育等敏感场景中的广泛应用将对脆弱群体构成显著风险。
链接: https://arxiv.org/abs/2606.23884
作者: Annika Marie Schoene,Cansu Canca,Gautham Vijay Kumar,Anson Antony
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 8 tables
Abstract:General-purpose large language models (LLMs) are increasingly used for mental health-related conversations, yet safety safeguards remain inadequate and inconsistent across clinical conditions. This study evaluates six proprietary LLMs across 16 DSM-5 conditions using four adversarial attack variants, introducing an eight-dimension harm taxonomy and a multi-dimensional evaluation framework. Results show that safeguards hold reliably only for suicide and self-harm, while conditions such as eating disorders, substance use disorder, and major depressive disorder exhibit failure rates of up to 100%. We argue that ethical design and deployment of these LLMs demand clearly defined harm categories across clinical conditions and implementation of safeguards accordingly. Until such safeguards are in place, these models pose significant risks to vulnerable populations, making their growing integration into educational settings a particularly concerning.
[NLP-73] ESBMC-PLC: A Unified IEC~61131-3 Formal Verification Framework as a PLCverif Successor
【速读】: 该论文旨在解决现有开源可编程逻辑控制器(PLC)形式化验证平台在输入格式支持与验证能力上的双重局限性:一方面,主流的PLCverif平台不支持梯形图(Ladder Diagram, LD)这一工业界广泛使用的编程语言;另一方面,其依赖CBMC作为后端,仅能提供有界证明,无法实现无界安全性质的验证。针对上述问题,本文提出ESBMC-PLC+,一个统一的验证框架,其关键创新在于:(1)通过MATIEC IEC 61131-3编译器集成结构化文本(ST/SCL)前端,将编译后的C代码送入ESBMC,并结合非确定性输入建模与YAML属性注入机制,实现对ST程序的无界验证;(2)为图形化LD程序引入函数块状态语义,扩展深度优先搜索(DFS)解析器,将TON/TOF/TP定时器、CTU/CTD计数器及R_TRIG/F_TRIG边沿触发器等典型功能块建模为在GOTO中间表示中持久化的扫描周期状态变量。ESBMC-PLC+首次在单一ESBMC后端下实现了对IEC 61131-3标准中三种主要输入格式(LD、ST/SCL和图形化LD)的全面支持,可执行基于k-归纳法的无界安全证明。实验表明,该框架在8个基准程序上达到与PLCverif相当的输入覆盖度,且在含多达8个整型定时器的复杂程序中显著优于nuXmv的BDD后端,速度提升达400至2,000倍,并可在nuXmv超时(120秒)的情况下完成证明。
链接: https://arxiv.org/abs/2606.23870
作者: Pierre Dantas,Lucas Cordeiro,Waldir Junior
机构: The University of Manchester (曼彻斯特大学); UFAM (亚马逊联邦大学)
类目: Programming Languages (cs.PL); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 21pages
Abstract:PLCverif is the most mature open-source platform for PLC formal verification, developed at CERN and in production use since 2019. Yet it has two fundamental limitations: no support for Ladder Diagram (LD) programs, the dominant PLC notation, and reliance on CBMC as its primary backend, which restricts verification to bounded proofs. The PLCverif authors themselves identified ESBMC as the appropriate backend improvement. Prior work established ESBMC-PLC (a textual LD frontend with k-induction) and ESBMC-GraphPLC (graphical PLCopen XML support); together, they cover LD with unbounded proofs but not Structured Text (ST), and graphical LD with timer/counter function blocks remains unverifiable. This paper presents ESBMC-PLC+, a unified framework that closes both gaps: (1) an ST/SCL frontend via the MATIEC IEC 61131-3 compiler, routing C-compiled ST to ESBMC with nondeterministic input modeling and YAML property injection; (2) function block state semantics for graphical LD, extending the DFS resolver to model TON/TOF/TP timers, CTU/CTD counters, and R_TRIG/F_TRIG edge triggers as persistent scan-cycle state variables in the GOTO IR. ESBMC-PLC+ is the first open-source PLC verification framework to support all three major IEC 61131-3 input formats via a single ESBMC backend, enabling k-induction-unbounded safety proofs. A feature comparison with PLCverif and experimental evaluation on 8 benchmark programs, including programs with up to 8 integer timers, shows that ESBMC-PLC+ matches PLCverif’s input coverage while providing stronger guarantees. Against nuXmv’s BDD backend, ESBMC-PLC+ is 400-2,000x faster on timer programs and completes proofs where nuXmv BDD times out at 120s.
[NLP-74] Self-Recognition Finetuning can Prevent and Reverse Emergent Misalignment
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)中涌现的错位现象(Emergent Misalignment, EM)问题,即模型在未显式学习有害内容的情况下,表现出偏离对齐目标的行为。其核心挑战在于理解EM的本质:是模型主动学习了错误人格特征,还是原有对齐人格结构被破坏所致。为此,作者提出一种基于自生成文本识别(Self-Generated Text Recognition, SGTR)的微调策略,作为针对模型“角色”(character)的靶向干预手段,区别于传统的训练中防御机制。该方案的关键在于通过强化模型对自身生成内容的自我识别能力,重建并巩固其默认对齐身份(default character),从而实现对EM的有效防御。实验结果表明,尽管多种微调方法在逆转已发生的错位行为上效果相当,但只有SGTR微调在预防阶段能持续降低错位程度且不恶化任何单一指标,证明其核心机制在于对模型内在角色结构的“加固”。研究进一步通过实验证明,EM与模型默认身份系统的稳定性密切相关:当模型的身份自我报告被人为干扰或其身份相关提示系统被移除时,EM效应显著减弱,这支持了EM并非源于构建一个连贯的错位人格,而是由于对齐人格的失稳所致。
链接: https://arxiv.org/abs/2606.23700
作者: Arush Tagade,Shaoheng Zhou,Jiaxin Wen,Shi Feng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 11 figures
Abstract:Emergent misalignment (EM) has been linked to the activation of misaligned persona vectors and evil character traits, suggesting that EM operates through disruption of the model’s aligned character rather than direct learning of harmful content. Motivated by this connection, we study self-generated text recognition (SGTR) finetuning as a character-targeted intervention that is distinct from existing in-training defenses. We conduct two-stage finetuning experiments across three models (GPT-4.1, Qwen2.5-32B-Instruct, Seed-OSS-36B-Instruct) and multiple EM datasets to compare SGTR finetuning against benign finetuning baselines (correct domain-specific data, general knowledge, and word counting) to find it an effective defense in both reversal and prevention settings. We find that all interventions produce comparable EM reversal, but only when restoring capabilities that EM had degraded. For prevention, only SGTR finetuning consistently reduces misalignment without exacerbating any individual metric, suggesting that character fortification specifically drives prevention. We provide further evidence for EM’s relation to the LLM’s default character by showing that EM finetuning induces diversity into the LLM’s identity self-reports, artificially corrupting self-recognition exacerbates misalignment caused by EM finetuning, and that removing the model’s identity-bearing system prompt substantially reduces the effect of EM finetuning. Together, these findings reframe EM not as the adoption of a coherent misaligned persona but as the destabilization of aligned character.
[NLP-75] Quantifying Prior Dominance in RAG Systems
【速读】: 该论文旨在解决生成式AI在检索增强生成(Retrieval-Augmented Generation, RAG)框架中因依赖离散启发式评估方法而导致的“认知盲区”问题,即无法有效区分模型对上下文信息的真实提取与参数化记忆的回忆。其核心解决方案是提出一种新的连续性度量指标——归一化上下文利用率(Normalized Context Utilization, NCU),通过在零样本、已知答案(oracle)和对抗性(adversarial)条件下分析连续的词元对数概率,严格量化上下文信息的实际增益。研究发现,在不涉及思维链(Chain-of-Thought)推理的严格事实提取任务中,传统模型规模扩展定律表现出极端的边际收益递减现象:小型语言模型(Small Language Models, SLMs)在效率与准确性上不仅可媲美甚至超越高容量大模型。此外,研究揭示了“先验主导性”(Prior Dominance)与模型规模及专有对齐存在强相关性,且所评估的商业API在近半数对抗性冲突中无视外部证据,并在参数先验被反驳时频繁出现系统性置信崩溃(负迁移,Negative Transfer)。这些结果凸显了在严格上下文提取工作流中,小型语言模型具备结构性的认知优势与更强的上下文依从性。
链接: https://arxiv.org/abs/2606.23695
作者: Barak Or
机构: ArtificialGate Ltd.(ArtificialGate有限公司); Google(谷歌); Reichman Tech School (雷赫曼科技学院); Technion – Israel Institute of Technology (以色列理工学院); Reichman University (雷赫曼大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, Preprint
Abstract:Retrieval-Augmented Generation (RAG) grounds Large Language Models in external knowledge, yet current evaluations rely on discrete heuristics that suffer from ‘‘epistemic blindness’’ - failing to distinguish genuine contextual information extraction from parametric memory recall. To address this, we introduce the Normalized Context Utilization (NCU) metric, leveraging continuous token log-probabilities across zero-shot, oracle, and adversarial conditions to strictly quantify contextual information gain. Evaluating architectures ranging from 1.5B to 72B parameters alongside a proprietary commercial API reveals that for strict factual extraction (without Chain-of-Thought reasoning), traditional scaling laws exhibit extreme diminishing returns: highly efficient Small Language Models (SLMs) match or outperform high-capacity architectures. Furthermore, we demonstrate that ``Prior Dominance’’ correlates with model scale and proprietary alignments. The evaluated commercial API not only overrode explicit external evidence in nearly half of adversarial conflicts, but also frequently suffered from systemic confidence collapse (Negative Transfer) when its parametric priors were contradicted. Our findings highlight the structural epistemic advantage and superior contextual adherence of SLMs in strict extraction workflows.
[NLP-76] ModTGCN: Modularity-aware Graph Neural Networks for Text Classification PAKDD2026
【速读】: 该论文旨在解决图神经网络(Graph Neural Network, GNN)在文本分类任务中因过度依赖局部邻域聚合而忽略文档图中全局类一致性聚类结构的问题,这一缺陷会导致类别边界模糊和表示过平滑。其核心解决方案是提出一种模块度感知的图神经网络——ModTGCN,通过联合优化交叉熵损失与基于模块度(modularity)的辅助目标,显式促进类别一致的文档社区形成,同时保持判别性表示。关键创新在于:在基于Transformer嵌入构建的文档-文档相似性图上计算模块度项,以引导模型学习更具语义凝聚力的图结构;同时通过将原始异构TextGCN图解耦为独立的文档-词和词-词子图,显著提升训练效率(实现2x–10x加速)。此外,研究还系统评估了图构建策略、标签感知的边重加权以及模块度优化中的监督方式,实验结果表明,ModTGCN在五个基准数据集上均取得稳定提升,尤其在复杂且低同质性(low homophily)的数据集如Ohsumed和20NG上表现更优。
链接: https://arxiv.org/abs/2606.23694
作者: Rajarshi Misra,Aditya Sharma,Vinti Agarwal,Hari Om Aggrawal
机构: BITS Pilani (比尔拉理工学院); Pilani Campus (皮拉尼校区)
类目: Computation and Language (cs.CL)
备注: PAKDD2026
Abstract:Graph-based text classification models typically rely on local neighborhood aggregation and overlook global community structure, despite semantic document graphs exhibiting strong class-consistent clustering. Ignoring this can blur class boundaries and lead to over-smoothing. We propose ModTGCN, a modularity-aware graph neural network for text classification that jointly optimizes cross-entropy and a modularity-based auxiliary objective to promote class-coherent document communities while preserving discriminative representations. The modularity term is computed on a document-document similarity graph derived from transformer embeddings (pretrained or fine-tuned). To improve scalability, we decouple the original heterogeneous TextGCN graph into separate document-word and word-word components, achieving 2x-10x faster training. We further study graph construction strategies, label-aware edge reweighting, and supervision choices for modularity optimization. Experiments on five benchmarks show consistent gains, with larger improvements on complex, low homophily datasets such as Ohsumed and 20NG.
[NLP-77] Progressive Alignment Objectives for Aligner-Encoder based ASR INTERSPEECH2026
【速读】: 该论文旨在解决生成式端到端自动语音识别(ASR)模型中,对齐器编码器(Aligner-Encoder)在深层网络中对齐信息突然形成的问题,这一现象导致训练过程对长语音输入敏感且不稳定。其核心解决方案是提出InterAligner框架,通过引入中间层对齐目标(Intermediate Aligner objective)和中间连接时序分类损失(InterCTC),促使对齐信息能够逐层渐进地建立,从而增强模型的优化稳定性与对长序列的鲁棒性。实验结果表明,在17层Conformer架构下,该方法显著降低了长语音上的词错误率(WER),尤其在LibriSpeech数据集上,测试集清洁语音与嘈杂语音的WER分别降至3.1%和5.6%,优于仅使用最终层对齐的基线模型。
链接: https://arxiv.org/abs/2606.24147
作者: Jaeyong Lee,Masato Mimura,Takafumi Moriya
机构: NTT, Inc.(日本电信电话公司)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to Interspeech 2026
Abstract:Aligner-Encoders are recently proposed seq2seq end-to-end ASR models that replace decoder attention by predicting the uth token directly from the u-th encoder position, so the encoder must learn the alignment internally without cross-attention or a transducer lattice. In practice, this alignment often forms abruptly in the upper layers, making training sensitive and brittle on long utterances. We propose InterAligner, which adds an intermediate Aligner objective so alignment can form progressively across depth, together with an intermediate CTC loss (InterCTC) to stabilize optimization. On LibriSpeech with a 17-layer Conformer, a final-only Aligner reaches 5.0/7.8 WER (test-clean/other). InterCTC improves to 3.4/6.0, and InterAligner further reduces WER to 3.1/5.6 with the largest gains on long utterances.
信息检索
[IR-0] Are We Ready For An Agent -Native Memory System?
链接: https://arxiv.org/abs/2606.24775
作者: Wei Zhou,Xuanhe Zhou,Shaokun Han,Hongming Xu,Guoliang Li,Zhiyu Li,Feiyu Xiong,Fan Wu
类目: Computation and Language (cs.CL); Databases (cs.DB); Information Retrieval (cs.IR)
备注: Paper list available at: this https URL . Source code available at: this https URL
Abstract:Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluations still benchmark agent memory mainly through end-to-end task success metrics (e.g., F1, BLEU), while treating the underlying system as a monolithic black box. As a result, critical system-level concerns, including operational costs, architectural trade-offs across memory modules, and robustness under dynamic knowledge updates, remain insufficiently explored. In this paper, we present a systematic experimental study of agent memory from a data management perspective. We propose an analytical framework that decomposes agent memory into four core modules: memory representation and storage, extraction, retrieval and routing, and maintenance. Under this framework, we evaluate 12 representative memory systems and two reference baselines across five benchmark workloads spanning 11 datasets. Our extensive end-to-end evaluation shows that no single architecture dominates across all scenarios; instead, effectiveness depends heavily on how well the memory structure aligns with the workload bottleneck. Furthermore, through fine-grained ablation studies, we quantify their individual effects on representation fidelity, retrieval precision, update correctness, and long-horizon stability. Finally, we reveal cost-performance trade-offs under realistic workloads, showing localized maintenance is more cost-efficient than global reorganization. Based on these findings, we identify promising directions towards building truly agent-native memory systems. The code is publicly available at this https URL.
[IR-1] PETRA: Transforming Web Text for Petroleum-Engineering Domain Adaptation
链接: https://arxiv.org/abs/2606.24346
作者: Kirill Dubovikov(1),Omar El Mansouri(1),Hachem Madmoun(1),Yanda Li(1),Sandeep Kumar(1),Aya El Mir(1),Supriyo Ghosh(2),Writabrata Bhattacharya(2),Adrian Garcia-Garcia(2),Onkar Pandit(2),Sunil Kumar Sahu(2),Federico Castanedo(2),Larry Murray(2),Martin Takac(1),Salem Lahlou(1) ((1) Mohamed bin Zayed University of Artificial Intelligence, (2) Inception AI)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Petroleum-engineering search exposes a supervision gap for strong general retrievers: relevant evidence exists in public web text, but domain relevance labels are scarce. To address this gap, we propose PETRA, a large-scale Petroleum Engineering Text for Retrieval Adaptation dataset and pipeline that converts noisy public web data into a curated domain corpus and synthetic supervision for dense retrieval and reranking. PETRA contains 1.36M curated chunks, approximately 2B token equivalents, \approx 859k, embedding training rows from \approx 224k anchors, and roughly 400k teacher-scored reranker candidate rows. Its construction combines high-recall energy-domain curation, an energy-domain classifier with 98.4% test accuracy, chunk-grounded query generation, LLM-written hard negatives, and retrieval-mined candidate lists. PETRA improves first-stage in-domain Normalized Discounted Cumulative Gain (nDCG) from 0.703 to 0.763 through score fusion. Reranker adaptation improves the public Earth Science benchmark by 44% relative and a six-task reasoning-intensive panel by 23%. Failed training recipes show that high train-holdout accuracy on synthetic labels does not predict retrieval gains; retrieval-mined data helps only after being repackaged as teacher-scored candidate lists sampled from the inference-time candidate distribution.
[IR-2] Unified Dominance Graph for Interval-Predicate Approximate Nearest Neighbor Search
链接: https://arxiv.org/abs/2606.24204
作者: Kwun Hang Lau,Ruiyuan Zhang,Elton Chun-Chai Li,Wun Yu Chan,Xiaojun Cheng,Xiaofang Zhou
类目: Databases (cs.DB); Information Retrieval (cs.IR)
备注:
Abstract:Approximate Nearest Neighbor Search (ANNS) is a core primitive for unstructured data retrieval. Real-world applications–such as temporal databases, financial data analysis, and retrieval-augmented generation–often require hybrid queries whose valid objects are constrained by continuous interval attributes, such as lifespans or price ranges. We study Interval-Predicate ANNS (IPANNS), where validity is determined by a predicate between an object interval and a query interval. Existing range-filtering ANNS (RFANNS) methods are designed for single-dimensional scalar filters, but interval predicates such as containment and overlap rely on two coupled endpoint constraints. Treating endpoints as independent scalar attributes can incur large intersection overhead, while containment-specific methods lack a generalized indexing abstraction. In this paper, we propose the Unified Dominance Graph (UDG), a graph-indexing framework for the closed two-bound conjunctive fragment of IPANNS. For a chosen interval predicate, UDG maps object and query endpoints into a normalized two-dimensional dominance space and builds a dominance-labeled graph over the transformed coordinates. Containment, overlap, and other supported endpoint-bound predicates therefore reuse the same construction and search algorithms after semantic mapping, while each UDG instance remains tied to its selected predicate. UDG compresses query-state-specific proximity graphs into one compact index. To improve graph search under restrictive interval filters, we add validity-preserving patch edges that provide routing choices when few objects remain valid. Extensive evaluations on standard benchmarks and real-world datasets show that UDG achieves stable query performance across multiple interval relations and workloads, significantly outperforming existing hybrid search baselines while maintaining low indexing overhead.
[IR-3] MMed-Bench-IR: A Heterogeneous Benchmark for Multilingual Medical Information Retrieval
链接: https://arxiv.org/abs/2606.24200
作者: Junhyeok Lee,Han Jang,Hyeonjin Goh,Kyu Sung Choi
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Under review. 15 pages, 3 figures
Abstract:Retrieval-augmented generation (RAG) in clinical settings increasingly requires multilingual retrieval against predominantly English evidence corpora. Multilingual medical retrieval demands three capabilities: cross-lingual alignment, concept discrimination, and evidence retrieval. However, existing benchmarks evaluate these only in isolation, leaving the interaction between biomedical expertise and multilingual coverage unmeasured. We introduce MMed-Bench-IR, a benchmark designed to disentangle these axes across 6 languages and three structurally heterogeneous tasks: (1) cross-lingual medical QA retrieval with 6,127 queries grounded in the Unified Medical Language System (UMLS), (2) concept discrimination over 4,975 confusion sets at three difficulty tiers, and (3) multilingual evidence retrieval for RAG with 2,040 quality-assured queries. The three tasks share zero concept and query overlap by design, ensuring that aggregate scores reflect genuine capability breadth. Evaluation of ten systems across six paradigm families reveals severe cross-lingual failure: biomedical encoders that score 0.818 nDCG@10 in English drop to 0.056 in Japanese, a gap that English-only benchmarks cannot detect.
[IR-4] Dialogue to Discovery: Attribute-Aware Preference Elicitation for Conversational Product Search Assistants
链接: https://arxiv.org/abs/2606.24194
作者: Sarthak Harne,Natwar Modani,Debabrata Mahapatra,Shubham Agarwal
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Conversational product search assistants offer a more expressive, natural, and interactive alternative to traditional keyword-based product search. With limited screen space, showing only a few items increases the need for precise preference elicitation, which can prolong conversations, leading to user frustration and session abandonment. Conversely, rushing to recommend items without a clear understanding of preferences risks poor matches and a degraded user experience. We present Dialogue to Discovery (D2D), an attribute-oriented preference elicitation framework that dynamically exploits the structure of product attributes to efficiently steer conversations toward the user’s desired item. D2D adaptively prioritizes the most informative queries and strategically times product recommendations, reducing premature or off-target suggestions that harm engagement. To evaluate D2D, we curate three datasets from the Amazon Reviews corpus. In simulated conversations modelled using a multi-factor utilitarian patience framework, D2D achieves a 22.2-29.9% improvement in target-finding accuracy, 6.6-16.1% reduction in abandonment, and 27.5% shorter average conversations over the state-of-the-art baselines. A complementary user study further confirms significant gains in both user satisfaction and perceived efficiency.
[IR-5] Aspect-Based Sentiment Evolution and its Correlation with Review Rounds in Multi-Round Peer Reviews: A Deep Learning Approach
链接: https://arxiv.org/abs/2606.24188
作者: Ruxue Hana,Haomin Zhoua,Jiangtao Zhong,Chengzhi Zhang
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:
Abstract:Mining sentiment information from the textual content of peer review comments offers valuable insights into the scientific evaluation process. However, previous studies are often constrained by coarse-grained analysis and the lack of differentiation across review rounds. Notably, the dynamic shifts in reviewers’ focus and sentiment tendencies throughout multiple review stages remain underexplored. To address this gap, the present study investigates the distribution and evolution of aspect-level sentiments and examines their correlation with the number of review rounds. We begin by segmenting the multi-round review comments of 11,063 accepted papers from Nature Communications and identifying fine-grained review aspect clusters. A manually annotated corpus of approximately 5,000 review sentences is then constructed. Using this dataset, we train a series of deep learning-based aspect sentiment classification models. Among them, the LCF-BERT-CDM model achieves the best performance, with a Macro-F1 score of 82.65%. Subsequent statistical analysis reveals a consistent trend: as the number of review rounds increases, the proportion of positive sentiments rises, while negative sentiments decline. Correlation analysis further indicates that aspect sentiment scores are negatively associated with the total number of review rounds. Key aspects exhibiting stronger correlations include “experiments”, “research significance” and “result analysis”.
[IR-6] Exploring Academic Influence of Algorithms by Co-occurrence Network Based on Full-text of Academic Papers
链接: https://arxiv.org/abs/2606.24099
作者: Yuzhuo Wang,Chengzhi Zhang,Min Song,Seong Deok Kim,Youngsoo Ko,Juhee Lee
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注:
Abstract:Algorithms have become central to scientific research in the era of artificial intelligence (AI). Although algorithm mentions in papers are often used to indicate popularity and influence, existing studies usually evaluate individual algorithms in isolation and pay limited attention to the collective influence formed through their interconnections. This study constructs large-scale algorithm co-occurrence networks in natural language processing (NLP) based on the full text of academic papers and investigates algorithm influence from a network perspective. Using deep learning models, we extract algorithm entities and build overall, cumulative, and annual co-occurrence networks. We analyze their structural characteristics and apply multiple centrality measures to assess the group influence of algorithms across the whole field and over time. The results show that algorithm networks display typical features of complex networks, with increasingly dense connections developing over approximately two decades. Classic, high-performing algorithms and those located at the intersections of different research periods tend to have high popularity, control, centrality, and balanced influence. When the influence of an algorithm declines, it usually loses its core network position first, followed by weaker associations with other algorithms. This study is the first large-scale analysis of algorithm co-occurrence networks. Covering more than four decades of academic publications, it provides a temporal and structural view of algorithm influence and offers a foundation for future research on networks linking algorithms, scholars, and tasks.
[IR-7] Is Higher Team Gender Diversity Correlated with Better Scientific Impact?
链接: https://arxiv.org/abs/2606.24098
作者: Chengzhi Zhang,Jiaqi Zeng,Yi Zhao
类目: Digital Libraries (cs.DL); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注:
Abstract:Collaborative research involving scholars of various genders constitutes a prominent theme in scientific research that has garnered substantial attention. While several studies have investigated the connection between gender-specific collaboration patterns and the scientific impact of paper, the specific gender diversity factors that contribute to enhanced scientific impact remain largely unexplored. In this study, we analyze the correlation between gender diversity and the scientific impact of papers using the examples of Natural Language Processing (NLP) and Library and Information Science (LIS) domains. Our findings reveal three key observations: First, significant gender disparities exist in both NLP and LIS domains, with underrepresentation of female scholars. The gender disparity is more pronounced in the NLP domain compared to the LIS domain. Second, based on papers from the NLP and LIS domains, we find that papers with different gender compositions achieve varying numbers of citations, with mixed-gender collaborations gradually obtaining higher average citation counts compared to same-gender collaborations. Lastly, there is an inverted U-shaped relationship between the gender diversity of paper collaborations and the number of citations received by those papers. Based on the most impactful gender diversity calculations, the ideal gender ratio for NLP and LIS teams within a range where one gender constitutes 5% to 15% of the total number of authors. This paper contributes to the exploration of the most impactful gender diversity in collaborative research and offers insights to guide more effective scientific paper collaboration.
[IR-8] ChartWalker: Benchmarking the Cross-Chart RAG Task
链接: https://arxiv.org/abs/2606.23997
作者: Ning Tang,Chenghan Xie,Hanyang Yuan,Yi Li,Renhong Huang,Qian Kou,Xiaofeng Shi,Hua Zhou,Jiarong Xu
类目: Information Retrieval (cs.IR)
备注:
Abstract:Cross-Chart Retrieval-Augmented Generation (RAG) is critical for complex multi-modal analytical tasks in scientific, business, and political domains. However, existing benchmarks either focus on tables, which are well-structured and textualized, or generate cross-chart questions by simply extracting key points, which often induces lexical overlap between queries and evidence and yields logically inconsistent reasoning chains. To address this, we introduce ChartWalker, a novel framework for constructing challenging cross-chart RAG tasks. ChartWalker features a hierarchical knowledge graph construction method tailored to charts, which organizes entities and relations by granularity to preserve analytical structure. We then propose a structure-aware sampling algorithm that synthesizes semantically coherent, multi-hop reasoning paths, enabling explicit control over query difficulty and granularity for QA generation. Built with this framework, we release ChartWalker-Bench, a comprehensive benchmark spanning diverse domains and cross-chart query types. Extensive evaluations across major RAG paradigms reveal significant performance gaps, underscoring the benchmark’s difficulty and utility. Furthermore, we provide ChartWalker-Agent, an agentic baseline to facilitate analysis and inspire future system design.
[IR-9] Unified Multi-Task Relevance Modeling for E-Commerce: Comparing Task Routing Architectures Across LLM s and Cross-Encoders SIGIR2026
链接: https://arxiv.org/abs/2606.23919
作者: Md Omar Faruk Rokon,Jhalak Nilesh Acharya,Shasvat Desai,Hong Yao,Kuang-chih Lee
类目: Information Retrieval (cs.IR)
备注: Accepted at E-commerce workshop, SIGIR 2026
Abstract:How can we build a single relevance model that handles six different entity pair relationship types in e commerce from query product matching to product type similarity when each task has different data volumes, different semantic requirements, and potentially conflicting learning signals? This question is important because current industry practice relies on separate models for each task, preventing knowledge transfer and producing inconsistent relevance signals. Our work is driven by the following insight: encoder based and decoder only models encode task identity through different mechanisms, so the choice of task routing architecture how task identity is communicated to the shared model affects these two families in asymmetric ways. As our key novelty, we combine three ideas: (a) a unified multi task framework that jointly trains on six entity pair tasks under a shared three point relevance scale, (b) a systematic comparison of three task routing architectures (text prefix routing, multi head classification, and multihead with private transformer layers) across both LoRA adapted LLMs and fully finetuned cross encoders, and © a majority vote ensemble that exploits the diversity induced by private layer routing. First, we show that the MHP Ensemble (multi head with private layers) achieves 89.96% accuracy on 453K test examples the highest across all configurations . Second, we show that removing text prefixes without private layers causes severe degradation for decoder only LLMs while cross encoders remain robust , suggesting an encoder decoder asymmetry in task identity encoding. Third, we show that multi task training yields up to 14% improvement on low resource tasks over single task baselines.
[IR-10] Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs
链接: https://arxiv.org/abs/2606.23915
作者: Tianyu Ding,Aditya Nannapaneni,Juan Pablo De la Cruz Weinstein
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Practice often treats automatic metrics for attribution in LLM retrieval-augmented generation as interchangeable. We audit eight automatic scorers – lexical, embedding, and BERTScore baselines alongside entailment/grounding-trained models (clean and FEVER NLI, the checker MiniCheck) – across three evaluation constructs (provenance/topicality, generated-answer attribution, and fact-check entailment), asking whether any scorer transfers: stays within the 95% confidence interval of the best audited scorer on every dataset of a multi-dataset construct. In the construct with the most multi-dataset human-labeled coverage – generated-answer attribution (AttributionBench’s four source datasets, n = 1,610, with independent HAGRID, n = 2,150) – none does: the per-dataset metric rankings invert (Kendall tau = -0.64, p = 0.031 on AttributedQA vs. LFQA), and an off-the-shelf NLI scorer that is best on short-claim AttributedQA (AUROC 0.90) collapses to AUROC 0.53 (chance) on long-form LFQA, where BERTScore wins (0.91); the flip is not a length or truncation artifact. This instability has a concrete decision cost: a naive “best-on-average” rule for choosing an evaluator fails leave-one-dataset-out (mean held-out regret 0.172 AUROC, worse than fixing one scorer), so metric choice must be validated on the target dataset rather than learned from others. A prompt-based LLM judge avoids the chance-level collapses the automatic scorers suffer (no LFQA collapse) but is not uniformly best, ~100x costlier, and non-deterministic – relocating, not removing, the validation burden.
[IR-11] Scaling Dense Retrieval with LLM -Annotated Training Data: Structured Mining and Progressive Curriculum for E-Commerce Sponsored Search SIGIR2026
链接: https://arxiv.org/abs/2606.23911
作者: Md Omar Faruk Rokon,Shasvat Desai,Jhalak Nilesh Acharya,Isha Shah,Kumar Priyam,Brahanyaa Somasundaram,Vamsee Tangirala,Minuteresa Thomas,Vivek Arora,Vijay Manchi,Hong Yao,Kuang-chih Lee
类目: Information Retrieval (cs.IR)
备注: Accepted at E-Commerce Workshop, SIGIR 2026
Abstract:How can we generate high-quality training data for dense retrieval models at production scale, without relying on click signals or manual annotation? This question is critical for e-commerce sponsored search, where click-based training suffers from position bias and tail-query sparsity, and manual labeling at the scale of hundreds of millions of query-item pairs is economically infeasible. Our work is driven by the following insight: heterogeneous retrieval systems disagree on most items they retrieve, and this disagreement creates a natural source of structured training signal – easy positives where all systems agree, hard positives that only lexical systems find, and hard negatives that fool exactly one system. As our key novelty, we combine three ideas into an end-to-end pipeline: (a) multi-channel retrieval mining with rank metadata from three production systems, (b) graded-relevance annotation by a calibrated three-model cascade ) that reaches 89.1% agreement with trained human annotators, and © three-stage progressive curriculum training that organizes 240M+ training examples across five difficulty levels. We deploy the trained two-tower BERT model on Walmart’s sponsored search and evaluate it against 30K queries labeled by trained third-party human annotators. First, we show that the system achieves +5.1% NDCG@10 over the click-trained production baseline, with the largest gain on tail queries . Second, we show that embarrassing retrievals (rating 0) drop from 8.7% to 3.5%. Third, a two-week online A/B test with tens of millions of ad requests per arm confirms +2.80% ad spend, +1.4% CTR, +2.8% eCPM, and +2.9% click conversion rate. Overall, our work provides a practical and scalable blueprint for replacing click-based training with structured LLM-annotated supervision in production retrieval systems.
[IR-12] INSPIRE: Intent-aware Neural Sponsored Product Retrieval for E-commerce SIGIR
链接: https://arxiv.org/abs/2606.23889
作者: Shasvat Desai,Hong Yao,Utkarsh Porwal,Kuang-chih Lee
类目: Information Retrieval (cs.IR)
备注: Accepted to ACM SIGIR E-commerce Workshop, 2026
Abstract:Walmart holds the largest share of the U.S. ecommerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences. From the advertisers perspective, many products are explicitly designed to target specific intents such as dietary preferences or size variants and must be surfaced at the right moment to be effective. Thus, we propose INSPIRE (Intent aware Neural Sponsored Product Retrieval for Ecommerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries. We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning that predicts intent attributes. We then introduce an intent augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a biencoder, enabling more precise matching between queries and sponsored products.
[IR-13] Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification ACL26 ACL2026
链接: https://arxiv.org/abs/2606.23881
作者: Qian Ma,Qiong Wu,Zhengyi Zhou,Yao Ma
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Accepted by ACL 2026 Findings. Project page this https URL
Abstract:Knowledge-Based Visual Question Answering (KB-VQA) requires grounding visual queries to external knowledge beyond directly observable content in images. While recent multi modal large language models (MLLMs) show strong perceptual abilities, they struggle on KB-VQA tasks requiring groundings from both fine-grained entity and evidence levels. Most existing multi-modal retrieval augmented generation (MM-RAG) methods tightly couple entity discrimination and section-level evidence ranking into a single re-ranking stage, leading to high cost and limited generalization. In this work, we revisit existing MM-RAG solutions from a workflow perspective and argue both entity-level and fact-level groundings are key bottlenecks. We observe that although MLLMs often fail under open-ended entity naming, they can better identify the correct entity when selecting from a small set of candidate names. Based on this insight, we propose a simple and training-free identify-before-answer IBA framework that decouples entity identification from section-level re-ranking. Our approach prompts an MLLM to select high-confidence entities using only candidate names, followed by an off-the-shelf textual re-ranker for evidence selection. Experiments on Encyclopedic-VQA and InfoSeek show that our method consistently outperforms fine-tuned multi-modal re-ranking baselines while reducing training and inference complexity. Additional analyses reveal that the improvements arise not only from better entity identification, but also from selecting more informative evidence once correct entity is fixed. Our implementation is made public to ease reproducibility.
[IR-14] HANCLIP: A Family of Hyperbolic Angular Negation Vision Language Models
链接: https://arxiv.org/abs/2606.23843
作者: Hoang-Bao Le,Aiden Durrant,Thai Son Mai,Binh T. Nguyen,Liting Zhou,Cathal Gurrin
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:
Abstract:Vision-Language Models (VLMs) are typically pre-trained on large-scale image-text datasets to capture semantic correspondences between visual content and natural language. However, they remain surprisingly brittle to negation: models often rely on shallow word co-occurrence and are easily distracted by misleading or irrelevant textual cues, even when their overall retrieval or classification performance is strong. Moreover, directly finetuning on negation data can interfere with previously acquired knowledge, causing noticeable degradation on standard vision-language benchmarks. To tackle these issues, this work introduces HANCLIP (Hyperbolic + Angular + Negation), a family of VLMs that explicitly restructures the embedding space to encode “what an image is not” alongside “what it is.” HANCLIP is trained on a compact set of 20,000 image-text quadruplets and combines a hyperbolic formulation, which models hierarchical semantic relations and asymmetries, with an angular triplet objective that drives systematic separation between negated descriptions and their corresponding positives. This geometry-aware design strengthens negation sensitivity while preserving the global structure of pretrained representations, rather than overwriting them. Extensive experiments across multiple vision-language tasks show that HANCLIP delivers consistent gains on the negation-focused NegBench benchmark, while maintaining competitive or improved performance on standard classification and image-text retrieval benchmarks. The framework is model-agnostic and can be plugged into CLIP, LongCLIP, SmartCLIP, and HiMo-CLIP without large-scale retraining, demonstrating that a carefully designed geometric objective can substantially extend the reasoning capabilities of existing VLMs using only modest additional data.
[IR-15] EvidenceLens: A Claim-Evidence Matrix for Auditing Financial Question Answering
链接: https://arxiv.org/abs/2606.23724
作者: Fengchen Gu,Xiaotian Ren,Zhengyong Jiang,Zhilu Zhang,Ángel F. García-Fernández,Angelos Stefanidis,Mian Zhou,Huakang Li,Jionglong Su
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Large language models are increasingly used to answer questions over annual reports, earnings decks, and analyst notes, yet their outputs remain difficult to verify in high-stakes financial workflows. A fluent answer can blend directly grounded statements, weak synthesis, and unsupported claims across narrative text, tables, and charts. We present EvidenceLens, a visual analytics prototype that treats financial question answering as a claim-evidence alignment problem. The system decomposes an answer into atomic claims, summarizes support composition and confidence, support gaps, and coordinates claim-level inspection with source passages, table cells, and chart regions. Its core visual representation is a multimodal claim-evidence matrix that makes coverage, contradiction, and modality imbalance immediately visible. To support reproducibility, we also specify a JSON-based artifact schema, a lightweight multimodal alignment pipeline, and a deterministic review-priority ranking that maps backend signals into an auditable visual structure. Through representative report-auditing scenarios, we show how EvidenceLens helps analysts distinguish grounded claims from overconfident synthesis that conventional chat interfaces flatten.
[IR-16] EXPO-SQL: Execution-based Clause-level Policy Optimization for Text-to-SQL
链接: https://arxiv.org/abs/2606.23693
作者: Jaehoon Lee,CheolWon Na,Suyoung Bae,Jin-Seop Lee,Jihyung Lee,YunSeok Choi,Jee-Hyong Lee
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 20 pages, 8 figures
Abstract:Text-to-SQL enables users to query databases using natural language by generating executable SQL queries. Recent methods have increasingly adopted Large Language Models based reinforcement learning (RL) to leverage execution feedback for training. However, existing RL methods assign uniform query-level rewards to all clauses in a SQL query, treating correct and incorrect clauses equally. This coarse-grained reward design leads to insufficient learning signals for correct SQL generation. To address this issue, we propose EXPO-SQL (EXecution-based clause-level Policy Optimization for Text-to-SQL) which provides fine-grained supervision through clause-level rewards. To assign clause-level rewards, our method identifies erroneous clauses by analyzing execution results, including error messages and clause-wise incremental execution. Experiments on widely-used Text-to-SQL benchmarks demonstrate that EXPO-SQL significantly outperforms existing supervised fine-tuning, prompting, and RL-based methods through fine-grained clause-level learning. Our code is available at https://github. com/jhn25/EXPO-SQL.
人机交互
[HC-0] “Zooming In” on Agent ic Web Browsers as Assistive Technologies: A Case Study with a Low-Vision Technology Expert
链接: https://arxiv.org/abs/2606.24870
作者: Laura Colazzo,Giuseppe Anzillotti
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Agentic Web Browsers (AWBs), powered by Large Language Models (LLMs), are emerging as autonomous systems capable of navigating the Web on behalf of users. Beyond enhancing productivity, they could also offer significant promise as Assistive Technologies (ATs) for visually-impaired individuals, transforming web interaction into a fluid conversational exchange. In this paper, we present a case study with a low-vision technology expert, examining how AWBs can support visually-impaired users in web navigation. The findings show that, despite the current limitations, the navigation experience is notably fluid and flexible, underscoring the strong potential of AWBs to enhance accessibility and reduce barriers in web interaction, with implications that may extend beyond accessibility to agentic UX more broadly.
[HC-1] Its Complicated: On the Design and Evaluation of AI-Powered AAC Interfaces
链接: https://arxiv.org/abs/2606.24854
作者: Blade Frisch,Will Wade,Dylan Gaines,Michelle Kinsella,Betts Peters,Tamara Broderick,Keith Vertanen
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Presented at Speech AI for All: The What, How, and Who of Measurement Workshop at the CHI Conference on Human Factors in Computing Systems, Barcelona, Spain, 2026
Abstract:Artificial intelligence (AI) can enhance what people who use augmentative and alternative communication (AAC) are able to do with their systems. However, evaluating AI-powered AAC interfaces can be difficult. People are intersectional beings and current evaluation metrics can struggle to capture the multifaceted and nuanced desires people may have for their AAC. We explore the complicated nature of six AAC problem spaces, explore how AI might be used in these spaces, and suggest more robust methods of evaluation that take the intersectional nuances of people into account. We also discuss broader issues that arise across these problem spaces and how they could be addressed using our proposed evaluation methods.
[HC-2] Virtual Simulation for Mental Health
链接: https://arxiv.org/abs/2606.24826
作者: Anna Fang
类目: Human-Computer Interaction (cs.HC)
备注: Doctoral dissertation
Abstract:Poorly designed interventions or those deployed without adequate safeguards can harm the communities they aim to serve, thus exacerbating existing vulnerabilities and leaving individuals unsupported. This is especially the case for the mental health context, where there is a growing trend of relying on technological interventions due to their accessibility and ability to deliver large-scale support. However, the mental health context is also particularly sensitive to change and risks of failure are dire; at their worst, failures in mental health interventions can result in lasting negative outcomes for individuals and tragic losses as people fall through the cracks. Thus, enabling safe ways to experiment in the mental health context is vital to allow both individuals and communities to engage with new interventions without risk of their real-world consequences. Virtual simulation, which uses virtual environments to replicate real-world interactions, processes, and behaviors, offers a promising opportunity for enabling safe, controlled experimentation with its ability to accurately replicate social situations, fears, stressors, and the potential outcomes of specific interactions. This work explores how simulation approaches can support emerging mental health processes through (1) evaluating community-level outcomes using agent-based modeling and (2) individual training in the mental health context through embodied, controlled spaces. I demonstrate this use of virtual simulation systems through a grounded human-centered approach, where system design is guided by empirical understanding of current real-world needs and challenges. By leveraging simulation to create environments where mental health strategies can be safely tested and practiced, this work aims to open new possibilities for designing scalable, user-centered systems that are effective and safe.
[HC-3] Assessing Distribution Shift in Human Activity Recognition for Domain Generalization
链接: https://arxiv.org/abs/2606.24781
作者: Rebecca Adaimi,Edison Thomaz
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 22 pages with references
Abstract:While the field of Human Activity Recognition (HAR) continues to draw interest from researchers and advance in important ways, some key challenges remain. One of the most difficult aspects of building HAR models that show good performance in real-world settings is dealing with data diversity from device and sensor heterogeneity, and contextual changes that are intrinsic to real-world applications. While data diversity in HAR has been well-acknowledged in the literature, there remains a gap in understanding the effect of various types of distribution shifts on HAR models and the domain generalization problem that arises. Towards that end, this paper systematically evaluates 4 different types of distribution shifts, including variations in device type, sensor placement, sampling rate, and user behavior. Quantifying their effects, we illustrate that diversity shifts predominantly define all types of shifts, indicating the existence of unique features that are not shared across different domains. We then introduce a uniform HAR-based distribution shift benchmarks and conduct a comprehensive evaluation of up to 28 domain generalization methods. Our analysis exposes the limitations of current domain generalization algorithms in achieving model generalizability, marginally outperforming the empirical risk minimization baseline. This work represents the first systematic exploration of domain generalization and adaptation concerning specific distribution shifts in sensor-based HAR, offering an open-source benchmark platform and datasets to spur further research.
[HC-4] ask Decomposition for Efficient Annotation
链接: https://arxiv.org/abs/2606.24734
作者: Nupoor Gandhi,Emma Strubell
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:High-quality annotations of structured representations are expensive to collect over large corpora. Manual annotation of structure is laborious, and model-based annotation, although cheaper to generate, requires expensive validation and potentially significant supervision to ensure that the annotation quality is strong enough to be useful downstream. In traditional annotation workflows, annotation of each complete example is performed end-to-end by a single annotator. However, structured annotation is complex, and each aspect of the task represents a unique challenge with an associated inferential load for a given annotator. Modern annotation projects can incorporate heterogeneous groups of annotators, including both models and human annotators with varying domain and linguistic expertise. It remains unclear, however, how to redesign annotation tasks in this setting, where efforts are discriminately allocated across heterogeneous annotators with respect to distinct annotation challenges. We propose to decompose annotation tasks into sub-tasks in order to reduce the aggregate inferential load of annotation projects. Inspired by the notion of centers from centering theory, we introduce a formal model of inferential load based on the degrees of freedom in the space of valid annotations. Using this model, we show that identifying these centers (i.e. salient anchor entities realized by annotation sub-tasks) constrains the output space complexity, and decompositions which isolate and advance center identification reduce the aggregate inferential load. We provide guidelines for decomposing complex structured annotation tasks, supported by examples demonstrating improved cost-efficiency from our prior work. Finally, we present a procedure for allocating sub-tasks across annotators to maximize quality under a fixed budget. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2606.24734 [cs.CL] (or arXiv:2606.24734v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.24734 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-5] SciFi-VIS: Way Out There – How SciFi and Visualization Influence Each Other IEEE-VIS2026
链接: https://arxiv.org/abs/2606.24731
作者: Ulrik Günther,Julián Méndez,Gabriela Molina León,Samuel Pantze,Mario Romero,Abdulhaq Adetunji Salako,Annalena Ulschmid
类目: Human-Computer Interaction (cs.HC)
备注: Accepted workshop to take place at IEEE VIS 2026: this https URL
Abstract:We propose a hybrid half-day workshop at IEEE VIS 2026, calling for participation from visualization researchers and science fiction creators in order to develop a systematic understanding of the two-way relationship these communities have long shared. We invite submissions of creative formats showcasing connections and inspiring future research. Our workshop plan includes a keynote, lightning talks, brainstorming, cross-community critique, affinity mapping, and discussion around identified themes.
[HC-6] SupplyNet: Supporting Visual Exploratory Learning in Supply Chain via Contextual Multi-Agent Simulation
链接: https://arxiv.org/abs/2606.24694
作者: Yanjia Li,Kelcy Kexin Han,Tianrui Hu,Yi-Fan Cao,Huamin Qu,Sicheng Song
类目: Human-Computer Interaction (cs.HC)
备注: 25 pages, 7 figures
Abstract:Simulation has long supported supply chain management instruction by letting learners observe network behavior and test decision strategies. Recent progress in LLM-driven agents opens new possibilities for richer, more adaptive simulations, but many existing systems still present abstract, opaque data that overwhelms learners and discourages active exploration. We introduce \textitSupplyNet, a gamified visual simulation system built on a contextual graph-based LLM multi-agent framework that models interdependent supply chain dynamics and provides responsive feedback through tiered challenges. \textitSupplyNet turns the simulation into a manipulable decision space by integrating an interactive network view of system state, a branching timeline for “what-if” exploration and comparison, and a task-oriented analysis console for structured performance breakdowns. Together, these visual components support counterfactual exploration, causal tracing, and comparative reasoning about outcomes. A user study suggests that \textitSupplyNet increases engagement and supports users’ perceived understanding of supply chain dynamics, highlighting the potential of pairing contextual multi-agent simulation with visualization to advance operational comprehension.
[HC-7] Measuring Users Mental Models of Speech Translation in Human-AI Collaboration ACL2026
链接: https://arxiv.org/abs/2606.24644
作者: HyoJung Han,Nishant Balepur,Jordan Boyd-Graber,Marine Carpuat
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: ACL2026
Abstract:Millions of people use machine translation (MT) tools daily, yet little is known about their perception of what systems can and cannot do. This paper studies users’ mental models of speech translation systems through a new framework based on cross-lingual question answering, where users either accept MT output or request professional re-translation to answer questions based on the information presented in a foreign language. By analyzing user behavior and accuracy trends across varying translation qualities, we examine to what extent they can predict where the system is likely to be wrong, and how this mental model evolves. Users develop stronger mental models with practice, especially when they have some knowledge of the source language, primarily by relying on surface-level error cues. Moreover, providing speech transcriptions can help users develop better mental models. Our results show the promise of cross-lingual question answering as a downstream task for studying MT mental models and advancing our understanding of human-AI collaboration.
[HC-8] Visualizing “We the People”: Bridging the Perception Gap through Pluralistic Data Storytelling
链接: https://arxiv.org/abs/2606.24635
作者: Lisa Schirch,Beth Goldberg
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Graphics (cs.GR)
备注:
Abstract:Traditional visual data storytelling relies on binary graphics that depict two simplified groups in conflict. This can increase political polarization by oversimplifying intra-group disagreements and erasing ambiguity and shared ideas or values. This can inadvertently foster “us versus them” thinking. Intentional, pluralistic design choices for AI-enabled digital platforms can produce visualizations that emphasize nuance, opinion distribution, and intergroup commonalities. To demonstrate this potential, we examine deliberative technologies that map high-dimensional opinion spaces and highlight areas of both consensus and dissensus. The paper highlights the We the People deliberation conducted by Jigsaw and the Napolitan Institute in September 2025, which engaged over 2,400 Americans across all 435 congressional districts in an AI-supported, asynchronous dialogue regarding freedom and equality. By utilizing AI to synthesize long-form, text-based participant inputs into interactive “opinion landscapes,” the initiative provided an alternative format for pluralistic data storytelling that humanized diverse viewpoints and revealed hidden areas of substantial broad consensus. The paper concludes that shifting from divisive, contrast-heavy visual frameworks to distribution-focused, interactive models represents a highly scalable, low-cost intervention capable of bridging perceptual gaps and cultivating a more resilient, collaborative democratic culture.
[HC-9] hemis: An explainable AI-enabled framework for Reinforcement Learning with Human Feedback
链接: https://arxiv.org/abs/2606.24622
作者: Andreas Chouliaras,Luke Connolly,Dimitris Chatzpoulos
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: The extended version of a paper published at the 2026 IEEE Conference on Artificial Intelligence (CAI). Includes an additional appendix with extended derivations and supplementary results. The main paper has 8 pages, 6 figures, 1 table
Abstract:Training safe Reinforcement Learning (RL) systems is inherently challenging, with no guarantee of avoiding unwanted behaviors. The most effective defenses against this are (i) transparency through explainability and (ii) alignment via human feedback. While both show promising results, no publicly available framework currently combines them. To address this, we introduce Themis, an XAI-enabled testing and evaluation framework for Reinforcement Learning from Human Feedback. Themis supports over 200 widely used environments and is easily configurable for experiments in RL, transparency, and alignment. Our results show that Themis can train reward models that match or outperform the environment’s true reward signal using human preferences. We also provide a cloud-based platform for collecting human feedback and managing experiments. It is user-friendly, auto-scalable, and supports large participant groups across multiple experiments without extra development overhead. Tests show Themis can support one thousand users in back-to-back experiments on a modest commercial machine.
[HC-10] Reinforcement Learning for Computer-Use Agents with Autonomous Evaluation IJCAI2026
链接: https://arxiv.org/abs/2606.24515
作者: Marta Sumyk,Oleksandr Kosovan
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to the 4th International Workshop on Generalizing from Limited Resources in the Open World (GLOW @ IJCAI 2026)
Abstract:Computer-Use Agents (CUAs) execute high-level user goals by perceiving and acting directly within graphical user interfaces. However, reinforcement learning for CUAs remains difficult because open-ended desktop environments rarely provide scalable, machine-readable reward signals: task success is often visually grounded and hard to specify with handcrafted reward functions or dense manual labels. We propose an RL fine-tuning framework that uses autonomous vision-language evaluation as a scalable supervision signal for GUI agents. Given a final screenshot and the original instruction, a Vision-Language Model judges task completion and provides terminal feedback without task-specific heuristics or manual labels during policy optimization. Because autonomous evaluators are imperfect, we model their feedback as a noisy binary reward channel and derive a noise-corrected reward estimator for Proximal Policy Optimization. Experiments across macOSWorld, Windows Agent Arena, and OSWorld show that corrected evaluator rewards outperform both zero-shot baselines and raw evaluator rewards, improving success rates by an average of 12.6 percentage points over zero-shot performance and 5.1 points over raw evaluator fine-tuning. These results suggest that autonomous evaluation can serve as a practical reward signal for RL in GUI environments when evaluator noise is explicitly modeled and corrected. Comments: Accepted to the 4th International Workshop on Generalizing from Limited Resources in the Open World (GLOW @ IJCAI 2026) Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2606.24515 [cs.AI] (or arXiv:2606.24515v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.24515 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-11] Optimizing Visual Analytics Workflows: From Theory to Practice
链接: https://arxiv.org/abs/2606.24454
作者: Philip Beaucamp,Alfie Abdul-Rahman,Rita Borgo,Wolfgang Jentner,Saiful Khan,Yiwen Xing,David Ebert,Min Chen
类目: Human-Computer Interaction (cs.HC)
备注: 22 pages, 15 figures
Abstract:The principle of visual analytics (VA) is to provide integrated workflows where human-centric processes (e.g., visualization and interaction) and machine-centric processes (e.g., statistics and algorithms) complement each other. To implement this principle in practice, it is necessary to reason about the trade-offs among different processes and make optimal use of them in a workflow. Building on an existing ontology of the methodology for analyzing such trade-offs information-theoretically and for optimizing VA workflows systematically, we investigate ways to transform this methodology from theory to practice. In particular, we adopted the action research method. Through case studies in different application domains, VA researchers with different background knowledge and experiences offered their answers to several hypotheses about using the methodology in practice and proposed ways forward. In this paper, we present our collective analysis, the strengths and feasibility of this theory-based methodology, as well as the obstacles to its broad deployment in practice. To address these challenges, we outline a roadmap to remove such obstacles.
[HC-12] Averag e Rankings Mask Per-Subject Optimality: A Friedman-Nemenyi Benchmark of EEG Motor-Imagery BCI Decoders
链接: https://arxiv.org/abs/2606.24394
作者: Xavier Vasques,Paul Barbaste,Olivier Oullier
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Robotics (cs.RO); Neurons and Cognition (q-bio.NC)
备注: 16 pages, 6 figures, 4 tables
Abstract:Electroencephalography (EEG) is the dominant non-invasive modality for brain-computer interfaces (BCIs), yet reliable decoding of motor imagery is hampered by inter- and intra-individual variability. A recurring claim is that one decoding pipeline, most often a spatial or Riemannian method, is broadly preferable. We test the weakest version of that claim under the most favourable conditions. Using the Mother of All BCI Benchmarks (MOABB) framework, we evaluated 1,056 decoding configurations (feature extractor x scaler x classifier), 340,000 subject-level model fits, across three public left-versus-right motor-imagery datasets (PhysionetMI, 109 participants; Cho2017, 52; Zhou2016, 4) and two frequency bands (8-15 Hz, 8-30 Hz). Every model is fit and tested within a single session of a single participant, the easiest regime, giving every pipeline its best chance. We apply the statistics standard for multi-classifier comparison: Friedman omnibus tests, Nemenyi critical-difference analysis and Wilcoxon signed-rank tests with effect sizes. Covariance tangent-space projection (cov-tgsp) and Common Spatial Patterns (CSP) are the strongest families, but their ordering is dataset-dependent and, on the largest and most heterogeneous cohort (PhysionetMI), statistically indistinguishable (Nemenyi p = 0.27; Kendall’s W = 0.11). At the individual level the single best pipeline is optimal for only 35% of PhysionetMI participants, and nonlinear descriptors are best for roughly one third; matching pipeline to participant adds about seven accuracy points over the best fixed choice. The ranking is not an artefact of dimensionality, and classifier and scaler choices are secondary to the feature representation. Even in the easiest regime, no single pipeline dominates: a lower bound on the personalization problem and a quantitative case for participant-aware model selection rather than a universal decoder.
[HC-13] Real-Time Interactive Music Generation via Data-Free Streaming Consistency Distillation
链接: https://arxiv.org/abs/2606.24307
作者: Baisen Wang,Chenxi Bao,Qisong Han
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Interactive music and live performance relies on real-time human expression, but modern generative music AI remains largely absent from this domain due to its prohibitive inference latency and offline rendering paradigm. To provide pioneer musicians with a novel medium for interactive composition, we should fundamentally change these static models into dynamic, playable instruments. In this paper, we propose a framework that bridges this gap. To achieve the low latency required for live interaction without sacrificing structural coherence, we formulate distillation within a streaming autoregressive latent space. Our approach gets rid of the need for expensive paired audio-latent datasets by utilizing prompt-only inputs to synthesize teacher-guided, chunk-wise trajectories on the fly. Because live instruments require high acoustic fidelity, we introduce music-aware consistency objectives, which combine latent, spectral, and temporal-difference losses, to preserve crucial qualities like timbre, transients, and rhythmic stability during accelerated single-step streaming generation. Implemented via parameter-efficient adaptation, our distillation reduces generation steps to achieve a low real-time factor. Crucially, by operating as a continuous autoregressive stream, the system can seamlessly assimilate dynamic human inputs on the fly, allowing users to instantly steer the musical trajectory without interrupting the audio flow. Ultimately, this work recontextualizes generative text-to-music models not as passive prompt-and-wait systems, but as responsive instruments, opening new frontiers for live human-AI musical co-creation.
[HC-14] A Dynamic Coupling Theory of Expertise Through Thinking Flow and Workflow Evolution
链接: https://arxiv.org/abs/2606.24197
作者: Annie Yuan
类目: Human-Computer Interaction (cs.HC)
备注: 19 pages, 4 figures
Abstract:Expertise has long been explained through tacit knowledge, deliberate practice, skill acquisition, and expert performance. While these perspectives have advanced understanding of expertise, they often describe its conditions or outcomes rather than the cognitive architecture through which expertise continuously emerges and evolves. This paper proposes Workflow Cognition as a theoretical framework for explaining expertise as a dynamic cognitive phenomenon. Workflow Cognition is defined as the cognitive architecture emerging from the recursive coupling of Thinking Flow and Workflow Evolution. Thinking Flow refers to ongoing processes of perception, interpretation, judgement, decision-making, and reflection; Workflow Evolution refers to the continuous adaptation of actions, task structures, and operational strategies within situated practice. Through their coupling, expertise is not treated as a static accumulation of knowledge or skill, but as an evolving process generated through cognition-in-practice. Building on this framework, the paper advances a new ontological definition of expertise: expertise is an emergent manifestation of Workflow Cognition operating across longitudinal professional experience. Knowledge, skills, decisions, aesthetic preferences, and behavioural patterns are therefore interpreted as observable expressions of expertise rather than expertise itself. Drawing on illustrative comparisons across craft, creative production, education, and leadership, the paper introduces a Dynamic Coupling Model of Expertise and establishes a foundation for future work on Longitudinal Tacit Cognition, Longitudinal Aesthetic Cognition, and Expertise Workflow Grammar. The framework contributes a cognitive ontology of expertise and supports future computational representations of human expertise within AI+Expert systems. Comments: 19 pages, 4 figures Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2606.24197 [cs.HC] (or arXiv:2606.24197v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2606.24197 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-15] Human-Centered Design: The Disclosure of Generative Artificial Intelligence for Emerging Professionals
链接: https://arxiv.org/abs/2606.24136
作者: Sydney Lee
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:As the Human centered design continues to grow, generative AI has the potential to streamline the research process by iterating tasks within established workflows to increase efficiency. However, integrating AI raises concerns surrounding ethical bias, complexity, and the lack of prioritization of humanistic values. Emerging professionals represent a cohort with the opportunity to learn Human Centered Design principles, yet without this foundation AI becomes more of a crutch than a tool, leading to reduced experience with deep work, decreased autonomy, and deskilling of key foundations. Disclosures are a common method to self report AI usage, but they provide little clarification on appropriate implementation and may encourage omission to avoid consequences. This paper reflects on experiences in the Human Centered Design course ITIS8300, which emphasized optimizing user experience, enhancing innovation and collaboration, and improving efficiency through iterative user feedback. A semester long project, structured through milestones and team roles including a generative AI advocate, resulted in a high level disclosure report detailing design processes, methodology, findings, and rationale for AI usage. The course offered freedom in execution while setting clear boundaries for incorporating human feedback, reinforcing justification for HCI workflows and encouraging transparent AI use. This approach mirrors an industry with minimal regulation, demonstrating that when AI usage is safe, justified, and transparent, it can significantly advance the field through AI augmented workflows and support co creation an increase productivity.
[HC-16] he impact of generative artificial intelligence on academic development of Chinese students in humanities and social sciences
链接: https://arxiv.org/abs/2606.24104
作者: Lei Fan,Fangxue Liu
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative artificial intelligence(GenAI) is reshaping learning in higher education, with particularly pronounced implications for the humanities and social sciences(HSS), where learning outcomes are commonly expressed through written and interpretive forms that align closely with GenAI’s capabilities. Yet, systematic evidence on the educational impacts of GenAI on HSS students remains limited. Addressing this gap, this study draws on a large-scale survey of HSS students in China to examine its role in academic development. Guided by relevant learning theories, this study focuses on four dimensions: patterns of use, effects on learning processes and academic performance, challenges associated with GenAI use, and preferred approaches to curricular integration. We found that more than half perceived enhanced learning motivation, independent thinking and creativity, although a substantial minority reported little change or even decline. Comparatively, a notably larger majority reported academic performance gains, although these gains may partly reflect limitations in conventional assessment practices. The study identifies variations in perceived learning and performance improvements among students with differing durations of GenAI experience, along with observable disciplinary differences and modest gender differences. While an overwhelming majority valued the importance of ethical considerations, only slightly more than half were satisfied with privacy protection. Limited accuracy and overreliance emerged as the most pressing concerns reported by students. Students favored partial or optional curricular integration supported by practice-oriented training, and widely recognized GenAI’s significance for their future professional development. Grounded in student perspectives, this study offers evidence-based recommendations for the responsible and pedagogically meaningful integration of GenAI
[HC-17] Do Language Models Pass the Bechdel Test? Auditing Gender Biases in LLM -Generated Screenplays
链接: https://arxiv.org/abs/2606.24022
作者: Megha N. Govindu,Stephanie T. Wang,Sorelle A. Friedler,Danaé Metaxa
类目: Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注:
Abstract:As large language models (LLMs) are increasingly used in media production from journalistm to filmmaking, what impact do they have on the stories being told? Prior work has shown LLMs to perpetuate social biases, including those related to gender. We complement existing literature on gender bias in LLM outputs by auditing the network structure of LLM-generated movie screenplays through automating the Bechdel test, a popular measure of women’s representation in literary and film works. We also introduce the use of social network analysis measures to further analyze representational bias in LLM-generated scripts. We evaluate screenplays generated by three state-of-the-art LLMs (GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5) against 768 corresponding human-written screenplays, finding that human-written scripts are more likely to pass the Bechdel test. However, other network analyses, like centrality, homophily, and triadic relationships demonstrate that in some cases LLM-scripts have less bias, although all script types demonstrate some representational bias under most measures. We conclude by discussing the continued need for further quantitative assessments of media representations and AI-generated content.
[HC-18] Embodied Explainability and Ontological Obstacles: Why We Struggle to Explain the Answers of Large Language Models (LLM s)
链接: https://arxiv.org/abs/2606.23840
作者: Marvin Pafla,Jesse Hoey,Kate Larson,Mark Hancock
类目: Human-Computer Interaction (cs.HC)
备注: 11 pages
Abstract:Explainability is often framed as a property of an AI model, with explanations extracted from its internals and shown to users. In this argument paper, we instead provide an embodied account of explainability based on Dourish and enactivist cognition: understanding is created in use as people act on affordances in shared practice. Using demonstrations and conceptual analysis, we reveal ontological obstacles when “looking inside” large language models: surrogates import external abstractions that can be mistaken for the model’s, and focusing on internal reasoning misses that explainers participate in their own understanding. We discuss these obstacles in XAI practice, arguing that many explanations are misnamed, which skews their purpose and can increase overreliance. Finally, we highlight how embodied explanations reorganize sense-making by making what matters publicly available for action, and argue that explainability claims should be reserved for designs that provide affordances to probe, coordinate, and repair behaviour in situated practice.
[HC-19] n Digits on a Train: AI-Assisted Verification of Two Eigenvalue Problems
链接: https://arxiv.org/abs/2606.23821
作者: Matthew J. Colbrook
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Spectral Theory (math.SP)
备注:
Abstract:Accurate numerical eigenvalues are often difficult to certify, especially in singular or non-normal settings. This article reports a human–AI collaboration on two such computations. For a singular self-adjoint Schrödinger operator, a verified zero count and Dirichlet–Neumann bracketing certify the complete negative spectrum to ten decimal places. For a delicate non-normal atom–molecule benchmark, a previously unresolved resonance pair is separated, with each member enclosed to ten digits. The second result is achieved not by increasing the precision of one-way shooting, but by reformulating the problem as a global matching system for projective solution lines. The infinite tail is encoded as uncertainty in the terminal projective data, and a componentwise, tail-robust Krawczyk–Brouwer inclusion supplies the certificate. This gives a reusable architecture for analytic boundary-value systems with ill-conditioned propagation and uncertain asymptotic data. The collaboration also exposes the strengths and limits of AI assistance. AI rapidly produced accurate candidates and plausible proof strategies, but several failed, including one apparently complete tail argument that omitted the componentwise check required by a nonuniform polydisc. Validated computation is a stringent test of AI-assisted mathematics: the output is not merely a number, but a number with a proof. These examples show why the proof object matters, and why human mathematical judgment remained decisive. More broadly, as AI makes code, exposition, and plausible numerical claims inexpensive, standards for verification, attribution, peer review, and training must adapt. The implications are unsettling; the opportunity is extraordinary.
[HC-20] Evaluating LLM Usage for Efficient and Explainable Numerical and Classified Implicit Sentiment Analysis of Product Desirability
链接: https://arxiv.org/abs/2606.23701
作者: Sherri Weitl-Harms,John Hastings
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 20 pages, 6 figures, 11 tables. arXiv admin note: text overlap with arXiv:2408.01527
Abstract:Qualitative product feedback can reveal nuanced user experiences, but its implicit sentiment is difficult to measure. This paper presents a scalable and interpretable framework that uses large language models (LLMs) to quantify product desirability from such data. Using two Product Desirability Toolkit (PDT) datasets from ZORQ and CARMA comprising 106 respondent term groupings with gold-standard human annotation, zero-shot continuous numerical sentiment scoring and categorical sentiment classification are evaluated without relying on explicit review scores. Across the datasets, LLMs generated numerical sentiment scores directly from qualitative responses and closely matched expert labels, achieving Pearson correlations up to 0.97 and classification accuracy up to 94%. LLMs maintained robustness even when handling data presented in multiple forms and consistently expressed high confidence. In contrast, lexicon-based and transformer baselines did not produce statistically significant results. Among the models tested, GPT-4o-mini achieved performance comparable to larger models at 94% lower cost, supporting scalable deployment. The framework also incorporates model confidence ratings and human-readable rationale explanations (xAI), improving interpretability, transparency, and trust while supporting practical use in product satisfaction assessment. In general, using the PDT tool as a survey method along with a cost efficient LLM for sentiment analysis has the potential to provide for product evaluation with results that are rich in terms of sentiment scores (both numerical and classified sentiment) and in terms of the high-level user impressions of the product that can be used to identify ideas for product development and improvement, as well as marketing ideas for target audiences. Comments: 20 pages, 6 figures, 11 tables. arXiv admin note: text overlap with arXiv:2408.01527 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) ACMclasses: I.2.7; D.2.8; I.2.6; H.5.2 Cite as: arXiv:2606.23701 [cs.CL] (or arXiv:2606.23701v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.23701 Focus to learn more arXiv-issued DOI via DataCite
[HC-21] A Geometry-Informed Computer Vision Method for Detecting and Examining Overtaking Vehicles From A Bicycle
链接: https://arxiv.org/abs/2606.23699
作者: Gandhimathi Padmanaban,Rayane Moustafa,Fred Feng
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 18 pages, 6 figures, in preparation for journal submission
Abstract:Instrumented bicycle studies have produced direct field evidence on vehicle passing behavior, but extracting overtaking events from continuous rear-facing video has remained dependent on manual, frame-by-frame annotation. This bottleneck constrains sample sizes and limits naturalistic cycling safety research. We present a geometry-informed computer vision pipeline that automates overtaking event detection from a single bicycle-mounted camera without multi-sensor configurations or explicit camera calibration. The system combines RT-DETR object detection with ByteTrack multi-object tracking through a three-stage geometric validation module enforcing bearing angle trend, apparent size growth, and spatial confirmation criteria derived from perspective projection principles. Validated on 315 manually annotated real-world overtaking events from urban roads in Ann Arbor, Michigan, the pipeline achieved 97.8% recall with zero false positives. The system identified overtaking intentions a mean of 2.44 seconds before vehicle passage, with 84.1% of events exceeding the 1.5-second human reaction time threshold, demonstrating feasibility for active cyclist warning. Lateral passing distance measurements from 96 events revealed 33.3% of passes below the 5-foot (152.4 cm) threshold, consistent with non-compliance rates in prior field and self-reported studies. A preliminary calibration-free lateral distance estimation approach using bounding box geometric features achieved mean absolute errors of 13-14 cm under leave-one-out cross-validation, sufficient to distinguish close passes from standard passes for safety categorization. By automating event isolation from consumer-grade footage, the system removes the primary annotation bottleneck of instrumented bicycle research and provides a scalable foundation for vehicle-bicycle interaction analysis across larger datasets and diverse urban environments.
[HC-22] When Surveys Become Conversations: Adaptive Matrix Validation for AI-Assisted Interviews
链接: https://arxiv.org/abs/2606.24244
作者: Tyler H. McCormick
类目: Methodology (stat.ME); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Econometrics (econ.EM); Machine Learning (stat.ML)
备注:
Abstract:AI-assisted interviews promise to reduce respondent burden in surveys by allowing respondents to describe experiences naturally while an AI system noisily maps those accounts into structured survey variables. That mapping is a measurement process that is fallible, versioned, adaptive, and potentially behaves differently across subgroups. This paper proposes Adaptive Matrix Validation (AMV), a design in which each respondent completes an AI-assisted interview, which is then mapped into tabular data by the AI. Respondents are also asked a small, randomized set of structured questions, which are used for statistical adjustment. The estimator first calibrates the mapped values using validation answers from other respondents, then corrects the remaining error with the validation answers observed for the target respondent. The paper develops estimators for item means, subgroup estimates, and regression coefficients when outcomes, predictors, or both are mapped from interviews. It also gives planning formulas the number of validation questions required and the sample size. A design-calibration simulation, an American Time Use Survey emulation, and a CHAMPS verbal-autopsy narrative study show when sparse validation can improve precision and when it cannot
[HC-23] Zero-Shot Neural Priors for Generalizable Cross-Subject and Cross-Task EEG Decoding
链接: https://arxiv.org/abs/2606.23706
作者: Baimam Boukar Jean Jacques,Brandone Fonya,Nchofon Tagha Ghogomu,Pauline Nyaboe,Kipngeno Koech
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:The development of generalizable electroencephalography (EEG) decoding models is essential for robust brain-computer interfaces (BCI) and objective neural biomarkers in mental health. Conventional approaches have been hindered by poor cross-subject and cross-task generalization, owing to high inter-subject variability and non-stationary neural signals. We address this challenge with a zero-shot cross-subject decoding framework on the large-scale Healthy Brain Network dataset, benchmarking a convolutional neural network baseline, a hybrid LSTM, and a Transformer-based foundation model. To adapt the Transformer for regression while averting catastrophic forgetting, we propose a novel progressive unfreezing strategy. The baseline yielded an nRMSE of 0.9991, whereas our fine-tuned Transformer achieved 0.9799 on unseen subjects. This work advances scalable, calibration-free EEG decoding for computational psychiatry and behavioral prediction.
计算机视觉
[CV-0] DiffusionBench: On Holistic Evaluation of Diffusion Transformers
链接: https://arxiv.org/abs/2606.24888
作者: Xingjian Leng,Jaskirat Singh,Zhanhao Liang,Ethan Smith,Martin Bell,Aninda Saha,Yuhui Yuan,Liang Zheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion transformer (DiT) research on image generation has converged to a single evaluation setup: class-conditional generation on ImageNet. While methods improve the FID and related metrics, it is increasingly unclear whether they reflect real progress in generative modeling. The natural alternative, i.e., text-to-image (T2I) generation, is perceived as too costly or inconvenient to train and evaluate and is often skipped. We argue that this perception no longer holds. We introduce NanoGen, a unified DiT training and evaluation framework. NanoGen matches state-of-the-art DiT baselines on ImageNet and, with 12 lines of configuration change, also trains competitive text-to-image models. It currently supports RAE, VAE, pixel-space, and MeanFlow diffusion methods under both ImageNet and T2I setups. Under NanoGen, training T2I requires comparable compute to ImageNet. After training 21 latent diffusion models with NanoGen, we observe that method ranking shows no strong correlation between ImageNet and T2I generation: Pearson correlation is between -0.377 and -0.580 across three metrics. This suggests that a method which improves class-conditional ImageNet FID may show no corresponding improvement on T2I, clearly indicating the necessity of evaluating DiTs on both tasks. To this end, we summarize ImageNet and text-to-image results, which yields DiffusionBench, a holistic benchmark for DiT research. We recommend reporting DiffusionBench in place of ImageNet alone: methods that improve DiffusionBench are more likely to reflect broader progress.
[CV-1] BenchX: Benchmarking AI Models for Cancer Detection and Localization with Demographic and Protocol Biases
链接: https://arxiv.org/abs/2606.24883
作者: Qi Chen,Wenxuan Li,Pedro R. A. S. Bassi,Xinze Zhou,Jakob Wasserthal,Ibrahim Ethem Hamamci,Sezgin Er,Ashwin Kumar,Yiwen Ye,Yuhan Wang,Yuyin Zhou,Akshay S. Chaudhari,Curtis Langlotz,Kang Wang,Yang Yang,Alan L. Yuille,Zongwei Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Artificial intelligence (AI) has achieved remarkable success in medical imaging, but it is widely recognized that these models often perform inconsistently across real-world clinical settings. Such inconsistencies occur when patient demographics and imaging protocols vary, for example, in detecting small tumors, analyzing scans from different contrast phases, or evaluating patients of different ages or sexes. To quantify these inconsistencies, we develop a large-scale, open benchmark of 85,355 CT scans that systematically evaluates 12 tumor-detection AI models across tumor size, location, patient subgroup, and imaging protocol. We leverage large language models (LLMs) to extract and organize subgroup information from clinical data, which makes the analysis both scalable and reproducible. Our benchmark reveals that current state-of-the-art AI models, optimized for average accuracy, perform poorly in rare or underrepresented subgroups, such as young, female African Americans. However, collecting sufficient annotated data for these rare cases is often impractical. The benchmark provides a foundation for building more reliable and robust AI models for tumor detection and highlighting the need for rigorous, subgroup-level evaluation in medical imaging and computer vision. Datasets, code
[CV-2] FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation
链接: https://arxiv.org/abs/2606.24876
作者: Orest Kupyn,Goutam Bhat,Philipp Henzler,Fabian Manhardt,Christian Rupprecht,Federico Tombari
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating explorable 3D scenes from a single image requires strong generative priors and accurate geometric representations suitable for downstream use. Current video diffusion models offer high-quality generation and implicitly encode multi-view geometric structure in latent space. However, existing feedforward latent scene decoders typically output volumetric 3D Gaussians that lack a well-defined surface, limiting their use in simulation or standard graphics pipelines. This motivates decoding surface-aligned primitives that are not only renderable but also closer to explicit geometric assets. We ask whether compressed video diffusion latents can be mapped directly to explicit surface primitives in a single pass. To this end, we introduce FLAT and, for the first time, show that triangle splats can be decoded directly from video diffusion latents. Compared with decoding 3D Gaussians, predicting flat primitives is notoriously more challenging due to high sensitivity to primitive orientations, oftentimes leading to poor gradient flow. FLAT solves with two key ingredients: a ray-centered rotation parameterization for triangle regression and a novel product window function that improves gradient flow during differentiable triangle rendering. On standard benchmarks, FLAT achieves significantly better geometric accuracy while maintaining competitive visual quality compared to state-of-the-art feedforward baselines. We further show that a lightweight test-time refinement step converts the predicted triangle soup into a fully opaque, game-engine-ready representation that supports real-time rendering. By evaluating 3DGS, 2DGS, and triangle splatting variants under an identical training setup, we provide the first systematic analysis of representation tradeoffs in feedforward scene generation. The project page is available at this https URL
[CV-3] FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation
链接: https://arxiv.org/abs/2606.24874
作者: Haorui Ji,Weizhe Liu,Hongdong Li,Hengkai Guo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Sparse voxel representation has emerged as a scalable foundation for image-to-3D Gaussian Splatting (3DGS) generation, yet current methods struggle to preserve high-frequency visual details of input images due to two structural bottlenecks. First, they adopt discriminative 2D features optimized for semantic abstraction to construct sparse voxel latents, which suppress reconstructive cues and induce a representation bottleneck. Second, in the generation stage, standard diffusion transformers lack effective mechanisms to align dense 2D image tokens with sparse 3D voxel latents, resulting in a cross-modal correspondence bottleneck. To address these issues, we propose FLUX3D, a scalable image-to-3DGS framework that boosts both representation learning and cross-modal alignment during generation. We first revisit 2D feature selection for sparse-voxel-based 3D representation learning, propose Diffusion-Aligned Structured Latents (DA-SLAT) and couple it with a decoder-only architecture to improve 3DGS reconstruction fidelity. We also design a sparse-structure-aware diffusion framework, which integrates the Sparse-structure Multimodal Diffusion Transformer (SMDiT) and Modal-Aware Rotary Positional Embedding (MARoPE) to achieve geometry-agnostic 2D-3D alignment. Extensive benchmark experiments demonstrate that FLUX3D yields substantial improvements in appearance fidelity and significantly outperforms all state-of-the-art (SOTA) methods in generating high-quality 3DGS assets.
[CV-4] IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation
链接: https://arxiv.org/abs/2606.24849
作者: Zixuan Li,Haokun Lin,Yicheng Xiao,Zhiwei Li,Xinyang Song,Zelong Zheng,Yong He,Heng Yao,Ke Ding,Chao Yu,Chuan Yuan,Qi Li,Zhenan Sun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Unified multi-modal large language models (MLLMs) have achieved strong text-to-image generation quality, but still struggle with structure-aware prompt following, where object counts, spatial relations, attribute bindings, and coarse layouts must be preserved. We attribute this limitation in part to the entanglement of structural planning and appearance rendering within a single conditioning stream. To address this issue, we propose Implicit Visual Chain-of-Thought (IV-CoT), a latent visual reasoning framework for query-conditioned image generation. IV-CoT decomposes the visual conditioning queries into a structural-to-semantic cascade, where structural queries first form a latent visual plan and semantic queries then render appearance conditioned on this plan. To guide the structural queries, we introduce training-only sketch supervision, which encourages them to capture structure from sketches without requiring sketch extraction or intermediate decoding at inference time. IV-CoT performs implicit CoT reasoning in a single forward pass and achieves superior results on GenEval and T2I-CompBench. Visualizations and analyses demonstrate that the learned structural and semantic queries play complementary roles in structure-aware generation.
[CV-5] Spherical-to-ERP Epipolar Rectification for Single-Axis Disparity in 360 Stereo
链接: https://arxiv.org/abs/2606.24847
作者: Sahereh Obeidavi,Dieter Landes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 Pages, 4 Figures, Conference
Abstract:Omnidirectional stereo images provide full-surround perception but violate the geometric assumptions of classical disparity estimation: in spherical or fisheye views, epipolar correspondences follow curved great-circle paths, producing two-dimensional displacements that cannot be treated as single-axis disparity before geometric rectification. In this work, we adopt a standard spherical-to-equirectangular (ERP) projection as a preprocessing step, which straightens epipolar curves and restores a one-dimensional disparity structure - horizontal for left-right rigs and vertical for top-bottom rigs. Building on our previously introduced RAFT + Epipolar-Aligned Channel Selection (EACS) framework, originally developed for rectilinear and ERP stereo, we examine whether the same modular pipeline remains accurate when the input originates from spherical stereo imagery. After ERP projection, dense optical flow from RAFT is reduced to disparity by retaining only the baseline-aligned flow component. Experiments on synthetic fisheye stereo datasets show that this spherical-to-ERP-to-RAFT+EACS pipeline produces accurate, smooth, and structurally consistent disparity maps at real-time speed. These findings confirm that established ERP preprocessing can be effectively combined with our earlier RAFT+EACS method to enable practical, interpretable, and efficient disparity estimation from spherical stereo, providing a straightforward pathway for extending conventional stereo pipelines to 360 imaging.
[CV-6] Bridging the Manifold Gap: Riemannian Residual Line Search for One-Step Image Editing
链接: https://arxiv.org/abs/2606.24844
作者: Hongzhu Yi,Zhongtian Luo,Tong Li,Yiyan Fan,Jungang Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:One-step diffusion editors are fast because they avoid inversion and iterative optimization, but a single transport update must be aggressive enough to realize the target prompt and conservative enough to preserve the source image–and no fixed update strength satisfies both demands across edit types. We treat this tension as a post-hoc candidate-selection problem on top of energy-field transport rather than as a new editing model. Our proposed method, Riemannian Residual Line Search, first builds a stronger edit by estimating the local time curvature of the prompt-delta field and projecting the corrected direction back onto the update norm of the original first-order energy-field transport estimation. It then forms a small residual path from the source image to this strong edit, retains the original first-order output as one candidate, and picks the final image by maximizing target-prompt CLIP alignment. On a 700-sample PIE-Bench++ evaluation across 10 edit type IDs, our method achieves state-of-the-art (SOTA) performance among current one-step update algorithms.
[CV-7] GeoT2V-Bench: Benchmarking 3D Consistency in Text-to-Video Models via 3D Reconstruction
链接: https://arxiv.org/abs/2606.24829
作者: Chenrui Fan,Paolo Favaro
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, 17 figures, 18 tables
Abstract:Camera-prompted text-to-video (T2V) models are increasingly used to synthesize virtual camera captures, such as orbiting objects or moving through static scenes. For these outputs, visual plausibility is insufficient: the generated frames should also provide coherent multi-view evidence for a single static 3D scene. We introduce GeoT2V-Bench, a reconstruction-based diagnostic benchmark for evaluating whether camera-prompted T2V clips can support explicit rigid 3D reconstruction. Our pipeline estimates per-frame camera intrinsics and poses with VGGT-style geometry estimation, fits DeformableGS, derives a static MedianGS proxy by temporal-median aggregation, and renders this proxy along the estimated camera path. Instead of producing a pass/fail label or a single scalar score, GeoT2V-Bench reports a continuous reconstruction profile covering apparent image motion, estimated trajectory behavior, MedianGS static rendering error, static-render flow agreement, and the gap between flexible and static fits. On a fair-format four-seed evaluation with 3,840 completed reconstructions from 12 open-weight model configurations and 80 GeCo-Eval static-scene prompts, we find that visible motion, static rendering error, flow agreement, and flexible-vs-static behavior often disagree. GeoT2V-Bench therefore captures complementary failure modes that emerge when generated videos are tested as global static-scene acquisitions.
[CV-8] High-Fidelity Synthetic Transmission Electron Microscopy Image Generation Using Diffusion Probabilistic Models for Data-Limited Semiconductor Metrology
链接: https://arxiv.org/abs/2606.24817
作者: Johannes Boehm,Bappaditya Dey
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: To be presented at the 2026 International Symposium ELMAR, published by IEEE in the conference proceedings
Abstract:Advanced semiconductor nodes drastically increased demand for Transmission Electron Microscopy (TEM), yet destructive sample preparation, slow imaging and high costs severely limit the availability of diverse datasets needed for downstream machine learning (ML). Synthetic data generation is becoming essential, but current generative models often miss TEM-specific noise, structural detail, and stochastic variability crucial for evaluation. We present a Denoising Diffusion Probabilistic Model (DDPM) framework for synthetic TEM image generation under extreme data scarcity. A progressive patch-based training strategy scales from low-resolution patches to full images, enabling from-scratch training with only 15 samples. We integrate a custom TrivialAugment adaptation, cross-process domain transfer, classifier guidance, and RePaint-style inpainting, culminating in full-image generation that preserves global structural and spatial relationships in compliance with FAB metrology requirements. Beyond synthesis, we repurpose DDPM feature representations for segmentation, partitioning encoder feature maps to obtain coherent region masks. Our synthetic images achieve up to MS-SSIM 0.98 and qualitative expert assessment consistent with structural similarity results, facilitating downstream ML training for defect detection, segmentation, and metrology while preserving statistical and physical realism.
[CV-9] DDStereo: Efficient Dual Decoder Transformers for Stereo 3D Road Anomaly Detection
链接: https://arxiv.org/abs/2606.24805
作者: Shiyi Mu,Zichong Gu,Zhiqi Ai,Yilin Gao,Shugong Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Stereo-based 3D object detection still faces two critical safety challenges: real-time performance and open-set generalization. Existing stereo 3D methods typically achieve twice the accuracy of monocular methods but suffer from significantly lower inference speeds, making them unsuitable for real-time applications. Meanwhile, recent advances in open-world detection have introduced open-set and open-vocabulary algorithms in monocular 2D and 3D settings, yet stereo-based open-set detection remains largely unexplored. To bridge this gap, we propose DDStereo, a novel Dual-Decoder Stereo Transformer for real-time open-set 3D object detection. DDStereo features two lightweight decoder branches: one for open-set foreground 2D detection and the other for 3D attribute regression. These decoders share object-level queries to achieve unified target-level alignment. To enhance inference efficiency, we designed a compact disparity feature extractor and a streamlined decoder architecture. Experiments on public stereo 3D benchmarks demonstrate that DDStereo achieves state-of-the-art accuracy under both closed-set and open-set protocols. Notably, our method surpasses existing stereo 3D detectors in inference speed and, for the first time, achieves real-time performance comparable to monocular approaches.
[CV-10] OrbitForge: Text-to-3D Scene Generation via Reconstruction-Anchored Video Synthesis
链接: https://arxiv.org/abs/2606.24799
作者: Chenrui Fan,Paolo Favaro
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 40 pages, 33 figures, 19 tables
Abstract:Generic text-to-video models can be used as rich open-world scene priors. Despite the high quality of today’s generated videos, they do not directly yield reliable 3D assets: camera motion is difficult to control, view coverage is partial, and frames often contain inconsistencies across time. We introduce OrbitForge, an adapter built from frozen video priors and per-prompt Gaussian Splatting reconstruction optimization that converts a single text-generated video into a canonical closed-orbit 3D Gaussian Splatting scene. We use 3D reconstruction as an anchor to improve the 3D consistency of the generated video. We obtain a preliminary 3D reconstruction from a first generated video via Deformable Gaussian Splatting with a robust MedianGS proxy. We render views from a prescribed orbit to detect missing viewpoints. OrbitForge uses the text-to-video model to complete only the missing views, and reconstructs the completed orbit into a final Gaussian Splatting scene. This design requires no task-specific video or multiview fine-tuning, avoids per-prompt score-distillation optimization, and does not progressively generate views one step at a time. We further argue that this setting demands coverage-aware evaluation: local smoothness alone rewards methods that never attempt a full orbit. On a frozen 300-prompt T3Bench-derived audit, OrbitForge reconstruction attains a 359.0-degree measured median span, raises originally unsupported-bin Q10 ImageReward from 8.07 to 16.36 relative to MedianGS-only reconstruction, while remaining competitive with VideoMV on the coverage-quality.
[CV-11] EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence
链接: https://arxiv.org/abs/2606.24797
作者: Linpeng Huang,Weixing Chen,Zexin Chen,Yang Liu,Liang Lin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Video Large Language Models (Video-LLMs) have yielded promising performance on video question answering (VideoQA). Nevertheless, existing benchmarks are predominantly evaluated through answer correctness, while the grounding of predictions in relevant video evidence remains largely unexamined. This disconnect between answer generation and evidence understanding motivates the construction of the Evidence-Grounded Video Question Answering Benchmark (EG-VQA), an open-ended evaluation protocol in which each QA pair is explicitly annotated with supporting temporal evidence, thereby requiring joint reasoning and precise evidence localization. EG-VQA is comprised of 2,067 videos and 11,838 QA pairs with fine-grained evidence annotations. To evaluate predicted evidence, Evidence-Grounded F1 (EG-F1) is introduced as a unified metric in which temporal alignment and semantic consistency against ground-truth evidence are jointly measured. Experimental evaluation reveals that even strong proprietary models struggle to accurately ground their predictions, exposing a fundamental discrepancy between answer correctness and faithful evidence localization. To bridge this gap, EG-Reasoner, an evidence-grounded reasoning model trained with explicit supervision, is proposed. State-of-the-art performance is achieved among open-source models, with results competitive against proprietary systems, particularly pronounced gains are observed on reasoning-intensive tasks such as counterfactual questions. These findings demonstrate that scaling alone is insufficient for robust video understanding and that structured evidence supervision is essential for the development of more reliable and interpretable VideoQA systems.
[CV-12] Pocket-SLAM: Rendering-Area-Aware Pruning for Memory-Efficient 3DGS-SLAM ICRA
链接: https://arxiv.org/abs/2606.24796
作者: Leshu Li,Jie Peng,Yang Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2026 IEEE International Conference on Robotics and Automation(ICRA)
Abstract:3D Gaussian Splatting (3DGS) has garnered significant attention in Simultaneous Localization and Mapping (SLAM) due to its advances in capturing fine-grained geometry features and synthesizing novel views. For SLAM in large-scale scenes, such as autonomous driving, 3DGS-SLAM faces a critical limitation: memory consumption increases continuously over time as Gaussian points accumulate, leading to poor memory efficiency and limiting its applicability. In this work, we propose a rendering-area-aware pruning strategy that selectively removes Gaussians based on their contribution to the effective rendering area, rather than solely relying on Gaussian-level heuristics such as opacity or gradient magnitude. This perspective directly targets the sources of memory redundancy, effectively reducing the peak memory footprint of 3DGS-SLAM during runtime. Evaluations on the EuRoC and KITTI datasets demonstrate that our method consistently outperforms existing pruning approaches in large-scale outdoor scenes, achieving over 60% memory reduction and more than 2 times FPS improvement while preserving localization and mapping accuracy. These results highlight rendering-area-aware pruning as a promising direction for scaling 3DGS-SLAM to real-world autonomous driving scenarios. Our code is publicly available at this https URL.
[CV-13] Counting Trees from Satellite Imagery with Noisy Supervision
链接: https://arxiv.org/abs/2606.24786
作者: Dimitri Gominski,Maurice Mugabowindekwe,Qiue Xu,Xiaowei Tong,Martin Brandt,Hieu Le,Rasmus Fensholt,Dimitris Samaras,Loic Landrieu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Counting individual trees is a fundamental task for environmental monitoring, yet remains largely unexplored with satellite imagery. At these resolutions, isolated trees may still be identifiable, but crown boundaries become ambiguous in dense forests, making the notion of an individual tree inherently ill-defined. Moreover, large-scale manual annotations of individual trees are prohibitively expensive. While scalable supervision can be derived from airborne LiDAR, the resulting annotations are noisy and difficult to exploit effectively. We address these challenges by formulating tree counting as a spatial density matching problem supervised through Unbalanced Optimal Transport. This formulation naturally accommodates both precise localization of isolate trees and robust density estimation in dense forests. We further introduce a self-correction mechanism that leverages transport residuals to progressively refine noisy supervision during training. We evaluate our approach on TinyTrees, a new benchmark spanning three continents and three satellite sensors, comprising over 215 million tree annotations (including 773K manually verified instances) across 23,000 this http URL. Our method consistently outperforms detection-based, regression-based, and transport-based distribution-matching baselines, demonstrating the effectiveness of unbalanced transport and reliability-aware supervision for large-scale tree counting from satellite imagery. Code, data and models are available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.24786 [cs.CV] (or arXiv:2606.24786v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.24786 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-14] AerialFusionMapNet: Online HD Map Construction with Aerial-Onboard BEV Fusion ITSC
链接: https://arxiv.org/abs/2606.24784
作者: Daniel Lengerer,Mathias Pechinger,Klaus Bogenberger,Carsten Markgraf
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IEEE International Conference on Intelligent Transportation Systems (ITSC) 2026
Abstract:High-resolution aerial imagery has recently emerged as a complementary modality for automated driving perception and has shown potential to improve birds-eye-view (BEV) scene understanding when fused with onboard sensors. Prior work demonstrated performance gains for online high-definition (HD) map construction through aerial-onboard fusion; however, conventional end-to-end fusion does not fully exploit the structural information contained in aerial representations. In this work, we introduce AerialFusionMapNet, a fusion-based mapping framework with a structured two-stage training strategy that explicitly enhances the contribution of aerial features within a unified pipeline. The proposed training scheme enables more effective integration of structural aerial priors. On the nuScenes geographic split, AerialFusionMapNet achieves up to 54.7 mAP, improving over prior aerial-onboard fusion baselines from 48.8 mAP by +5.9 absolute and +12.1% relative. The results suggest that structured training design, rather than increased architectural complexity, plays a more decisive role in unlocking the full potential of aerial imagery for online HD map construction. Code and trained models are available at this https URL.
[CV-15] Revealing Training Data Exposure in Vision Language Large Models via Parameter Gradients
链接: https://arxiv.org/abs/2606.24774
作者: Zhihao Zhu,Hongyi Tang,Yi Yang,Ahmed Abbasi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Large Models (VLLMs) trained on massive crawled corpora raise pressing copyright and data-provenance concerns. These concerns are particularly acute in healthcare, where patient medical images paired with clinical reports demand rigorous privacy safeguards. However, existing training data detection methods either fail in cross-modal scenarios or rely on superficial output signals with insufficient discriminative power. We introduce GradAudit, a gradient-based auditing framework that examines internal optimization dynamics rather than treating VLLMs as black boxes. Our approach builds on a key observation: model parameters converge to regions where gradients on training samples become stable and well-aligned, whereas gradients on non-training samples remain noisy and inconsistent. By analyzing these gradient signatures, GradAudit achieves strong separability and detects genuine image-text associations learned during training, not merely individual modality membership. Empirically, across both medical and general-domain datasets, GradAudit substantially outperforms state-of-the-art baselines in both pretraining and fine-tuning VLLMs. In a case study employing copyrighted content, we show that existing training data detection methods not only underestimate the extent of unauthorized data usage, but that this underestimation becomes more pronounced as models become more recent and more advanced.
[CV-16] Compact Object-Level Representations with Open-Vocabulary Understanding for Indoor Visual Relocalization
链接: https://arxiv.org/abs/2606.24767
作者: Zhaopeng Cui,Jiarui Hu,Jingbo Liu,Boming Zhao,Xiyue Guo,Boyin Feng,Haocheng Peng,Yujun Shen,Hujun Bao,Guofeng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by RA-L 2026
Abstract:Indoor visual relocalization plays a critical role in emerging spatial and embodied AI applications. However, prior research was predominantly devoted to low-level vision schemes, struggling to perceive scene semantics and compositions, which limits both interpretability and applicability. In this paper, we explore the issue of how to organize rich object information in a scene, including semantics, layout, and geometry, into a structured map representation, thereby utilizing object units exclusively to drive the camera relocalization task. To this end, we propose OpenReLoc, a camera relocalization system designed to provide scene understanding and accurate pose estimation capabilities. Leveraging recent foundation models, we first introduce a multi-modal mechanism to integrate open-vocabulary semantic knowledge for effective 2D-3D object matching. Additionally, we design object-oriented reference frames as position priors, paired with a reference frame selection strategy based on the Distance-IoU (DIOU), enabling extension to scalable scenes. Moreover, to ensure stable and accurate pose optimization, we also propose a dual-path 2D Iterative Closest Pixel loss guided by object shape. Experimental results demonstrate that OpenReLoc achieves superior relocalization recall and accuracy across various datasets. Our source code will be released upon acceptance.
[CV-17] UniDrive: A Unified Vision-Language and Grounding Framework for Interpretable Risk Understanding in Autonomous Driving
链接: https://arxiv.org/abs/2606.24759
作者: Xiaowei Gao,Pengxiang Li,Yitai Cheng,Ruihan Xu,James Haworth,Stephen Law,Yun Ye
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent multimodal large language models (MLLMs) have shown strong potential for autonomous driving scene understanding, yet existing methods still face a fundamental trade-off between temporal reasoning and spatial precision. Models that rely on single-frame or low-resolution inputs often miss small, distant, or partially occluded hazards, while language-centric driving models frequently provide limited grounded evidence for their explanations. To address this gap, we propose UniDrive, a unified visual-language and grounding framework for interpretable risk understanding in autonomous driving. UniDrive combines a temporal reasoning branch that models scene dynamics from multi-frame visual input with a high-resolution perception branch that preserves fine-grained spatial details from the latest frame. The two branches are integrated through a gated cross-attention fusion module, enabling dynamic context to be aligned with precise spatial evidence. Based on the fused representation, UniDrive jointly generates natural-language risk descriptions and grounded bounding-box outputs for risk objects. Experiments on the DRAMA-Reasoning benchmark show that UniDrive outperforms representative image-based and video-based baselines in both captioning and risk-object grounding. In particular, UniDrive achieves the best overall performance on the validation split and demonstrates clear advantages in small-object localization, zero-shot generalization to NuScenes and BDD100K, and human-rated interpretability and trustworthiness. These results suggest that explicitly combining temporal semantics and high-resolution perception provides a stronger foundation for interpretable and safety-oriented autonomous driving systems. The code is available at this https URL.
[CV-18] Adaptive Hebbian Memory Routing in Vision Transformers for Few-Shot Learning
链接: https://arxiv.org/abs/2606.24756
作者: Mohammed Yusuf Mujawar,Noorbakhsh Amiri Golilarz
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Few-shot image recognition requires models to adapt to new classes from a small labeled support set. Hebbian fast-weight memory can provide temporary associative information during an episode, but fixed memory behavior may not be appropriate for every few-shot task. In this work, we propose Adaptive Hebbian Routing for few-shot Vision Transformers. The method uses a lightweight MLP router to control the contribution of Hebbian memory, the strength of memory updates, and the retention of previous memory from support-set features. We study Adaptive Placement, Adaptive Plasticity, and Fully Adaptive Hebbian Routing. Experiments use ViT-Small, DeiT-Small, and Swin-Tiny under 5-way 1-shot evaluation on Omniglot, CIFAR-FS, and cross-domain transfer from CIFAR-FS to Omniglot. In the direct Swin comparison, fixed and adaptive Hebbian variants use the same memory location. Adaptive Plasticity improves the fixed Hebbian result from 96.74% to 96.92%, while Fully Adaptive Routing achieves the best result at 96.94%. The fully adaptive Swin model also reduces inference time from 16.51 ms to 14.05 ms relative to fixed Hebbian Swin. On CIFAR-FS, adaptive variants improve performance across all three backbones, and the multi-shot evaluation shows that these gains remain useful as the number of support examples increases. These results show that adaptive plasticity and adaptive memory activation can improve few-shot Transformer representations beyond fixed Hebbian behavior.
[CV-19] BioMedVR: Confusion-Aware Mixture-of-Prompt Experts for Biomedical Visual Reprogramming ECCV2026
链接: https://arxiv.org/abs/2606.24740
作者: Jiaxiang Liu,Tianxiang Hu,Juwei Guan,Yujie Wu,Yusong Wang,Yao Mu,Zuozhu Liu,Mingkun Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026. 19 pages, 6 figures. Project page: this https URL
Abstract:Recent advances in vision-language models (VLMs) such as CLIP have demonstrated strong generalization across natural-image domains. However, adapting these models to biomedical imaging is non-trivial: full-model fine-tuning is computationally expensive, while medical data are often scarce and exhibit subtle, fine-grained inter-class differences, making parameter-efficient adaptation particularly critical. Visual Reprogramming (VR) offers a parameter-efficient alternative by injecting learnable perturbations into the input space, but existing VR approaches for VLMs mainly focus on positive class prompts and overlook confusing negatives, leading to miscalibrated predictions in fine-grained medical scenarios. We present BioMedVR, the first VR-based framework for biomedical imaging, enabling few-shot adaptation of pretrained VLMs through compact learnable VR modules. To mitigate class confusion, we introduce a Confusion Minimization Mechanism that leverages LLM-generated confusion-aware attributes together with a Confusion-Suppression Loss to explicitly reduce false-positive alignment. Moreover, the designed Mixture-of-Prompt Experts combines a positive expert for main-class discrimination and a negative expert for confusion suppression, balanced via adaptive gating. Extensive experiments on 18 datasets, including 11 biomedical datasets and 7 natural image benchmarks, demonstrate that BioMedVR achieves superior accuracy and generalization, effectively bridging VR and VLMs in biomedical domains.
[CV-20] VSANet: View-aware Sparse Attention Network for Light Field Image Denoising
链接: https://arxiv.org/abs/2606.24737
作者: Gargi Panda,Soumitra Kundu,Saumik Bhattacharya,Aurobinda Routray
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Light field (LF) image denoising is challenging due to the high-dimensional structure of LF data. While noise is independent across sub-aperture images, scene content exhibits strong cross-view correlations. We introduce VSANet, a view-aware sparse attention network for LF denoising. Specifically, we propose a view-aware sparse attention (VSA) block that represents the 4D LF feature map as a unified spatial-angular token space and performs cross-view aggregation via locality-sensitive hashing-based sparse attention. This enables global feature interactions with linear complexity, effectively exploiting LF correlations across views and spatial locations. In addition, we design a feature refinement (FR) block to emphasize informative features in spatial, angular, and epipolar subspaces. The VSA and FR blocks are integrated within a sequential attention refinement module, forming the core of VSANet. Experiments demonstrate VSANet outperforms stateof-the-art LF denoising methods.
[CV-21] SER: Learning to Ground Video Reasoning with Semantic Evidence Rewards
链接: https://arxiv.org/abs/2606.24726
作者: Sheng Xia,Zhengqin Lai,Tianxiang Jiang,Kanghui Tian,Shoujun Zhou,Bin Li,Yi Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video MLLMs often struggle with fine-grained spatio-temporal reasoning, sometimes generating correct answers based on irrelevant frames or objects. Although outputting spatio-temporal evidence during reasoning is a promising direction, existing RL frameworks typically rely on geometry-only (IoU) rewards, which can be sensitive to boundary perturbations and overlook semantic alignment. To address this, we propose Semantic Evidence Reward (SER), which reformulates spatio-temporal evidence grounding as a constrained verification task. Instead of computing pixel-level overlap, SER uses a referee VLM as a local checker to evaluate model-generated evidence claims across two dimensions: relevance and localization quality, combined with a temporal penalty. This design reduces the reliance on dense box annotations and enables training directly on standard video QA data. On the V-STAR benchmark, SER achieves 49.6% mLGM, improving by 3.0 points over the strong evidence-grounded baseline Open-o3-Video, demonstrating its potential in enhancing both answer accuracy and evidence grounding.
[CV-22] Evaluating the Interpretability of Sparse Autoencoders with Concept Annotations ECCV2026
链接: https://arxiv.org/abs/2606.24716
作者: Jonas Klotz,Cassio F. Dantas,Pallavi Jain,Diego Marcos,Begüm Demir
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ECCV 2026
Abstract:Sparse autoencoders (SAEs) are increasingly used to extract interpretable concepts from vision and vision language models, yet existing evaluation methods largely rely on proxy metrics or qualitative inspection rather than measuring semantic correspondence. We present a human-grounded evaluation framework that quantifies alignment between SAE latents and human-annotated concepts, without requiring user studies, and validate this matching through targeted attribute perturbations. To enable this intervention-style evaluation in vision, we construct synCUB and synCOCO, synthetic benchmarks of paired images that differ in exactly one attribute. We introduce Fully-Binary Matching Pursuit (FBMP), a coalition-based matching procedure that supports many-to-one mappings between SAE latents and annotated concepts, and consistently outperforms one-to-one baselines. For functional validation, we propose a Targeted Attribute Perturbation Alignment Score (TAPAScore), which tests whether matched concepts respond selectively and in the expected direction under targeted image-level attribute perturbations. Under sanity checks, our matching and TAPAScore are the only evaluated metrics that reliably distinguish trained SAEs from untrained ones. Across SAEs trained on CLIP and DINOv2 embeddings, we find that increased overcompleteness can reduce perturbation alignment, indicating a reduction in interpretability. Our evaluation framework suggests that moderate dictionary sizes provide the best trade-off, yielding the most interpretable SAEs. Code and datasets are available at this https URL.
[CV-23] Agent ic Collaborative Cognition for Zero-Shot 3D Understanding ECCV2026
链接: https://arxiv.org/abs/2606.24649
作者: Wenxin Wang,Bo Zhang,Feng Chen,Zixuan Wang,Wen Li,Changsheng Li,Yinjie Lei
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026. Project page: this https URL
Abstract:Recent advancements have explored agentic zero-shot 3D understanding by reformulating it as video keyframe understanding with Multimodal Large Language Models (MLLMs). However, existing methods face an intrinsic bottleneck due to the finite observation perspectives inherent in videos and the implicit perception of 3D scenes. In this paper, we propose a collaborative multi-agent framework that assigns a Planning Agent to handle high-level viewpoint planning and supplement novel perspectives, and a Perception Agent to explicitly summarize the 3D scene into a structured holistic cognitive map. Specifically, Planning Agent first analyzes this cognitive map to determine query-relevant viewpoints and supplements missing critical perspectives to ensure comprehensive observation. Subsequently, Perception Agent documents object-level attributes from these views by assigning consistent instance identifiers across viewpoints, thereby integrating fragmented observations into the holistic cognitive map. In parallel, it provides feedback to filter out mismatched candidate objects and guide subsequent viewpoint planning. Through this closed-loop iterative process, two agents collaboratively figure out candidates until Perception Agent determines that sufficient information has been captured to complete the task. Extensive experiments demonstrate that our method achieves state-of-the-art performance on 6 benchmarks, with improvements of 11.1% Acc@0.5 on ScanRefer, 14.6 BLEU-1 on 3D-assisted dialog, and 2.1 EM on SQA3D.
[CV-24] ArtiTwinSplat: Interactable Digital Twin Reconstruction via Gaussian Splatting from RGB-D videos ICRA2026
链接: https://arxiv.org/abs/2606.24628
作者: Pranjal Mishra,René Zurbrügg,Max Wilder-Smith,Marco Hutter,Marc Pollefeys,Zuria Bauer,Hermann Blum
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at the ICRA 2026 Workshop on Advances and Challenges in AI-Driven Automation and Robotic System Integration with Digital Twins, Vienna, June 2026
Abstract:Deploying robots in unstructured real-world environments needs accurate, interactive models of the objects. Constructing these models at scale remains a critical bottleneck for robotic system integration. We present ArtiTwinSplat, a framework that automatically constructs articulated, photo-realistic digital twins of objects directly from RGB-D videos, requiring no CAD models, simulation assets, or manual annotations. Our method is built on 3D Gaussian Splatting that preserve geometric fidelity and photometric realism, coupled with an unsupervised articulation discovery pipeline that recovers part structure and joint kinematics from observed motion alone. With tracking and optimization stages our method provides stable, queryable digital twins that support real-time rendering, viewpoint control, and interactive manipulation. Unlike prior methods confined to simulation, ArtiTwinSplat operates directly on real-world observations and produces twins that are immediately usable by downstream robot planning and learning systems. This method offers a practical, scalable pathway toward digital twin construction, lowering the integration barrier for articulated object manipulation in embodied AI and human-robot collaboration contexts.
[CV-25] ViTexQA: A Multi-Frame Temporal Perception Dataset for Video Text Question Answering ECCV2026
链接: https://arxiv.org/abs/2606.24602
作者: Zhentao Guo,Chen Duan,Tongkun Guan,Zining Wang,Kai Zhou,Pengfei Yan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV2026
Abstract:Despite remarkable progress in multimodal understanding, current MLLMs still exhibit limitations in video text understanding, particularly when semantics emerge through the integration of temporally distributed textual cues across multiple frames. This perception challenge fundamentally differs from static image text understanding, yet existing datasets fail to capture: the vast majority of questions remain answerable from single frames, inadequately reflecting real-world video text comprehension demands. To address this, we present ViTexQA, a large-scale video-text QA dataset, and FrameThinker for robust multi-frame temporal reasoning. We build ViTexQA via a quality-controlled Chain-of-Thought (CoT) annotation pipeline boosted with temporal constraints; all its QA pairs demand cross-frame text fusion to solve, enforcing true temporal reliance. FrameThinker adopts two-stage training for explicit temporal modeling: CoT-Guided Supervised Fine-Tuning (SFT) generates frame-aware reasoning chains, followed by Temporally-grounded Reinforcement Learning (RL) optimized with multi-frame coherence rewards. Evaluations show our method outperforms SOTA baselines on ViTexQA, lifting ROUGE-L by 6.3%.
[CV-26] EERLoss: A Novel Loss Function for Training Deep Biometric Models. A Case Study in Keystroke Dynamics
链接: https://arxiv.org/abs/2606.24586
作者: Nahuel Gonzalez,Marta Robledo-Moreno,Ivan DeAndres-Tame,Ruben Vera-Rodriguez,Ruben Tolosana
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Deep learning approaches to biometric verification are commonly trained by optimizing indirect objectives, creating a misalignment between the optimization process and the primary evaluation metric, typically the Equal Error Rate (EER). This paper introduces EERLoss: a subdifferentiable, arbitrarily accurate approximation to EER for training deep biometric models. Furthermore, this framework has the potential to be adapted to optimize any specific operating point on the DET curve, enhancing its generalizability. To validate this approach, EERLoss is evaluated on a particularly demanding behavioral biometric modality: keystroke dynamics verification. This task is characterized by its high intra-class and low inter-class variability. Experiments are conducted on the large-scale KVC-onGoing benchmark, incorporating data from over 185,000 subjects across different scenarios. A comprehensive ablation study initially demonstrates the superiority of EERLoss in comparison to existing state-of-the-art loss functions. It also converges substantially faster compared to other losses, reducing the overall training cost. Additionally, a comparison is made between the proposed loss and the KVC-winning architecture by re-training it with EERLoss, demonstrating that the proposed approach significantly outperforms the original SoTA, achieving a relative EER reduction of up to approx. 30%. This improvement on a challenging, large-scale benchmark validates the effectiveness of EERLoss as a task-aligned training objective specifically suited for high-variance biometric traits.
[CV-27] Jolia: Concept-Level Vision-Language Alignment for 3D CT Contrastive Learning
链接: https://arxiv.org/abs/2606.24570
作者: Julien Khlaut,Charles Corbière,Baptiste Callard,Amaury Prat,Leo Butsanets,Antoine Saporta,Théo Danielou,Leo Machado,Korentin Le Floch,Tom Boeken,Pierre Manceron,Corentin Dancette
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language contrastive pretraining has become the dominant recipe for 3D medical foundation models, leveraging the large volumes of paired scans and reports produced in clinical practice. However, medical images usually span dozens of organs, and radiological reports are much longer than typical natural image captions and are composed of multiple structured sections. CLIP-style pretraining compresses this structure by encoding each modality into a single global token, at the risk of losing important details. We introduce ConQuer (Concept Queries), an image-text pretraining method that augments CLIP’s global alignment with a set of localized alignments, one per concept. ConQuer splits the report into concept-specific sections and learns cross-attention queries that pool the matching image features without using any segmentation mask or spatial supervision. Contrastive learning is then applied independently for each concept. Concepts can be any unit of semantic localization; here, they are anatomical regions, one query per organ or gross body region. As a byproduct, each query learns attention maps focused on its concept, providing built-in spatial interpretability. We use ConQuer to train Jolia, a 3D CT foundation model on chest and abdominal CT. Jolia consistently outperforms a CLIP baseline on findings classification, report generation, and cross-center transfer, and sets a new state of the art across multiple public benchmarks. Jolia’s weights will be released upon acceptance.
[CV-28] Multilevel Stochastic Plug-and-Play for Sparse-View CT Reconstruction
链接: https://arxiv.org/abs/2606.24567
作者: Antoine De Paepe,Alexandre Bousse,Dimitris Visvikis
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 12 pages, 6 figures, 3 tables
Abstract:Sparse-view computed tomography (SVCT) reduces radiation exposure and acquisition time, but the limited number of projection views makes the reconstruction problem severely ill-posed and leads to streak artifacts when analytical methods are used. Plug-and-Play (PnP) methods provide an effective way to combine data fidelity with learned image priors, while stochastic PnP methods further improve robustness by matching the denoiser input distribution through re-noising. However, these methods often require many iterations to converge, which limits their practical efficiency. In this work, we propose a multilevel (ML) stochastic PnP method for SVCT that accelerates stochastic PnP reconstruction. We highlight that, in the stochastic setting, directly enforcing prior coherence across levels would require accurately estimating fine-level prior gradients through multiple denoiser function evaluations, which substantially increases the computational cost. Motivated by this observation, we perform the multilevel steps in multiresolution analysis (MRA) approximation spaces. This choice is supported by the structure of the wavelet decomposition, which causes the prior-coherence correction to vanish in expectation, thereby avoiding costly estimation of fine-level stochastic prior gradients for the coarse-level corrections. Experiments on SVCT reconstruction show that our method, called Multilevel Stochastic Plug-and-Play (ML-SPnP), achieves reconstruction quality comparable to state-of-the-art methods while substantially reducing runtime.
[CV-29] PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments
链接: https://arxiv.org/abs/2606.24564
作者: Zhenyang Li,Lutao Jiang,Yizhou Zhao,Ying-Cong Chen,Xin Wang,Weikai Chen,Yifan Peng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures
Abstract:Reconstructing realistic, physically plausible garments from a single image remains a fundamental challenge. Template-free methods capture surface geometry but lack explicit sewing structure for simulation; while programmatic systems are simulation-ready but constrained by predefined templates. This reveals a fundamental representation gap between geometric reconstruction and structured garment construction. We present PatternGSL, a structured garment representation in the form of a template-free and learnable specification language that encodes complete sewing patterns, including panel boundaries, parameterized seams, and explicit stitch topology, in a compact and standardized form. PatternGSL preserves the physical rigor of pattern-based models while removing template dependence, elevating sewing structure as a first-class target for generative modeling. We further propose a vision-language framework that predicts PatternGSL specifications directly from a single image and decodes them into garments using lightweight deterministic validity handling, without optimization-based refinement or manual cleanup. In addition, we introduce PatternGSLData, the first large-scale image-to-GSL paired dataset comprising 300K samples with complete sewing pattern annotations, enabling supervised VLM training for structured garment reconstruction. Experiments demonstrate improved pattern accuracy over prior baselines, explicit sewing-structure recovery, reliable cloth simulation, and pattern-level editing through the same deterministic decoding pipeline. Code and data-processing scripts will be released at this https URL.
[CV-30] Quantum CT via Dynamic Interval Encoding and Prior-Balanced QUBO Reconstruction
链接: https://arxiv.org/abs/2606.24561
作者: Ao Wang,Yikuang Yuluo,Yujie Liu,Shuangyang Zhong,Yuwen Zhang,Zihao Wang,Fenglin Liu,Andreas Maier,Haijun Yu,Yixing Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 10 figures
Abstract:Quadratic unconstrained binary optimization (QUBO)-based quantum computed tomography (CT) casts reconstruction as a binary quadratic problem for quantum annealing and hybrid quantum–classical solvers. For grayscale CT, however, image encoding is constrained by the binary-variable budget: fixed global bit-plane encodings increase QUBO size and coupling complexity as gray-level precision improves, whereas low-bit encodings introduce quantization error. We propose a QUBO-based grayscale CT reconstruction framework that combines dynamic interval encoding with prior-balanced optimization. Each refinement round encodes active pixels only within local gray-level intervals around the current estimate, and a boundary-hit-guided update rule adaptively switches between search expansion and local refinement. To improve optimization stability, the method balances projection-domain data consistency and an edge-preserving quadratic prior before forming the final QUBO. Sparse-view and limited-angle fan-beam CT experiments show that the proposed method recovers structures and gray-level distributions more faithfully than the evaluated analytic, iterative, variational, and representation-based baselines. Expressivity analysis and ablation studies further indicate that the improvement mainly arises from effective gray-level representation through dynamic local encoding and more stable data-fidelity–prior coupling. Experiments on the D-Wave hybrid binary quadratic model (BQM) solver further demonstrate that the formulation is executable on a hardware-backed hybrid quantum–classical backend.
[CV-31] Heterogeneous Knowledge Distillation via Geometry Decoupling and Momentum-Aware Gradient Regulation
链接: https://arxiv.org/abs/2606.24557
作者: Wuming Yang,Xiang Zhang,Hongmin Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review
Abstract:Heterogeneous Knowledge Distillation (HKD) aims to transfer knowledge across varying architectures (e.g., from Transformer to CNN) but inherently suffers from severe training instability. We reveal that this instability stems from two highly coupled challenges: massive feature norm discrepancies that cause optimization drag, and severe gradient conflicts between the primary and distillation objectives arising from distinct inductive biases. To achieve stable distillation, we propose SPOFA, a framework built upon a novel Feature and Gradient Dual Stabilization mechanism. Specifically, at the feature level, we introduce a LayerNorm-based decoupling projector that explicitly decouples feature magnitude from direction, creating a bounded and stable space for semantic alignment. At the gradient level, we propose a momentum-driven Exponential Moving Average (MEMA) dynamic scaler. By establishing a robust historical baseline of the optimization trajectory, MEMA actively evaluates instantaneous gradient conflicts and adaptively penalizes harmful distillation signals, guaranteeing stable convergence. Importantly, SPOFA achieves this dual stabilization with an extremely lightweight parameter footprint. Extensive experiments on two mainstream benchmarks demonstrate that SPOFA achieves state-of-the-art accuracy, significantly outperforming computationally expensive methods while introducing only minimal computational overhead compared to standard baselines.
[CV-32] Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning
链接: https://arxiv.org/abs/2606.24548
作者: Jiayi Lei,Yuandong Pu,Xingyu Han,Rongpeng Zhu,Jing Xu,Jinyao Wang,Zijian Zhou,Bin Fu,Yuewen Cao,Yihao Liu,Yongsheng Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures. Project page: this https URL
Abstract:Text-to-image (T2I) generation models have achieved remarkable progress in producing visually realistic images from natural language prompts. Yet it remains unclear whether their success reflects genuine causal understanding or sophisticated pattern matching over visual-textual correlations. Inspired by Russell’s inductivist turkey, we introduce Counterfactual-World (CF-World), a counterfactual benchmark designed to investigate whether text-to-image models can generate images under rules that systematically contradict real-world priors. CF-World organizes each scenario into three progressive levels: factual generation under ordinary world knowledge, explicit counterfactual generation with direct visual instructions, and implicit counterfactual generation requiring causal deduction from altered rules. We evaluate both open-source and closed-source T2I models using a Vision Language Model (VLM)-based evaluator (CF-Eval). Furthermore, we introduce two metrics: Prior Resistance Rate (PRR), which measures a model’s ability to overcome entrenched real-world priors, and Reasoning Retention Rate (RRR), which assesses whether models can maintain reasoning-dependent counterfactual generation without explicit visual cues. Experiments show that all models exhibit sharp degradation from factual to counterfactual settings. Further analyses suggest that these failures arise because current T2I models encode world knowledge and visual appearances as tightly coupled patterns. Consequently, their heavy reliance on frequent visual co-occurrences within the training data forces them to default to familiar commonsense priors when tasked with rendering counterfactual worlds.
[CV-33] PointVG-R: Internalizing Geometric Reasoning in MLLM s for Precise Pointing Localization via Visual Chain of Thought
链接: https://arxiv.org/abs/2606.24539
作者: Ling Li,Bowen Liu,Zinuo Zhan,Jianhui Zhong,Ziyu Zhu,Bingcai Wei,Kenglun Chang,Zhidong Deng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pointing-based visual grounding requires models to precisely locate target objects by deciphering complex spatial relationships between the visual scene and pointing gestures. Traditional methods typically encode input images into static feature representations and perform reasoning primarily within the linguistic domain, often overlooking the rich perceptual cues and explicit spatial geometry inherent in images. In this study, we aim to mitigate the cognitive vulnerability of models in interpreting gestural spatial relations by proposing PointVG-R, a reasoning-guided Multi-modal Large Language Model (MLLM). PointVG-R introduces geometric-aware reasoning for pointing-based grounding, enabling the model to think with images through the strategic integration of Reinforcement Learning (RL) and cold-start data. Specifically, we design a novel geometric reasoning pipeline that simulates the iterative cognitive process humans employ when interpreting pointing gestures. Furthermore, we construct EgoPoint-CoT, a high-quality visual Chain-of-Thought (CoT) dataset featuring detailed reasoning trajectories to guide the model via Supervised Fine-Tuning (SFT) and RL. To address the varying quality of learning signals encountered during training, we further propose an Adaptive Importance Weighting strategy based on Group Variance, which dynamically adjusts reward signals to optimize the learning process. Experimental results demonstrate that PointVG-R achieves SOTA performance, outperforming the baseline by \textbf15.86 points in mIoU. Extensive ablation studies further validate the efficacy of our proposed modules. Code: this https URL.
[CV-34] ForensicsTok: Forensics-Guided Tokenized Modeling for Image Tampering Localization
链接: https://arxiv.org/abs/2606.24538
作者: Lei Xu,Haowei Wang,Shen Chen,Taiping Yao,Bin Li,Changsheng Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 4 figures, 8 tables
Abstract:Multi-modal Large Language Models (MLLMs) offer powerful reasoning for forensic tasks, yet existing approaches utilizing exogenous segmentation decoders often suffer from suboptimal localization. The reliance on stitched pipelines introduces information bottlenecks during backpropagation, which dilutes spatial signals and is limited by semantic priors of the segmentor. To address these limitations, we propose ForensicsTok, which reformulates image manipulation localization as an autoregressive sequence generation task. ForensicsTok directly generates spatially grounded token sequences, enabling precise mask prediction without intermediary supervision. Specifically, we introduce a Token Splatting Decoder (TSD) to map tokens to binary masks via codebook-aware code smoothing, which mitigates sharp gradients from deterministic detokenizers. Furthermore, to capture diverse tampering clues, we propose a Hierarchical Expert Fusion (HEF) module that injects multi-scale features from a forensic expert model. This unified architecture effectively compensates for the lack of forensic priors in standard MLLMs. Extensive experiments on six benchmarks show that ForensicsTok substantially improves over existing MLLM-based baselines and slightly improves over strong forensic expert baselines, while exhibiting stronger robustness to perturbations.
[CV-35] VisCritic: Visual State Comparison as Process Reward for GUI Agents ECCV2026
链接: https://arxiv.org/abs/2606.24525
作者: Jiachen Qian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 4 figures; ECCV 2026 submission; supplementary material uploaded as ancillary file
Abstract:GUI agents powered by vision-language models show strong potential for automating digital tasks, yet frequently fail in long-horizon scenarios due to the absence of step-level verification. Existing process reward models verify actions through textual reasoning alone, missing the visual nature of GUI state changes. We introduce VisCritic, a visual process reward framework that verifies agent actions by directly comparing pre-action and post-action screenshots in visual feature space. VisCritic employs a Siamese vision transformer to extract change-aware representations, coupled with an Action-Aware Critic Head that jointly evaluates action success, task progress, and error type. A critic-training data construction pipeline generates weakly supervised samples from existing trajectories without additional human labels for critic training. Experiments and offline analyses across five benchmarks demonstrate that VisCritic serves as a plug-and-play enhancement for diverse GUI agents, generally improving benchmark metrics while providing visual diagnostic cues.
[CV-36] What Do Flow-Based Inverse Solvers Approximate? A Posterior-Transport View
链接: https://arxiv.org/abs/2606.24516
作者: Jian Xu,Delu Zeng,John Paisley,Qibin Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A growing family of training-free solvers – FlowDPS, FLOWER, PnP-Flow and their diffusion ancestors (DPS, DAPS) – repurpose a pretrained flow-matching prior to solve imaging inverse problems by adding a measurement-guidance term to the deterministic probability-flow ODE. Despite strong empirical results, what these per-step corrections actually approximate – and how far the resulting samples are from the true posterior p(x\mid y) – has not been characterized. We give a posterior-transport account of flow-based inverse problem solving. Our starting point is a simple but consequential fact: for a \emphdeterministic flow prior, Bayesian conditioning is realized entirely by a \emphreweighting of the source distribution, not by a drift correction; pushing the reweighted source through the \emphunmodified velocity field yields exact posterior samples. From this we show that trajectory-guidance solvers can be read as the minimum-kinetic-energy \emphcorrection field needed to morph the unconditional source into the posterior, and that FlowDPS / FLOWER / PnP-Flow correspond to distinct zeroth-order / Gaussian / proximal approximations of this single object; we bound the resulting posterior bias in Wasserstein distance. A controlled 2 D study with a closed-form posterior confirms the theory decisively: source reweighting matches the true posterior to the Monte-Carlo floor on every metric, whereas trajectory guidance incurs 200 – 800\times larger error and collapses posterior modes, \emphregardless of guidance strength. Guided by the analysis we propose a cheap, principled velocity-correction solver that is competitive across two in-domain priors (AFHQ, CelebA) and two out-of-distribution settings while, unlike point-estimate source-space optimizers, producing diverse posterior samples with uncertainty that correlates with reconstruction error.
[CV-37] GeoIMO: Geometry-Driven Independent Motion Classification for Event Cameras
链接: https://arxiv.org/abs/2606.24499
作者: Anil Bayram Gogebakan,Filippo Marostica,Alessio Caviglia,Alessandro Savino,Stefano Di Carlo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing automotive event datasets rely on appearance-based annotations from frame pipelines, making them poorly suited for motion-aware event perception. We present a geometry-driven, annotation-free framework that classifies detected objects as static or independently moving by exploiting ego-motion structure directly from the event stream. A Focus of Expansion model with yaw compensation estimates global background motion, while objects are labeled as moving when local motion deviates from this prediction, as quantified by a scale-invariant residual. Temporal stabilization improves robustness across consecutive event windows. The method requires no learning, no manual motion labels, and works with any input bounding boxes. Experiments on MVSEC and the Prophesee 1 Megapixel Automotive Detection dataset demonstrate consistent performance across diverse driving scenarios, with yaw compensation improving results during turns and a simple translational local model offering a favorable accuracy-efficiency trade-off.
[CV-38] VistaRef: Boosting Visual Spatial Orientation Awareness for Pointing-to-Object Detection
链接: https://arxiv.org/abs/2606.24498
作者: Ling Li,Zhizhen Cai,Xinkun Wu,Ziyu Zhu,Jiaqing Lyu,Bowen Liu,Zhidong Deng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Grounding deictic gestures in natural images is fundamental to AR and human-robot collaboration, providing a basis for seamless spatial interaction. While Transformer-based visual models have achieved significant progress in general object detection, their global attention mechanisms often neglect micro-geometric relationships, degrading orientation accuracy. In pointing tasks, this deficiency manifests as an inability to accurately capture the pointing ray implied by finger poses, which results in pointing drift and localization ambiguity when dealing with distant or densely packed objects. To address this, we propose VistaRef, a framework designed to explicitly enhance spatial orientation awareness. First, we develop the Local Hand Entity Modeling (LHEM) module, which incorporates hand-pose embeddings to strengthen the model’s capability to capture subtle finger deviations. Second, drawing inspiration from multi-view geometry, we construct the Geometric Ray Modeling (GRM) module to transform implicit orientation information into explicit spatial geometric features, guiding feature aggregation and deep fusion via attention mechanisms. Furthermore, we introduce a novel Orientation-Consistent Alignment Loss (OCAL) to synergistically supervise hand presence and pointing consistency, ensuring that all architectural improvements collectively serve the core objective of spatial localization. Experimental results demonstrate that VistaRef significantly outperforms the baseline, achieving a 14-point absolute gain in grounding accuracy. Qualitative analysis further confirms that VistaRef effectively models the geometric correlation from hand to target, bridging the spatial perception gap inherent in traditional Transformers for complex scenarios. Code: this https URL.
[CV-39] RetiSEM: Generalising Causal Models for Frag mented Biomedical Data
链接: https://arxiv.org/abs/2606.24488
作者: Inam Ullah,Imran Razzak,Shoaib Jameel
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:
Abstract:Learning causal models from fragmented biomedical data is challenging because clinical, molecular, and imaging variables are often incomplete or not jointly observed. We propose RetiSEM, a domain-constrained structural equation modelling (SEM) framework for causal graph recovery and mediation analysis under limited multimodal resources. This proposed work organises variables into biologically informed blocks, applies forbidden-edge constraints, and decomposes pathway-level effects into TE, NDE, and NIE components. We evaluate RetiSEM across ten synthetic benchmark scenarios that vary in dimensionality, nonlinearity, causal depth, and pathway structure, together with a fragmented real-world setting that combines NHANES clinical variables with externally derived retinal representations. This approach achieves lower structural error and higher causal accuracy than unconstrained baselines across the synthetic benchmarks. In the real-data analysis, retinal variables behave mainly as downstream biomarker-like indicators, with smaller but detectable indirect effects. These findings support our strategy as an interpretable framework for testing structured causal hypotheses in limited-resource biomedical AI. The code and resources for this work are publicly available at: this https URL.
[CV-40] Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods ECCV2026
链接: https://arxiv.org/abs/2606.24484
作者: Xingsong Ye,Yongkun Du,Jiaxin Zhang,Haojie Zhang,Chong Sun,Chen Li,Jing Lyu,Zhineng Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026
Abstract:WordArt (artistic text) features highly customized fonts, textures, and layouts, making WordArt-oriented scene TExt Recognition (WATER) substantially more challenging than general Scene Text Recognition (STR). Existing STR datasets and methods, typically built around regular scene text and fixed-template inputs, struggle to scale to WATER. Thus, we aim to advance this task from both data and model perspectives. On the data side, we construct a 2M synthetic dataset, WATER-S, with the scale improved by hundreds of times compared to existing artistic text data. WATER-S consists of two complementary subsets. One rendered by an upgraded rendering pipeline (SynthWordArt), which provides highly accurate and controllable synthetic WordArt data. The other is generated by combining Qwen3-VL for prompt mining and Z-Image for image synthesis, which improves the coverage of realistic and diverse data. On the model side, we propose WATERec. It adopts an visual encoder supporting arbitrary-shaped inputs and an autoregressive decoder to model complex layouts, structurally breaking the bottleneck of fixed-template STR on WordArt. Experiments show that this architecture outperforms prior STR methods, achieving state-of-the-art performance on irregular texts such as WordArt. Together with WATER-R, carefully reorganized from existing real STR data, our strong baseline with the new synthetic data and model design reaches 90.40% accuracy on WordArt-Bench, surpassing both general-purpose and OCR-specialized vision-language models by a large margin. Code and data are available at this https URL.
[CV-41] MambaRaw: Selective State Space Modeling for Efficient 4K Raw Image Reconstruction ECCV2026
链接: https://arxiv.org/abs/2606.24479
作者: Peize Li,Fanhu Zeng,Tongda Xu,Xingguo Xu,Xinjie Zhang,Xingtong Ge,Haotian Zhang,Yan Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026
Abstract:In-camera JPEG previews are ubiquitous in raw image formats and provide an sRGB reference at negligible storage cost. Although existing metadata-based reconstruction frameworks can exploit this side information when recovering raw images, their context models often become computationally expensive especially at high resolution, eg, 4K raw image, given that attention mechanisms scale quadratically with feature maps, hindering its practical application. To address these limitations, we propose MambaRaw, a JPEG-conditioned metadata-based raw image reconstruction framework that uses State Space Models (SSMs) to estimate entropy parameters efficiently. Our key contribution comprises a Spatial-Energy Coupled Context Modeling mechanism with two lightweight modules: (1) TileMambaBlock, which performs Mamba-style selective scanning only on information-dense tiles to improve the efficiency; and (2) Energy-Aware Refinement (EAR), an identity-initialized residual module that enhance feature representation to match the long-tail energy distribution of raw signals. Extensive experiments on three camera datasets (Sony, Olympus, Samsung) show consistent improvements over strong metadata-based baselines and set a new state of the art for JPEG-guided raw reconstruction with great efficiency. Notably, at low metadata bitrates, MambaRaw increases PSNR by 1.2–1.4 dB and reduces end-to-end coding latency by about 9%. Code is released at this https URL.
[CV-42] video-SALMONN-R3: Learning to ReWatch ReAsk and ReAnswer for Efficient Video Understanding
链接: https://arxiv.org/abs/2606.24477
作者: Yixuan Li,Guangzhi Sun,Yudong Yang,Wei Li,Zejun MA,Chao Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Video large language models (LLMs) are often constrained by computation and memory budgets, leading them to use reduced frame rates and spatial resolutions, which may cause them to miss critical information for question answering (QA). A practical and efficient solution is a two-stage paradigm: first perform coarse video understanding to localize relevant segments, and then re-watch these segments at higher temporal or spatial fidelity. In this paper, we present video-SALMONN-R ^3 , the first end-to-end video-LLM that enables re-watch through reinforcement learning without relying on chain-of-thought (CoT) cold-start. This design removes the need for costly CoT data annotations and avoids CoT-based supervised fine-tuning (SFT), which can otherwise degrade the pretrained video understanding abilities. To address the mismatch between the reasoning-first behavior induced by re-watch and the answer-first tendency of pretrained video-LLMs, we propose a re-answer strategy, in which the model first produces a direct answer in the first watch and then refines it after re-watching. Finally, to improve question adherence during re-watching, we propose a re-ask mechanism that re-injects the query when revisiting localized segments. Experimental results show that video-SALMONN-R ^3 consistently outperforms both the base model and the QA-SFT baseline, while surpassing prior re-watch-based approaches with significantly lower computational cost. Code, models, and data will be publicly released upon acceptance.
[CV-43] Boosting Text-Driven Video Segmentation via Geometry-Aware Distillation ECCV2026
链接: https://arxiv.org/abs/2606.24464
作者: Tianyu Zhu,Yingping Liang,Hesong Li,Ying Fu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV2026
Abstract:Text-driven Referring Video Object Segmentation (RVOS) aims to locate and segment target objects in videos given natural language. However, existing models are typically trained on 2D image or video datasets with naive segmentation losses, which overlooks the geometric consistency across frames and leads to weak spatial understanding. In this paper, we propose Geometry-enhanced Language-guided Video segmentation (GeoLaV), a two-stage framework that distills 3D geometric knowledge from images to enhance text-driven video segmentation. In the first stage, we perform monocular geometry pretraining with monocular novel-view synthesis, enabling the model to acquire geometry-consistent visual representations via spatial alignment on large-scale single-image datasets. In the second stage, we introduce geometry-aware distillation and fine-tune the model on video segmentation datasets, transferring 3D structural knowledge from a general 3D prior model. This process reinforces 3D awareness and improves both spatiotemporal coherence and language grounding in segmentation. Extensive experiments show that our method using only image segmentation data already provides notable zero-shot generalization in RVOS. When combined with geometry-aware distillation for fine-tuning on videos, our method achieves state-of-the-art performance across multiple RVOS benchmarks. The code is available at this https URL.
[CV-44] Lite Any Stereo V2: Faster and Stronger Efficient Zero-Shot Stereo Matching
链接: https://arxiv.org/abs/2606.24457
作者: Junpeng Jing,Ronglai Zuo,Zhelun Shen,Shangchen Zhou,Rolandos Alexandros Potamias,Stefanos Zafeiriou,Krystian Mikolajczyk,Jiankang Deng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in stereo matching have achieved remarkable accuracy, but often rely on large models, heavy computation, or additional foundation-model priors, making them difficult to deploy on resource-constrained platforms. In contrast, efficient stereo models offer faster inference but are commonly considered less capable of strong zero-shot generalization. In this paper, we challenge this assumption by introducing Lite Any Stereo V2 (LAS2), an ultra-fast model series designed for efficient zero-shot stereo matching. LAS2 is developed from both architecture and training perspectives. Architecturally, we revisit efficient stereo design under practical deployment settings and propose a 2D-only cost aggregation framework, optimized for real inference latency rather than theoretical MACs alone. For training, we develop a three-stage strategy that combines synthetic supervision, self-distillation, and real-world knowledge distillation. To improve the reliability of real-world pseudo supervision, we further introduce pseudo-label filtering and an error-clamping operation, enabling smoother synthetic-to-real transfer. We instantiate LAS2 as a family of models, including feed-forward variants for different efficiency budgets and an iterative variant for higher accuracy. Extensive experiments show that LAS2 achieves state-of-the-art accuracy among efficient stereo methods while maintaining significantly lower latency. Specifically, LAS2-H achieves stronger overall zero-shot performance than the iterative method Fast-FoundationStereo, with 1.8x and 2.7x faster inference on H200 and Orin, respectively. The project page, demos, and code are available at this https URL.
[CV-45] SENTRY: SAM2-Enhanced Neighbor-Aware and Temporally Reason ed Memory for Visual Tracking ECCV2026
链接: https://arxiv.org/abs/2606.24449
作者: Mohamad Alansari,Yonathan Michael,Hasan AlMarzouqi,Muzammal Naseer,Naoufel Werghi,Sajid Javed
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the European Conference on Computer Vision (ECCV 2026)
Abstract:We revisit the memory update mechanism in SAM2-based visual object tracking and identify confidence-only mask selection as the dominant cause of drift under occlusion, rapid motion, and distractors. We introduce SENTRY, a training-free, plug-and-play, refine-before-write module that validates each memory update for short-horizon temporal consistency before committing it. SENTRY aggregates diverse segmentation hypotheses per frame, backtracks them into short tracklets, and uses neighbor-aware cycle-consistent matching against recent trajectories to favor temporally and geometrically consistent masks. It leaves the base architecture untouched, replacing confidence-driven writes with consistency-validated ones. For fair evaluation, we re-evaluate major open-source SAM2-based trackers across all available scales and datasets, filling gaps in prior reports. Integrated into five strong baselines, SENTRY delivers consistent gains across nine benchmarks, achieving new zero-shot SOTA on LaSOT, LaSOT_ext, GOT-10k, VOT20, VOT22, and DiDi. Despite these checks, the SAM2-L version runs at 32.8 FPS on an A100, and across compatible hosts adds only about 0.4–0.6 GB VRAM. Our results provide the first unified all-scale evaluation of SAM2-based trackers and show that enforcing temporal validity at write time stabilizes memory-augmented tracking without retraining.
[CV-46] P-MTP: Efficient Document Parsing via Multi-Token Prediction with Progressive Depth Scaling
链接: https://arxiv.org/abs/2606.24447
作者: Le Xiang,Chenxi Zhai,Shu Wei,Jingjing Wu,Qunyi Xie,Xiao Tan,Kunbin Chen,Wei He
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) have revolutionized document parsing by enabling end-to-end mapping from images to structured text, imposing a significant latency bottleneck, particularly for token-dense documents. While Multi-Token Prediction (MTP) has emerged as a promising approach for accelerating inference, its potential is constrained by optimization instability when scaling to deeper look-ahead depth. In this paper, we propose \textbfP-MTP, a framework that leverages \textbfProgressive Multi-Token Prediction with a lightweight MTP module to scale the look-ahead depth for high-throughput document parsing. Specifically, we introduce Progressive Curriculum Loss that adaptively re-weights different look-ahead depths using cumulative path reliability and retrospective target consistency. By effectively suppressing gradient noise in long-range predictions, P-MTP, facilitates an automated easy-to-hard optimization transition, enabling the model to master increasingly distant look-ahead depths. Furthermore, we propose Confidence-Gated Dynamic Drafting to maximize the effective look-ahead depth and acceptance rate by adaptively calibrating speculative length during inference, thereby minimizing computational waste and further pushing the boundaries of inference speedup. Experimental results across multiple benchmarks and architectures demonstrate that P-MTP, achieves up to a 5\times speedup with negligible loss in accuracy, providing the first successful validation of extensive look-ahead MTP in the document parsing domain.
[CV-47] S1-Omni-Image: A Unified Model for Scientific Image Understanding Generation and Editing
链接: https://arxiv.org/abs/2606.24441
作者: Qingxiao Li,Zikai Wang,Qingli Wang,Nan Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 15 figures
Abstract:We present S1-Omni-Image, an open-weight unified multimodal model for scientific image understanding, generation, and editing. Unlike general-purpose image generation models, scientific image tasks require not only high-fidelity synthesis, but also robust understanding of scientific semantics, structural relations, domain knowledge, and task intent. To this end, S1-Omni-Image builds on the scientific multimodal reasoning backbone S1-VL-32B and couples its understanding capability with an image generation module under a unified think-before-generate paradigm. Given a user instruction, the model first produces a task-oriented reasoning trace, a textual answer, and a task special token; their hidden states are then injected into the generation module to condition image generation or editing. S1-Omni-Image supports scientific image understanding, generation, and editing in a unified framework. For generation, it focuses on scientific illustrations and text rendering, including logical diagrams, relational comparisons, data charts, and realistic scientific visualizations. For editing, it casts segmentation and other domain-specific vision tasks as native image editing problems, enabling multi-turn illustration editing, medical and geographic image segmentation, medical image translation, and scientific image super-resolution. We construct SciGenEdit, a 314K-sample training dataset, and release the model weights, inference code, and SciGenEdit-10K. Experiments show that S1-Omni-Image substantially improves scientific image generation and editing while preserving the scientific image understanding capability inherited from S1-VL-32B. It outperforms open-source models on GenExam and TechImage-Bench, achieves state-of-the-art results on four editing benchmarks including MSD, cigRockSEM, SynthRAD2025, and IXI, and maintains stable performance on scientific image understanding evaluations.
[CV-48] MedPCFM: Improving Medical Point Cloud Completion by Integrating Point Transformers and Flow Matching
链接: https://arxiv.org/abs/2606.24433
作者: Kamil Kwarciak,Marek Wodzinski
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 9 figures
Abstract:Medical point cloud completion is important for anatomical reconstruction and downstream clinical workflows, yet generative modeling in this setting remains insufficiently studied. We investigate completion through continuous-time generative modeling and introduce PCFM, a PTv3-backed flow matching approach for medical point cloud completion. We evaluate on SkullFix and SkullBreak, and additionally on the more recent Mandibular Defect dataset. We build strong baselines by adapting PTv3 to a deterministic encoder-decoder completion model and by instantiating diffusion completion (PCDiff) with both PVCNN and PTv3 denoisers. PCFM with PTv3 is competitive with the deterministic PTv3 baseline and achieves state-of-the-art generative performance across datasets, while requiring substantially fewer sampling steps than diffusion. At the best operating points, PTv3 also yields clear throughput gains, providing up to a 7 \times speed-up for PCFM compared to a PVCNN backbone. Finally, we study empirical scaling trends by varying model size and point cardinality, showing consistent gains with higher point resolution and informative trade-offs across model scales.
[CV-49] ransformation Behavior of Images in Latent Space
链接: https://arxiv.org/abs/2606.24430
作者: Christian Zöllner(1),Mozzam Motiwala(1),Aysel Ahadova(1),Gerrit Anders(4),Robert Hüneburg(2 and 3),Jacob Nattermann(2 and 3),Matthias Kloor(1) ((1) Department of Applied Tumor Biology Institute of Pathology Heidelberg University Hospital, (2) National Center for Hereditary Tumor Syndromes University Hospital Bonn, (3) Department of Internal Medicine I University Hospital Bonn, (4) Leibniz Institut für Wissensmedien)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Training of neural networks for histopathology classification tasks typically relies on data encoding into latent space, which reduces complexity and improves performance. There are several encoder networks available, either pretrained on general image datasets such as ImageNET, or specifically on histopathological images. Training of encoder networks should be adapted to downstream tasks, allowing encoding of biologic/diagnostic content while rendering networks invariant to label-irrelevant transformations. This paper investigates the effect of classical image transformation on the latent space, using networks provided by Lunit Inc. and Bioptimus, both focusing on pathological images, and by Meta Research Team. We assess variance of embeddings resulting from standard data transformations by comparing original and transformed image embeddings and by contrasting them with random, unrelated embeddings, using image tiles from hematoxylin/eosin-stained sections available in a colorectal tissue dataset and the publicly accessible TCGA dataset. Our findings show that embeddings of original and transformed images are closer to each other than to random embeddings, indicating robustness to transformations. However, they are not fully invariant, revealing that the encoder networks do not completely neutralize transformation effects in latent space, explaining why transformation-mediated augmentation of datasets can improve performance. Significant differences were observed between general and histopathology-specific encoder networks. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.24430 [cs.CV] (or arXiv:2606.24430v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.24430 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-50] EgoSAT: A Comprehensive Benchmark of Egocentric Streaming Interaction Understanding ECCV2026
链接: https://arxiv.org/abs/2606.24422
作者: Yijia Lei,Jinzhao Li,Yichi Zhang,Jiacheng Hua,Yin Li,Miao Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. Project page: this https URL
Abstract:We introduce EgoSAT, the first comprehensive benchmark for egocentric video reasoning in streaming settings, designed to evaluate the capabilities of modern vision-language models (VLMs). The benchmark targets streaming interaction understanding, where video frames arrive sequentially and models must continuously interpret evolving visual context. EgoSAT unifies several previously distinct tasks within a single streaming framework. In this formulation, queries about completed events correspond to retrospective reasoning, queries about ongoing activities require online understanding, and queries about future actions involve prospective anticipation. This unified setting requires models to reason about the past, present, and future while operating under the constraint that only previously observed frames are available. EgoSAT contains 1,997 unique videos spanning 165 hours of egocentric footage and around 4,800 high-quality question-answer pairs, carefully designed to probe reasoning across varying temporal contexts. Using this benchmark, we evaluate a diverse set of both open-weight and closed-weight VLMs, providing a systematic assessment of their ability for streaming interaction understanding. By distinguishing answerability and conducting diagnostics on confidence of models, we find existing models not only struggle with prospective and retrospective modeling, but also exhibit severe mis-calibration: confidence often fails to track inherent answerability, leading to dangerous “confidently wrong” behaviors. Project page: this https URL
[CV-51] Modality-Aware Out-of-Distribution Detection for Multi-Modal Action Recognition ECCV’26
链接: https://arxiv.org/abs/2606.24404
作者: Lars Doorenbos,Duc Manh Vu,Serdar Ozsoy,Juergen Gall
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV '26
Abstract:The incorporation of additional modalities into action recognition models increases their performance across a wide range of settings. However, how this additional information can contribute to making the models more robust remains underexplored, particularly for the case of multi-modal out-of-distribution (OOD) detection. While methods exist that regularize the multi-modal training process with OOD detection in mind, they still apply off-the-shelf OOD detectors designed for the uni-modal case during inference, discarding important information. Based on an interesting relationship we find between the multi-modal and uni-modal predictions, we propose to use this signal to build a post-hoc detector explicitly designed for the multi-modal scenario. We combine this new source of information with a feature-space score, which detects off-manifold samples in the multi-modal space, and normalize them by the multi-modal logits. In doing so, the proposed hybrid detector is compatible with existing training-time approaches and consistently improves performance. Experiments on a wide range of established datasets from the MultiOOD benchmark show that, on average, our approach outperforms the state of the art. Our results show the importance of explicitly considering the different modalities at inference time for multi-modal OOD detection.
[CV-52] MATCH: Flow Matching for Multi-View Anomaly Detection ECCV2026
链接: https://arxiv.org/abs/2606.24375
作者: Mathis Kruse,Melissa Schween,Bodo Rosenhahn
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026
Abstract:Detecting anomalies in industrial objects is an important topic for increasing production efficiency. More complex objects often require the analysis of several view points, which has led to the field of multi-view anomaly detection. We present MATCH, the first multi-view anomaly detection method based on Flow Matching (FM). With the ODE formulation of Flow Matching, we can estimate likelihoods and thereby derive an anomaly score to detect anomalies in multi-view image data at object, image, and pixel-level. The architectural flexibility of FM models allows us to efficiently transform features of different spatial sizes to the normal distribution. We evaluate thoroughly on the already established Real-IAD data set and are also the first to provide a comprehensive evaluation of popular anomaly detection methods for the MANTA-Tiny data set. MATCH achieves state-of-the-art performance in both anomaly detection and segmentation, all while running on consumer-level hardware. By omitting the costly divergence term needed for likelihood estimation, we ensure that MATCH is usable in real-time production scenarios. Lastly, several ablation studies are conducted to validate the methodological choices.
[CV-53] Structural Kolmogorov-Arnold Convolutions: Learnable Function on the Values or the Filter Shape as Parameter-Efficient Alternative to Per-Edge Convolutional KANs
链接: https://arxiv.org/abs/2606.24371
作者: Stefano Mereu,Oleksandr Kuznetsov,Gabriele Marchello,Alessandro Galdelli,Emanuele Frontoni,Adriano Mancini,Ferdinando Cannella
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Convolutional Kolmogorov–Arnold Networks (KANs) replace the fixed weights of a convolutional kernel with learnable univariate functions. The dominant formulation attaches one such function to every kernel entry and lets it act on pixel values, expressive but parameter-heavy and prone to overfitting. We argue that the learnable functions are better placed in the \emphstructure of the convolution than on each edge, and we organise the design space along a single axis: whether the function acts on the pixel \emphvalues or on the filter \emphshape. We study three realisations. SV-KAN applies one shared univariate function to the values and leaves the spatial filter free and static, aa classical convolution with a single learnable shared activation. AG-KAN keeps the shared value function but supplies the spatial structure through a content-adaptive Gaussian gate. RF-KAN instead moves the learnable functions onto the filter shape, building each filter from oriented ridge profiles expanded in a localised oscillatory (Morlet) wavelet basis with content-adaptive amplitudes. Under a matched four-layer protocol with in-run references and three seeds, RF-KAN and SV-KAN reach 88.47\pm0.10% and 88.20\pm0.31% on CIFAR-10 and 64.40\pm0.19% and 64.57\pm0.30% on CIFAR-100, at about 0.4 M parameters. At this matched scale the shape model and the simplest value model meet at the top, both above a plain convolution and every per-edge KAN we tested, including the official Gram variant, at roughly a fifth of the parameters. A controlled study attributes the RF-KAN gain to an intrinsically localised oscillatory basis and to content adaptivity, and an ablation that removes the learned shape entirely, leaving only the shared value function, collapses accuracy by over forty points, identifying the learned shape as the load-bearing ingredient at this scale.
[CV-54] SignNet-1M: Large-Scale Multilingual Sign Language Video Dataset with Downstream Benchmarks ECCV2026
链接: https://arxiv.org/abs/2606.24361
作者: Zhewen He,Junyi Hu,Haomian Huang,Zhenhua Li,Yu-Shen Liu,Yi Fang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages. Accepted to ECCV 2026
Abstract:Sign language models are typically trained on datasets captured under constrained conditions, with limited viewpoint, background, and signer-identity diversity, leading to poor robustness under real-world distribution shifts. We introduce SignNet-1M, a large-scale augmented dataset spanning ASL, CSL, and German Sign Language (DGS). SignNet-1M synthesizes realistic variations along three axes: (i) novel-view rendering (rotation and zoom) via 3D Gaussian Splatting (3DGS), (ii) scene/identity editing via diffusion models for background replacement and signer substitution while preserving sign motion and linguistic content, and (iii) post-rendering augmentations that emulate capture and compression artifacts (e.g., pose/temporal perturbations and video-level corruptions) to better match in-the-wild recordings. Beyond data release, we provide a unified benchmark suite across downstream tasks (e.g., translation and recognition) and ablations that isolate each augmentation component. Experiments across backbones show that training with SignNet-1M consistently improves generalization under cross-view, cross-background, cross-identity, and post-rendering shifts, while maintaining strong in-distribution performance. The dataset, full augmentation pipeline, and benchmark are available at this https URL.
[CV-55] Open-Vocabulary BEV Segmentation with 3D-Aware Geometric Constraints ECCV2026
链接: https://arxiv.org/abs/2606.24353
作者: Hojun Choi,Seulbin Hwang,Dae Jung Kim,Kisung Kim,Hyunjung Shim,Jinhan Lee
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This paper has been accepted by ECCV 2026
Abstract:Bird’s-eye view (BEV) perception fuses multi-camera images into a unified top-down representation for autonomous driving. Despite recent progress, state-of-the-art methods remain confined to closed-set scenarios, making them vulnerable to unpredictable real-world environments. In this work, we introduce open-vocabulary BEV segmentation (OVBS), which leverages vision-language models (VLMs) to recognize categories beyond the training set while maintaining precise BEV perception and real-time efficiency. A key challenge in OVBS lies in the 3D geometric inconsistency inherent in the ill-posed lifting of 2D VLM semantics into BEV. To address this, we propose OVBEVSeg, a geometry-aware OVBS framework that enhances efficient Gaussian splatting (GS)-based unprojection by leveraging robust 3D geometric constraints across three progressive stages: (1) 2D-to-BEV pseudo-labeling via reliable 3D projection for OV generalization; (2) joint 2D-BEV per-scene optimization with BEV structural constraints for 3D geometric consistency; and (3) 3D geometric distillation for online efficiency. On the nuScenes dataset, OVBEVSeg achieves state-of-the-art performance, outperforming closed-set methods by 15.3 mIoU on unseen categories. Remarkably, even with no novel-class ground-truth labels, it remains competitive with self- and semi-supervised baselines trained with up to 40% of ground-truth annotations. Furthermore, it achieves 2.5x faster inference with only 0.22x the memory consumption of projection-based methods. Project page: this https URL.
[CV-56] IGER: Taming Identity Geometry and Generative Priors for High-Quality Face Video Restoration
链接: https://arxiv.org/abs/2606.24336
作者: Yang Zhou,Wenxue Li,Peng Zhang,Yifei Chen,Fei Wang,Daiguo Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Face Video Restoration (FVR) aims to recover high-fidelity facial videos from degraded input while preserving identity and semantic consistency across frames. Existing methods often struggle to simultaneously address three key challenges: identity shift, viewpoint-entangled guidance, and perceptual realism. To tackle these issues, we propose TIGER, a structured tri-prior fusion framework that Tames Identity, Geometry, and gEnerative pRiors for high-quality FVR. Specifically, an Identity Prior is first established by injecting subject-discriminative embeddings into the latent space, effectively anchoring the subject’s identity against severe degradations. Then, to provide temporally consistent structural guidance for dynamic videos, TIGER constructs a Geometry Prior by lifting 2D reference cues into a disentangled 3D parameter space, creating a geometric anchor through cross-source parameter fusion. Moreover, to achieve maximum efficiency without compromising realism, we harness the video generation model’s Generative Prior through a one-step rectified flow. We further design a progressive three-stage training optimization strategy that refines structural fidelity, textural reconstruction, and distribution-level realism to ensure robust optimization. We also construct a large-scale FVR dataset to facilitate robust training and standardized evaluation. Extensive experiments demonstrate that TIGER achieves state-of-the-art performance in both identity fidelity and temporal stability, delivering a high-quality, efficient and identity-consistent FVR. Project page: this https URL.
[CV-57] Ill-Posed by Design: Probing Evidence Use in VLMs
链接: https://arxiv.org/abs/2606.24335
作者: Boaz Meivar,Shaked Perek,Shani Shvartzman,Eli Schwartz,Shai Avidan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Counterfactual analysis is widely used to study evidence use in vision-language models, but its diagnostic value is limited on well-posed tasks: when several cues independently support the same answer, removing one may not change the prediction. We propose monocular metric object-size estimation as an ill-posed diagnostic setting for evidence selection: because physical size cannot be determined from a single uncalibrated image, models must rely on imperfect cues category priors, target appearance, local context, apparent image size, and scene geometry. We assemble Metric VQA ( 10,813 dimension queries from Objectron and 331 tape-measured in-the-wild scenes) and evaluate 12 open-weight VLMs ( 3 – 397 ,B parameters) with counterfactual analysis decomposing six visual and language evidence channels. Even the largest VLMs tested (Qwen3-VL-235B, Qwen3.5-397B, InternVL3.5-241B) trail a text-only frontier LLM on the in-the-wild split. The diagnostic analysis shows: target identity is the most load-bearing cue, target pixels and local context help only some models, apparent size shifts predictions without a directional readout, and global scene geometry is largely unused. We analyze LoRA fine-tuning as an actionable intervention specific to metric estimation: while the task is learnable, the models do not learn to leverage scene geometry.
[CV-58] UniTranslator: A Unified Multi-modal Framework for End-to-end In-Image Machine Translation ECCV2026
链接: https://arxiv.org/abs/2606.24333
作者: Jiahao Lyu,Pei Fu,Zhenhang Li,Shaojie Zhang,Jiahui Yang,Yu Zhou,Can Ma,Zhenbo Luo,Jian Luan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026
Abstract:In-Image Machine Translation (IIMT) aims to translate scene text in an image and render the translated text back into the original regions while preserving the overall visual appearance. Recent unified multimodal models provide a promising solution by combining visual-text understanding and image generation within a single framework. However, directly adapting such models to IIMT remains challenging. In particular, they often suffer from understanding-generation conflicts, where the translation inferred during understanding is inconsistent with the text supervision used in generation, and spatial position misalignment, where the rendered text does not accurately match the target text regions. To address these issues, we present UniTranslator, a unified multimodal framework for IIMT that tightly couples translation understanding and text editing. Specifically, we introduce an Understand-Generation Alignment Module (UGAM) to bridge the representation gap between understanding and generation, encouraging semantic consistency between translated content prediction and text rendering. We further propose a Spatial Mask Decoder (SMD) with pixel-level supervision over text regions to improve spatial grounding, geometric alignment, and layout controllability during generation. Extensive experiments on multiple benchmarks demonstrate that UniTranslator achieves state-of-the-art performance across diverse language directions and complex real-world layouts. Moreover, our results reveal a strong mutual reinforcement effect between translation understanding and image generation, highlighting the advantage of unified translation multimodal learning. Code is available at this https URL.
[CV-59] REDI-Match: Rotation-Equivariant Distillation for Efficient and Robust Dense Matching
链接: https://arxiv.org/abs/2606.24330
作者: Yinji Ge,Guixu Zheng,Wulong Guo,Qian Feng,Xu Wu,Kai Zhou,Xinyuan Liu,Fei Xing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Foundation Models (VFMs) have significantly advanced dense feature matching, yet severe in-plane rotation remains a critical challenge. Existing solutions face a fundamental dilemma: data-driven methods require inefficient parameter scaling to implicitly learn rotations, whereas strictly equivariant networks lack the semantic capacity of modern VFMs. Consequently, current frameworks typically freeze VFMs and shift the entire burden of rotation generalization to the downstream decoder. To break this architectural bottleneck, we propose REDI-Match, an efficient framework driven by a novel Rotation-Equivariant Distillation (REDI) paradigm. Instead of relying on rotation data augmentation to establish rotational correspondences, REDI distills the non-equivariant semantic representations of a VFM into a lightweight, strictly rotation-equivariant encoder, leveraging an equivariant geometric architecture to constrain robust high-dimensional semantics. To fully exploit these features, we equip the decoder with an entropy-driven spatial alignment module. By evaluating discrete rotation hypotheses, this mechanism explicitly locks onto the canonical coordinate system, eliminating global ambiguity before continuous refinement. Extensive experiments demonstrate that REDI-Match establishes a new state-of-the-art (SOTA) across multiple benchmarks. Notably, it achieves a 13.89% absolute pose accuracy improvement on the highly challenging SatAst dataset while operating 1.9x faster than the current SOTA (RoMa v2), enabling real-time inference (~41 FPS) on a single RTX 4090 GPU. Code: this https URL.
[CV-60] rOCR for Medieval HTR: A Systematic Ablation Study with Cross-Dataset Validation ICDAR
链接: https://arxiv.org/abs/2606.24302
作者: Sachin Sharma,Michele Flammini,Federico Simonetta
类目: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
备注: Accepted at Document Analysis Systems Workshop 2026 (ICDAR Satellite event)
Abstract:Fine-tuning transformer-based handwritten text recognition (HTR) models on medieval manuscripts is challenging because these models are pre-trained on modern text and must adapt to a very different visual domain. This paper studies how three controllable fine-tuning choices (contrast normalization, data augmentation, and layer freezing) affect recognition accuracy when adapting TrOCR to small historical datasets. We run controlled experiments on a 13th-century Italian manuscript (I-CT 91 “Cortonese”) and replicate the same experimental grid on the public READ-16 benchmark as robustness evidence. On Cortonese, our best configuration achieves 8.03% character error rate (CER). Statistical comparisons across 13 configurations show that freezing up to three encoder layers or six decoder layers does not significantly harm accuracy, while deeper freezing becomes progressively detrimental. Removing contrast normalization (CLAHE) yields 7.84% CER, comparable to a domain-specialized baseline, suggesting strong optimization can reduce reliance on image preprocessing. Cross-dataset validation on READ-16 shows that decoder freezing thresholds transfer more robustly than encoder thresholds, and combined freezing strategies require dataset-specific re-validation. Finally, we use Grad-CAM gradient attributions and decoder cross-attention maps to diagnose error patterns and failure modes revealed by the ablations. Source code is available at this https URL
[CV-61] MM-TRELLIS: Point-Cloud Guided Multi-Modal 3D Vehicle Generation in Autonomous Driving
链接: https://arxiv.org/abs/2606.24301
作者: Hongli Xiao,Youjian Zhang,Yucai Bai,Chaoyue Wang,Yaohui Jin,Xiaoguang Ren,Wenjing Yang,Long Lan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recovering realistic 3D vehicle models from autonomous driving scenes is crucial for synthesizing training data and building simulation environment. However, most existing vehicle generation methods fail to fully exploit multimodal sensors i.e. multi-view images and LiDAR point clouds) and rely on neural rendering based reconstruction, leading to low-quality mesh. Recently, native 3D generative models have made significant progress, yet they are not built for arbitrary multi-view inputs and often struggle with in-the-wild driving images. In this work, we present MM-TRELLIS, a multi-modal version of TRELLIS for in-the-wild 3D vehicle generation that integrates LiDAR and image sensors from autonomous driving datasets into native 3D generative models. Specifically, multi-view images are cycled as conditioning inputs, while LiDAR point clouds provide test-time guidance to ensure geometric accuracy and cross-view consistency. During denoising, we first align the guidance point cloud with the model priors, then enforce consistency between the generated geometry and the guidance point cloud. Finally, we introduce a voxel filtering strategy based on the opacity of 3D Gaussian Splatting to suppress floaters and produce clean meshes. Comprehensive experiments on Waymo dataset demonstrate our method outperforms existing methods in high-fidelity 3D vehicle generation. Code is available at this https URL.
[CV-62] raining-free Cross-domain Few-shot Segmentation via Robust Semantic Representation and Matching ECCV2026
链接: https://arxiv.org/abs/2606.24297
作者: Sujun Sun,Mingwu Ren,Haofeng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026
Abstract:Cross-domain Few-shot Segmentation (CD-FSS) aims to transfer knowledge learned from source domain to distinct target domains, segmenting unseen target classes with only a few annotated samples. Although existing methods have made significant progress, they still rely on training or fine-tuning processes, which incur high computational costs and risk overfitting. We observe that when powerful and general-purpose vision foundation models are incorporated into these methods, their performance shows only marginal improvement or even degrades due to overfitting. To address this, we eliminate trainable parameters and propose a training-free framework to avoid both training overhead and overfitting. Built upon the self-supervised vision encoder DINOv3, our framework addresses cross-domain challenges through three core modules. First, the Semantic-aware Feature Re-fusion (SAFR) module identifies and re-fuses features that emphasize semantic patterns, generating representations with enhanced semantic discriminability. Additionally, the Adaptive Support Enhancement (ASE) module narrows semantic gaps between support and query through robust query information aggregation. Finally, the Hybrid Prototype Matching (HPM) module integrates matching results from diverse prototypes to adapt to varying semantic complexity across domains. Extensive experiments on four target domain datasets demonstrate that our method achieves state-of-the-art performance in CD-FSS without any training.
[CV-63] Hierarchical Spatial and Channel Aggregation for Cross-domain Few-shot Segmentation ECCV2026
链接: https://arxiv.org/abs/2606.24296
作者: Sujun Sun,Mingwu Ren,Haofeng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026
Abstract:Cross-domain Few-shot Segmentation (CD-FSS) aims to learn generalizable segmentation capability from abundant annotated samples in the source domain, enabling accurate segmentation of novel classes in the target domain with only a few annotated samples. Existing CD-FSS methods mainly focus on mitigating feature distribution shifts caused by style gaps while ignoring significant differences in class semantic granularity and discriminative attributes across domains, leading to two key degradations in support-query matching: semantic over-alignment and attribute over-alignment. To this end, we propose the Dual Hierarchical Aggregation Network (DHANet), which comprises three key modules. First, the Hierarchical Spatial Aggregation (HSA) module performs multi-scale region aggregation of pixel features along the spatial dimension, generating hierarchical semantic-enhanced features to alleviate semantic over-alignment. Additionally, the HCA module conducts multi-scale attribute aggregation along the channel dimension, generating hierarchical attribute-enhanced features to mitigate attribute over-alignment. Finally, we propose the Online Probabilistic Semantic Bank (OPSB), which progressively constructs and updates class probability distributions from query predictions during inference, and samples multiple pseudo-prototypes as additional support information to mitigate insufficient support. Extensive experiments on four target-domain datasets demonstrate that our method achieves state-of-the-art performance.
[CV-64] ActiveScope: Actively Seeking and Correcting Perception for MLLM s ICML2026
链接: https://arxiv.org/abs/2606.24292
作者: Yajing Wang,Chao Bi,Junshu Sun,Shufan Shen,Zhaobo Qi,Shuhui Wang,Qingming Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated impressive vision-language understanding, yet still struggle with fine-grained perception in high-resolution images. While existing training-free methods typically rely on attention-based localization or coarse-to-fine search, they are often misled by distractors and fail to locate multiple targets. Our investigation attributes these failures to Contextual Dominance, where salient distractors overwhelm target attention and cause inaccurate localization, and Semantic Bias, where global semantics cause the model to fixate on the most salient concept, resulting in incomplete localization in multi-object scenarios. Built on these insights, we propose ActiveScope, a training-free framework that enhances MLLMs by actively seeking and correcting perception. ActiveScope features two modules. The Semantic Anchor Localization (SAL) utilizes fine-grained semantic anchors to independently localize key targets, thereby mitigating semantic bias. The Interference-Suppressed Refinement (ISR) refines localization by suppressing attention on salient distractions to overcome contextual dominance. Extensive experiments on high-resolution image understanding benchmarks demonstrate that ActiveScope outperforms existing training-free methods (e.g., 96.34 percent accuracy on V^* Bench), validating the superiority of the active search and self-correction paradigm. Our code is available at this https URL.
[CV-65] UniRED: Unified RGB-D Video Frame Interpolation with Event Guidance
链接: https://arxiv.org/abs/2606.24282
作者: Yinuo Zhang,Guangshun Wei,Yuanfeng Zhou,Yiran Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High frame-rate RGB-D videos are crucial for a variety of downstream tasks, including motion analysis, dynamic scene understanding, and 3D reconstruction. However, due to hardware and sensing constraints, practical RGB-D cameras are typically limited to low frame rates, making it difficult to capture rapid scene dynamics. Existing video interpolation methods have achieved strong performance on RGB data, but they are not readily applicable to RGB-D scenarios, where they often yield blurry boundaries, visible artifacts, and degraded geometric consistency. Furthermore, motion estimation from only two boundary frames is inherently under-constrained in complex dynamic scenes. Event cameras, by contrast, provide asynchronous measurements with ultra-high temporal resolution, offering dense motion cues. In this paper, we propose a unified multimodal framework for RGB-D video interpolation that jointly exploits RGB appearance, depth geometry, and event-based temporal cues. Specifically, it first extracts and fuses RGB, depth and event cues, then estimates bidirectional flow with motion basis refinement for RGB and Z-axial refinement for depth, and finally synthesizes the target RGB-D frame via bidirectional warping and soft blending. In addition, we construct a new RGB-D-Event dataset to alleviate the scarcity of tri-modal training data. Extensive experiments on a public benchmark and the proposed dataset demonstrate that our method achieves superior photometric fidelity for RGB interpolation and stronger geometric accuracy for depth interpolation than existing approaches.
[CV-66] MotifGen: Spatiotemporal interpolation of misaligned satellite images via multi-source generative modeling in an application to tropical cyclones
链接: https://arxiv.org/abs/2606.24263
作者: Clément Dauvilliers(Inria),Claire Monteleoni(Inria)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Microwave satellite imagery plays a crucial role in monitoring tropical cyclone precipitation and intensity worldwide, but suffers from long revisit times, potentially missing rapid storm evolution phases. While this raises the need for an interpolation method, it is made challenging by the high level of heterogeneity of microwave data coming from different instruments. In this work, we introduce the first generative model that can be applied to multiple geospatial sources that change across samples, occur at irregular time intervals, are misaligned geographically, and come from instruments with varying characteristics. We apply this model to the case of spatio-temporal interpolation of tropical cyclone microwave images from other microwave and infrared instruments. We train using a self-supervised task in which a random source is masked and reconstructed, and show that it leads to a significant decrease in Continuous Ranked Probability Score over supervised training. We show a further improvement by combining infrared and microwave data compared to microwave only. Using these improvements, the generative model produces an ensemble mean on par with that of a deterministic model, while generating a power spectrum significantly closer to that of true observations. To the best of our knowledge, this is the first generative model that interpolates microwave images of cyclones by combining multiple microwave instruments and infrared observations at irregular time intervals.
[CV-67] 3DCarGen: Scalable 3D Car Generation via 3D-consistent Multi-view Synthesis
链接: https://arxiv.org/abs/2606.24257
作者: Hongli Xiao,Youjian Zhang,Yaohui Jin,Xiaoguang Ren,Wenjing Yang,Long Lan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-quality 3D vehicle assets are essential for autonomous driving simulation. Although multi-view diffusion-based paradigms enable controllable single-image reconstruction, they typically produce limited viewpoints and exhibit cross-view geometric inconsistencies, thereby reducing reconstruction fidelity in real-world scenarios. In this work, we introduce 3DCarGen, a scalable single-view 3D car generation framework designed for real-world images by synthesizing an arbitrary number of 3D-consistent multi-view images. Specifically, given a single image as input, we first synthesize a set of images from fixed viewpoints. These images are then fed into a feed-forward reconstruction model, resulting in a coarse 3D representation based on 3D Gaussian Splatting. Conditioned on this explicit 3D prior, our multi-view diffusion model generates 3D-consistent images from arbitrary camera viewpoints. We further extend a fast mesh reconstruction algorithm by incorporating color-normal joint optimization to recover detailed and coherent 3D vehicle models from the synthesized dense views. Extensive experiments on synthetic and real-world datasets demonstrate that our approach achieves robust geometric consistency and reconstruction fidelity compared to existing methods. Code and models will be released.
[CV-68] rimming the Long-Tail of Visual World Modeling Evaluation
链接: https://arxiv.org/abs/2606.24256
作者: Bingxuan Li,Yining Hong,Cheng Qian,Hyeonjeong Ha,Jiateng Liu,Zhenhailong Wang,Yue Guo,Yunzhu Li,Heng Ji
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Physical interactions follow a long-tailed distribution: a set of common and regular interactions dominates human experience and visual data, while a broad spectrum of rare and irregular interactions remains underrepresented. Although recent visual world models, including image and video generation models, achieve impressive realism on existing benchmarks, they primarily focus on simulating common physical interactions. This raises a central question: Do current visual world models internalize and generalize physical principles? In this work, we introduce Tailor-Bench, a benchmark that challenges world models to simulate irregular physical interactions. To enable systematic evaluation, we design three scenario modes that progressively challenge model reasoning: Regular scenarios reflect common tool-task pairs, Unconventional scenarios replace conventional tools with attribute-compatible substitutes to test affordance generalization, and Impossible scenarios introduce attribute-violating tools to probe constraint awareness. Additionally, we design two complementary settings under a unified evaluation protocol: predictive generation requires inferring outcomes without guidance, while descriptive generation specifies the target outcome for faithful realization. Our experimental results reveal a clear long-tail gap in physical world modeling: performance degrades from Regular to Unconventional and Impossible scenarios, indicating limited generalization beyond common interactions. Failure analysis further shows that models rely on superficial visual patterns: image models fail to realize correct state changes, while video models further suffer from temporal inconsistencies.
[CV-69] Social Structure Matters in 3D Human-Human Interaction Generation
链接: https://arxiv.org/abs/2606.24255
作者: Zhongju Wang,Beier Wang,Yatao Bian,Pichao Wang,Zhi Wang,Daoyi Dong,Hongdong Li,Huadong Mo,Zhenhong Sun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Although text-to-motion generation has achieved strong progress in synthesizing realistic single-person motions from language, extending it to text-driven 3D human-human interaction (HHI) remains non-trivial, as HHI requires modeling the underlying \textbfsocial structure that governs phase progression, actor roles, and inter-actor coordination. In this paper, we formulate HHI generation as a social structure modeling and grounding problem: the model must first infer how an interaction unfolds and how the two actors coordinate their roles, and then realize this structure as continuous, physically plausible, and partner-aware 3D motion. To study how such structure should be modeled, we first examine the capability boundary of large language models (LLMs) for HHI generation. Our analysis shows that LLMs can \textitthink by recovering phase decompositions and partner-aware roles, but cannot directly \textitmove, as they fail to generate dynamic, physically plausible, and interaction-aware motion. This motivates our planner-executor paradigm, \textbfThink with LLM, Move with Motion Skill. The LLM planner converts implicit interaction semantics into motion-aligned social supervision by decomposing interactions into phases, assigning partner-aware actor roles, and aligning them with motion sequence. The motion executor then grounds the planned social structure into coordinated two-person motion by adapting a pretrained solo motion model with LoRA, previous-phase self-conditioning, and ego-relative partner conditioning. Together, our Solo-to-Social framework bridges social organization and motion realization, producing 3D HHI with improved phase consistency, role alignment, and partner-aware coordination.
[CV-70] uringViT: Making SOTA Vision Transformers Accessible to All
链接: https://arxiv.org/abs/2606.24253
作者: Qiman Wu,Hanlin Chen,Lyujie Chen,Rui Xin,Jianlei Zheng,Mingyuan Wang,Jiahui Hu,Da Zhu,Yuecheng Ma,Yuhua Wei,Yizhao Wang,Hua Zhou,Yuheng Zhang,Anhua Liu,Shaman Tang,Yue He,Pengfei Diao,Shuang Su,Haotong Xin,Weichao Huang,Hang Zhang,Xianming Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modern VLMs and VLA systems commonly adopt off-the-shelf ViTs such as SigLIP2 as visual encoders, but diverse downstream requirements in latency, temporal modeling, and VLM integration often call for customized SOTA-level ViTs. Training such encoders remains beyond the reach of much of the community, as it requires massive image-text data, while standard softmax attention makes high-resolution or dynamic-resolution pretraining prohibitively costly and often forces low-resolution pretraining followed by post-hoc adaptation. TuringViT addresses these challenges with three key designs: Turing Linear Attention (TLA) for efficient sequence modeling, VISTA-Curation to construct supervision-rich image-video training data, and native dynamic-resolution pretraining that supports flexible inputs from the start and transfers seamlessly to downstream VLMs. As a result, TuringViT outperforms leading open-source ViT baselines with only 10% of the data, achieves stronger downstream VLM performance, and delivers substantially better latency scaling on high-resolution inputs. Our scaling-law analysis further shows that TuringViT continues to improve predictably with curated data scale, far from saturation. Its fast adaptation, hardware-friendly design, and efficient deployment have made it a unified visual foundation across XPeng’s AI systems. More broadly, TuringViT provides a reproducible pipeline that dramatically lowers the cost for the community to train, customize, and deploy SOTA-level ViTs, moving toward making such Vision Transformers accessible to all.
[CV-71] M2C-EvDet: Multi-Domain Multi-Order Cross-Modal Knowledge Distillation for Event-based Object Detection
链接: https://arxiv.org/abs/2606.24248
作者: Wei Bao,Siqi Li,Shouan Pan,Yi Xie,Yue Gao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event-based object Detection (EvDet), as a biologically inspired visual perception paradigm, demonstrates superior performance in scenarios demanding high temporal resolution and a wide dynamic range. Nevertheless, the inherent sparse representations and inadequate visual semantics of event data result in a considerable performance disparity between EvDet and frame-based object detection. Previous works attempt to alleviate this cross-modal discrepancy through knowledge distillation, yet they only focus on spatial visual semantics or pair-wise relational information, thus limiting performance in more complex scenarios. To address this challenge, this paper proposes M^2C-EvDet, a Multi-domain and Multi-order Cross-modal knowledge distillation framework for EvDet. Built upon frequency learning and hypergraph computation, M^2C-EvDet integrates two specialized modules: Adaptive Frequency-Decoupled Feature Distillation (AF^2D^2) and Multi-Order Relational Distillation (MORD).
[CV-72] From Open Waters to Enclosed Cabins: ProteusVPR for Cross-Scene Visual Place Recognition in Maritime Perception and Cabin Inspection
链接: https://arxiv.org/abs/2606.24234
作者: Zexi Chena,Zitai Huang,Qiwen Gu,Zhiqi Li,Shengli Dong,Chenlei Wang,Junqiao Zhao,Hongdong Wang,Bing Han
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Autonomous robotic inspection in maritime environments presents unique challenges for Visual Place Recognition (VPR) due to cross-scene perceptual shifts. Robots navigating ship-borne environments must transition between visually distinct domains: open decks with sparse textures and severe illumination changes, and enclosed cabins with repetitive structures and high visual ambiguity. Existing VPR methods, designed primarily for urban or indoor scenes, fail to generalize reliably across these starkly different scenarios. To address this, we propose ProteusVPR, a two-stage retrieval-refinement framework. The first stage employs any standard VPR model for initial image retrieval. The second stage introduces a geometric-visual estimation network that fuses the retrieved image with two temporally preceding frames, incorporating geometric descriptors, a local affine coordinate system, and camera azimuth encoding to achieve precise localization. To support this task, we introduce the XHZ dataset, an 8K-panoramic ship-borne dataset collected from an operational vessel, featuring multi-floor cabin structures, deck transition zones, and strict query-database separation for rigorous evaluation. Extensive experiments on the XHZ dataset demonstrate that ProteusVPR consistently improves the localization accuracy across multiple VPR backbones, reducing mean localization error by over 60% on average and that ProteusVPR offers an effective and robust solution for precise visual localization in challenging, cross-scene maritime environments.
[CV-73] Latent Visual States for Efficient Multimodal Reasoning
链接: https://arxiv.org/abs/2606.24233
作者: Xiuwei Chen,Wentao Hu,Yongxin Wang,Zisheng Chen,Likui Zhang,Kun Xiang,Jianhua Han,Hui-Ling Zhen,Jingyuan Zou,Hang Xu,Xiaodan Liang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The integration of visual evidence has significantly enhanced the capabilities of large multimodal models. However, this integration predominantly relies on generating discrete outputs (etc., code or box coordinates) to invoke external tools, a process that introduces rigid dependencies and substantial latency. To overcome these limitations, we propose EVA (LatEnt Visual StAtes), a novel framework that natively generates continuous latent visual representations. These internal representations manifest as an adaptive sequence of Latent_slot tokens, serving as intermediate visual thoughts during the reasoning process. These Latent_slot tokens are then trained end-to-end with the discrete text tokens. This co-optimization, notably, causes extreme policy deviation in the ‘transition window’ following the Latent_slot tokens. We develop D-GSPO (Decouple-GSPO) to target this root cause by decoupling the optimization of latent and discrete components. To support SFT, we construct EVA-230K, a high-quality text-image interleaved CoT dataset encompassing a diverse range of real-world scenes, documents, charts and OCR tasks. Extensive experiments across multiple benchmarks confirm that EVA achieves significant performance gains while enhancing inference efficiency.
[CV-74] FiCA: Feed-forward instant Gaussian Codec Avatars from a Single Portrait Image
链接: https://arxiv.org/abs/2606.24232
作者: Kim Youwang,Zhengyu Yang,Liuhao Ge,Yu Rong,Timur Bagautdinov,Su Zhaoen,Nir Sopher,Jovan Popović,Teng Deng,Tae-Hyun Oh,Chen Cao
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL
Abstract:We introduce FiCA, a Feed-forward, instant Gaussian Codec Avatar generation pipeline that creates lifelike avatars from a single portrait image. Generating a photorealistic and drivable avatar from just a single image is significantly challenging due to the limited visual information available to accurately infer the 3D appearance and geometry of human heads. To address this, we develop a novel system that combines human-centric vision foundation models with a diffusion model. This system is designed to fully exploit partial visual observations to generate lifelike human avatars. Our proposed diffusion model learns a generative mapping from these partial observations to complete and authentic 3D mesh reconstruction. Additionally, we introduce a feed-forward mesh refinement network that enhances the fidelity and identity preservation of the generated avatars, eliminating the need for person-specific test-time optimization. By leveraging a universal prior model that decodes a generated mesh into a set of 3D Gaussians, we generate a photorealistic 3D Gaussian avatar, capable of being driven with novel expressions in real-time. Our experiments demonstrate that the avatars generated by our feed-forward approach faithfully represent diverse identities and surpass the visual quality of avatars produced by recent competing methods.
[CV-75] Geometry-Instructed Video Editing
链接: https://arxiv.org/abs/2606.24225
作者: Chirui Chang,Xiaoyang Lyu,Yi-Hua Huang,Haoru Tan,Shizhen Zhao,Yikang Ding,Jianmin Bao,Xin Tao,Pengfei Wan,Xiaojuan Qi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Object-level geometric edits, including translating, rotating, scaling, duplicating, or removing an object, are routine operations in digital content creation (DCC) workflows, yet they remain unreliable in generative video editing. The key challenge lies in specifying the target object’s 3D state change unambiguously across viewpoint and time, while consistently updating geometry-dependent secondary effects such as shadows and reflections. We introduce GIVE, a geometry-instructed video editing framework that represents edits through a unified object-state formulation. Two video-aligned geometry streams describe the target object before and after editing: a depth-box encoding coarse 3D placement and extent, and an orientation-box providing an appearance-agnostic orientation cue. Together, these streams provide a compact pre/post geometric specification for object-state transitions. To provide paired supervision for learning these edits, we build a scalable graphics-engine pipeline that executes object-level edit programs and renders controlled before/after pairs, isolating the intended geometric edit while keeping secondary effects consistent with the transformation. Experimental results demonstrate that GIVE produces faithful geometric edits with temporal coherence and consistent secondary effects across operators in a unified framework, and shows promising transfer to in-the-wild videos. Project page: this https URL
[CV-76] MorVess: Morphology-Aware Pulmonary Vessel Segmentation Network
链接: https://arxiv.org/abs/2606.24214
作者: Fuyou Mao,Yifei Chen,Beining Wu,Lixin Lin,Jinnan Dai,Zhiling Li,Yilei Chen,Yaqi Wang,Hao Zhang,Yan Tang,Huiyu Zhou,Feiwei Qin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate pulmonary vessel segmentation remains challenging due to the sparse, tortuous, and multi-scale nature of vascular structures, where small branches are easily lost and topology integrity is difficult to preserve under voxel-wise supervision. Existing deep segmentation models primarily optimize binary masks, lacking explicit geometric constraints, thus struggling to recover continuous tubular morphology and fine vascular connectivity. In this study, we introduce MorVess, a morphology-aware segmentation framework that integrates differentiable geometric priors with large-scale foundation model adaptation to achieve fine-grained vascular parsing. MorVess jointly predicts vessel masks, distance maps, and thickness maps, providing explicit supervision for vascular boundaries, centerline consistency, and smooth diameter transitions. A lightweight 2.5D adapter bridges 3D spatial context and 2D SAM representations, while a global-local fusion block aggregates multi-level semantics and geometric cues for high-fidelity topology reconstruction. Across two challenging pulmonary CT benchmarks, MorVess delivers superior Dice, clDice, and HD95 scores, substantially improving small-vessel recovery and global connectivity. These results demonstrate that embedding geometric intelligence into pretrained vision models offers a principled and scalable pathway toward precise vessel analysis and clinically reliable structural quantification. Our source code is available at this https URL.
[CV-77] Inclusive Interactive Collisions for Multi-View Consistent Compositional 3D Generation
链接: https://arxiv.org/abs/2606.24206
作者: Chang Liu,Mingwen Shao,Xiang Lv,Xinyuan Chen,Lingzhuang Meng,Qiao Zhang,Zhengyi Gong,Jinghao Hu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent breakthroughs in 3D generation have advanced notably with the development of text-to-image diffusion model. However, existing methods remain two practical challenges: (1) They primarily generate single 3D object, but struggle to generate multi-object compositional 3D assets due to the lack of the modeling for Gaussian primitives in reasonable interactions. (2) They often suffer from cross-view inconsistency during 3D optimization, as Score Distillation Sampling inherently performs on each single view, inevitably resulting in cross-view hallucinations. To solve above issues, we propose I2C-3D, a novel optimization-based method to generate multi-view consistent compositional 3D assets with reasonable interactions. Specifically, we propose an Inclusive Interactive Collisions strategy to guide Gaussian primitives appearing in reasonable interaction regions naturally, thereby ensuring objects in the compositional scene interact in a physically plausible and visually coherent way. Additionally, to enhance multi-view consistency, Multi-View Adaptive Score Distillation Sampling is devised to distill multi-view consistency prior and layout prior from pre-trained diffusion model by modulating attention map of instance token and spatial token across viewpoints. Benefiting from above elaborate designs, I2C-3D not only generates high-fidelity multi-view consistent compositional 3D assets but also supports 3D editing flexibly, facilitating complex scene generation. Extensive experiments demonstrate our I2C-3D outperforms existing methods in generation quality and multi-view consistency.
[CV-78] owards Fast and Effective Long Video Understanding of Multimodal Large Language Models via Adaptive Quasi-Gaussian Sampling NEURIPS2026
链接: https://arxiv.org/abs/2606.24187
作者: Kun Zhang,Chenxin Fang,Tao Chen,Baiyang Song,Yunhang Shen,Yiyi Zhou,Rongrong Ji
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2026 submission. 15 pages, 8 figures
Abstract:Long video understanding remains a daunting challenge for \emphMultimodal Large Language Models (MLLMs) due to the excessive computation and memory footprint. Thus, \emphkeyframe selection is often adopted to mitigate this shortcoming, which however still suffers from low flexibility and high noise due to its hard sampling principle. In this paper, we define video frame selection as a problem of \emphQuasi-Gaussian Sampling, and propose an adaptive and training-free approach termed \textbf\emphAdaQ. Inspired by the 3 - \sigma rule of Gaussian distribution, the objective of AdaQ is to achieve the optimal 3 - \sigma interval for different examples, \emphi.e., a smaller 3 - \sigma interval for the local query and a larger one for the global query, thereby facilitating robust and adaptive frame sampling. To validate AdaQ, we apply it to four MLLMs with three embedding models. The extensive experimental results not only show its obvious performance gains over the default MLLMs and the SOTA keyframe selection methods, \emphe.g., helping Qwen3-VL-8B outperform GPT4o by 15.8% on average by using only 64 frames, but also confirm its superior robustness and high efficiency for long-video understanding, \emphe.g., \textbfonly 1 hyper-parameter needs to be set. \textbfOur code project is given at \hrefthis https URLthis https URL.
[CV-79] Deep Learning Approaches for 3D Medical Scene Completion: From Geometric Modeling to Generative Paradigms
链接: https://arxiv.org/abs/2606.24180
作者: Afifa Khaled,Said Jadid Abdulkadir,Majdy Mohamed Eltayeb Eltahir
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Three-dimensional scene completion has evolved as a major problem in computer vision and robotics, and its applications are diverse, including autonomous navigation and augmented reality. In this study, a systematic review has been conducted to compile the research contributions made in the last ten years, i.e., 2016 to 2026, which has revolutionized the field from the voxel semantic completion paradigm represented by SSCNet to the latest paradigm that combines generative diffusion priors with real-time rendering using a Gaussian splatting technique. The evolution in representation paradigms, such as voxel grids, point learning, implicit neural fields, transformer networks, diffusion networks, and the latest paradigm based on rendering-aware 3D Gaussian primitives, has been discussed in this study. A comprehensive analysis has been carried out on the contributions made in the last ten years, and a taxonomy has been developed to provide a clear idea about the contributions made in the field. The study has also discussed the research contributions made in the field, along with the challenges that still need to be addressed. Finally, the study has presented a research agenda that will provide a clear idea about the directions that can be followed in the development of the next-generation system
[CV-80] Zero-Shot Test-Time Canonicalization using Out-of-Distribution Scoring
链接: https://arxiv.org/abs/2606.24178
作者: Dominik Lindner,Johann Schmidt,Tom Siegl,Martin Becker,Sebastian Stober
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Pretrained vision models often misclassify inputs that are rotated, scaled, or sheared, even though these affine transformations leave the object class unchanged. Robustness is usually restored either by building equivariance into the architecture or by retraining with augmentation, both of which require changing or retraining the model. Test-time canonicalization instead leaves the classifier untouched. It undoes the transformation of each input, mapping it to a canonical form near the training distribution before classification. Existing canonicalizers, however, rely on a narrow set of logit-based energy scores and bespoke search procedures, leaving the design space of scoring functions and optimizers unexplored. We reframe canonicalization as out-of-distribution (OOD) detection, which lets any OOD score serve as the energy minimized over transformations. Across benchmarks ranging from handwritten characters and sketches to natural images and 3D point clouds, we systematically evaluate around twenty OOD scores and nine search algorithms, finding that distance-based scores paired with random search and local refinement perform best overall. Because canonicalizing an already-aligned input can hurt accuracy, we add a gated mechanism that transforms an input only when its OOD score indicates this is needed, preserving most in-distribution accuracy while retaining the robustness gains on transformed inputs. Code is available at this http URL.
[CV-81] ri-Efficient Transfer Learning for Point Cloud Videos
链接: https://arxiv.org/abs/2606.24175
作者: Yiding Sun,Dongxu Zhang,Jihua Zhu,Haozhe Cheng,Zhengqiao Li,Pengcheng Li,Chaowei Fang,Yonghao Dong,Lin Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While point cloud foundation models have significantly advanced point cloud video understanding, existing parameter-efficient fine-tuning (PEFT) methods still suffer from two critical limitations: prohibitive annotation costs for large-scale point cloud datasets and severe memory bottlenecks. In this paper, we aim to mine richer supervision signals from existing data rather than blindly scaling datasets. A further key principle is that the memory footprint of fine-tuning must be drastically reduced compared to full fine-tuning, which remains elusive for current PEFT techniques. Driven by these challenges, we identify three core desiderata: data-, parameter-, and memory efficiency, and present PoinTriE, a unified framework that excels along all three dimensions. For pre-training, pseudo-motion trajectories are synthesized via rigid transformations, paired with text corpora and 2D projections derived from raw point clouds. We then propose a Geometric-Motion Duality Network optimized via multimodal contrastive learning, rigid rotation prediction, and motion distribution divergence to produce dense self-supervision. During fine-tuning, we freeze the pretrained backbone and only update a lightweight Spatio-temporal Side Network built with LoRA units. Equipped with a gradient flow masking strategy, PoinTriE simultaneously reduces memory consumption and parameter overhead. Extensive experiments confirm that PoinTriE establishes new state-of-the-art results on action recognition and semantic segmentation tasks.
[CV-82] Spectral Evolution-Guided Token Pruning in Multimodal Large Language Models ECCV2026
链接: https://arxiv.org/abs/2606.24165
作者: Bin Chen,Yuxiang Cai,Yadan Luo,Yi Zhang,Jianwei Yin,Zhi Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026
Abstract:Reducing visual token redundancy is critical for accelerating Multimodal Large Language Models (MLLMs) without degrading cross-modal reasoning performance. Existing token pruning methods typically rely on single-layer signals, such as attention scores or token similarities, which overlook the cross-layer transformation of visual representations and may exhibit positional bias in multimodal token sequences. To address this limitation, we propose a training-free token pruning framework based on Cross-Layer Spectral Evolution (CLSE). Instead of measuring token importance from single-layer feature magnitudes, CLSE quantifies how token representations evolve across Transformer layers in the frequency domain. This evolution reflects the transition from high-frequency structural details to low-frequency semantic abstractions. We observe that tokens with stronger spectral redistribution across layers are more likely to be semantically active and should therefore be preserved. By modeling cross-layer token dynamics, CLSE provides a stable importance criterion that mitigates positional bias. Extensive experiments on both image and video benchmarks demonstrate that CLSE achieves a superior trade-off between efficiency and accuracy under aggressive token reduction. Across multiple MLLMs, CLSE reduces FLOPs, KV cache memory, and latency while maintaining competitive or improved performance.
[CV-83] Dual-Branch Cross-Projection Debiasing through Diffusion-based Disentanglement
链接: https://arxiv.org/abs/2606.24161
作者: Xiangqian Zhao,Xinyang Jiang,Zhipeng Xu,Lingfeng He,Zilong Wang,Dongsheng Li,De Cheng,Nannan Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Foundation models trained on biased datasets often rely on spurious correlations between target labels and non-causal attributes, resulting in poor generalization on minority groups. Bias mitigation remains challenging due to two fundamental issues. First, when group labels are unavailable, existing group-unsupervised methods typically infer spurious attributes implicitly from model behavior, making it difficult to identify spurious factors that are semantically aligned with real-world biases. Second, even with pseudo spurious supervision, most existing debiasing methods follow a single-branch design that operates within a single shared feature space, where target and spurious attributes are intrinsically entangled. To address the first challenge, we introduce Confidence-guided Bias Concept Mining (CBCM), which leverages diffusion-disentangled, semantically grounded concept representations to identify reliable spurious attributes without attribute annotations. To address the second challenge, we propose Dual-branch Cross-projection Debiasing (DCD), a prompt-tuning framework that separates target and spurious representations into two branches and explicitly removes spurious information through cross null-space projection while preserving target-relevant semantics. Extensive experiments on four benchmark datasets show that our method achieves state-of-the-art worst group accuracy among group-unsupervised approaches, while tuning at most 0.22% of the model parameters. The source code is available in the supplementary materials.
[CV-84] Accelerating Multimodal Large Language Models with Prior-Corrected Token Reduction ECCV2026
链接: https://arxiv.org/abs/2606.24156
作者: Zengjie Chen,Yuxiang Cai,Jingcai Guo,Taotao Cai,Jianwei Yin,Zhi Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026
Abstract:Visual token reduction has emerged as an effective strategy for accelerating Multimodal Large Language Models (MLLMs). Many existing methods prune tokens by ranking text-visual attention scores. However, we show that attention is often dominated by a model-induced prior: even without textual instruction, MLLMs tend to focus on certain task-agnostic regions. Consequently, the attention scores of instruction-conditioned tokens are suppressed, increasing the risk that these tokens are discarded during pruning. To address this issue, we propose Prior-Corrected Token Reduction (PriorTR), a training-free token reduction method that explicitly separates task-conditioned attention from the model-induced prior. PriorTR estimates the attention map of the prior, and contrasts it with the task-conditioned attention distribution to measure the additional usable information contributed by each visual token. Importantly, PriorTR computes both the model-induced prior and the task-conditioned posterior within a single forward pass by introducing a null token that serves as an instruction-agnostic probe in the attention block. This design avoids duplicated propagation. Extensive experiments across multiple multimodal benchmarks and MLLMs demonstrate that PriorTR consistently improves the trade-off between accuracy and efficiency over strong training-free baselines, particularly under aggressive token budgets.
[CV-85] Differential Unfolding: Efficient Unfolding Reconstruction for Video Snapshot Compressive Imaging
链接: https://arxiv.org/abs/2606.24153
作者: Muyuan Zhang,Jiancheng Zhang,Haijin Zeng,Yin-ping Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While Deep Unfolding Networks (DUNs) dominate video Snapshot Compressive Imaging (SCI), they remain constrained by a uniform design philosophy. Existing methods repeatedly stack high-complexity priors with identical structures, ignoring the fact that optimization trajectories converge toward static states. This results in representation stagnation, where high-cost computations are wasted on minimal feature updates. To address this inefficiency, we present Differential Unfolding (DU), a heterogeneous framework that replaces uniform repetition with dynamic evolution. Central to DU is the Differential Evolutionary Framework (DEF), which partitions the unfolding process into two complementary roles: structural anchoring and differential evolution. In this scheme, high-parameter general stages are sparsely deployed to generate high-fidelity feature foundations. Complementing these, lightweight differential stages employ a Differential Representation Prior (DRP) to propagate and refine these foundational features through a differential mechanism. By integrating Differential Representation Attention (DRA) for evolving attention maps and a Differential Modulated FFN (DM-FFN) for feature rectification, DRP effectively models cross-stage variations with minimal overhead. By focusing computational resources on dynamic evolution rather than static redundancy, DU achieves a superior trade-off between accuracy and efficiency. Extensive experiments verify that our method establishes new state-of-the-art results while significantly slashing computational overhead. this https URL
[CV-86] Autonomous Video Generation with Counterfactual Controllability for Self-Evolving World Models
链接: https://arxiv.org/abs/2606.24152
作者: Xin Wang,Wenxuan Liu,Tongtong Feng,Wenwu Zhu
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 1 figure
Abstract:Existing literature claims that video generation essentially is world modelling. On the one hand, the claim is productive because it pushes generative AI beyond static images and toward temporally extended physical scenes. On the other hand, this claim dangerously relies on the belief that scaling visual prediction alone will automatically yield physical agents. We prefer a more accurate statement: video generation models learn a partial, implicit spatiotemporal world model, but not a fully grounded or controllable one. The reason is as follows: a model may generate a plausible video of a drone crossing a forest or a robot arm manipulating a cup, yet still fail to know which variables are controllable, which constraints belong to a particular body and which futures remain valid under intervention. The frontier in essence is not predictive realism alone, instead it emphasizes a self-evolving generative nature that requires the decisive criterion to be counterfactual controllability: the capability of asking what would happen under an action, to test whether the generated future can survive embodiment constraints and to feed the resulting action knowledge back into future imagination (generation). Therefore, in this paper we present a new perspective, i.e., autonomous video generation with counterfactual controllability is one promising way to realize self-evolving world models.
[CV-87] Geometry-Aware Style Transfer in 3D Gaussian Splatting ECCV2026
链接: https://arxiv.org/abs/2606.24144
作者: Min Hyeok Bang,Jun Hyeong Kim,Seung-Wook Kim,Se-Ho Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures, accepted at ECCV 2026
Abstract:In this paper, we present a novel geometry-aware style transfer framework for 3D Gaussian splatting (3DGS) that simultaneously transfers appearance attributes and geometric structures. Unlike prior works that primarily focus on color-based stylization and often overlook structural adaptation, our method explicitly incorporates geometry adaptation through a decoupled optimization scheme that alternately updates color and geometry parameters. This strategy alleviates potential interference between color and geometry updates, leading to stable and consistent scene-level geometry transformation. The decoupled optimization is enabled by the proposed geometry-aware contrastive feature matching (GCFM). GCFM integrates RGB, depth, and edge cues into a contrastive objective and is employed in both optimization phases to effectively transfer structural characteristics from style images to Gaussian primitives. Extensive experiments show that our approach achieves superior performance in both qualitative fidelity and quantitative metrics, significantly outperforming existing 3DGS-based stylization methods. Our code is available at \hrefthis https URLthis https URL.
[CV-88] Sat2City v2: Native 3D City Asset Generation from a Single Satellite Image
链接: https://arxiv.org/abs/2606.24138
作者: Tongyan Hua,Dongli Wu,Jinjing Zhu,Yinrui Ren,Zhongcheng Hong,Ying-Cong Chen,Hui Xiong,Wufan Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating explicit 3D city assets from a single satellite image is important for digital twins, urban simulation, and geospatial intelligence. Unlike satellite-to-street-view synthesis, the task requires a reusable textured mesh with plausible geometry and controllable appearance rather than a 3D proxy optimized only for rendering a small set of images or videos. The ICCV Sat2City framework made a first step by conditioning cascaded sparse-voxel latent diffusion on satellite-derived height maps, but its appearance was random, its training data were synthetic, and its task-specific VAE did not scale well to noisy real-world reconstructions. We present Sat2City v2, a journal extension that adapts a pretrained native structured-latent 3D foundation model to weakly aligned satellite images and textured meshes. We build a real-world dataset with 16,241 satellite-mesh pairs across 24 regions in 9 cities. Instead of learning a 3D representation from noisy city meshes, Sat2City v2 encodes each mesh into a pretrained native 3D latent space, fine-tunes a satellite-conditioned geometry flow, and uses the decoded shape to anchor satellite-conditioned texturing. This retains Sat2City’s geometry-to-appearance cascade while enabling appearance-controllable generation from the satellite input. Experiments on metric-scale DSM reconstruction and generative city-asset benchmarks for geometry and appearance show that Sat2City v2 achieves the best overall performance among evaluated baselines. Overall, Sat2City v2 advances satellite-to-city generation from rendering-oriented 3D proxies to explicit textured mesh assets, supported by, to the best of our knowledge, the first documented satellite-mesh paired dataset collected from matched geographic crops for this asset-level task. Project page: this https URL
[CV-89] Bengal-HP_RU: A Dataset of Bengal People For Head Pose Estimation
链接: https://arxiv.org/abs/2606.24122
作者: Md. Ahanaf Arif Khan,Md. Tawhidur Rahman,Sangeeta Biswas,Md. Iqbal Aziz Khan,Subrata Pramanik,Sanjoy Kumar Chakravarty,Bimal Kumar Pramanik
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing head pose datasets predominantly feature subjects of Western or East Asian origin, leaving South Asian populations, particularly Bengali individuals, largely underrepresented. We introduce Bengal-HP_RU, the first publicly available head pose dataset centred on Bengali subjects, comprising 12,894 labelled head images annotated with continuous yaw, pitch, and roll values. Images were collected from Wikimedia Commons under free licences and processed through an automated pipeline followed by manual label correction. The dataset is partitioned by Wikimedia uploader identity to prevent data contamination, yielding 10,494 training and 2,400 test images across 296 unique uploaders. Bengal-HP_RU exhibits substantial diversity in subject age, gender, occlusion, illumination, and background, reflecting realistic in-the-wild conditions. The dataset is publicly available at this https URL.
[CV-90] Flood Mapping from RGB imagery using a Vision Foundation Model
链接: https://arxiv.org/abs/2606.24120
作者: Vladyslav Polushko,Tilman Bucher,Ronald Rösch,Thomas März,Markus Rauhut,Andreas Weinmann
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Timely, high-resolution maps of flood extent around settlements are essential for emergency response and damage assessment. We consider airborne RGB imagery for flood mapping as it can be collected rapidly at low cost. To produce flood maps, deep learning models for water segmentation are often used. CNN based and small vision transformer models are used. However, they need much data for adaptation to a change of scenery, i.e., another flooding event. Vision foundation models or large vision transformers are known to generalize across domains. Recently, foundation models for Earth observation became available. They are pretrained on satellite data, whose spatial resolution, viewing geometry, and radiometry differ from nadir RGB imagery. Thus, adaptation is required. We investigate how a satellite-pretrained Earth observation foundation model can be adapted to centimeter-scale floodwater mapping from RGB imagery. Specifically, we fine-tune a model we call Prithvi-2.0-UPN consisting of the Prithvi-EO-2.0-600M Vision Transformer combined with a UPerNet decoder for binary water segmentation on two RGB datasets (BlessemFlood21, NeuenahrFlood). In a first experiment we observe that Prithvi-2.0-UPN reaches state-of-the-art results on BlessemFlood21 and NeuenahrFlood, when trained on their datasets. In a second experiment we show that Prithvi-2.0-UPN performs better than state-of-the-art baseline models for transfer to a new flood event (trained on BlessemFlood21, tested on NeuenahrFlood) in a zero-shot setting. However, the performance indicates room for improvement. In this respect, we investigate in a third experiment how performance improves when further fine-tuning the models with small shares of NeuenahrFlood training data: Prithvi-2.0-UPN improves the fastest and reaches almost the performance level when fully trained on NeuenahrFlood, indicating transfer capabilities.
[CV-91] An LMM for Precisely Grounding Elements in Documents
链接: https://arxiv.org/abs/2606.24118
作者: Yijian Lu,Chuangxin Zhao,Kai Sun,Lei Hou,Juanzi Li,Ji Qi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual grounding in documents is a crucial ability for Large Multimodal Models (LMMs) in areas such as document understanding, deep research and document error detection. However, existing approaches exhibit poor grounding precision in text-rich document images, often failing to accurately locate the critical document elements needed for reliable reasoning. To address this gap, we introduce PreciseDoc, an LMM specifically designed for precise element grounding and can be further optimized for Document VQA tasks. Specifically, to enhance the basic localization capability, we construct challenging training data by two pipelines capable of mass-producing high-quality documents with paired metadata of fine-grained coordinates, including synthetic hand-filled documents with camera effects. The model develops more real-world functions beyond straightforward localization of single text, such as locating personal information from CVs. Furthermore, we introduce a training paradigm for visual grounded reasoning where the grounding and reasoning are supervised jointly with reinforcement learning to improve the contribution of the grounded evidence. A comprehensive evaluation on various benchmarks demonstrates the advantage of the proposed data and methods in document spatial grounding and document understanding.
[CV-92] A Benchmark for Hallucination Detection in VLMs for Gastrointestinal Endoscopy
链接: https://arxiv.org/abs/2606.24115
作者: Aminu Lawal,Niyoj Oli,Sachin Acharya,Prashnna Gyawali,Maria Carmen Romano,Binod Bhattarai
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at the Medical Image Understanding and Analysis (MIUA) 2026 conference
Abstract:Vision-language models (VLMs) are prone to hallucination, which remains a major barrier to their safe deployment in clinical practice. To date, most hallucination detection methods have been evaluated on radiology benchmarks such as MIMIC-CXR and VQA-RAD, while gastrointestinal (GI) endoscopy remains largely underexplored. In this paper, we benchmark nine hallucination detection methods on the Gut-VLM dataset, a GI diagnostic Visual Question Answering (VQA) dataset with 4,392 test VQA pairs, across five VLMs (MedGemma-4B, MedGemma-27B, LLaVA-Med-7B, LLaVA-v1.6-7B, and Lingshu-32B). The methods span three categories: black-box methods (RadFlag, SelfCheckGPT-NLI), gray-box methods (AvgProb, AvgEnt, MaxProb, MaxEnt, Semantic Entropy, and VASE), and a white-box method (ReXTrust). Our results show that ReXTrust, a white-box method, achieves the highest AUC across all five models, outperforming the strongest alternative method on each VLM by a statistically significant margin (paired permutation test, p 0.001 in all cases), reaching a peak AUC of 93.0 on MedGemma-4B. White-box hidden-state access provides a consistent advantage of 19.5 AUC points on average (range: 9.5–33.5), with ReXTrust maintaining strong performance even on LLaVA-v1.6-7B (AUC 79.9), where black-box methods and clustering-based gray-box methods collapse to near-chance performance. Among non-white-box methods, token-level gray-box statistics (MaxEnt, MaxProb) are the strongest alternatives, outperforming both clustering-based gray-box methods (Semantic Entropy, VASE) and black-box approaches on average. We further identify confident confabulation, a failure mode in which models hallucinate with high inter-sample consistency or high token-level probability, as a systemic failure for both consistency and uncertainty-based methods.
[CV-93] DramaDirector: Geometry-Guided Short Drama Generation
链接: https://arxiv.org/abs/2606.24107
作者: Hengji Zhou,Sijie Liu,Jianrun Chen,Xingchen Zou,Lianghao Xia,Liqiang Nie
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 17 figures, 6 tables. Code is available at this https URL
Abstract:Short dramas, with their rapid shot rhythms, dialogue-driven focus shifts, and demanding cinematographic grounding, pose challenges that prompt-level or text-only video generation pipelines struggle to meet. We study plot-to-short-drama generation, where a global plot and local context are transformed into visually grounded multi-shot videos. We propose DramaDirector, a geometry-grounded framework that lets the planner borrow cinematographic geometry from a gallery of real short-drama shots indexed by depth and pose. DramaDirector decouples each shot into static visual and dynamic narrative conditions, trains the planner with schema-constrained SFT and GRPO under a learned text-visual alignment reward, and retrieves depth-pose references to guide first-frame generation and image-to-video synthesis. We also introduce DramaBoard, a benchmark built from 35 live-action dramas, 2.8K episodes, and 81K shots, with structured storyboards and multi-dimensional evaluation protocols. Experiments show that DramaDirector improves over representative multi-agent and video generation baselines on faithfulness, consistency, and controllability. Our code is released at: this https URL
[CV-94] NavWM: A Unified Navigation World Model for Foresight-Driven Planning ECCV2026
链接: https://arxiv.org/abs/2606.24101
作者: Yanghong Mei,Longteng Guo,Ming-Ming Yu,Guiyu Zhao,Xingjian He,Jing Liu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures, accepted to ECCV 2026
Abstract:Conventional visual navigation policies often struggle with myopic decision-making and mode collapse in complex environments. While world models offer a promising alternative, existing paradigms typically isolate perception, generation, and control, failing to capture their shared spatio-temporal dynamics. In this paper, we propose NavWM, a unified navigation world model that seamlessly integrates latent world reasoning, multimodal action prediction, and controllable visual generation. At its core, NavWM leverages latent world tokens to distill geometric and semantic priors, endowing the agent with robust structural understanding. To overcome the limitations of deterministic policies, we introduce an anchor-based multimodal trajectory forecasting framework that generates a diverse action space. This inherent diversity explicitly empowers the generative world model to act as a robust closed-loop planner, utilizing visual foresight to evaluate and select the optimal path. Extensive experiments across diverse robotics datasets demonstrate that NavWM significantly advances the state-of-the-art, delivering remarkable improvements in both high-fidelity future state generation and zero-shot navigation success.
[CV-95] Beyond Bayer: Task-Optimal Sensor Co-Design for Robust Autonomous-Driving Segmentation
链接: https://arxiv.org/abs/2606.24096
作者: Reeshad Khan,John Gauch
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Robust perception underpins autonomous driving, and most recent progress comes from scaling the model-larger backbones, foundation models, and cooperative multi-agent fusion. We pursue a complementary, upstream question: what should the camera itself measure? Using a differentiable RAW-to-task pipeline, we decompose which sensor degrees of freedom benefit dense prediction. Learning the spectral colour-filter-array (CFA) weights is the dominant lever, improving mIoU by +0.017 (KITTI-360) and +0.023 (ACDC) over a fixed camera. In contrast, point-spread-function (optics) co-design is net-negative (-0.020 mIoU on KITTI-360) - a consequence of the data-processing inequality, which also bounds the task information that any downstream model, however large or cooperative, can recover. Noise co-optimisation is marginal, and counter to intuition enlarging the CFA tile beyond 2x2 consistently hurts, as the filters are confined to the rank three sRGB input. Because the intervention is at the sensor, the gains are model-agnostic; we validate robustness on ACDC’s fog, night, rain, and snow, and conclude with a simple recipe: learn the 2x2 CFA weights and keep an identity PSF.
[CV-96] Universal Guideline-Driven Image Clustering via a Hybrid LLM Agent CVPR2026
链接: https://arxiv.org/abs/2606.24094
作者: Wenliang Zhong,Rob Barton,Lucas Goncalves,Kushal Kumar,Feng Jiang,Hehuan Ma,Yuzhi Guo,Vidit Bansal,Karim Bouyarmane,Junzhou Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Unifying image clustering across different clustering scenarios remains challenging due to fundamental gaps among tasks. We introduce a Guideline-Driven Image Clustering Agent, the first universal framework that bridges these gaps through textual guidelines. To incorporate complex guidelines without task-specific training, we propose Generative Concept Proxy Modeling, which generates guideline-aware embeddings via concept proxy extraction. For scenarios requiring automatic cluster discovery, we introduce LLM Traversal based on Minimum Spanning Tree that selectively applies LLM reasoning for complex semantic judgments. Our method generalizes across diverse clustering scenarios spanning from general to fine-grained categorization, from global to local criteria, and from balanced to long-tail distributions. Our framework consistently outperforms specialized methods across diverse clustering tasks.
[CV-97] Progressive Pixel-Neighborhood Deformable Cross-Attention for Multispectral Object Detection
链接: https://arxiv.org/abs/2606.24092
作者: Tian Qiu,Jifeng Shen,Xin Zuo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Sensors
Abstract:Effective cross-modal feature alignment and interaction are central challenges in multispectral object detection. Although global cross-attention provides strong long-range modeling ability, its quadratic complexity with respect to feature size limits deployment on resource-constrained platforms. We therefore propose Progressive Pixel-Neighborhood Deformable Cross-Attention for multispectral feature fusion, termed PNAFusion. The proposed framework is motivated by two observations: weak misalignment between visible and thermal images is usually concentrated around local neighborhoods, and semantic correspondence across modalities often follows non-linear spatial mappings that fixed receptive fields cannot model well. To address these issues, PNAFusion incorporates local spatial priors into its architectural design to concentrate feature interaction and alignment on the most relevant neighborhoods. Specifically, a Pixel-Neighborhood Cross-Attention (PNCA) module is introduced to avoid redundant global feature matching and suppress background noise. Meanwhile, an Adaptive Deformable Alignment (ADA) module captures non-linear spatial correspondences through learned pixel-wise offsets. These components are further integrated through an iterative feedback mechanism to progressively refine cross-modal feature alignment. Experiments on FLIR, M3FD, and DroneVehicle show that PNAFusion achieves 84.2, 90.5, and 85.5 mAP@0.5, respectively, under the YOLOv5 detector, and further reaches 86.8 mAP@0.5 on FLIR and 90.8 mAP@0.5 on M3FD when transferred to Co-DETR. Efficiency analysis indicates that PNAFusion reduces allocated GPU memory by 33.0% compared with ICAFusion and reduces theoretical FLOPs from 194.8 G to 156.4 G, although the deformable sampling and iterative refinement introduce additional latency. Our code will be available at this https URL.
[CV-98] End-to-End Radar and Communication Modulation Recognition with Neuromorphic Computing
链接: https://arxiv.org/abs/2606.24075
作者: Xiaohu Li,Chongxiao Qu,Caiyong Lin,Chenxiao Dou,Wei Hua
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Although deep learning-based methods can achieve high accuracy in automatic modulation recognition (AMR) tasks, their high computational cost makes it difficult to strike a balance between accuracy and power consumption, thereby limiting their application on resource-constrained platforms. Neuromorphic architectures that perform spike-driven inference with modest energy budgets have recently been explored for vision and timeseries tasks. Motivated by these works, we propose EMRFormer, a novel end-to-end spiking nerural network (SNN) architecture that applies spike-driven transformer to the constraints of neuromorphic hardware for AMR. The model incorporates an adaptive spike encoder and Integer Leaky Integrate-and-Fire neurons to mitigate the degradation of effective information and enhance SNN representational capacity. By integrating spike-separable Convolution Neural Networks (SSCNN) into Spike-Driven Transformers (SpikeFormer), EMRFormer effectively extracts multi-scale temporal features from the raw IQ waveforms. We validate our approach across various mainstream datasets, the experimental results show that EMRFormer achieves state-of-the-art interms of accuracy, outperforming all the baselines. Furthermore, the model maintains strong performance in low signal-to-noise(SNR) environments and reduces theoretical energy consumption by over 90%. Finally, we evaluate our model on a KA200 neuromorphic chip. The results show that our model achieves up to 5 times reduction in power compared to running on a 3090 GPU or an Orin NX. This work demonstrates a promising pathway for AMR on resource-constrained devices.
[CV-99] Fabric Image Demoiréing Benchmark from Synthesis to Restoration ECCV2026
链接: https://arxiv.org/abs/2606.24072
作者: Pengchao Wei,Xiaojie Guo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026
Abstract:Fabric moiré is a sampling-induced aliasing artifact caused by the interaction between fine textile patterns and camera sensor grids, producing structured interference that severely degrades image quality. Unlike screen-induced moiré, which stems from strictly periodic display lattices, fabric moiré is intrinsically more challenging due to the broadband and semi-periodic nature of textile weaves. The heavy spectral overlap between intrinsic texture and aliasing components renders fabric demoiréing substantially more ill-posed. Consequently, existing models trained on screen moiré datasets generalize poorly to these complex textile patterns. Despite its practical importance, fabric image demoiréing remains underexplored and lacks standardized benchmarks. We present the first comprehensive benchmark for fabric image demoiréing. To address the difficulty of acquiring pixel-aligned real-world pairs, we develop a physically motivated synthesis framework and construct a large-scale dataset comprising 16,050 paired multi-resolution fabric images with controllable aliasing severity. Furthermore, we customize a baseline model, which establishes promising performance on the proposed benchmark dataset with strong generalization ability. Our benchmark provides a standardized platform for advancing research in fabric image demoiréing.
[CV-100] ObsGraph: Hierarchical Observation Representation for Embodied Reasoning and Exploration
链接: https://arxiv.org/abs/2606.24068
作者: Taekbeom Lee,Youngseok Jang,Jeonghwa Heo,Jeongjun Choi,H. Jin Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Embodied reasoning and exploration are increasingly considered crucial abilities for robots operating in complex and unfamiliar environments. To accomplish tasks in such settings, an agent must identify and acquire the information necessary for the task through exploration. We propose ObsGraph, an observation-centric hierarchical scene graph that unifies scene representation, retrieval, and exploration. It retains visual evidence and organizes it into room-view-object layers: rooms provide coarse semantic anchors, views preserve contextual object covisibility, and objects store fine-grained details. On top of this representation, we perform coarse-to-fine hierarchical retrieval under a bounded budget, and crucially use retrieval outcomes to structure the exploration candidate space–activating room-level exploration, view refinement, or frontier exploration–thereby tightly coupling representation, retrieval, and adaptive multi-scale exploration. Experiments across embodied reasoning and exploration benchmarks demonstrate improved success and efficiency, highlighting the benefits of structured scene representation and more targeted information gathering driven by identified evidence gaps.
[CV-101] Ingredient-Level Food Image Segmentation for Nutrition Awareness
链接: https://arxiv.org/abs/2606.24059
作者: Jonesh Shrestha
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures, 4 tables
Abstract:Food images often contain several visible ingredients, so assigning one dish label to an entire image hides important visual structure. This work studies ingredient-level semantic segmentation on FoodSeg103, where the model predicts an ingredient class for each pixel. Two SegFormer variants were fine-tuned and evaluated under a controlled setup: SegFormer-B0 as the smaller baseline model and SegFormer-B1 as the larger final model. Both models use ImageNet-pretrained MiT backbones with newly initialized 104-class output layers. On the held-out FoodSeg103 test split of 2,135 images, B0 achieved 0.7709 pixel accuracy and 0.2521 mean IoU, while B1 achieved 0.7929 pixel accuracy and 0.3204 mean IoU. B1 improved every saved test metric, including a +0.0683 absolute gain in mean IoU. The system also converts predicted masks into visible ingredient-area percentages, giving a simple visual composition summary of the predicted meal. This summary can serve as a first-pass nutrition-awareness cue by providing a visual alternative to detailed food tracking similar to plate-based meal guidance, but it is not a direct estimate of calories, macronutrients, food mass, volume, density, or true portion size.
[CV-102] VisChronos: Revolutionizing Image Captioning Through Real-Life Events
链接: https://arxiv.org/abs/2606.24058
作者: Phuc-Tan Nguyen,Hieu Nguyen,Minh-Triet Tran,Trung-Nghia Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SOICT 2024
Abstract:This paper aims to bridge the semantic gap between visual content and natural language understanding by leveraging historical events in the real world as a source of knowledge for caption generation. We propose VisChronos, a novel framework that utilizes large language models and dense captioning models to identify and describe real-life events from a single input image. Our framework can automatically generate detailed and context-aware event descriptions, enhancing the descriptive quality and contextual relevance of generated captions to address the limitations of traditional methods in capturing contextual narratives. Furthermore, we introduce a new dataset, EventCap (this https URL), specifically constructed using the proposed framework, designed to enhance the model’s ability to identify and understand complex events. The user study demonstrates the efficacy of our solution in generating accurate, coherent, and event-focused descriptions, paving the way for future research in event-centric image understanding.
[CV-103] EPEdit: Redefining Image Editing with Generative AI and User-Centric Design
链接: https://arxiv.org/abs/2606.24057
作者: Hoang-Phuc Nguyen,Dinh-Khoi Vo,Trong-Le Do,Hai-Dang Nguyen,Tan-Cong Nguyen,Vinh-Tiep Nguyen,Tam V. Nguyen,Khanh-Duy Le,Minh-Triet Tran,Trung-Nghia Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SOICT 2024
Abstract:The demand for image manipulation has seen a significant increase recently. Traditional tools like Photoshop and Capture One, while powerful, require considerable expertise to use effectively. Generative AI has introduced alternative platforms, such as Luminar Neo, Pixlr X, and Canva. However, many of these solutions, including resource-heavy models like Stable Diffusion, often require substantial retraining and fine-tuning, leading to high costs for users. To address these challenges, we introduce Efficient Photo Editor (EPEdit), an application that integrates a robust backend framework with a user-friendly front-end interface. EPEdit supports a wide range of creative image editing tasks, including image generation, object replacement, object removal, background modification, changes in object pose or perspective, region-specific editing, and thematic collection design, all guided by masks and prompts. Users can interact with the system through simple text commands or by marking areas for precise adjustments, making it accessible even to those without technical expertise. At its core, EPEdit leverages zero-shot image editing algorithms based on Stable Diffusion model, removing the need for additional fine-tuning. This approach enables efficient image manipulation and thematic collection creation. User evaluations for tasks of image editing, thematic design, and overall system performance demonstrate that EPEdit outperforms existing solutions, offering a user-friendly, cost-effective solution for comprehensive image editing.
[CV-104] DriveStack-VLA: Render-Teacher Alignment for BEV-Based DeepStack Vision-Language-Action Model
链接: https://arxiv.org/abs/2606.24051
作者: Jingke Wang,Zhenru Zhao,Shuangming Lei,Hao Su,Yuehao Huang,Yijia Xie,Kai Tang,Guanglin Xu,AiXue Ye,Yukai Ma,Yong Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language-Action driving models convert a pretrained Vision-Language Model into a driving policy, allowing them to use world knowledge and follow language guidances. However, existing VLA driving models still lack driving-oriented spatial intelligence: their policies are mainly grounded on perspective image tokens and language priors, while precise motion planning requires metric geometry, top-down scene structure, and attention to safety-critical perceptual cues. This limitation makes current models vulnerable to weak visual geometry modeling and perceptual coverage in expert demonstrations. In this paper, we present DriveStack-VLA, a framework built upon a large VLM backbone. To strengthen the spatial grounding of VLA driving, we develop dual visual modeling components. We inject a Bird-Eye-View representation into the Large Language Model decoder through a DeepStack-style connection, and propose Render-Teacher Alignment to align the perceptual focus of real images with that of rasterized images. Furthermore, to bridge the gap in multimodal trajectory selection, we introduce a head-based self-critique module that ranks sampled trajectories and conditionally refines the best one. DriveStack-VLA achieves 91.6 PDMS on NAVSIMv1, 91.0 EPDMS on NAVSIMv2 (with the human penalty filter enabled), and a driving score of 79.49 with a success rate of 56.36% on the closed-loop Bench2Drive. More visualizations are available on our project page: this https URL.
[CV-105] oken-to-Token Alignment of Text Embeddings for Semantic Blending
链接: https://arxiv.org/abs/2606.24021
作者: Saar Huberman,Ron Mokady,Or Patashnik,Daniel Cohen-Or
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:In modern generative models, images are specified and controlled through text prompts. In practice, images are generated from sequences of tokens derived from these prompts. However, the space of token sequences lacks a consistent accessible structure: semantically similar images may correspond to sequences that differ in wording, ordering, and placement of concepts, while similar token sequences may encode very different semantics. This apparent lack of structure makes it difficult to perform smooth transitions in this space, hindering applications such as image blending and continuous control of edits. We argue that this limitation stems not from the absence of semantic structure, but from misalignment between representations. To address this misalignment, we introduce Token-to-Token alignment, a framework that establishes explicit semantic correspondence between tokens across prompts. Our approach transforms prompts into a structured representation in which semantically corresponding concepts are mapped to consistent positions across prompts, and then aligns their token embeddings based on semantic similarity. Concretely, the method consists of two stages: a structural alignment that rephrases prompts into a shared structured form, followed by an embedding-level alignment that matches token representations across prompts. With this alignment in place, simple linear interpolation becomes a meaningful operation, producing smooth and coherent semantic transitions and enabling applications such as blending and continuous editing. Our results show that text embedding spaces in text-to-image models implicitly encode a continuous semantic structure that becomes accessible once representations are properly aligned, suggesting that semantic control can be achieved by organizing existing representations rather than modifying the generative model.
[CV-106] Cyclic Denoising Reveals Ultrastable Memories in Diffusion Models
链接: https://arxiv.org/abs/2606.24000
作者: Rishabh Sharma,Stefano Martiniani
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 7 main figures; supplementary material included. Supplementary movies available at the project webpage
Abstract:We introduce cyclic denoising – repeated forward and reverse diffusion at controlled noise amplitudes – as an extraction attack for image diffusion models. Inspired by random organization in disordered solids, cyclic denoising exposes regions of the learned distribution that are largely inaccessible to standard sampling. The dynamics drive samples toward attractors with a broad stability spectrum. The deepest attractors are ultrastable: they regenerate after near-total corruption and persist through thousands of noising-denoising cycles. Many of these attractors correspond to memorized training images, including stock photographs, brand watermarks, and web-crawl artifacts. The attack requires only sampler-level control, with no gradients, weight inspection, prompts, captions, or prior knowledge of the training data. Unlike generate-and-filter attacks, which rely on large-scale prompted generation and post-hoc similarity or membership-inference filtering, our main protocol is fully unconditioned. We demonstrate the phenomenon in Stable Diffusion v1.4 and in a pixel-space DDPM, showing consistent behavior across latent- and pixel-space diffusion models. Across noise amplitudes, we observe a yielding-like transition: low-amplitude cycling produces trivial absorbing fixed points or limit cycles, while larger amplitudes induce rearrangements, basin hopping, and long-lived trapping in structured memorized attractor basins. We also observe hierarchical partial absorption, prompt-stabilized basins, and cross-initial-condition universality of the recovered attractor set. Our results therefore show that cyclic denoising is both a physics-inspired probe of generative landscapes and a practical tool for memorization auditing, with implications for privacy, copyright compliance, and model fingerprinting.
[CV-107] 3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy MICCAI2026
链接: https://arxiv.org/abs/2606.23964
作者: Amirhossein Kardoost,Lion Gleiter,Tingying Peng,Carsten Marr
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: Accepted at MICCAI 2026. Code available at: this https URL
Abstract:Self-supervised learning in fluorescence microscopy often relies on 2D projections, despite the inherently three-dimensional nature of cells. We present a systematic comparison of 2D and 3D masked autoencoders (MAE-2D vs. MAE-3D) on volumetric microscopy data. Under matched architectures and training protocols, MAE-3D consistently outperforms 2D max-projection and slice-based variants on downstream single-cell tasks. We further align visual representations with a pretrained protein language model (ESM2) and show that cross-modal supervision yields larger gains for volumetric models. Channel cross-attention and frequency-domain regularization are critical for leveraging 3D spatial context. On a protein–protein interaction task, MAE-3D achieves a ROC–AUC of 0.865, outperforming prior methods by up to +0.025. For protein localization, our best 3D model attains state-of-the-art AUC _\textmicro (0.952) and F1 _\textmicro (0.742), improving over previous approaches by +0.003 and +0.010 absolute, respectively. Overall, these results demonstrate the advantages of native 3D modeling and multimodal alignment for representation learning in single-cell microscopy.
[CV-108] DivRL: Disentangled Self-Similarity Rewards for Diverse Subject-Driven Generation ECCV2026
链接: https://arxiv.org/abs/2606.23950
作者: Qian Wang,Zhenyu Li,Abdelrahman Eldesokey,Peter Wonka
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. Project page: this https URL
Abstract:Subject-driven image generation faces an “Identity-Diversity Paradox”, where strong identity preservation often leads to rigid and low-diversity outputs. We propose a post-training framework called DivRL that jointly optimizes identity consistency and structural diversity simultaneously by leveraging disentangled visual features from a robust similarity model. Specifically, we introduce a Negative Self-Similarity Measure (nSSM) to quantify structural diversity, and Visual Semantic Matching (VSM) to evaluate identity consistency. We propose an “Explore-and-Suppress” strategy that treats VSM as a gated constraint: the model freely explores structurally diverse configurations, and only samples that violate the identity threshold are penalized via a quadratic hinge loss. This converts identity preservation from a competing objective into a feasibility constraint, allowing nSSM and VSM to improve jointly. Experiments demonstrate that our method effectively pushes the model to generate both consistent and diverse images and improves structural diversity while maintaining comparable identity consistency through a gated optimization formulation.
[CV-109] rustworthy Image Authentication using Forensic Knowledge Graphs ECCV2026
链接: https://arxiv.org/abs/2606.23917
作者: Tai D. Nguyen,Matthew C. Stamm
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and Published at ECCV 2026
Abstract:Advances in generative AI have made image falsification highly realistic, demanding trustworthy authentication systems. Existing forensic detectors can target certain forgery types but lack interpretability, while vision-language models (VLMs) provide explanations but cannot exploit forensic traces for reliable detection. We propose Forensic Knowledge Graphs (FKGs), a unified framework that integrates forensic evidence extraction, structured reasoning, and human-interpretable explanation. Our FKG structure encodes forensic traces along with their causal dependencies and links to scene content. To generate accurate FKGs, we introduce a novel forensic authentication network and an Iterative Context Refinement strategy that guides VLMs to produce faithful, grounded explanations. We also present FKG-50K, a dataset of 50,000 realistic forgeries with ground-truth FKGs. Experiments demonstrate that FKG outperforms both forensic detectors and VLMs in detection, forgery identification and localization, and forensic justification.
[CV-110] he Professor: Multi-Teacher Unsupervised Prompt Distillation for Vision-Language Models
链接: https://arxiv.org/abs/2606.23897
作者: Ahmad Algadhi,Ahmed Alzuhair,Omar Alkhulaif,Muzammil Behzad
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Prompt distillation compresses large vision-language models (VLMs) such as CLIP into lightweight student models by matching teacher predictions on unlabeled domain images. PromptKD (CVPR 2024) established this paradigm with a single PromptSRC-finetuned ViT-L/14 teacher and a ViT-B/16 student. We propose TheProfessor, a multi-teacher extension that distills from a fixed two-teacher ensemble: a domain-finetuned PromptSRC ViT-L/14 teacher and a zero-shot EVA-CLIP-L/14 teacher whose logits are pre-computed per dataset. We evaluate single-teacher PromptKD, equal-probability ensembling, and confidence-weighted ensembling on four base-to-novel datasets: Caltech-101, DTD, UCF101, and EuroSAT. In a 12-run single-seed sweep, confidence-weighted ensembling improves average HM from 87.52 to 89.28 (+1.77 points), while equal averaging improves average HM to 88.88 (+1.37 points). Gains are dataset dependent: they are negligible on Caltech-101 (+0.16 HM for confidence weighting), modest on UCF101 (+0.62), and largest on domain-shifted EuroSAT (+5.78). These results update our earlier Caltech-only analysis and show that multi-teacher prompt distillation is most useful when the second teacher contributes complementary supervision under domain shift.
[CV-111] REALM: A Unified Red-Teaming Benchmark for Physical-World VLMs
链接: https://arxiv.org/abs/2606.23892
作者: Yifei Zhao,Qian Lou,Mengxin Zheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 5 figures. Preprint
Abstract:Vision-language models (VLMs) are increasingly used as perception-reasoning backbones for embodied intelligence in safety-critical physical systems, where perception or reasoning errors can lead to unsafe decisions or actions. Although many red-teaming methods have been developed to probe VLM vulnerabilities, their evaluation remains fragmented across datasets, metrics, and threat models, making direct comparison difficult and obscuring whether observed differences arise from stronger attacks, more vulnerable models, or incompatible evaluation settings. Existing chatbot-centric red-teaming benchmarks mainly standardize jailbreak and content-safety evaluation, but they do not systematically capture physically grounded functional failures or cover red-teaming methods that target physical-world VLMs. This raises the key challenge of comparing diverse attack methods under a unified protocol while targeting the same scenario-specific failures. We introduce REALM, to our knowledge the first unified red-teaming benchmark for physical-world VLMs. REALM integrates 12 red-teaming methods, 3 model-agnostic defenses, and 13 VLMs under a practical black-box threat model with shared datasets and metrics. To align adversarial objectives across attack families, REALM introduces an agentic target-generation pipeline that constructs shared, scenario-specific, and physically grounded attack objectives for each scene, enabling fair comparison of diverse red-teaming methods under aligned adversarial goals. Our evaluation shows that text and typographic injection attacks induce the most failures, multimodal co-optimization yields the strongest visual-perturbation transfer, single-pass attacks approach iterative methods at much lower cost, and model scale alone does not confer adversarial robustness. Code is available at this https URL.
[CV-112] Machine Learning Modeling for Real-Time Melt Pool Monitoring in Laser Powder Bed Fusion Additive Manufacturing: A Hybrid Approach
链接: https://arxiv.org/abs/2606.23851
作者: Inioluwa Emmanuel,Zhuo Yang,Ho Yeung,Xinyao Zhang
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This work investigates the implementation of artificial intelligence and machine learning (AI/ML) for real-time monitoring in laser powder bed fusion (LPBF) additive manufacturing. We developed a binary image classification framework for distinguishing normal and abnormal melt pool images using a balanced dataset of 1,200 images collected from Nickel superalloy 625 on the NIST AMMT platform. The study evaluates accuracy and inference time based on control requirements and hardware limitations of open-architecture LPBF machines. We benchmark three transfer learning architectures (ResNet50, EfficientNetB0, and MobileNetV2) against two Random Forest approaches: one trained on EfficientNetB0 feature embeddings (hybrid) and one trained on raw pixel features (baseline). Images are stratified into 80/20 train-test splits, with a further 90/10 validation split on the training set, and undergo standardized resizing, normalization, and label-preserving data augmentation to emulate realistic process variability. Each model is evaluated using accuracy, precision, recall, F1 score, and area under the receiver operating characteristic curve (AUC), along with training time, inference latency, and CPU GPU usage to capture deployability constraints relevant to factory-floor monitoring. The hybrid EfficientNetB0-plus-Random Forest approach achieves the best performance on the held-out test set, with an F1 score of 0.9451, accuracy of 0.9458, and AUC of 0.9904, while maintaining sub-millisecond per-image inference (1.15 ms). In contrast, purely deep learning models exhibit significantly higher inference times with lower accuracy. These results demonstrate that combining pre-trained convolutional features with classical ensemble methods provides a robust, computationally efficient route to real-time melt pool anomaly detection in data-limited additive manufacturing environments.
[CV-113] ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation
链接: https://arxiv.org/abs/2606.23835
作者: Anindya Mondal,Sauradip Nag,Anjan Dutta
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Under review, webpage: this https URL
Abstract:ABACUS is a unified vision-language model that handles object counting, crowd counting, referring-expression counting, and count-faithful image generation without any benchmark-specific training required. Our model is built on existing 3B-parameter unified foundation model and is adapted for object localization tasks using three key innovations: density-aware adaptive zooming with objectness maps for spatial grounding; a boundary-aware count policy via GRPO to eliminate crop-boundary errors; and a cycle-consistent GRPO strategy where the understanding branch self-critiques generated outputs, closing the understanding-generation gap without any external annotations. ABACUS achieves state-of-the-art results across seven benchmarks, outperforming both task-specific specialists and larger generalist models.
[CV-114] From Spatial to Spectral: An Efficient Frequency-Guided Feature Representation Learner for Small Object Detection
链接: https://arxiv.org/abs/2606.23825
作者: Yuhan Rui,Shihan Qiao,Yibin Lou,Mingxi Yu,Yutong Wan,Yanqiao Chen,Dongsheng Hou,Zhen Cao,Athena Zhuoming Zhong,Qi Hao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Efficient small object detection is bottlenecked by the inherent feature scarcity of tiny targets, which is further aggravated by operations of spatial-domain detectors that indiscriminately discard critical high-frequency details. Recovering these fragile cues within the spatial domain is notoriously difficult, as it often requires computationally expensive architectural upscaling that inadvertently amplifies background noise. To bridge this gap, we propose a paradigm \textbfshift from spatial to spectral feature processing, introducing a holistic solution with the following novelty: (1) A versatile \textbfFrequency-Guided Feature Representation framework that generalizes across diverse detector architectures (both CNN and Transformer-based), offering a robust alternative to spatial-only feature extraction; (2) The unified \textbfDecompose–Enhance–Reconstruct (DER) operator, instantiated via three \textbflightweight, plug-and-play modules – Wavelet-Difference Gate (WDG), Log-Gabor Enhancer (LGE), and Frequency-Driven Head (FDHead) – to systematically inject frequency-aware modulation into the backbone, neck, and head. This mechanism decouples feature modeling from resolution reduction, capturing discriminative high-frequency components to enable accurate localization with significantly reduced parameter redundancy; (3) Extensive validation on multi-domain benchmarks (VisDrone2019, UAVDT, TinyPerson, DOTAv1) demonstrating consistent gains. Notably, our proposed \textbfDERNet series outperforms YOLOv11 models under the same scale while requiring \textbfonly 1/6 of the parameters, backed by rigorous spectral diagnostics and error decomposition analysis.
[CV-115] Listening makes Vision Clear for VLMs
链接: https://arxiv.org/abs/2606.23763
作者: Yiyang Chen,Yixin Tan,Binrui Shen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18pages,3 figures
Abstract:Recent work typically assesses vision–language consistency using attention distributions of answer-side tokens. However, we observe that highest attention regions are not always consistent with the intended semantic token. This probably stems from decoding drift, where language priors from previously generated answer tokens accumulate and mismatch with visual attention. Besides the priors from previous answer tokens, we find that structural tokens, e.g., modality boundary markers, may encompass the entire context and generate high attention to areas unrelated to the target. To avoid these distortions and provide consistency evaluation for large VLMs, we adopt prompt-side semantics and propose Prompt-Vision Token Activation Map (PV-TAM). PV-TAM further incorporates a filter to remove systematic bias induced by modality boundary markers. Unlike traditional methods that evaluate overlap solely through masks while ignoring activation intensity, our metrics leverage the peak distribution of attention to measure the alignment between prompts and visual regions. In experiments, PV-TAM consistently improves both attention-based and IoU-style localization metrics over answer-side baselines on various datasets.
[CV-116] Sol Video Inference Engine: Agent -Native Full-Stack Acceleration Framework for Efficient Video Generation
链接: https://arxiv.org/abs/2606.23743
作者: Yitong Li,Junsong Chen,Haopeng Li,Haozhe Liu,Jincheng Yu,Ligeng Zhu,Ping Luo,Song Han,Enze Xie
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Modern video diffusion models achieve higher generation quality through scaling, but this also increases inference cost. Although many acceleration methods have been proposed, a central challenge is that the most effective acceleration strategy is highly instance-specific: a recipe that works well for one combination of model, hardware, and inference configuration often does not transfer to another. Different models vary in architecture, numerical sensitivity, and attention concentration patterns. Inference settings differ in spatial and temporal resolution and video duration, while hardware platforms differ in memory hierarchy, supported numerical formats, and kernel throughput. These factors create a large tuning space, making manual performance engineering costly. We present Sol Video Inference Engine, an agentic, native, training-free acceleration framework for video diffusion models. It organizes five broadly applicable techniques, cache, sparse attention, token pruning, quantization, and kernel fusion, into an agentic acceleration stack for instance-specific optimization. For a concrete deployment target defined by a model, hardware platform, and serving configuration, parallel skill agents optimize the implementation of each technique, an agent integrator composes them into a global acceleration stack, and a human validator provides feedback on generation quality. We instantiate this workflow on three video models with different sizes and architectures: 64B Cosmos3-Super, 22B LTX-2.3, and 2B SANA-Video. With little human effort, the full stack achieves more than 2x end-to-end acceleration while maintaining near-lossless VBench quality, demonstrating the effectiveness of the agent framework for video diffusion acceleration.
[CV-117] Systematic Exploration of 4-Expert Heterogeneous Mixture-of-Experts via Automated Pipeline Search
链接: https://arxiv.org/abs/2606.23739
作者: Yashkumar R Lukhi,Harsh Rameshbhai Moradiya,Radu Timofte,Dmitry Ignatov
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: 8 pages, 2 figures
Abstract:We present an automated large-scale search pipeline for heterogeneous 4-Expert Mixture-of-Experts (MoE4) architectures within the LEMUR neural network dataset ecosystem. Building on a hand-crafted heterogeneous MoE reference model, we replace manual design with a deterministic code-assembly generator that systematically combines base architecture families drawn from the LEMUR database into MoE4 ensembles, each governed by a convolutional gating network with temperature scaling, mixup augmentation, and cosine-annealed learning rate scheduling. Over a 28-day campaign on an NVIDIA RTX 4090, the pipeline generated 4,463 candidate models across 197 batches, of which 1,021 were evaluated successfully. A critical finding emerged from the campaign: due to alphabetical enumeration via this http URL, the entire explored search space (4.8% of the theoretical 23,751 possible 4-family combinations) is anchored to a single family, AirNet. We characterise this coverage bias precisely, identify the root cause in the generator, and propose a stratified random sampling fix. Within the AirNet anchored scope, ShuffleNet and MobileNetV3 consistently co-produce the highest-accuracy ensembles (mean accuracy up to 0.632), while FractalNet and MNASNet are identified as low-yield families warranting exclusion in future campaigns. The pipeline, analysis artefacts, and corrected generator are released as part of the open-source NNGPT project at this https URL
[CV-118] I2E: Real-Time Image-to-Event Conversion for High-Performance Spiking Neural Networks AAAI-26
链接: https://arxiv.org/abs/2511.08065
作者: Ruichen Ma,Liwei Meng,Guanchao Qiao,Ning Ning,Yang Liu,Shaogang Hu
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: AAAI-26 Oral
Abstract:Spiking neural networks (SNNs) promise highly energy-efficient computing, but their adoption is hindered by a critical scarcity of event-stream data. This work introduces I2E, an algorithmic framework that resolves this bottleneck by converting static images into high-fidelity event streams. By simulating microsaccadic eye movements with a highly parallelized convolution, I2E achieves a conversion speed over 300x faster than prior methods, uniquely enabling on-the-fly data augmentation for SNN training. The framework’s effectiveness is demonstrated on large-scale benchmarks. An SNN trained on the generated I2E-ImageNet dataset achieves a state-of-the-art accuracy of 60.50%. Critically, this work establishes a powerful sim-to-real paradigm where pre-training on synthetic I2E data and fine-tuning on the real-world CIFAR10-DVS dataset yields an unprecedented accuracy of 92.5%. This result validates that synthetic event data can serve as a high-fidelity proxy for real sensor data, bridging a long-standing gap in neuromorphic engineering. By providing a scalable solution to the data problem, I2E offers a foundational toolkit for developing high-performance neuromorphic systems. The open-source algorithm and all generated datasets are provided to accelerate research in the field.
[CV-119] Female-RHINO: A Real-Time Scanner-Integrated Framework for Automated Quantitative Uterine MRI Analysis and Structured Reporting
链接: https://arxiv.org/abs/2606.24390
作者: Deepak Bhatia,Saad Ahmad,Smiti Tripathy,Maria Camila Bustos Vivas,Lieselotte Kratzsch,Anika Knupfer,Jordina Aviles Verdera,Susanne Schulz-Heise,Matthias May,Jana Hutter
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Standardized assessment of uterine MRI remains challenging due to anatomical variability, observer dependence, and the lack of workflow-integrated automated analysis tools. This work presents Female-RHINO: ®eproductive (H)ealth (I)maging A(N)alysis T(O)ol, a real-time AI-assisted framework for automated quantitative uterine MRI analysis and structured reporting during image acquisition. We present an end-to-end system that integrates inline communication with the MRI scanner and deep learning-based analysis to derive quantitative uterine biomarkers from sagittal T2-weighted pelvic MRI. The framework combines segmentation and anatomical landmark detection models trained and evaluated on more than 500 multi-center datasets spanning diverse protocols, vendors, and patient populations. It performs volumetry, detects and quantifies common incidental findings such as fibroids and Nabothian cysts, and extracts six anatomical landmarks for biometric assessment. Results are compiled into a structured clinician-oriented report with integrated visualizations, without manual interaction. Evaluation on independent retrospective and prospective cohorts demonstrated robust performance across varying acquisition settings. Mean Dice similarity coefficients were 0.82 for the uterus and 0.80 for fibroids, with lower but consistent agreement for Nabothian cysts. Landmark detection achieved a mean radial error of 3.7 mm. End-to-end processing was completed in under 70 seconds, enabling availability of results during the ongoing scan. Prospective deployment yielded immediate, standardized, and reproducible analyses supported by inter-observer agreement. The proposed system enables real-time scanner-integrated AI for automated uterine MRI analysis and reporting, with potential to improve standardization, efficiency, and clinical workflow in pelvic imaging.
[CV-120] Automated Residual Plot Assessment With the R Package autovi and the Shiny Application autovi.web
链接: https://arxiv.org/abs/2606.24236
作者: Weihao Li,Dianne Cook,Emi Tanaka,Susan VanderPlas,Klaus Ackermann
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published in Australian New Zealand Journal of Statistics
Abstract:Visual assessment of residual plots is a common approach for diagnosing linear models, but it relies on manual evaluation, which does not scale well and can lead to inconsistent decisions across analysts. The lineup protocol, which embeds the observed plot among null plots, can reduce subjectivity but requires even more human effort. In today’s data-driven world, such tasks are well suited for automation. We present a new R package that uses a computer vision model to automate the evaluation of residual plots. An accompanying Shiny application is provided for ease of use. Given a sample of residuals, the model predicts a visual signal strength (VSS) and offers supporting information to help analysts assess model fit.
[CV-121] A Dual Edge Spatial Jacobian Image Graph for Interpretable Diabetic Retinopathy Grading
链接: https://arxiv.org/abs/2606.24168
作者: Inam Ullah,Imran Razzak,Shoaib Jameel
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
Abstract:Automated diabetic retinopathy (DR) grading from colour fundus photographs can achieve strong predictive performance, but clinical interpretation requires more than an image-level label. It requires understanding how lesion evidence is distributed around retinal vessels and how this evidence relates to quantitative vascular biomarkers. We present a dual-edge spatial-Jacobian image graph for interpretable DR grading. Each fundus image is represented as a graph node with four aligned evidence streams: AutoMorph vessel information ( X_1 ), DR-XAI-style lesion evidence maps ( X_2 ), a 128-dimensional lesion-based contrastive image embedding ( X_3 ), and AutoMorph morphometric biomarkers ( X_4 ). The spatial edge branch ( X_12 ) encodes vessel-lesion geometry, while the Jacobian branch ( X_34 ) models embedding-biomarker sensitivity. Lightweight two-token attention fuses both edge families into a final image graph. On 2,910 matched non-augmented APTOS images, the full graph achieves 0.8076 accuracy, 0.8312 quadratic weighted kappa, 0.5915 macro-F1, and 0.9330 adjacent-grade accuracy; referable DR reaches 0.9055 accuracy and 0.9711 AUROC. The framework is positioned as an explainable representation-learning tool for lesion-biomarker hypothesis generation, rather than as a deployment-ready clinical classifier. The code is available at this https URL.
[CV-122] E-MRL: Cross-view Aligned Evidence-driven Multimodal Reinforcement Learning for Reliable 3D Tumor Analysis
链接: https://arxiv.org/abs/2606.23888
作者: Sijing Li,Zhongwei Qiu,Zhuoya Wang,Boxiang Yun,Zhenyu Yi,Jianwei Xu,Wenqiao Zhang,Yingda Xia,Ling Zhang
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 2 figures
Abstract:While Vision-Language Models (VLMs) show great promise in volumetric medical report generation, they frequently suffer from visual hallucinations and a lack of grounding in 3D CT data. Current Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) strategies typically optimize text fidelity alone, essentially rewarding correct diagnoses derived from language priors rather than genuine visual perception. To address this, we propose cross-view aligned Evidence-driven Multimodal Reinforcement Learning (Evidence-MRL, noted as E-MRL), a reliable RL reasoning framework that formulates the generation process as a Markov Decision Process of “diagnosis-localization-verification”. Unlike standard approaches, our model is explicitly trained to identify a “key evidence slice” alongside the global diagnostic report, grounding its findings in verifiable visual evidence. Crucially, we introduce a novel cross-view consistency reward, which validates the semantic alignment between the golden-standard report and a local visual re-query of the selected key slice, providing additional rewards for correctly-localized reasoning. Experiments on large-scale 3D CT tumor datasets demonstrate that E-MRL significantly reduces hallucinations and improves diagnostic accuracy compared to SFT and RL baselines, offering a clinically interpretable solution for visually-grounded and tumor analysis.
[CV-123] Performance and Interpretability of Convolutional Transformer and Hybrid Deep Learning Models in Colorectal Histology Classification
链接: https://arxiv.org/abs/2606.23744
作者: Reza Bozorgpour
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning has become an important tool in computational pathology, enabling automated analysis of histopathological images. While convolutional neural networks (CNNs) have traditionally dominated this field, transformer-based and hybrid architectures have recently demonstrated promising performance. However, comprehensive comparisons of these approaches for colorectal histopathology remain limited. This study evaluated twelve ImageNet-pretrained CNN, transformer, and hybrid architectures using the Kather colorectal histopathology dataset containing 5,000 image tiles from eight tissue classes. All models were trained using a standardized transfer-learning and fine-tuning protocol and assessed using multiple performance metrics, including accuracy, precision, sensitivity, specificity, F1-score, ROC-AUC, Cohen’s kappa, and Matthews correlation coefficient. All evaluated models achieved high classification performance, with accuracies ranging from 93.2% to 97.1%. EVA-02 achieved the highest overall performance (97.1% accuracy, 97.0% F1-score), closely followed by ViT-B/16. Among CNNs, ResNet34 and ConvNeXt-Tiny demonstrated highly competitive performance, achieving accuracies of 96.4% and 96.3%, respectively. Transformer architectures generally produced the strongest results across evaluation metrics, although the performance gap between the best transformer and CNN models was relatively small. Per-class analysis showed consistently strong classification performance across all tissue categories, with Complex Stroma representing the most challenging class. Overall, transformer-based architectures achieved the highest predictive performance, whereas modern CNNs provided a favorable balance between accuracy and model complexity. These findings provide a comprehensive benchmark of major deep learning paradigms for colorectal histopathology classification.
人工智能
[AI-0] InSight: Self-Guided Skill Acquisition via Steerable VLAs
链接: https://arxiv.org/abs/2606.24884
作者: Maggie Wang,Lars Osterberg,Stephen Tian,Ola Shorinwa,Jiajun Wu,Mac Schwager
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project website: this https URL
Abstract:Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight, a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the primitive-action level (e.g., “move gripper to the bowl”, “lift upward”, “pour the bottle”). InSight consists of two primary stages: (1) an automated segmentation pipeline that partitions demonstrations into labeled primitives via VLM plan decomposition and end-effector poses to enable VLA primitive steerability, and (2) a VLM-guided data flywheel that identifies missing primitives required to accomplish a novel task, autonomously attempts demonstrations of the missing primitives with VLM-proposed low-level control, and automatically labels, stores, and integrates successful demonstrations into the VLA training set. We evaluate InSight across simulation and real-world manipulation tasks, including block flipping, drawer closing, sweeping, twisting, and pouring, without any human demonstrations of these target skills. Once learned, these primitives can be composed to execute novel, long-horizon tasks without additional human demonstrations. Our findings demonstrate that primitive steerability provides a practical foundation for continual skill acquisition in VLA policies. Project website: this https URL.
[AI-1] OpenThoughts-Agent : Data Recipes for Agent ic Models
链接: https://arxiv.org/abs/2606.24855
作者: Negin Raoof,Richard Zhuang,Marianna Nezhurina,Etash Guha,Atula Tejaswi,Ryan Marten,Charlie F. Ruan,Tyler Griggs,Alexander Glenn Shaw,Hritik Bansal,E. Kelly Buchanan,Artem Gazizov,Reinhard Heckel,Chinmay Hegde,Sankalp Jajee,Daanish Khazi,Emmanouil Koukoumidis,Xiangyi Li,Hange Liu,Shlok Natarajan,Harsh Raj,Nicholas Roberts,Ethan Shen,Nishad Singhi,Michael Siu,Ashima Suvarna,Hanwen Xing,Patrick Yubeaton,Robert Zhang,Leon Liangyu Chen,Xiaokun Chen,Steven Dillmann,Saadia Gabriel,Xunyi Jiang,Anurag Kashyap,Boxuan Li,Yein Park,Minh Pham,Sujay Sanghavi,Lin Shi,Ke Sun,Yixin Wang,Zhiwei Xu,Erica Zhang,Siyan Zhao,Wanjia Zhao,Jenia Jitsev,Alex Dimakis,Benjamin Feuer,Ludwig Schmidt
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic language models dramatically expand the applications of AI yet little is publicly known about how to curate training data for broadly capable agents. Existing open efforts such as SWE-Smith, SERA, and Nemotron-Terminal typically target a single benchmark, leaving open the question of how to train models that generalize across diverse agentic tasks. The OpenThoughts-Agent (OT-Agent) project addresses this gap with a fully open data curation pipeline for training agentic models. We conduct more than 100 controlled ablation experiments to systematically investigate each stage of the pipeline, yielding insights on the importance of task sources and diversity. We then assemble a training set of 100K examples from our pipeline and fine-tune Qwen3-32B on this dataset, which yields an average accuracy of 44.8% across seven agentic benchmarks and a 3.9 percentage point improvement over the strongest existing open data agentic model (Nemotron-Terminal-32B, 40.9%). Moreover, our training data exhibits strong scaling properties, outperforming alternative open datasets at every training set size in compute-controlled comparisons. We publicly release our training sets, data pipeline, experimental data, and models at this http URL to support future open research on agentic model training.
[AI-2] World Models in Pieces: Structural Certification for General Agents ICML2026
链接: https://arxiv.org/abs/2606.24842
作者: Yikai Lu,Yifei Wu,Xinyu Lu,Tongxin Li
类目: Artificial Intelligence (cs.AI)
备注: 30 pages, camera-ready version in ICML 2026
Abstract:In the big-world regime, agents cannot be universally capable and their ability is inevitably specialized across a world model in pieces. Consequently, standard uniform guarantees fail to distinguish between the understanding of critical bottlenecks and irrelevant failures. We first formalize this limitation by proving that general agents are not universal, rendering standard worst-case analysis uninformative. To overcome this, we introduce structural certification, a transition-local framework that maps bounded goal-conditioned performance to entry-wise guarantees on the agent’s internal world model. Our main contribution is constructive. We provide algorithms that filter specific transitions using deep compositional goals and prove that a general agent on these goals has a structural world model with a \mathcalO(1/n) + \mathcalO(\delta) error bound. Conversely, this bound is tight in the small- \delta regime, whose existence is explicitly guaranteed by our certification. These results enable the certifiable deployment of general agents by localizing the specific transitions where long-horizon planning is reliable.
[AI-3] Grading the Grader: Lessons from Evaluating an Agent ic Data Analysis System
链接: https://arxiv.org/abs/2606.24839
作者: Tian Zheng,Kai-Tai Hsu
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:
Abstract:Agentic data analysis systems produce rich outputs, including code, numerical results, and verbal diagnostics. This makes them more challenging to evaluate than single-turn LLM responses. It is therefore necessary to distinguish genuine disagreement between an agent’s output and a ground-truth answer from grading artifacts. We investigate how reliably automated graders assess such a system and what strategies improve grading quality by applying LAMBDA, a multi-agent data-analysis system, on 153 numerical QRData tasks from DSGym. We develop and evaluate a three-layer human-AI grading cascade: strict regex matching, LLM-based lenient grading, and snippet-based human inspection, which combines non-GenAI and GenAI strategies with different failure profiles. Both automated graders achieve 100% observed precision (0/70 false positives). The lenient grader’s recall is 97% against human labels. A keyword-anchored extraction pipeline raises the strict grader’s recall by 60 percentage points over a last-number heuristic; the lenient grader is architecturally parser-independent. An iterative nudge mechanism raises grading run success from 36% to 97% and lenient-pass rates from 16% to 46%; comparing nudging with and without original-question re-injection shows that re-injection offers no benefit, confirming the nudge as an answer template cue. We further observe in this case study that variable type is the task metadata field most consistently associated with grading pipeline dynamics and observed outcome grades.
[AI-4] Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NFR Assessment SIGDIAL2026
链接: https://arxiv.org/abs/2606.24834
作者: Ali Pourghasemi Fatideh,Wilder Baldwin,Maria Dhakal,Collin McMillan,Sepideh Ghanavati
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures. Accepted to SIGDIAL 2026 (27th Annual Meeting of the Special Interest Group on Discourse and Dialogue)
Abstract:LLM-based dialogue assistants have become mainstream tools for software developers, yet current evaluation benchmarks focus exclusively on functional correctness. This leaves a critical gap in assessing the quality and accuracy of these conversations when handling Non-Functional Requirements (NFRs), which are inherently vague, context-dependent, and involve many parts of a program. Evaluating how well these systems support collaborative reasoning about NFRs requires methods that go beyond single-turn accuracy to capture both the correctness of the system’s outputs and the quality of the multi-turn interaction. In this paper, we investigate the accuracy and quality of multi-turn conversations between developers and an LLM-based agent in the domain of Health Insurance Portability and Accountability Act (HIPAA) regulatory compliance. We hired 49 programmers to interact with GitHub Copilot to assess 148 HIPAA-derived NFRs against the iTrust codebase, a system designed to comply with HIPAA regulations, across three dimensions: requirement satisfaction level, reasoning, and code localization. We find that developers tend to agree with LLM assessments, but accuracy against expert ground truth is low. We model user satisfaction and find that longer system responses and more information-providing turns negatively affect user satisfaction, whereas proactive interactions positively affect it. Our findings provide insights for designing LLM-based dialogue systems that support NFR assessment.
[AI-5] Difference-Making without Making a Difference
链接: https://arxiv.org/abs/2606.24832
作者: Sander Beckers
类目: Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Over a series of seven papers, Andreas Günther have introduced seven definitions of actual causation and have classified them as belonging to three different, competing, types of accounts: factual difference-making, counterfactual difference-making, and regularity-based. I show that their most recent - factual difference-making - definition instantiates all three types, thereby proving that these are distinctions without a difference. I further compare their novel account to the other six accounts on several crucial examples, revealing that this undermines all seven of their accounts.
[AI-6] Solving Inverse Problems of Chaotic Systems with Bidirectional Conditional Flow Matching
链接: https://arxiv.org/abs/2606.24824
作者: Peiyan Hu,Jian Zhang,Jiashu Pan,Ruiqi Feng,Tao Zhang,Zhi-Ming Ma,Yuan-Sen Ting,Gongjie Li,Tailin Wu
类目: Artificial Intelligence (cs.AI)
备注: 50 pages, 17 figures
Abstract:Modeling chaotic systems is crucial yet challenging. Inverse problems in chaotic dynamics, namely inferring initial conditions from final states, remain largely unsolved because of ill-posedness, non-uniqueness, instability, and potentially chaotic time-reverse dynamics. We address this open problem with Bidirectional Conditional Flow Matching (Bi-CFM), which learns bidirectional mappings between distributions of initial and final states to capture the stochasticity of chaotic evolution and mitigate exponential error accumulation over time. Furthermore, for systems with conservation laws, we extend it to Conservation-constrained Bi-CFM (CBi-CFM). Across the classic Lorenz, Circuit, and high-dimensional Lorenz 96 systems, Bi-CFM improves five distribution-level metrics over baselines while achieving a speedup of more than two orders of magnitude. In the three-body planet-planet scattering problem in planetary dynamics, CBi-CFM better respects conservation laws, with conservation errors comparable to those of the ground truth. Finally, on real observations of globular clusters, collisional million-body systems shaped by \sim 10^10 years (10 Gyr) of evolution, our method represents an advance in accuracy, establishing a scalable route to solving inverse problems of long-timescale real-world chaotic dynamics.
[AI-7] Grad Detect: Gradient-Based Hallucination Detection in LLM s ICML2026
链接: https://arxiv.org/abs/2606.24790
作者: Anand Kamat,Daniel Blake,Brent M. Werness
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the 2nd Workshop on Compositional Learning at ICML 2026, Seoul, South Korea. Copyright 2026 by the author(s)
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet they remain prone to generating hallucinations. Detecting these hallucinations is critical for deploying LLMs reliably in high-stakes applications. We present Grad Detect, a gradient-based approach for predicting hallucinations by analyzing layer-wise gradient patterns from a single forward-backward pass during inference. Our method shows that the internal gradient structure of a model carries rich information about the correctness of its output. This information is not accessible through output-level signals alone. We evaluate Grad Detect on several QA benchmarks across both hallucination detection and model abstention prediction, where it consistently outperforms confidence-based and sampling-based baselines. Through comprehensive layer ablation studies across all eleven models from four architectural families, we find that the final five layers concentrate over 97% of the discriminative gradient signal, enabling efficient deployment with minimal performance loss. Grad Detect provides a unified framework for predicting multiple dimensions of LLM reliability, offering strong predictive performance alongside interpretable insights into where and how model failures originate.
[AI-8] BluTrain: A C/CUDA Framework for AI Systems
链接: https://arxiv.org/abs/2606.24780
作者: Adhitya Charan,Adwaid Suresh,Anuj Kumar,Aparna A,Dhanakumar K,Dharun M S,Dinesh G,Goutham Kumar Reddy K,Harshini V M,Jenifa D,Jona Delcy C A,Kathirvel S,Killi Uma Maheswara Rao,Kiruthik Kanna M,Kurra Vishnu Sai,Madhumithaa G K,Navin Kumar V,Ram Charan Golla,Revathi T,Rishikkanth R,Sanjay Krishna M V,Surendra Vendra
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Progress in deep learning is, at scale, more a matter of systems engineering than of modelling: the behaviour of a model in training (its throughput, its memory footprint, and the numerical fidelity of the result) is determined less by the architecture itself than by how that architecture is expressed on the hardware. To achieve absolute control over this hardware expression while abstracting away systems complexity to make modelling seamless and eliminating the need for repetitive orchestration logic, BluTrain was architected from first principles as a robust, lightweight, and architecture-general training framework in standard C++ and the core CUDA programming model. Every layer is implemented natively: a typed tensor module with reverse-mode autograd, a linear-algebra library, a caching allocator, a multi-mode distributed-execution module, and an MLIR-based deep-learning compiler. In formal evaluations training a 124M-parameter GPT-2 baseline in FP32 on an 8-GPU 6000 Ada system, BluTrain outperforms industry-standard baselines in both throughput (sustaining an average of 407K tokens/s versus PyTorch’s 395K tokens/s) and memory efficiency (achieving up to a 22% footprint reduction), while strictly preserving numerical fidelity and converging to a marginally lower final validation loss. With every layer explicitly open to native tuning, the performance ceiling is the framework’s own to raise.
[AI-9] Context-Aware Prediction of Student Quiz Performance with Multimodal Textbook Features
链接: https://arxiv.org/abs/2606.24770
作者: Samin Khan
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages, 2 figures, 2 tables
Abstract:Educational platforms often predict student performance from prior interactions, but the assessment content itself also varies in linguistic and visual complexity. This paper studies whether lightweight content features extracted from CourseKata chapter-review questions improve prediction of end-of-chapter quiz scores beyond a student’s average prior exercise performance. The study combines 2023 CourseKata student response data with chapter-level text features from review-question wording and image features from textbook visuals. Across 4,742 student-chapter observations from 562 class-student IDs, adding content features improves student-grouped five-fold quiz prediction performance by 9.1% relative to a prior-performance baseline. In leave-chapter-out validation, text features reduce prediction error relative to the baseline, while image-containing models have higher error. This paper suggests that a context-aware model adds useful signal about the text and visual features of questions to better predict student quiz performance compared with using past student performance alone.
[AI-10] Can Scale Save Us From Plasticity Loss in Large Language Models ?
链接: https://arxiv.org/abs/2606.24752
作者: J. Fernando Hernandez-Garcia,Tomás Figliolia,Beren Millidge
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The loss of plasticity - the ability of a network to learn new information after having already learned older information - is a fundamental challenge in creating artificial neural networks capable of continual learning. Although this phenomenon has been known for decades, it has mostly been studied in older, relatively small architectures and rarely in natural-language domains. To determine whether loss of plasticity remains a problem in the modern transformer-based LLM paradigm, we study plasticity loss in GPT-style Transformer models trained on a multilingual continual learning problem. Consistent with prior work, we find evidence of plasticity loss across models ranging from 5M to 314M non-embedding parameters, as measured by deterioration on a held-out Vietnamese probing task. We further find that the onset of plasticity loss follows a predictable scaling law, growing sublinearly with model size. These results suggest that larger models may delay the measurable effects of plasticity loss, but that increasing parameter count alone is likely to be insufficient to completely prevent it. We also find evidence of plasticity loss under stationary multilingual training, challenging the view that the phenomenon is exclusive to continual learning with abrupt task changes. Overall, our results suggest that even large Transformer language models trained on natural-language will eventually lose the ability to efficiently adapt to new data after sufficiently long training, in both continual and stationary settings.
[AI-11] Scaling Laws for Task-Specific LLM Distillation
链接: https://arxiv.org/abs/2606.24747
作者: Lavinia Ghita,Dhruv Desai,Ioana Boier
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 24 pages, 13 figures
Abstract:Large Language Models (LLMs) achieve strong performance across a growing range of domains, yet their scale poses deployment challenges in applications where latency and cost constraints are critical. This paper derives empirical scaling laws for domain-specific LLM compression, quantifying how in-domain and general knowledge performance scale with dataset size, compression ratio, supervision format, and iterative pruning schedule. Using quantitative finance as our application domain, we compare logit-based and LoRA-based distillation under iterative structural pruning, introducing a blended chain-of-thought supervision loss that stabilizes KL-divergence distillation over reasoning traces. In-domain task quality degrades predictably under compression while general-knowledge benchmarks collapse well before the same point; supervision format is the key driver of this tradeoff, with chain-of-thought supervision actively recovering general knowledge that pruning erases. We release the headline dataset FinHeadlineMix, scaling law results, and practical recommendations to provide a reusable framework for domain-specific compression decisions.
[AI-12] Beyond U-Net: A Latent-Representation-Aligned Skip-Free Backbone for Flow-Matching Speech Enhancement
链接: https://arxiv.org/abs/2606.24745
作者: Wangyi Pu,Michele Scarpiniti
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Generative models, particularly diffusion and score-based approaches, have recently achieved strong performance in speech enhancement, but their iterative sampling process limits real-time deployment. Flow Matching offers an efficient alternative by transporting noisy speech toward clean speech through an ordinary differential equation with few function evaluations. In this work, we propose a skip-free encoder-decoder backbone for flow-matching speech enhancement, guided by Latent Representation Alignment (LRA). Instead of relying on U-Net skip connections, which may transfer noise-correlated low-level features to the decoder, the proposed model aligns its bottleneck and decoder representations with clean latent features extracted from a frozen Descript Audio Codec encoder-decoder without quantization. This codec-aligned supervision promotes compact clean-speech representations while preserving efficient few-step inference. Experiments on WSJ0-CHiME3 and VoiceBank-DEMAND show improved PESQ and perceptual quality, especially on VoiceBank-DEMAND, using only five function evaluations.
[AI-13] Decentralised AI Training and Inference with BlockTrain
链接: https://arxiv.org/abs/2606.24722
作者: Peter Toth
类目: Artificial Intelligence (cs.AI)
备注: First arXiv version. 17 pages
Abstract:Frontier AI training is increasingly shaped by access to dense, centrally controlled accelerator clusters. This creates a structural advantage for hyperscalers and large centralized laboratories, and makes open or independent AI efforts depend on scarce capital, privileged infrastructure, and data-center geography. We present Spheroid BlockTrain, a decentralized training protocol in which a model is partitioned into independently trainable blocks, each optimized on a local objective derived from the same global target and composed at inference into one model. On byte-level WikiText, BlockTrain reaches cross entropy 1.359 (perplexity 3.89), within about 0.04 CE of a same-setup end-to-end Transformer reference, while each active worker trains only one block and avoids full-model optimizer state. A shared six-worker block training run reaches CE 1.385 by averaging same-block updates into one assembled model. HTTP/TCP transport experiments move real serialized checkpoints and updates, including a public-IP three-host run that improves CE from 5.580 to 1.811 while moving 15.22 GB. For inference, the current BlockTrain path uses one block-stack traversal per full output and serves over direct TCP across three public-network GPU hosts up to a 75.80B-parameter logical fp16 shape, outperforming a matched plain-autoregressive TCP pipeline baseline because it emits a full sequence per WAN pipeline traversal rather than one token per traversal.
[AI-14] ACTFUL: Tactile-Driven Exploration For Object Localization and Identification in Confined Environments IROS2026
链接: https://arxiv.org/abs/2606.24712
作者: Shivani Kamtikar,Chung Hee Kim,Camilla Tabasso,Tye Brady,Joshua Migdal,Taskin Padir
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: IROS 2026
Abstract:Humans effortlessly locate and identify objects by touch alone, even without vision. In contrast, robotic systems rely heavily on vision and struggle with autonomous tactile exploration and object identification. We present TACTFUL, a vision-free tactile exploration framework that enables a multi-fingered robot to autonomously explore confined workspaces, discover objects through contact, and identify them via tactile reconstruction. Trained entirely on real hardware without simulation, our system learns a single policy that balances global workspace exploration with local surface refinement through a dynamic reward schedule. Our results demonstrate that tactile sensing, when paired with structured learning, can serve as an effective primary modality for object-level reasoning, achieving 77% success with 0.015 m average reconstruction error and outperforming baseline approaches on real-world objects.
[AI-15] FlowPipe: LLM -Enhanced Conditional Generative Flow Networks for Data Preparation Pipeline Construction SIGMOD2027
链接: https://arxiv.org/abs/2606.24679
作者: Kunyu Ni,Lei Cao,Jie He,Xiaotong Zhang,Jianfeng Jin,Junyu Dong,Yanwei Yu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by SIGMOD 2027
Abstract:Data preparation pipelines improve data quality in machine learning by transforming raw tables into learning-ready data through sequential cleaning and feature transformation operators. However, automatically constructing such pipelines is computationally difficult because operator sequences are combinatorial and end-to-end evaluation is expensive. Existing state-of-the-art (SOTA) Multi-DQN methods still face three key limitations: decoupled value estimators weaken long-horizon credit assignment, dataset context is only weakly injected into the policy, and exploration is inefficient in a sparse search space with many invalid states. To address these issues, we propose FlowPipe, a unified framework that formulates pipeline synthesis as conditional probabilistic flow generation over a directed acyclic graph. FlowPipe uses Conditional Generative Flow Networks (C-GFlowNets) with a Trajectory Balance objective to connect terminal validation rewards with early pipeline decisions. It further introduces Deep Semantic Modulation through Feature-wise Linear Modulation (FiLM), allowing LLM-derived logical priors to condition the policy’s internal activations according to dataset semantics. In addition, FlowPipe incorporates failure awareness into the flow objective to avoid invalid states and concentrate search on high-potential regions. Experiments on two benchmark suites with 74 real-world datasets show that FlowPipe outperforms SOTA baselines, improving accuracy by 11.96% on average and achieving 12.5x faster training convergence. Source code is available at this https URL.
[AI-16] Cost-Optimal Decision Diagrams for Stochastic Boolean Function Evaluation
链接: https://arxiv.org/abs/2606.24672
作者: Xia Zong,Tuomo Lehtonen,Jussi Rintanen
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures
Abstract:In many decision-making scenarios, acquiring information incurs different costs. We consider the problem of constructing a deterministic evaluation strategy that minimizes the expected cost of evaluating a propositional formula under variable costs and a probability distribution over truth assignments. We present a branch-and-bound algorithm with variable-selection heuristics, pruning, and caching. To the best of our knowledge, it is the first practical exact algorithm for this level of generality. Experiments on random instances demonstrate scalability and quantify the efficiency-quality trade-off of a greedy beam-search variant. We additionally evaluate a structured heart-disease diagnosis instance. Finally, we prove that the problem is #P -hard and contained in \mathrmPSPACE .
[AI-17] LaGO: Latent Action Guidance for Online Reinforcement Learning ICML2026
链接: https://arxiv.org/abs/2606.24669
作者: Kuan-Yen Liu,Ren-Jyun Huang,Ti-Rong Wu
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures. Accepted at the ICML 2026 Workshop on Large Language Models for Planning (LM4Plan)
Abstract:Large language models (LLMs) have shown strong potential for planning and sequential decision-making, but prior work often relies on using them as direct controllers, which requires precise action generation and can be unreliable in practice. This paper proposes Latent Action Guidance for Online Reinforcement Learning (LaGO), a framework that uses a pretrained LLM as a latent action prior to softly guide online policy optimization, rather than treating the LLM as an explicit planner or controller. Experiments on both a discrete-control benchmark, CLEVR-Robot, and a continuous-control benchmark, Meta-World, demonstrate that LaGO consistently improves both reward and success rate over Vanilla PPO. In particular, LaGO increases the average success rate from 15.1% to 27.2% on CLEVR-Robot and from 2.7% to 15.2% on Meta-World. Our analysis further shows that stronger pretrained LLMs provide more effective guidance, suggesting that LLM knowledge can improve planning and online decision-making.
[AI-18] CineCap: Structured Reasoning with Spatio-Temporal Anchors for Cinematographic Video Captioning
链接: https://arxiv.org/abs/2606.24636
作者: Xinyu Mao,Yuhui Zeng,Xiaokun Liu,Wenyu Qin,Meng Wang,Xin Tao,Pengfei Wan,Xiaohan Xing,Max Meng
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures
Abstract:Cinematographic captioning aims to describe how a video is filmed using professional film-language concepts such as camera movement, shot size, depth of field, composition, and shooting angle. This capability is important for fine-grained video understanding and controllable movie-quality video generation, yet remains underexplored in existing multimodal large language models. Unlike question-answering-based evaluation of cinematic understanding, cinematographic captioning requires a unified open-form description over multiple cinematographic dimensions. This task is challenging for two main reasons: the model must infer professional cinematographic concepts from subtle visual evidence, and it must generate captions that are both comprehensive and accurate. Accordingly, we propose CineCap, a framework that combines structured reasoning with spatio-temporal anchors and reinforcement learning with comprehensiveness, accuracy, and gated coverage rewards. The former grounds professional cinematographic descriptions in explicit visual evidence and organizes them into compact atomic reasoning for supervised fine-tuning, while the latter improves the balance between descriptive completeness and factual correctness. In addition, we construct CineCap Bench, a benchmark of 472 manually annotated video-caption pairs for systematic evaluation. Extensive experiments show that CineCap consistently outperforms strong proprietary and open-source baselines, establishing a new state of the art for cinematographic captioning. The code, model checkpoint, and benchmark are publicly available in this https URL.
[AI-19] SAFARI: Scaling Long Horizon Agent ic Fault Attribution via Active Investigation ICML2026
链接: https://arxiv.org/abs/2606.24626
作者: Chenyang Zhu,Jiayu Yao,Kushal Chawla,Youbing Yin,Nathan Wolfe,Pengshan Cai,Jingyu Wu,Spencer Hong,Sangwoo Cho,Shi-Xiong Zhang,Daben Liu,Sambit Sahu,Erin Babinsky
类目: Artificial Intelligence (cs.AI)
备注: Published at the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026
Abstract:As autonomous agents tackle increasingly complex multi-step, multi-agent tasks, their execution trajectories have scaled beyond the constraints of even the largest context windows. Current methods for effectively diagnosing agent failures load the full trajectory into an LLM’s context window, which suffers from attention dilution and fails when agentic traces inevitably exceed context limits. To address this, we introduce SAFARI (Scaling long-horizon Agentic Fault AttRibution via active Investigation), a framework that replaces linear context loading with a tool-augmented diagnostic loop. By equipping LLMs with a specialized toolbox to read and search trajectory segments alongside a persistent Short-Term Memory (STM) for cross-turn reasoning, SAFARI effectively decouples diagnostic accuracy from architectural context limits. Our experiments demonstrate that SAFARI outperforms state-of-the-art results by 20% on the WhoWhen dataset within a 1M token budget, and by 19% on TRAIL GAIA subset on a 25K token budget. Most significantly, SAFARI maintains a 0.58 precision even when the target fault resides 5x beyond the model’s native context window, a scenario where traditional evaluators fail entirely.
[AI-20] When CQs Go Wrong: Challenges in CQ Verification with OE-Assist ESWC
链接: https://arxiv.org/abs/2606.24619
作者: Anna Sofia Lippolis,Mohammad Javad Saeedizade,Robin Keskisärkkä,Aldo Gangemi,Eva Blomqvist,Andrea Giovanni Nuzzolese
类目: Artificial Intelligence (cs.AI)
备注: Acceted poster at this https URL 23rd European Semantic Web Conference (Satellite Event)
Abstract:Competency Questions (CQs) are the central component of CQ-verification, an established process in which an ontology is evaluated against a set of natural language questions to determine whether the intended purpose of the ontology has been properly modelled. However, CQ-verification is often time-consuming and error-prone, as it requires careful interpretation of linguistic nuances and precise alignment with formal ontology constructs. Ambiguities and complexity in CQs can further complicate this process, leading to inconsistent modelling decisions and verification outcomes. In this paper, we investigate what makes a CQ challenging and possible solutions to enhance the users’ performance in the CQ-verification process. We experimented with the data of 19 participants who performed CQ-verification on 20 tasks using an LLM assistant to support ontology evaluation. The results show the necessity of a tool to refine CQs before publishing them to avoid ambiguity or excessive complexity in later phases of the ontology engineering process.
[AI-21] Abstractions of Queries in Ontology-Based Data Access KR2025
链接: https://arxiv.org/abs/2606.24618
作者: Michel Leclère,Marie-Laure Mugnier,Guillaume Pérution-Kihli
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Extended version of a paper published in the proceedings of KR 2025
Abstract:In ontology-based data access (OBDA), multiple data sources are integrated via mappings to an ontology. We consider an OBDA setting based on existential rules and the certain answer semantics. We address the recent issue of query abstraction, which consists of abstracting data queries by translating them to the ontology layer. Since a perfect abstraction may not exist, the notions of minimally complete and maximally sound abstractions have been introduced. We study abstractions within an extension of UCQs with a limited form of inequality and a special predicate marking database constants. While this extension does not lead to an increased complexity of the problems of interest, it is able to express minimally complete abstractions, hence perfect abstractions when they exist. We also characterize maximally sound abstractions by making a new connection with the notion of maximum recovery stemming from data exchange. Comments: Extended version of a paper published in the proceedings of KR 2025 Subjects: Artificial Intelligence (cs.AI); Databases (cs.DB) Cite as: arXiv:2606.24618 [cs.AI] (or arXiv:2606.24618v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.24618 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.24963/kr.2025/43 Focus to learn more DOI(s) linking to related resources
[AI-22] AI Tokenomics: The Economics of Tokens Computation and Pricing in Foundation Models
链接: https://arxiv.org/abs/2606.24616
作者: Quanyan Zhu
类目: Artificial Intelligence (cs.AI); Performance (cs.PF); General Economics (econ.GN)
备注:
Abstract:Tokens have become the practical accounting unit for modern foundation model services, linking information processing, computation, memory use, energy expenditure, pricing, and economic value. This paper develops a framework for AI tokenomics: the study of how tokens are generated, consumed, priced, allocated, and optimized across AI systems. We connect token-level technical costs to workflow-level production functions, enterprise resource allocation, measurement and instrumentation methods, and emerging market-design questions. The framework shows that token expenditure and economic value are distinct: value depends on marginal productivity, workflow position, hidden reasoning activity, risk, and downstream propagation effects. The paper concludes by identifying open research directions in hidden-token measurement, empirical calibration, token productivity, dynamic allocation, and token-based markets.
[AI-23] ScaleToT: Generalizing Structured LLM Reasoning for Billion-Scale Low-Activity User Modeling
链接: https://arxiv.org/abs/2606.24605
作者: Tianbao Ma,Chang Xi,Yichuan Zou,Chengen Li,Linxun Chen,Zilong Lu,Yanan Niu,Zhaojie Liu,Han Li,Kun Gai
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate user modeling often depends on rich interaction histories, which are unavailable for billions of low-activity users. Large Language Models (LLMs) can infer latent user states from static profiles, but this reasoning becomes unreliable when profiles are sparse, and applying an LLM to billions of users is prohibitively expensive. We present ScaleToT, which learns structured reasoning from a small LLM-processed subset and extends it to the broader low-activity user population. To improve reasoning reliability, ScaleToT constructs typed user-state chains with a bounded entropy-guided Tree-of-Thought (ToT) refinement procedure. To make this structured reasoning usable from sparse profiles, the teacher-curated chains are used to train a student model on static profiles through supervised fine-tuning (SFT) and Outcome-Driven Segment-Aware Implicit Reward Policy Optimization (OSIPO). ScaleToT then transfers the student’s reasoning representations to a lightweight profile encoder, providing shared reasoning signals for the remaining users without LLM inference. We evaluate ScaleToT on lifetime value (LTV) prediction in a billion-scale advertising deployment. A randomized online A/B test increased LT30 by 6.738%, while offline reasoning covered only 7.32% of the potential population, greatly reducing compute cost compared with full-population reasoning.
[AI-24] Uncertainty-Aware Longitudinal Forecasting of Alzheimers Disease Progression Using Deep Learning
链接: https://arxiv.org/abs/2606.24604
作者: Arya Hariharan,Shreyank N Gowda,Anala M R
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Longitudinal modelling of Alzheimer’s disease progression is clinically useful only if it can describe not just the most likely next diagnosis, but how a patient may evolve over time and how reliable that forecast is. Most deep learning approaches reduce this problem to single-step classification, treating cognitively normal, mild cognitive impairment, and dementia as flat categories while providing limited insight into how uncertainty accumulates across future visits. We propose a probabilistic framework that combines ordinal diagnosis prediction, multi-horizon trajectory generation, and decomposed uncertainty estimation. A Temporal Fusion Transformer encoder is adapted with a CORAL ordinal output layer, asymmetric loss weighting, and converter oversampling to respect disease-stage ordering and improve sensitivity to MCI-to-dementia transitions. Conditioned on the learned patient-context representation, an autoregressive Mixture Density Network generates five-year probabilistic trajectories for diagnosis state, CDR Sum of Boxes, MMSE orientation, and hippocampal volume. On ADNI, the model outperforms linear, recurrent, and transformer baselines for next-visit diagnosis prediction, with the strongest gains on MCI-versus-dementia discrimination. Generated trajectories achieve near-nominal 90% credible interval coverage, widening uncertainty across the forecast horizon, and biomarker dynamics consistent with expected Alzheimer’s disease progression. We further separate aleatoric from epistemic uncertainty using analytic mixture variance and a five-member bootstrap ensemble, which provides the strongest encoder diversity and output-level epistemic signal. Epistemic uncertainty is higher for rare progression archetypes, MCI and dementia patients, and under external evaluation on OASIS-3, where it increases alongside prediction error.
[AI-25] ASALT: Adaptive State Alignment for Lateral Transfer in Multi-agent Reinforcement Learning
链接: https://arxiv.org/abs/2606.24601
作者: Anurag Akula,Satheesh K. Perepu,Abhishek Sarkar,Kaushik Dey
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at RLC 2026 conference
Abstract:Multi-agent reinforcement learning (MARL) addresses the problem of training multiple agents that pursue collaborative, competitive, or mixed objectives. Prior work has investigated transfer learning between source and target domains in MARL; however, the majority of existing approaches impose the constraint that the dimensionalities of the observation space and the global state space must be identical across domains. In this paper, we introduce a method that explicitly accommodates mismatched state-space dimensionalities between source and target domains. The proposed approach, ASALT, incorporates both observation-level and state-level adapters that map the target-domain observations and global states into a shared embedding space, thereby enabling more effective transfer of knowledge across both actors and critics. These adapters can generate embeddings that support efficient strategy transfer across heterogeneous domains. Experimental results on multiple configurations in standard benchmark environments demonstrate that ASALT surpasses existing baselines in terms of sample efficiency and global return in cooperative settings, but its effectiveness depends on the degree of mismatch between source and target domains. Furthermore, our findings indicate that ASALT mitigates negative transfer, which frequently constitutes a major obstacle when transferring policies between domains with differing observation and action spaces.
[AI-26] oward Self-Evolution-Ready Workflow Harnesses: A Reversible Migration Path and Convertibility Taxonomy for Expert LLM Pipelines
链接: https://arxiv.org/abs/2606.24598
作者: Yimo Lin,Zhen Zhang,Yibin Li
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While expert-validated “LLM + script” workflows deliver significant value, they remain static: they encode hard-won domain knowledge yet fail to adapt execution based on feedback. Existing agent research predominantly targets greenfield agents and synthetic benchmarks, leaving the migration of active legacy workflows unresolved. To bridge this gap, we present a reversible, Strangler-Fig migration path that refactors legacy workflows into composable, typed, and auditable stages. Central to this framework is a three-tier convertibility taxonomy (A/B/C), implemented as a routing stage within the system harness, which diagnoses a workflow’s readiness and routes it accordingly.
[AI-27] LLM s Prompted for Legal Context Object More: Overrefusal from Small On-Premises LLM s in Criminal Legal Context
链接: https://arxiv.org/abs/2606.24585
作者: Anastasiia Kucherenko,François Brouchoud,Dimitri Percia David,Andrei Kucharavy
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While the validity of LLMs’ use in the legal context remains subject to ethical and legal debate, legal professionals are already experimenting with personal LLMs, if only for translation and reformulation. However, even such a seemingly innocuous use can introduce biases through case processing speed if LLM assistants selectively refuse assistance on certain topics. To better anticipate such biases, we investigate several modern small LLMs that are most likely to be used as on-device assistants, to assess the impact of overrefusal on legal prompts. Surprisingly, we find that authority-style prefixes (you are acting as an assistant of the national supreme court'', […] defense lawyer’') systematically increase refusal rates by 2–20x over the no-prefix baseline, while a known role-play jailbreak prefix shows mixed effects, sharply increasing refusals in some models and barely shifting them in others. The finding suggests that small on-prem deployable LLMs are unstable under contextual framings that a real institutional user might naturally introduce, and further investigation is essential to minimize opportunities for bias. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.24585 [cs.AI] (or arXiv:2606.24585v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.24585 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-28] Quant Convergence: Bridging Classical Value Investing and Modern Factor Models for Systematic Equity Selection
链接: https://arxiv.org/abs/2606.24575
作者: Augusto Eiji Yamazaki,Hugo Garrido-Lestache Belinchon
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Modern finance relies heavily on complex machine learning models to find patterns in the stock market. However, as these AI models get more complicated, they often memorize short-term market noise instead of finding companies with real, lasting value. We designed this research to test if Benjamin Graham’s classic value investing rules could act as a mathematical “low-pass filter” to keep these modern models in check. We built three different sets of features - pure Graham rules, modern market factors, and a mix of both - and tested them against highly complex models (XGBoost and AutoGluon) using 20 years of SP 500 data. By applying a strict buy-and-hold strategy over a four-year test period (March 2022 to March 2026), the results showed that more complex algorithms do not always win. While the AutoGluon model captured high returns (222.68%), it suffered a substantial 39.78% drop because it bought volatile tech stocks right before the market crashed. On the other hand, the pure Graham Random Forest achieved the highest overall return (232.13%) with much less risk (1.38 Calmar Ratio). Furthermore, the Combined Random Forest successfully mixed momentum with Graham’s rules, making a 202.91% return while keeping the lowest maximum drop (34.53%) of any model tested. Ultimately, this research proves that Graham’s “margin of safety” isn’t outdated; it is actually a highly effective way to prevent modern AI from taking on too much risk.
[AI-29] GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents
链接: https://arxiv.org/abs/2606.24551
作者: Xiao Zhou,Siyue Zhang,Yilun Zhao,Jinbiao Wei,Tingyu Song,Arman Cohan,Chen Zhao
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Computer-use agents can execute software tasks through either graphical interfaces or programmatic command interfaces, but existing evaluations confound interaction modality with differences in tasks, initial states, verifiers, and permitted actions. We introduce a matched execution-layer benchmark of 440 desktop tasks across 18 applications and 12 workflow categories, where screen-only GUI agents and skill-mediated CLI agents receive identical goals, states, and final-state verifiers while being restricted to modality-native actions. In this controlled setting, the strongest GUI agent reaches a 59.1% full pass rate, outperforming the strongest original-skill CLI agent at 48.2%; however, verifier-guided skill augmentation raises CLI success to 69.3%, showing that much of the CLI deficit comes from incomplete skill coverage rather than model capability alone. These results suggest that GUI and CLI expose different execution bottlenecks: GUI agents are limited by reliable grounded interaction over long-horizon workflows, whereas CLI agents are limited by the coverage and scalability of their skill interfaces.
[AI-30] Governed Shared Memory for Multi-Agent LLM Systems
链接: https://arxiv.org/abs/2606.24535
作者: Yanki Margalit,Nurit Cohen-Inger,Erni Avram,Ran Taig,Oded Margalit
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent LLM environments require robust mechanisms for shared knowledge management. This paper formalizes the fleet-memory problem and identifies four foundational failure modes: unauthorized leakage, stale propagation, contradiction persistence, and provenance collapse. To address these, we define explicit systems-level primitives: scoped retrieval, temporal supersession, provenance tracking, and policy-governed memory propagation. These primitives are implemented in MemClaw, a production multi-tenant memory service, and evaluated via ArgusFleet, a reproducible harness testing four governance dimensions. Rather than a baseline comparison, this study measures a live production service, emphasizing real-world architectural insights and negative results. Key Evaluation Results Provenance: Successfully reconstructed 100% of depth-four derivation chains with correct writer identity at sub-second per-hop latency. Propagation: Demonstrated high intra-fleet visibility with zero cross-fleet leakage. Under strong write mode, write-to-visible latency was optimized to a single search round-trip. Production Architectural Issues Discovered Asymmetric Scope Enforcement: Tenant isolation held, but sub-tenant scope was initially bypassed on direct GET-by-id requests for agent-scoped credentials (disclosed and remediated during the study). Pipeline Ordering Conflict: While contradiction supersession works for admitted writes, a synchronous near-duplicate gate can prematurely reject contradictory writes before the asynchronous contradiction detector can evaluate them. Conclusion: Long-context retrieval alone is insufficient for production multi-agent memory. Governed shared memory demands explicit systems-level abstractions, and live evaluation is vital to expose enforcement and pipeline-ordering failures missed by design-only treatments.
[AI-31] A Fair Evaluation of Graph Foundation Models for Node Property Prediction ICML2026
链接: https://arxiv.org/abs/2606.24509
作者: Oleg Platonov,Gleb Bazhenov,Dmitry Eremeev,Liudmila Prokhorenkova
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: Accepted at The Workshop on Graph Foundation Models at ICML 2026
Abstract:Due to the wide use of graph-structured data in different fields of industry and science, the development of Graph Foundation Models (GFMs) has recently attracted a lot of attention. While many different types of models are called GFMs, particular interest has been paid to GFMs designed for node property prediction tasks, which is one of the most popular settings in Graph ML with lots of real-world applications from fraud detection in financial and social networks to recommendation systems for e-commerce and user-generated content platforms. While a number of GFMs for this task have been recently proposed, the field has not converged to a unified evaluation setting, and different works evaluate their models in widely different ways, preventing reliable comparison of GFMs with each other and with other types of models. In this work, we conduct a fair and rigorous reevaluation of 9 recent GFMs for node property prediction, comparing them to strong Graph Neural Network (GNN) baselines. We find that, among these GFMs, only the most recent ones based on the Prior-data Fitted Networks paradigm outperform well-tuned GNNs in predictive performance, although at a higher inference cost.
[AI-32] CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation
链接: https://arxiv.org/abs/2606.24506
作者: Zhuoren Ye,Tianyu Wo,Dinghao Xue,Mingming Zhang,Yuchen Teng,Chunming Hu,Renyu Yang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:
Abstract:Emerging LLM services increasingly host many sparse MoE models, yet most models receive sparse requests and remain cold. This creates a GPU memory problem: model weights are stable and model-determined, while KV-cache is transient and demand-determined. Because cold models rarely reach peak KV-cache demand at the same time, reserving worst-case KV capacity per model wastes memory; a shared KV-cache pool can instead provision aggregate active demand. However, KV-cache sharing is not sufficient when weights and KV-cache remain in a monolithic GPU memory pool. Static weights compete with dynamic KV-cache, and KV-head-limited attention under cold, low-concurrency traffic exposes only a fraction of replicated KV capacity, leading to low GPU memory utilization and weak long-context support. We present CrossPool, a serving engine for cold MoE models that separates FFN weights and KV-cache into two GPU memory pools: a weights pool that consolidates FFN weights across cold models, and a KV-cache pool that dynamically serves active requests while keeping attention local to KV-cache. CrossPool combines a KV-cache planner and virtualizer, a layer-wise pipeline scheduler that hides hidden-state transfers, and persistent kernels with control lowering to reduce CPU-GPU control overhead. With efficient GPU memory pooling, CrossPool underpins bursty long-context requests and outperforms the state-of-the-art kvcached-based multi-LLM serving system, reducing P99 TBT by up to 10.4\times .
[AI-33] On the Smallness of the Large Language Models Scaling Exponents
链接: https://arxiv.org/abs/2606.24504
作者: Sauro Succi,Peter V. Coveney,Alex Hansen
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures
Abstract:We discuss reasons why the scaling exponents of current Large Language Models (LLMs) applications are indicating an unsustainable regime in terms of energy resources. We further show that attributing the smallness of such exponents to a numerical bias due to the neglect of a non-zero value of the loss function in the limit of infinite data (``pedestal effect") does not remove the unsustainability issue. Finally, the effects of the smoothness (roughness) of the data on the scaling exponents is commented upon based on an analogy with phenomenological models of fluid turbulence.
[AI-34] Red-Teaming the Agent ic Red-Team
链接: https://arxiv.org/abs/2606.24496
作者: Dario Pasquini,Michal Bazyli,Taras Fedynyshyn,Artem Sorokin
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: v0.1
Abstract:The use of agentic systems to perform offensive security operations has moved from a theoretical possibility to a commoditized capability. However, while the community has focused on creating more and more capable agents, less attention has been allocated to assessing the security of those systems. In this work, we present the first in-depth security analysis of the most widely used agentic systems for offensive security operations. We show that most of these tools share common design flaws that enable an active adversary to exfiltrate API keys, establish persistent footholds, and fully compromise the operator’s machine, even when the agent operates inside a sandboxed container. To support our analysis, we introduce a full cyber kill chain for such agentic systems, capturing the progression from initial LLM manipulation to lateral movement, persistence, guardrail bypass, and sandbox escape. Building on our security analysis, we derive a robust architecture for agentic offensive-security tools and propose actionable, broadly applicable design principles that mitigate the disclosed attack paths at the architectural level. Comments: v0.1 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.24496 [cs.CR] (or arXiv:2606.24496v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.24496 Focus to learn more arXiv-issued DOI via DataCite
[AI-35] G3VLA: Geometric inductive bias for Vision-Language-Action Models
链接: https://arxiv.org/abs/2606.24472
作者: Yue Peng,Yongzhe Zhao,Artur Habuda,Khuyen Pham,Yanheng Zhu,Tran Nguyen Le,Fares Abu-Dakka,Li Guo
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Submitted to CoRL 2026
Abstract:Vision-language-action (VLA) models have made rapid progress in generalist robot manipulation by harnessing semantic knowledge from pretrained vision-language backbones, but their visual tokens remain grounded in 2D image coordinates rather than the calibrated geometry of the robot’s cameras – a mismatch especially pronounced in multi-camera setups, where views are coupled by known intrinsics and extrinsics yet processed as independent images. We propose G ^3 VLA, a camera-aware geometric module that injects calibrated structure into the visual-token stream of a pretrained VLA without altering its action space or imitation objective, combining intrinsic-conditioned ray embeddings, projective positional encoding (PRoPE), and bidirectional cross-view fusion. Geometric supervision is provided either from ground-truth point maps when available, or from confidence-gated \pi^3 X teacher predictions, requiring no depth sensors or manual annotations. Instantiated on \pi_0 , G ^3 VLA yields consistent gains across the LIBERO suites, RoboCasa24, RoboTwin2.0, and real-robot settings, with the largest improvements on spatially and object-sensitive tasks. We further validate on \pi_0.5 and GR00T 1.5, with results suggesting that geometric transfer is most effective when geometry-aware tokens have direct access to the action generation pathway. Our project page is at this https URL
[AI-36] he Latent Bridge: A Continuous Slow-Fast Channel for Real-Time Game Agents
链接: https://arxiv.org/abs/2606.24470
作者: Bojie Li,Noah Shi
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:A real-time agent for general computer use - with games as the most demanding case - must act within tens of milliseconds while still planning over seconds. These two regimes sit at opposite ends of the latency-quality tradeoff. A reasoning VLM (Qwen3-VL-8B-Thinking) deliberates effectively but requires ~1.5 s per response - far too slow for a 15 Hz control loop. In contrast, a reactive VLM (MiniCPM-o 4.5) acts in milliseconds but underperforms on planning-heavy tasks. We couple two frozen models of matched scale (9B reactive, 8B reasoning), leaving the communication channel as the sole trainable component. The standard coupling is a Text Bridge (T): the slow model writes a suffix the fast model reads. We introduce a learned continuous Latent Bridge (L) that projects the slow model’s residuals into the fast model’s input-embedding space in a LLaVA-style manner, avoiding any text round-trip; both are compared against Fast-Only (F). On 7 Atari games and a driving domain (MetaDrive), tuning the action decoder per channel on held-out seeds, the Latent Bridge matches or beats the Text Bridge in every domain: it significantly improves two games (MsPacman +57%, RoadRunner +28%) and is a safe drop-in elsewhere. Combining both channels interferes destructively (RoadRunner -96%), so only one should be used. The benefit is highly predictable: the bridge helps if and only if slow reasoning already beats fast reaction (T F) - the Latent and Text gains over Fast-Only move together at r=0.93. MetaDrive is the controlled negative, where the Latent Bridge is demonstrably inert because the Text Bridge adds no value. We release replay recordings and reproducible pipelines.
[AI-37] CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference
链接: https://arxiv.org/abs/2606.24467
作者: Xiaolin Lin,Jingcun Wang,Olga Kondrateva,Yiyu Shi,Bing Li,Grace Li Zhang
类目: Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2508.02401
Abstract:Long-context large language model (LLM) inference is increasingly constrained by the memory footprint and decoding cost of key-value (KV) caches, limiting sustainable deployment on resource-constrained hardware. Existing KV cache eviction methods typically apply heuristic token scoring over all heads in GQA-based LLMs. These methods ignore the different functionalities of attention heads, leading to the eviction of critical tokens and thus degrading the performance of LLMs. To address this issue, we propose CompressKV, a resource-efficient KV-cache compression framework for GQA-based LLMs. Instead of aggregating attention scores from all heads, CompressKV identifies Semantic Retrieval Heads (SRHs) that capture both the initial and final tokens of a prompt and semantically important mid-context evidence, and uses them to select tokens whose KV pairs should be retained. Furthermore, CompressKV allocates cache budgets across layers according to offline estimates of layer-wise eviction error. Experiments on LongBench and Needle-in-a-Haystack show that CompressKV consistently outperforms existing KV-cache eviction methods across memory budgets. Notably, it preserves over 97% of full-cache performance using only 3% of the KV cache on LongBench question-answering tasks and achieves 90% accuracy with just 0.7% KV storage on Needle-in-a-Haystack. These results demonstrate an improved resource–performance trade-off for long-context LLM inference. Our code is publicly available at: this https URL
[AI-38] NoContactNoWorries: Estimating Contact through Vision and Proprioception for In-Hand Dexterous Manipulation IROS
链接: https://arxiv.org/abs/2606.24450
作者: Soham Patil,Avirup Das,Sourabh Bhosale,Spandan Roy
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS) 2026
Abstract:Perceiving physical contact is fundamental to dexterous manipulation. While robots often rely on dedicated hardware tactile sensors, humans exhibit a remarkable ability to infer contact by integrating visual information with an innate sense of their body’s pose and movement. Inspired by this embodied perceptual skill, we investigate whether a robot can learn to infer contact from vision, an approach that also offers a scalable alternative to tactile hardware specifically for binary contact estimation, which faces practical challenges in cost, fragility, and integration. We present NoContactNoWorries, a transformer-based multimodal framework that fuses RGB-D vision with the robot’s proprioception to infer binary contact states as a pseudo-tactile signal for hand-object interactions. We validate by training a single contact prediction model on multiple objects and show that the inferred contact signal supports downstream reinforcement learning agents for in-hand object reorientation, generalizing to novel objects. Experiments in both simulation and on a real-world robot validate our approach, highlighting the feasibility of inferring contact from vision and proprioception. Project Page: this https URL
[AI-39] ReM-MoA: Reasoning Memory Sustains Mixture-of-Agents Scaling
链接: https://arxiv.org/abs/2606.24437
作者: Heng Ping,Arijit Bhattacharjee,Peiyu Zhang,Shixuan Li,Wei Yang,Ali Jannesari,Nesreen Ahmed,Paul Bogdan
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture-of-Agents (MoA) architectures improve inference-time scaling by organizing multiple LLM agents into layered reasoning pipelines. However, existing MoA variants fail to sustain gains as depth increases, exhibiting degradation, early plateauing, or saturation. We propose ReM-MoA, a memory-augmented MoA framework that sustains scaling through two mechanisms: (1) a Ranked Reasoning Memory that persistently stores and ranks reasoning traces from all layers using a comparative Reviewer Agent, and (2) a Curated Diversified Memory Routing scheme that exposes different agents to distinct combinations of successful and failed traces, preserving exploration diversity while propagating high-quality reasoning. We further introduce an optional multi-domain Reviewer distillation pipeline that improves ranking quality through frontier-model supervision. Across five reasoning benchmarks spanning math, formal logic, code, knowledge, and commonsense, ReM-MoA consistently outperforms prior MoA variants across both depth and width scaling, and its advantage widens with depth, establishing structured cross-layer reasoning memory as a key missing mechanism for scalable multi-agent inference.
[AI-40] Detecting AI Coding Agents in Open Source: A Validated Multi-Method Census of 180 Million Repositories
链接: https://arxiv.org/abs/2606.24429
作者: Arsham Khosravani,Audris Mockus
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI coding agents are entering the open-source supply chain, yet their diverse and often invisible traces leave their prevalence poorly understood. We introduce a multi-layered detection framework that integrates configuration-file scanning, commit-message analysis, author-identity matching, and bot-signature lookup across World of Code (180M+ Git repositories), classifying agent traces into four behavioral types. No single method captures more than a fraction of activity: multi-method detection identifies 850,157 Claude Code commits in one snapshot, of which bot-account lookup_the signal most adoption studies rely on_recovers only 28,154 (3.3%), a 30x relative-recall gap, so single-signal prevalence estimates are biased low by at least this factor. Every detection pattern is hand-validated (495 labels) with per-cell precision and Wilson confidence intervals. Across snapshots from December 2024 to April 2026, commit-attributed agents generate over 320,000 commits per month; Claude Code leads (886,122 commits across 17,295 projects) and dominates silent, configuration-file-only adoption (21,078 projects). Compared against an independent pull-request census (AIDev), the two channels capture nearly disjoint agent populations_a PR census misses 79% of commit-detected Claude Code adopters and essentially all Codex adopters_and different kinds of work: PR-deployed cloud agents (Codex, Cursor) surface as feature work, while commit-deployed in-editor agents (Claude Code, OpenHands, Aider) surface as maintenance. The observed work profile follows deployment and detection mode rather than the tool itself, so no single channel is representative.
[AI-41] Can Aggregate Invariants Accelerate Continuous Subgraph Matching? Limits Laws and a Dynamic Spectral Index
链接: https://arxiv.org/abs/2606.24421
作者: Minghao Chen,Jiale Zheng
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Data Structures and Algorithms (cs.DS)
备注: 11 pages, 3 figures, 3 tables
Abstract:Spectral filtering recently delivered substantial pruning for \emphstatic subgraph matching: Laplacian interlacing rejects candidates whose neighborhoods cannot host the query. We study whether such aggregate structural tests can accelerate \emphcontinuous subgraph matching (CSM) over dynamic graphs, and answer in three parts. First, lazily maintained spectral bounds are infeasible exactly where spectral pruning has value: we characterize the tightest safe rule over a formalized perturbation relaxation and show that even it loses essentially all pruning power within four touching updates. Second, exact maintenance is affordable when selective: pruning utility and recomputation cost are anti-correlated across vertices – hubs provably never prune – so recomputing small-neighborhood spectra on touch sustains exact local spectra at microseconds per update, complete by construction. Third, integrated into a decoupled CSM benchmark against an identical-minus-spectra control, the tests remove up to 51% of candidates or safely skip up to 47% of update enumerations, yet enumeration intermediates remain unchanged – beyond the gates’ skipped first-level bindings, typically zero – across two engines, four real graphs, two stream types, and 77 solved queries; a constructed radius-stratified workload confirms the instrument detects the exception when one exists ( -99.9% intermediates, 748\times faster). Aggregate tests accelerate what scales with candidate sets – construction, list scans – never adjacency-guided exploration. We distill an intermediate-invariance methodology for evaluating CSM filters and release a reusable dynamic local-spectra index.
[AI-42] Agent ic AI for Bilevel Long-Term Optimization of Policy-Driven Physical Layer Systems
链接: https://arxiv.org/abs/2606.24416
作者: Bingnan Xiao,Chenhao Yang,Wei Ni,Xin Wang,Tony Q. S. Quek
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 11 figures
Abstract:Network operators’ changing policies, service requirements, and stringent real-time constraints render existing methods designed with fixed objectives and constraints ineffective. This paper presents Agentic long-term performance optimization (Agentic-LTPO), a nested bilevel optimization framework that can be applied to adaptive physical layer problem configuration. The key idea is to employ agentic AI to generate upper-level configurations in a bilevel optimization structure, where evolving operator policies, environment summaries, and historical experiences are translated into structured lower-level optimization problem configurations. The lower level solves the problems with updated configurations for real-time physical-layer decisions. Considering cell-free MIMO beamforming as a use case, we embody Agentic-LTPO by designing a new multi-agent decision process with retrieval-augmented experience-based verification in the upper level, together with a closed-form beamformer in the lower level. Experiments demonstrate that Agentic-LTPO exhibits strong adaptability to dynamic operator policies and effectively enhances the system’s long-term performance by 57.2% compared to traditional methods.
[AI-43] Cycle-Consistent Neural Explanation of Formal Verification Certificates
链接: https://arxiv.org/abs/2606.24414
作者: Andoni Rodriguez,Alberto Pozanco,Daniel Borrajo
类目: Artificial Intelligence (cs.AI)
备注: 15 pages of main text
Abstract:Formal verification produces machine-checkable certificates that attest to the satisfaction or violation of temporal properties, yet these certificates remain opaque to non-specialist stakeholders. We propose a cycle-consistent neural architecture that generates faithful natural language explanations of verification certificates. A forward network NN1 maps certificates to explanations, and an inverse network NN2 reconstructs certificates from explanations; a symbolic verifier closes the loop, providing a differentiable faithfulness proxy. A pointer-generator mechanism ensures lexical grounding by copying state names directly from the certificate. We evaluate on 420 test certificates spanning six verification methods (bounded proof, k-induction, inductive invariant, lasso, reachability, witness pair) in both YES and NO verdict variants, drawn from a financial compliance domain with 207 named states. Our trained architecture, combined with a hybrid inference-time routing strategy, achieves 90.0% cycle-verified soundness, surpassing a multi- LLM few-shot baseline (76.1% for the best of 16 LLM combinations across four frontier models) by 13.9 percentage points. The neural model wins on 10 of 12 verdict/kind categories, with three categories reaching 100% soundness. The architecture offers 860x faster inference (185 ms vs. 160 s per certificate for the full multi-LLM baseline), offline operation, deterministic outputs, and zero per-inference cost. These results demonstrate that trained specialization outperforms general-purpose LLM prompting for structured certificate explanation, while eliminating the deployment constraints of cloud-based inference.
[AI-44] Entity Resolution via Batched Oracle Queries
链接: https://arxiv.org/abs/2606.24407
作者: Lorenzo Balzotti,Donatella Firmani,Luca Gagliardelli,Giovanni Simonini
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:We consider an oracle that processes a limited batch of records at a time and clusters those that refer to the same real-world entity. We study how to interrogate such an oracle to resolve entities in a dataset whose size is far larger than a single batch, and where no batch is guaranteed to contain all records of any given entity. We aim at a pay-as-you-go approach, to have full control over the costs (the number of oracle consults), while achieving the highest possible recall at every step. We formally cast this problem as batched entity resolution, prove that selecting optimal batches is NP-hard, and provide an optimal solution under a natural condition on entity sizes. Finally, we evaluate our approach on six datasets and show its superiority over state-of-the-art baselines.
[AI-45] ATRIA: Adaptive Traceable ECG Reporting with Iterative Agents
链接: https://arxiv.org/abs/2606.24392
作者: Donggyun Hong,Kyuhwan Lee,Junmyung Kwon,Yong-Yeon Jo
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing ECG report generation is tightly coupled – interpretation and reporting fused end-to-end, so errors propagate without stage-level recourse – while agent-based systems decouple tasks but remain single-pass, never revisiting earlier outputs. Clinical ECG reporting instead unfolds iteratively, requiring progressive context integration and bidirectional editing. We present \textscATRIA, a multi-agent ECG reporting system that mirrors the clinician’s iterative workflow: it binds every report claim to its supporting evidence, flags statements unsupported by that evidence, incorporates additional context mid-session, and lets clinicians verify and revise individual findings rather than accept one opaque output. Because its agents use ECG analysis models already in clinical use, the underlying findings are clinically trustworthy; and as a cloud-based web service, \textscATRIA is ready for immediate deployment. We demonstrate \textscATRIA through four interaction cases, with a live demo and video available.
[AI-46] PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models
链接: https://arxiv.org/abs/2606.24388
作者: Simone Gallivanone,Hossein Khodadadi,Mauro Dore,Mauro Medda,Nicola Franco
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The dataset has been released at: this https URL
Abstract:We introduce a large-scale, open-source dataset of pre-generated adversarial attacks for vision-language models (VLMs). The dataset is designed to be diverse, representative, and practical, extending existing benchmarks by covering 10 high-level categories and 55 subcategories of harmful intents. Our primary goal is to make adversarial data accessible to the research community, given the computational cost and complexity of generating large numbers of attacks. The dataset comprises 47 524 adversarial samples, generated using state-of-the-art attack strategies from recent literature. Our work complements existing efforts by consolidating and extending prior benchmarks from multiple established sources, resulting in 7 826 intents, and introduce an additional category to broaden coverage. This provides realistic evaluation resources for studying model robustness and alignment. Our dataset intends to enable researchers and practitioners to systematically evaluate the robustness and safety of VLMs, fine-tune attack-generation models, and develop or stress-test defensive guardrails under diverse adversarial conditions. By releasing this resource, we aim to lower the barrier to adversarial research and foster more reproducible, comprehensive, and comparable evaluations of VLM safety.
[AI-47] When Helpfulness Overrides Causal Caution: Context-Dependent Suppression and Recovery in LLM s
链接: https://arxiv.org/abs/2606.24370
作者: Hiroshi Okumura
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 43 pages, 3 figures, 5 tables. SSRN Abstract ID: 6965680
Abstract:Large language models (LLMs) are increasingly integrated into decision-support roles in business and policy contexts. While prior benchmark studies have primarily evaluated LLMs’ causal reasoning capabilities, a more fundamental epistemic dimension has been overlooked: Causal Caution, defined as the propensity to refrain from causal judgment when empirical evidence is insufficient. This study examines the systematic suppression of Causal Caution that occurs when LLMs shift from academic to practical advisory contexts. Using an evaluation rubric inspired by Pearl’s Causal Hierarchy (the PCH score), we conducted experiments on four high-performance LLMs – Claude Sonnet 4.6, Claude Opus 4.7, GPT 5.5, and Gemini 3.1 Pro – across 480 trials. Causal Caution maintenance rates were 91.7–100.0% in academic contexts but dropped to 6.7–18.3% in practical advisory contexts (Fisher’s exact test, p .001 across all models). Furthermore, when restricted to practical prompts requesting concrete recommendations or explanatory rationales, only 1 of 200 responses (0.5%) maintained Causal Caution. A brief self-correction prompt – “Please reconsider this judgment from the perspective of causal relationships” – restored the expression of Causal Caution to maintenance rates of 71.4–100.0% (McNemar’s test, p .001 across all models). These results suggest that helpfulness-oriented response patterns may suppress the expression of Causal Caution in practical advisory contexts, with important implications for organizational governance. The findings indicate that this suppression reflects context-dependent variation in expression rather than an underlying capability limitation, suggesting that multi-agent architectures that separate proposal generation from causal auditing may offer a promising governance design.
[AI-48] Accelerating Disaggregated RL for Visual Generative LLM s with Diffusion-Based Parallelism and Trainer-Assisted Generation
链接: https://arxiv.org/abs/2606.24369
作者: Sijie Wang,Zhengyu Qing,Zhiqiang Tan,Yiming Yin,Yeqing Zhang,Yaoyuan Wang,Qiang Wang,Xiaowen Chu,Shaohuai Shi
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI); Performance (cs.PF)
备注: 14 pages, 18 figures, 1 table
Abstract:Reinforcement learning (RL) has become a dominant post-training paradigm, driving the emergence of high-performance RL systems such as veRL for autoregressive large language models (LLMs). In parallel, diffusion-oriented RL algorithms, e.g., DanceGRPO and FlowGRPO, have rapidly expanded the scope of RL from language reasoning to diffusion-based visual and flow-based generation. However, efficient RL systems for diffusion generative LLMs remain underexplored. Existing implementations, e.g., veRL-Omni, still rely on colocated execution, which simplifies synchronization but couples rollout and training resources, limits heterogeneous deployment, and constrains independent scaling. To this end, we introduce DigenRL, a disaggregated RL framework for diffusion-based generative LLMs that supports flexible resource allocation, accommodates heterogeneous GPUs, and facilitates efficient task scheduling. To maximally reduce the execution bubbles in the disaggregated architecture, we propose: 1) a generation-axis pipeline (GAP) and time-step parallelism (TSP) in the diffusion architecture to enable finer-grained pipelining between rollout and training; 2) an elastic trainer-assisted generation (TAG) approach to enable the trainer GPU resources to dynamically assist in executing rollout generations; and 3) a tightly one-step constrained asynchronous strategy to further utilize the tail bubble in the pipeline. Extensive experiments are conducted on three hardware testbeds with 16-32 GPUs using HunyuanVideo-13B, Wan2.1-14B, FLUX.1-12B, and QwenImage-20B generative models. Experimental results show that DigenRL achieves 1.56-2.10x throughput improvements over state-of-the-art diffusion RL systems, veRL-Omni and GenRL. Comments: 14 pages, 18 figures, 1 table Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI); Performance (cs.PF) Cite as: arXiv:2606.24369 [cs.AI] (or arXiv:2606.24369v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.24369 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-49] MVG-KAN: Multi-View Geo-Wind Guided KAN for PM_2.5 Forecasting
链接: https://arxiv.org/abs/2606.24347
作者: Cheng Huang,Muyao Guan,Jairus Yougui Railey,Ning Xu,Honghui Xu,Changjiang Zhang,Zhen Zhang,Shiqing Zhang,Cong Bai
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate short-term PM _2.5 forecasting is important for public health protection, air-quality early warning, and urban environmental management. However, PM _2.5 variation is driven by multiple coupled factors, including stable periodic changes induced by human activities and meteorological regularity, station-specific short-term concentration evolution, and meteorology-driven pollutant dispersion among monitoring stations. Existing spatio-temporal forecasting methods may capture station relationships to some extent, but distance-only, correlation-based, or purely adaptive graphs are often insufficient to comprehensively represent these heterogeneous factors, especially wind-direction-dependent pollutant transport. To address this problem, we propose a Multi-View Geo-Wind Guided KAN model for PM _2.5 forecasting, named \textbfMVG-KAN, which models station-level PM _2.5 evolution from three complementary views: local periodic regularity, station-wise residual temporal dynamics, and meteorological-environment-guided spatial dispersion. Specifically, the periodic-residual forecasting backbone first separates stable daily and weekly patterns from non-periodic residual variations. A Geo-Wind Graph is constructed by combining geographic distance decay with wind-direction- and wind-speed-aware transport, providing a lightweight physically motivated directed spatial prior for residual propagation among stations. In addition, a temporal Kolmogorov-Arnold network (TKAN) residual head is then introduced to learn station-wise nonlinear autoregressive correction from de-periodized PM _2.5 residuals and historical multi-pollutant sequences, thereby enhancing the modeling of local residual inertia and pollutant co-variation.
[AI-50] What Does ODRL Mean? A Cross-Level Ontological Grounding of Permissions Prohibitions and Duties in UFO-L
链接: https://arxiv.org/abs/2606.24344
作者: Daham M. Mustafa,Christoph Lange,Giancarlo Guizzardi,Diego Collarana,Christoph Quix,Stefan Decker
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: Accepted at FOIS 2026 (16th International Conference on Formal Ontology in Information Systems), Vitória, Brazil; to appear in Frontiers in Artificial Intelligence and Applications, IOS Press. 16 pages, 1 figure, 2 tables
Abstract:ODRL policy evaluators produce verdicts, but say nothing about the normative positions a policy brings into existence, the authority structures those positions presuppose, or who holds the power to declare a norm violated. We formulate the Cross-Level Design Principle: any normative language with violable, consequential norms requires both conduct-level positions (Permission, Duty, Right, No right) and competence-level positions (Power, Subjection, Immunity, Disability). Applying this to ODRL, we establish that prohibition is sanctioned (violation possible and consequential), that permission is underspecified across its behaviour parameter (open vs. closed world), and that the formal semantics covers achievement obligations only. We ground ODRL in UFO-L, mapping each activated rule to a simple legal relator and extending coverage from two to eight legal positions; violation-declaration authority, implicit in every existing evaluator, becomes an explicit Power-Subjection pair. All axioms are mechanically verified in Isabelle/HOL and across a 39-problem benchmark under Vampire, E, and Z3.
[AI-51] ZONOS2 Technical Report
链接: https://arxiv.org/abs/2606.24320
作者: Gabriel Clark,Sofian Mejjoute,Mohamed Osman,George Close,Beren Millidge
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures, 7 tables. Technical report. Model weights, inference code, and the ZTTS1-Eval benchmark released under Apache 2.0. Code: this https URL ; weights: this https URL ; benchmark: this https URL
Abstract:We present ZONOS2 8B, our latest TTS model, which achieves state-of-the-art naturalness, prosody, and voice cloning fidelity. We improve upon Zonos-v0.1 across scale, data, and training recipe. We scale the model from 1.6B to 8B total parameters (900M active) with a novel mixture-of-experts (MoE) backbone, improving inference latency and throughput. We expand our training corpus from 200K to over 6M hours using a new data processing pipeline, and we simplify our post-training and conditioning recipes to improve naturalness and voice cloning fidelity. We evaluate ZONOS2 8B on quality, speaker similarity, WER, and ZTTS1-Eval, our novel TTS benchmark, where it performs competitively with state-of-the-art systems while maintaining good streaming latency. We release our model weights and example inference code under an Apache 2.0 license on GitHub and Hugging Face.
[AI-52] Prob-BBDM: a Probabilistic Brownian Bridge Diffusion Model for MRI sequence image-to-image translation
链接: https://arxiv.org/abs/2606.24313
作者: Martin Valls(UFR SFA (Poitiers), XLIM-ASALI, LabCom I3M (Poitiers)),Pascal Bourdon(UFR SFA (Poitiers), LabCom I3M (Poitiers), XLIM-ASALI),Christine Fernandez-Maloigne(LabCom I3M (Poitiers), XLIM-ASALI, UFR SFA (Poitiers)),Guillaume Herpe(CHU Poitiers – Radio, DACTIM-MIS (Poitiers), LabCom I3M (Poitiers)),David Helbert(UFR SFA (Poitiers), XLIM-ASALI, LabCom I3M (Poitiers))
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AI-driven image-to-image synthesis is rapidly advancing, with growing applications in medical imaging. Multi-modal image analysis plays a crucial role in optimizing examination quality, yet acquiring multiple imaging modalities in clinical settings remains resource-intensive and time-consuming, especially for 3D imaging. To address this challenge, we propose a novel image-to-image translation model based on Brownian Bridge Diffusion Models (BBDM), which synthesizes magnetic resonance imaging (MRI) sequences from 2D axial slices. Our approach integrates a variational encoder-guided diffusion mechanism, leveraging probabilistic image distributions to enhance synthesis quality. Evaluated on the BraTS 2021 dataset, our Probabilistic-BBDM (Prob-BBDM) achieves superior performance across multiple translation tasks, reaching up to 88.46% SSIM and 26.09 dB PSNR, with consistent improvements over baselines. Notably, our diffusion process requires only 4 steps, making it computationally efficient while maintaining high-quality synthesis. To further validate generalizability, we test Prob-BBDM on an external third-party dataset, demonstrating consistent performance across domains. Additionally, we assess the clinical utility of the synthesized slices by using them as input to a pre-trained segmentation model. Tumor segmentation yields a Dice score of 88.71% and an HD95 of 3.49 mm, confirming that the synthesized slices preserve critical diagnostic information. These results highlight the potential of Prob-BBDM for high-quality, efficient, and generalizable MRI synthesis, offering a promising step toward improved medical image translation.
[AI-53] LemonHarness Technical Report
链接: https://arxiv.org/abs/2606.24311
作者: Kailong Ren,Fubo Sun,Jiachen Liu,Liu Yang,Zimo Yin,Jiaying Li,Congli Yin,Ming He,Yu Huo,Jiawei Liu,Zeping Chen,Yubin Huangfu,Ronghua Li,Yixuan Wu,Xing Su,Yanzhi Xu,Likang Wu,Hongke Zhao,Lei Zhang,Xiaohui Geng,Jianping Fan
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As large language model (LLM) agents are applied to longer tasks, they increasingly modify workspace state across multiple rounds of iteration. However, agents typically observe only tool outputs and log fragments, while the actual state changes occur in the file system. Without explicit workspace boundaries, state-changing operations such as file writes and temporary artifact generation may scatter changes across paths. Over time, these weakly constrained changes accumulate, making states such as modified files difficult to track. This paper presents LemonHarness, an integrated execution framework for long-horizon agents. LemonHarness establishes an explicit execution boundary by constraining state-changing operations within a clearly defined workspace and bringing model invocation, tool execution, and rule knowledge within a single controlled boundary. State-changing operations, including file writes, dependency installation, and temporary artifact creation, are executed through structured tool interfaces, with execution feedback recorded as observations available to subsequent model decisions. The system also introduces a reusable rule knowledge base, which turns recurring execution rules and acceptance criteria into runtime knowledge. LemonHarness further adds a time-aware execution mechanism that exposes elapsed and remaining budget to the model, so it can rebalance exploration, implementation, and validation effort as time pressure shifts and avoid timeouts from long waits or excessive verification. On Terminal-Bench 2.0, LemonHarness_GPT-5.3-CodeX reached 84.49% accuracy over 445 trials; pairing the same framework with the stronger GPT-5.5 backbone raised the average accuracy to 86.52% across five jobs. The results suggest that a unified runtime boundary, callable rule knowledge, and time-aware execution can improve the stability of long-horizon agent execution.
[AI-54] ractable Reasoning and Conjunctive Query Answering for Defeasible DL-Lite under Rational Closure
链接: https://arxiv.org/abs/2606.24279
作者: Giovanni Casini(1 and 2),Umberto Straccia(1) ((1) CNR - ISTI, (2) University of Cape Town)
类目: Artificial Intelligence (cs.AI)
备注: 108 pages, 2 figures, 1 table
Abstract:In Description Logics (DLs), reasoning under Rational Closure (RC) is a well-known and widely accepted non-monotonic formalism to handle defeasible knowledge. In this paper, we study the application of RC to the core and horn variants of the DL-Lite family of lightweight description logics. We analyze both entitlement (instance checking) and Conjunctive Query (CQ) answering under RC. Our main contribution is providing a plug-in architecture that builds upon existing standard classical reasoners, establishing that reasoning and CQ answering under RC for DL-Lite can be done efficiently with minimal computational overhead.
[AI-55] Neural Network-Based Parametric Model Reduction for Predicting Turbulent Flow for Different Vehicle Geometries
链接: https://arxiv.org/abs/2606.24265
作者: Kazuto Ando,Rahul Bale,Akiyoshi Kuroda,Makoto Tsubokura
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:
Abstract:Numerical simulations in industrial applications often require performing numerous high-precision computations parameterized by specific experimental conditions. For instance, in vehicle body design, aerodynamic simulations are essential for evaluating the aerodynamic characteristics of various proposed body geometries. However, computational resource constraints often become a bottleneck. Therefore, achieving the desired accuracy while minimizing computational cost is crucial. To address this challenge, model reduction methods have been developed to decrease the degrees of freedom by constraining the possible states of a physical system to a lower-dimensional subspace. In particular, reduction techniques that project the system onto a nonlinear subspace using neural networks have been actively studied. Our previous research developed a reduced-order model that integrates neural-network-based model reduction with a time-evolution method, implemented as a distributed parallel training framework to process high-resolution flow field data efficiently. In this study, we extend this reduction approach by incorporating a variational autoencoder to assess its robustness in high-Reynolds-number flows around multiple vehicle bodies with varying geometries. Specifically, we evaluate the reconstruction accuracy of vortex generation across different spatial and temporal scales using a compact latent representation, with a particular focus on the flow behavior near the rear end of the vehicle body.
[AI-56] Probing the Misaligned Thinking Process of Language Models
链接: https://arxiv.org/abs/2606.24251
作者: Kaiwen Zhou,Constantin Venhoff,Jonathan Michala,Xin Eric Wang,William Saunders
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models exhibit a growing range of misaligned behaviors such as strategic deception, sandbagging, and self-preservation. As they are increasingly deployed in high-stakes settings, it is critical to reliably detect such behaviors to ensure safe and responsible use. In this work, we propose to monitor misalignment by decomposing it into fine-grained cognitive processes – misalignment indicators – and detecting their presence in a model’s internal activations via linear probes. We develop a taxonomy of 18 indicators spanning different misaligned behaviors, paired with an automated, meta-plan-guided pipeline that generates multi-turn training conversations. To rigorously evaluate generalization, we construct an out-of-distribution suite combining automated behavioral elicitation, established misalignment benchmarks, and natural benign conversations. Across 5 misaligned behaviors, our probes match a strong LLM judge with 0.935 AUROC on out-of-distribution benchmarks while keeping a low false positive rate on benign traffic. We further perform in-depth analysis to understand the probes and the model’s internal representations of misalignment indicators.
[AI-57] AutoSpec: Safety Rule Evolution for LLM Agents via Inductive Logic Programming
链接: https://arxiv.org/abs/2606.24245
作者: Pingchuan Ma,Zhaoyu Wang,Zimo Ji,Yuguang Zhou,Zhantong Xue,Zongjie Li,Shuai Wang,Xiaoqin Zhang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Large language model (LLM) agents increasingly automate complex tasks by integrating language models with external tools and environments. However, their autonomy poses significant safety risks: agents may execute destructive commands, leak sensitive data, or violate domain constraints. Existing safety approaches face a fundamental tradeoff: hand-crafted rules are interpretable but brittle, with overly conservative rules blocking safe operations (high false positives) while permissive rules miss unsafe behaviors (high false negatives). Neural classifiers lack the interpretability required for safety-critical deployments. We present AutoSpec, a framework that automatically evolves deployed expert-designed safety rules from user safe/unsafe annotations through counterexample-guided inductive synthesis (CEGIS) guided by inductive logic programming (ILP). Starting from the expert rules and a stream of annotated traces, AutoSpec iteratively evaluates rules, mines false-positive and false-negative counterexamples, uses ILP to learn which predicates discriminate them, generates candidate rule edits, and verifies candidates to select the best revision. The key insight is that ILP efficiently identifies predicates that appear frequently in false negatives but rarely in false positives (or vice versa), dramatically pruning the exponential search space of rule edits. This continues until convergence, producing interpretable rules that balance precision and recall. We evaluate AutoSpec on 291 execution traces spanning code execution and embodied agent domains. AutoSpec raises rule F1 to 0.98 and 0.93 across the two domains, achieving up to 94% false positive reduction while maintaining high recall, and converges within 4-5 iterations. The ILP-guided approach achieves up to 4.8x higher F1 than heuristic CEGIS. The learned rules are human-readable, auditable, and generalize to unseen scenarios. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2606.24245 [cs.SE] (or arXiv:2606.24245v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2606.24245 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-58] owards Federated Long-Tailed Graph Learning: An Energy-Guided Dual Decoupling Approach
链接: https://arxiv.org/abs/2606.24237
作者: Lianshuai Guo,Zhongzheng Yuan,Xunkai Li,Meixia Qu,Wenyu Wang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Federated Graph Learning facilitates collaborative graph modeling across distributed clients while preserving data privacy. However, real-world data categories frequently exhibit long-tailed distributions. Such statistical scarcity severely degrades performance in two ways: it biases the global model toward majority classes, and it structurally isolates minority nodes by submerging them in heterophilic, head-dominated neighborhoods. While existing methods attempt topology-agnostic statistical compensations, they often fail under data scarcity. Instead of recovering tail nodes, they overfit the structural noise from adjacent dominant classes, leading to representation degradation. To address these limitations, we propose FedEPD, a framework built on a dual decoupling paradigm that separates topological purification from semantic recalibration. Specifically, FedEPD utilizes distribution-aware Dirichlet energy pruning to filter spatial heterophilic edges. It then overcomes Non-IID distribution shifts by extracting robust global prototypes from topologically central nodes, which are incorporated into local representations via a spatial low-pass prototype injection. Furthermore, a two stage alternating optimization strategy strictly protects majority decision boundaries while improving minority accuracy. Extensive experiments demonstrate that FedEPD achieves state-of-the-art performance across diverse long-tailed benchmarks, yielding absolute improvements of up to 4.97% in Accuracy and 5.48% in Macro-F1.
[AI-59] SP-Mind: An Autonomous Reasoning Agent for Spatial Proteomics Analysis ICML2026
链接: https://arxiv.org/abs/2606.24235
作者: Yucheng Yuan,Yuanfeng Ji,Zhongxiao Li,Ruijiang Li
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 6 figures. Accepted to ICML 2026. Equal contribution by Yucheng Yuan and Yuanfeng Ji
Abstract:Spatial proteomics enables single-cell-resolution characterization of protein expression within tissue architecture, playing a critical role in understanding tumor microenvironments and guiding precision medicine. However, current analysis workflows remain fragmented, requiring expert manual orchestration of heterogeneous tools and limiting research scalability and reproducibility. We present SP-Mind, the first autonomous AI agent designed to unify the spatial proteomics analysis pipeline, from raw multiplexed tissue imaging to downstream phenotype discovery. Equipped with expert-curated biological analysis skills and specialized computational tools, SP-Mind converts natural-language queries into end-to-end analytical workflows without task-specific fine-tuning. To rigorously evaluate its capabilities, we introduce SP-Bench, a comprehensive benchmark spanning diverse tissue types, comprising 102 tasks across 18 distinct categories. Through extensive evaluation on SP-Bench and established downstream tasks, SP-Mind achieves state-of-the-art performance compared to existing open-source biomedical agent baselines.
[AI-60] FlowR2A: Learning Reward-to-Action Distribution for Multimodal Driving Planning
链接: https://arxiv.org/abs/2606.24231
作者: Xirui Li,Zhe Liu,Xiaoqing Ye,Wenhua Han,Yifeng Pan,Junyu Han,Hengshuang Zhao
类目: Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Multimodal driving planning faces a long-standing tension between two paradigms: scoring-based methods benefit from dense reward supervision but are confined to a fixed action vocabulary, while anchor-based methods generate proposals dynamically yet suffer from sparse supervision constrained to a single ground-truth trajectory. In this work, we propose FlowR2A, which resolves this tension by reframing simulation-based rewards from discriminative targets into generative conditions. By learning the reward-conditioned action distribution from dense trajectory-reward pairs with a flow-matching decoder, FlowR2A unifies the dense supervision of scoring-based methods with the proposal generation of anchor-based methods in a single generative model, forcing the model to internalize the correlation between an action and its outcomes in safety, progress, comfort, and rule compliance. To balance hard safety constraints against soft progress objectives, we introduce fine-grained per-timestep reward conditioning and reward noise augmentation. The generative formulation naturally supports controllable test-time sampling via reward guidance and anchored sampling, producing high-quality proposals. FlowR2A achieves state-of-the-art results on the NAVSIM v1 and v2 benchmarks, with multimodal proposals of substantially higher quality than prior methods.
[AI-61] Exploring the relationship between human-centric AI and firm idiosyncratic risks
链接: https://arxiv.org/abs/2606.24224
作者: Zhen-Yuan Ralph Liu(CUMT),Yu-Ting Wang(NFU),Jia-Jia Yan,Shivam Gupta(NEOMA),Mihalis Giannakis
类目: Artificial Intelligence (cs.AI)
备注: Information Systems Frontiers, 2026
Abstract:Despite the extensive discussions of human-centric AI (HCAI) in Industry 5.0, its effects on firms’ idiosyncratic risks (IR) remains underexplored. This is an imperative issue for firms navigate financial risks during the current technological revolution, as IR reflects investor reactions to corporate heterogeneous AI strategies and implementations by isolating firm-level stock volatility from systematic factors. Integrating situated AI theory with social-technical systems theory, we conceptualise HCAI as a situated AI strategy that reduces AI-related ethical risks and fosters AI-Human synergies in firms’ business operations, ultimately reducing IR by aligning with stakeholders’ diverse expectations. Moreover, socio-technical factors, namely digitalisation, operational efficiency, executive shareholding, and CEOs with IT background, may moderate the HCAI-IR relationship. Using a multi-source panel dataset of Chinese listed firms from 2015 to 2023, we find that HCAI is associated with lower firm IR. Furthermore, digitalisation and executive shareholding strengthen this risk-reducing effect, whereas operational efficiency and CEOs with IT background surprisingly attenuate it. Our findings offer theoretical contributions and practical insights for both ethical AI governance and firm financial risk management in the AI era.
[AI-62] Navigating User Behavior toward Personalized Multimodal Generation
链接: https://arxiv.org/abs/2606.24196
作者: Hengji Zhou,Yufeng Liu,Ye Liu,Yong Xu,Lianghao Xia,Liqiang Nie
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 15 figures, 5 tables. Code is available at this https URL
Abstract:Modern AIGC pipelines deliver high-fidelity images and videos but presuppose a well-formed creation instruction, while end users rarely articulate visual details, leaving generators misaligned with user demand. We study personalized content generation, which turns a user’s interaction history into an executable instruction for downstream synthesis, and identify two obstacles: behavior must be encoded in a form legible to language reasoning, and the model must acquire instruction-writing skill absent from both pretraining and behavior data. We propose NaviGen, which represents each item with a dual identifier coupling a collaborative code and a textual code as a behavioral substrate and a semantic bridge in one token stream. On this representation, a two-stage SFT+RL pipeline first distills preference reasoning and instruction writing from evolutionarily searched supervision, then aligns generation with user intent through hierarchical and self-consistent rewards. Experiments across product, game, and short-video domains show that NaviGen improves personalized image and video generation, strengthens next-item prediction, and yields more specific, relevant, and visually generatable instructions. Our code is anonymously released at: this https URL.
[AI-63] Lightweight Transformer Models for On-Device Fault Detection: A Benchmark Study on Resource-Constrained Deployment
链接: https://arxiv.org/abs/2606.24173
作者: Disha Patel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures
Abstract:On-device fault detection enables real-time diagnostics without cloud dependency, but deploying machine learning models on resource-constrained hardware demands careful tradeoffs between accuracy, latency, and model size. We present a benchmark comparing traditional ML methods (Random Forest, XGBoost, SVM, Logistic Regression) against lightweight transformer architectures (DistilBERT, TinyBERT-6L, TinyBERT-4L, MobileBERT) for binary fault detection across three public datasets: NASA C-MAPSS turbofan degradation, SECOM semiconductor manufacturing, and UCI AI4I 2020 predictive maintenance. We evaluate classification performance (F1-score, AUC), model size, and CPU inference latency, and further assess INT8 dynamic quantization and a two-stage adaptive inference pipeline. Our results reveal that on well-separated sensor data (C-MAPSS), lightweight transformers match traditional ML at 87.8% F1 but at 100x the model size and 9000x the latency. TinyBERT-4L emerges as the most deployment-friendly transformer at 55 MB and 18 ms CPU latency. INT8 quantization reduces size by 25% while preserving 86.9% F1. Our adaptive pipeline, routing 97.9% of predictions through a quantized triage model and only 2.1% to a larger expert, achieves 87.6% F1 at 19.5 ms average latency. On severely imbalanced datasets (SECOM, UCI-PM), both traditional and transformer methods struggle significantly, highlighting fundamental limitations of current approaches for extreme class imbalance in fault detection. All code is publicly available.
[AI-64] Data Scale Not Latency Shapes Cross-Lingual Encoder Transfer in Streaming ASR
链接: https://arxiv.org/abs/2606.24169
作者: Nenad Banfic
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Adapting a streaming speech recognition model to a new language requires choosing between two plausible warm starts: a multilingual (ML) encoder or an English-only (EN) encoder. The common intuition is that the multilingual encoder should help most at low data, but it is unclear how long that advantage persists, whether tight streaming latency amplifies it, and whether it survives deployment quantization. We answer these questions with a controlled sweep of a 0.6 B-parameter cache-aware FastConformer transducer across eight European languages, up to five target-language data scales (100 h to 2500 h), three streaming tiers plus offline decoding, and up to four public test sets. The main result is that multilingual initialization is a data-limited advantage, not a latency-limited one. On FLEURS at 160 ms, the mean EN-ML word error rate (WER) gap falls from +4.21 percentage points (pp) at 100 h to +0.20 pp at 2500 h; a power-law fit summarizes this decay, with each doubling of target-language data roughly halving the remaining advantage. Across the three streaming tiers, the across-language mean EN-ML gap is approximately stable at each scale from 100 to 1000 h, and is near zero by 2500 h. Finally, 4-bit weight-only encoder quantization at the matched 560 ms streaming tier reduces the encoder footprint by about 3x, with an average FLEURS WER increase of about 0.5 pp. The resulting guideline is simple: use multilingual initialization in low-data regimes, treat the choice as effectively irrelevant at large data, and make latency and quantization decisions independently.
[AI-65] An Introduction to Causal Reinforcement Learning
链接: https://arxiv.org/abs/2606.24160
作者: Elias Bareinboim,Junzhe Zhang,Sanghack Lee
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Causal inference provides a set of principles and tools that allow one to combine data and knowledge about an environment to reason with questions of counterfactual nature, i.e., what would have happened had reality been different, even when no data of this unrealized reality is currently available. Reinforcement learning provides methods to learn a policy that optimizes a specific measure (e.g., reward, regret) when the agent is deployed in an environment and pursues an exploratory, trial-and-error approach. These two disciplines have evolved independently and with virtually no interaction between them. We note that they operate over different aspects of the same building block, counterfactual relations, which makes them umbilically connected. Based on these observations, novel learning opportunities arise when this connection is explicitly acknowledged and mathematized. To realize this potential, we note that any environment where the RL agent is deployed can be decomposed as a collection of autonomous mechanisms with different causal invariances, parsimoniously modeled as a structural causal model; any standard RL setting implicitly encodes such a model. This formalization allows us to put under a unifying treatment different modes of learning, including online, off-policy, and causal calculus learning, which appear unrelated in the literature. However, these modalities are not exhaustive: we introduce several natural and pervasive classes of learning settings that entail novel dimensions of analysis. Specifically, we introduce and discuss through causal lenses generalized policy learning, where to intervene, imitation learning, and counterfactual learning. These tasks lead to a broader view of counterfactual learning and suggest great potential for studying causal inference and reinforcement learning side by side, which we call causal reinforcement learning (CRL).
[AI-66] he Geometry Behind Diffusion and Flow Matching: Gradient Flows and Geodesics in Wasserstein Space
链接: https://arxiv.org/abs/2606.24157
作者: Yian Yao,Weiwei Zhang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The space \mathcalP_2(\mathbbR^d ) of probability measures with finite second moment carries a natural geometry: the quadratic Wasserstein distance W_2 makes it a complete metric space and, following Otto, a (formal) Riemannian manifold whose geodesics are the optimal-transport interpolations. On this manifold, the gradient flow of the free energy F(rho) = KL(rho || \pi) is exactly the Fokker-Planck equation, and its implicit-Euler discretization is the JKO scheme. This is the geometry underlying diffusion models: the forward process descends the free energy, and each denoising step realizes one JKO step, which recovers DDPM, DDIM, NCSN/SMLD, and Energy Matching; this is one scheme, not separate theories. The same manifold supports a second variational principle. Its geodesics - the minimum-action curves of the Benamou-Brenier formula - are precisely the optimal-transport paths that Flow Matching learns. Fixing both endpoints and following the geodesic, generation becomes a deterministic ODE along a straight line, hence far fewer sampling steps. Placing both families of models on one manifold makes their relationship exact: diffusion follows a free-energy gradient flow, an initial-value problem; optimal-transport Flow Matching follows a Wasserstein geodesic, a boundary-value problem. The two reach the same endpoints along different paths.
[AI-67] 2D-Bench: Evidence-Gated Evaluation of LLM Outputs for Type 2 Diabetes Using a Multi-Layer Clinical-Lifestyle Knowledge Graph
链接: https://arxiv.org/abs/2606.24145
作者: Saba A. Farahani,Hung Cao,Ramesh Jain,Amir M. Rahmani
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 2 figures, 2 tables. Accepted as a poster at AMIA 2026 Annual Symposium
Abstract:Large language models (LLMs) can produce clinically fluent recommendations for type 2 diabetes while failing to satisfy guideline constraints or explicitly justify lifestyle-related glycemic claims. We present T2D-Bench, a reproducible benchmark and evidence-gated evaluation framework for testing whether LLM outputs satisfy explicit, graph-checkable evidence requirements. T2D-Bench is built on a multi-layer clinical-lifestyle knowledge graph that combines a biomedical spine (UMLS, DrugBank, SIDER), computable ADA Standards of Care rules, and lifestyle knowledge connected through a mechanistic bridge to glycemic laboratory effects. Across 100 structured vignettes spanning diagnosis, medication safety, and adversarial lifestyle conflicts, baseline outputs failed benchmark-defined evidence-path checks in 35% of cases for GPT-4o-mini and 33% for GPT-4o. The evidence gate detects unsupported omissions and uses constrained revision to bring outputs into verifier-level compliance with benchmark-defined evidence requirements. These results show that computable evidence constraints can make unsupported clinical omissions explicit, measurable, and correctable in diabetes-focused LLM outputs.
[AI-68] OmniPath: A Multi-Modal Agent ic Framework for Auditing Wheelchair Accessibility
链接: https://arxiv.org/abs/2606.24129
作者: ASM Mobarak Hossain,Nadim Mahmud,Vaskar Raychoudhury,Md Osman Gani
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 13 figures. Submitted to IEEE COMPSAC 2026. OmniPath: A Multi-Modal Agentic Framework for Auditing Wheelchair Accessibility
Abstract:For a wheelchair user, a standard blue line on a map is often a broken promise. While platforms like OpenStreetMap (OSM) successfully capture where a path is, they frequently fail to convey how it physically feels to travel on it. This information barrier is problematic for wheelchair users. To solve this issue, we present OmniPath, a system that moves from passive mapping to proactive environmental auditing. Our framework fuses the network topology of OSM with the submeter precision of high-density aerial LiDAR (USGS 3DEP) to create a high-fidelity 3D model of the pedestrian environment. Rather than simply routing a user, our agent virtually traverses the network, analyzing the surface in 0.5 meter increments. It rigorously quantifies physical friction points specifically running slope, cross slope, and vertical discontinuities against ADA compliance standards, calculating a weighted severity score to categorize hazards from Mild'' to Critical.‘’ To ensure real world reliability, we validated the system against 200 physical ground truth field surveys across the National Mall using stratified random sampling. The framework demonstrated strong diagnostic reliability for high-severity hazards, achieving F1-scores of 0.60 for Severe and 0.58 for critical categories. By automating this micro-scale inspection, OmniPath identifies the ``invisible’’ barriers that standard maps miss, effectively transforming a static dataset into accessibility data source that anticipates accessibility challenges before the user ever leaves home.
[AI-69] VeryTrace: Verifying Reasoning Traces through Compilable Formalism and Structured Verification ICML2026
链接: https://arxiv.org/abs/2606.24124
作者: Ninghan Zhong,Ahmet Ege Tanriverdi,Kaan Kale,Sriram Vishwanath
类目: Artificial Intelligence (cs.AI)
备注: Accepted at LM4Plan Workshop @ ICML 2026
Abstract:Multi-step reasoning with Chain-of-Thought (CoT) prompting remains fragile: logical errors or hallucinations in early steps silently propagate, producing confident but incorrect conclusions. This paper presents VeryTrace, a zero-shot verification-and-repair framework that formalizes natural-language reasoning traces into a structured, compilable representation. VeryTrace introduces a Domain-Specific Language (DSL) that (i) makes step dependencies explicit, (ii) mechanizes quantitative content as executable expressions, and (iii) structures semantic inferences via deduction schemas. Our hybrid verifier combines deterministic checks for computational correctness, dependency resolution, and constraint satisfaction with targeted LLM audits for non-mechanizable semantic judgments, enabling step-level error localization and repair. Across three diverse domains-competition mathematics (AIME 2025), robotics planning (LLM-BabyBench), and kinship reasoning (CLUTRR), VeryTrace improves accuracy over zero-shot baselines on state-of-the-art LLMs without requiring domain-specific training or in-context examples, demonstrating that formalized trace verification achieves both precision and generalization. Comments: Accepted at LM4Plan Workshop @ ICML 2026 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.24124 [cs.AI] (or arXiv:2606.24124v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.24124 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-70] ReMMD: Realistic Multilingual Multi-Image Agent ic Verification for Multimodal Misinformation Detection
链接: https://arxiv.org/abs/2606.24112
作者: Chenhao Dang,Dantong Zhu,Jun Yang,Conghui He,Weijia Li
类目: Artificial Intelligence (cs.AI)
备注: The project is available at this https URL
Abstract:Multimodal misinformation detection is increasingly important because viral posts now combine long multilingual narratives, several images, mixed provenance, and subtle text–image framing errors. Existing benchmarks and methods remain poorly matched to this setting: they usually isolate short captions, single images, binary labels, or one manipulation source, while agentic verification remains costly under realistic evidence search. We present ReMMD, a realistic multilingual multi-image agentic verification framework for multimodal misinformation detection. ReMMD includes ReMMDBench, a real-world multimodal misinformation detection benchmark with 500 samples, 2,756 images, five monolingual languages, two cross-lingual settings, three text-length tiers, multi-image posts, five-way veracity labels, eight distortion labels, evidence provenance, and rationales. It also includes ReMMD-Agent, a persistent-memory verifier that decomposes posts into atomic points, builds a reusable evidence set, and predicts structured L1/L2/L3 outputs. Across proprietary systems, open LVLMs, MMD-Agent, and T2-Agent, ReMMD-Agent obtains the best five-way veracity performance, with 41.80% accuracy and 39.12% macro-F1 using GPT-5.2, while reducing cost by 17.5% relative to MMD-Agent and 79.9% relative to T2-Agent. The project is available at this https URL.
[AI-71] DynaWM: Dynamics-Aware Distillation with World Model and Momentum Targets for Smooth Locomotion over Continuous Stairs IROS
链接: https://arxiv.org/abs/2606.24089
作者: Haidong Hou,Zhangguo Yu,Hengbo Qi,Jianlin Zhang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Comments: 8 pages, 7 figures, accepted by IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Abstract:Recent advances in control have enabled bipedal-wheeled robots to traverse slopes and single-step obstacles, yet long staircase traversal remains challenging as current teacher-student frameworks suffer from weakened dynamics-aware representations and incomplete terrain geometry encoding. To bridge this gap, we propose DynaWM, a dynamics-aware representation learning framework. To enhance terrain encoding capability and enable transparent assessment, we introduce a world model as a regularizer to enforce forward-dynamics awareness, preserving comprehensive terrain geometry while facilitating hierarchical encoding visualization. To stabilize knowledge transfer, we employ a momentum target encoder to provide consistent distillation targets, preventing dimensional collapse from non-stationary teacher updates. Evaluation of the learned representations through Principal Component Analysis (PCA) visualization and quantitative metrics reveals that our encoder hierarchically captures terrain geometry with higher terrain encoding capability, leading to enhanced terrain adaptability and motion smoothness. Experimental results in simulation and real hardware demonstrate that our method achieves superior terrain adaptability and motion smoothness, enabling bipedal-wheeled robots to overcome diverse continuous stairs, as shown in Fig. 1.
[AI-72] PixJail: Self-Evolving Paper-to-Pipeline Reproduction for Text-to-Image Jailbreak Evaluation
链接: https://arxiv.org/abs/2606.24081
作者: Leyi Sheng,Han Sun,Zhen Sun,Yuntao Yue,Jinlin Wu,Xinlei He,Jiaheng Wei
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As Text-to-Image (T2I) jailbreak techniques evolve rapidly, existing benchmarks and reproduction workflows often struggle to keep pace. More importantly, T2I jailbreak evaluation is not a single prompt-level test, but a pipeline-level problem shaped by multiple stages, including prompt transformation, image generation, safety filtering, and multimodal judging. This makes results across papers difficult to reliably reproduce and fairly compare. To bridge this gap, we propose PixJail, a self-evolving paper-to-pipeline agent framework for reproducible T2I jailbreak evaluation. Given a T2I jailbreak paper and optional reference code, PixJail rapidly constructs a paper-specific attack module and a runnable evaluation pipeline under a unified contract, while faithfully reproducing the original experimental results. PixJail further maintains a memory bank that stores paper digests, attack evolution patterns, reusable templates, failure cases, and versioned artifacts, enabling future reproduction efforts to reuse prior experience. We reproduce eleven representative T2I jailbreak methods, including both code-available and code-unavailable papers. Under their original settings, our framework accurately recovers prior results with minimal error (2.1% average, 0% median). We hope that PixJail can serve as a unified foundation for future T2I jailbreak reproduction and evaluation, significantly reducing manual effort.
[AI-73] oken Complexity of Certifying Stochastic-Oracle Reliability
链接: https://arxiv.org/abs/2606.24074
作者: Jie Wang
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Information Theory (cs.IT)
备注: 21 pages, 0 figures
Abstract:Wang~\citeWang2026 introduced the Stochastic-Oracle Turing Machine (SOTM) framework and defined token complexity as the minimum expected cost of interacting with a stochastic oracle needed to attain a specified solution quality for a task. This paper develops an analogous notion for certifying the reliability of a stochastic oracle on a given domain. Certification token complexity is the minimum expected token cost required, with controlled error probability, to distinguish oracles that meet a target reliability level from those that fall below a lower reliability threshold. We construct an SPRT-based certification SOTM that queries the oracle, computes binary correctness scores, and stops when the accumulated log-likelihood evidence crosses a decision threshold. The SOTM halts almost surely, satisfies the desired two-sided error guarantee over the reliability regions to be certified, and yields an explicit upper bound on certification token complexity in terms of the reliability thresholds, the error bound, and the expected per-turn token cost. We then establish a matching information-theoretic lower bound: even with adaptive queries, every error-bounded certification SOTM must incur the same leading-order expected token cost as the SPRT-based construction as the prescribed error bound tends to zero. Together, these bounds characterize the leading-order certification token complexity in the small-error regime. Comments: 21 pages, 0 figures Subjects: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Information Theory (cs.IT) Cite as: arXiv:2606.24074 [cs.CC] (or arXiv:2606.24074v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2606.24074 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-74] Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning
链接: https://arxiv.org/abs/2606.24064
作者: Tianyuan Shi,Canbin Huang,Bei Li,Xin Chen,Xiaojun Quan,Jingang Wang,Qifan Wang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Distilling reasoning capabilities from strong to weak language models typically involves imitating specific solution trajectories, effectively transferring what to answer rather than how to reason. This trajectory-level imitation encourages memorization of instance-specific steps rather than acquisition of transferable problem-solving skills, limiting generalization to novel problems. We propose Strategy-Guided Policy Optimization (SGPO), which replaces instance-level trajectory imitation with reusable strategy distillation. SGPO extracts structured strategy descriptions from strong-model responses and, for each problem, constructs both autonomous and strategy-guided trajectories to enable direct comparison of the model’s behavior with and without strategic guidance. The framework then addresses two key questions. For how to distill, a token-level forward-KL objective selectively transfers the distributional shift induced by strategy conditioning into the unguided policy, with proximal constraints ensuring stability. For when to distill, adaptive instance-level weighting strengthens guidance when autonomous exploration falls short and reduces it as the model’s own competence grows. Experiments on four mathematical benchmarks across two model families show that SGPO consistently outperforms SFT, on-policy RL, and hybrid-policy baselines, improving the average score by 2.2 points over the strongest baseline on Qwen2.5-7B-Instruct. Analysis reveals that the forward-KL objective provides an inherently selective distillation signal that outperforms direct trajectory imitation, and that strategy distillation exhibits complementary scaling with base model capability.
[AI-75] RAVEN: A Regime-Aware Variable-context Expert Network for Financial Time Series Forecasting
链接: https://arxiv.org/abs/2606.24062
作者: Cheng He,Zhenyu Guan,Xijie Liang,Defu Lian,Jiajia Li,Enhong Chen,Patrick P. C. Lee,Geng Hu,Zehao Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Financial time series forecasting presents structural challenges absent from standard benchmarks. Log-returns are non-stationary, exhibit exceptionally low signal-to-noise (SNR) ratios, and are governed by regime-dependent temporal dependencies. We identify a key limitation of state-of-the-art (SOTA) time series models in financial settings. A fixed context window is mismatched to the time-varying optimal look-back of non-stationary price processes. We propose the Regime-Aware Variable-context Expert Network (RAVEN), a Mixture-of-Experts framework designed to adaptively determine the temporal context for each input sample. Instead of relying on a fixed look-back horizon, RAVEN constructs a hierarchy of nested contiguous windows whose lengths are determined by the data itself. Specifically, RAVEN scores patches by learned importance in reverse chronological order and applies the Cumulative Importance Thresholding (CIT) mechanism to derive nested prefix windows, each routed to a scale-specialized expert. A Global Compressed Representation (GCR) branch runs in parallel over the full context, preserving global temporal coherence that local experts cannot guarantee. Because the nested routing induces structured overlap among expert inputs, we introduce a Correlation-Aware Weighting (CAW) to align variable-length expert outputs and penalize pairwise cosine similarity prior to aggregation. Experiments on cumulative log-return prediction (HS300, SP500) and fund sales forecasting demonstrate that RAVEN achieves SOTA performances, improves Pearson correlation by 9.2% on HS300 and 20.2% on SP500, and reduces MSE by 18.2% on fund sales forecasting, while achieving the best results in 14 of 16 metrics on four PEMS traffic benchmarks.
[AI-76] Ensemble Feature Selection and Harris Hawks Optimization for Explainable Mental Health Risk Prediction in Female Sex Workers
链接: https://arxiv.org/abs/2606.24047
作者: Ahnaf Atef Choudhury,Md. Parvej Hoque Palash,Shahriar Siddique Ayon,Ramkrishna Saha,Abdullah Al Mamun
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted and presented at the 2026 8th IEEE Symposium on Computers Informatics (ISCI 2026). To appear in IEEE conference proceedings
Abstract:One of the significant mental health issues affecting female sex workers (FSWs) is mental disorders, especially depression. Exposure to violence, stigma, and economic hardship further increases their psychological risk. Current machine learning (ML) models are typically ineffective at capturing the high-dimensional and complex risk patterns that exist in this marginalized group. This paper suggests a hybrid predictive model that merges an ensemble feature selection strategy using ANOVA and mutual information and Harris Hawks optimization-tuned logistic regression and represents a new application of swarm intelligence to predict mental health in vulnerable groups. The explainable AI (XAI) methods can be used to understand the factors of trauma associated with model predictions. When applied to a group of 3,005 FSWs, it can be seen that the proposed model is more effective than traditional classifiers, with an accuracy of 95.78%, an F1 score of 95.77%, and an AUC of 0.96, and identifying post-traumatic stress, client-related violence, and occupational factors as major contributors to depression. This work bridges the gaps between conventional and ML approaches to develop an XAI tool that enables vulnerable groups to receive early assistance, evidence-based targeted psychosocial care, and health planning.
[AI-77] Rapid FinFET Modelling Using an Autoencoder
链接: https://arxiv.org/abs/2606.24046
作者: Amit Sarkar Suman Sau,Swagata Mandal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
备注:
Abstract:This work presents a machine learning framework that leverages an autoencoder (AE) for the efficient modeling of FinFET. We first calibrated a BSIM-CMG model to generate a dataset of current-voltage (ID-VG) characteristics. This data was used to train an autoencoder that compresses full I-V curves into a low-dimensional latent space, which intrinsically encodes key device physics. A key innovation is the explicit incorporation of parameter such as drain to source voltage (VDS) as an input feature, enhancing the model ability to capture bias dependent variation. The trained model successfully reconstructs full I-V curves and directly extracts critical device metrics including threshold voltage (VTH), subthreshold slope (SS), and peak transconductance (gm). This approach demonstrates that data driven compact models, built from actual characterization data, can achieve high accuracy with minimal training data, providing a powerful tool for rapid device characterization, modelling and circuit level simulation.
[AI-78] Breaking the Filter Bubble: A Semantic Pareto-DQN Framework for Multi-Objective Recommendation
链接: https://arxiv.org/abs/2606.24042
作者: Cláudio Lúcio Do Val Lopes,Lucca Machado da Silva,André de Oliveira Brandão
类目: Artificial Intelligence (cs.AI)
备注: IEEE International Conference on Responsible Artificial Intelligence (IRAI) - 2026
Abstract:Recommender systems often induce filter bubbles and semantic homogenization by monolithically optimizing for immediate user engagement. Standard single-objective models, including traditional Deep Q-Networks, are ill-equipped to navigate the trade-offs between platform retention and critical societal values like information diversity and provider fairness. To address these limitations, we introduce a multi-objective reinforcement learning framework that formalizes recommendation as a semantic multi-objective Markov decision process. By integrating high-fidelity semantic embeddings with a Pareto-DQN agent, our architecture treats engagement, diversity, and fairness as distinct, non-aggregable reward signals, avoiding the pitfalls of static reward scalarization. Empirical evaluations on the MovieLens small dataset shows that our hypervolume based action selection disrupts the feedback loops responsible for semantic collapse. By sustaining high state-trajectory variance, the Pareto-DQN effectively maps the Pareto frontier, achieving gains in auxiliary societal objectives with only marginal impacts on engagement. This work provides a path toward intrinsically aligned, responsible recommender systems.
[AI-79] Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?
链接: https://arxiv.org/abs/2606.24026
作者: Ayan Antik Khan,Harsh Kohli,Yuekun Yao,Huan Sun,Ziyu Yao
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 4 figures, 14 tables
Abstract:Mechanistic interpretability has made substantial progress in automatically localizing circuits, but explaining what localized components do remains labor-intensive and difficult to standardize. In this work, we study whether language model (LM) agents can assist with this explanation problem once a circuit has already been identified. We introduce AgenticInterpBench, a benchmark for circuit explanation built from 84 semi-synthetic transformer circuits with 163 component-level annotations. We propose HyVE (Hypothesize, Validate, Explain), an agentic explainer that analyzes each component through an iterative loop of observation, hypothesis generation, and causal validation, eventually producing a component-level explanation and a circuit-level task description. Across four LM backbones, HyVE recovers useful component- and task-level explanations, but no backbone is uniformly best. Our analysis shows that strong backbones usually form observation-grounded hypotheses, while failures more often arise later in the validation loop, through incomplete validation plans, code execution errors, or unresolved hypotheses. A case study on an arithmetic circuit in Llama-3-8B shows that the same formulation can extend beyond semi-synthetic benchmarks to naturally trained models. Overall, LM agents are promising circuit explainers, but reliable validation remains the key obstacle.
[AI-80] Safe and Generalizable Hierarchical Multi-Agent RL via Constraint Manifold Control
链接: https://arxiv.org/abs/2606.24010
作者: Zihao Guo,Jianing Zhao,Ling Li,Hao Liang,Giuseppe Loianno,Yali Du
类目: Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:Multi-agent systems are widely used in safety-critical applications that require coordinated behavior under strict safety constraints. Existing approaches face a fundamental trade-off: learning-based methods achieve strong empirical performance but lack theoretical safety guarantees, while control-theoretic methods enforce safety but often lead to overly conservative and inefficient behaviors. We propose a hierarchical multi-agent reinforcement learning framework that enforces hard safety constraints under mild assumptions at low level via a constraint manifold, while enabling effective coordination through high-level policy learning. Our approach provides theoretical safety guarantees in the multi-agent setting and yields stationary learning dynamics, thereby enabling stable and efficient training. Empirically, our method achieves competitive performance while maintaining nearly perfect safety rates, and generalizes effectively to varying numbers of agents and obstacles.
[AI-81] Fast and Slow Variational Continual Learning
链接: https://arxiv.org/abs/2606.24007
作者: Subarnaduti Paul,Yohan Jung,Mohammad Emtiyaz Khan,Siddharth Swaroop,Thomas Möllenhoff,Martin Mundt
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Continual learning remains a major challenge for modern deep networks, partly because commonly used optimizers lack inherent mechanisms for continual adaptation. One such natural mechanism is fast and slow adaptation to balance stability and plasticity. This mechanism has deep roots in neuroscience and biology, but there is no consensus on how to best incorporate it in commonly used optimizers. Here, we show that this can be easily done via the VCL framework, where past posteriors are used as priors in the future. Our key idea is to incorporate slow adaptation via merging of past posteriors to slow down the drift in the knowledge as learning progresses. The merged posterior is then used as the prior in the VCL update to implement the fast-weight updates. These steps can be seamlessly implemented in the IVON optimizer, whose form and costs are nearly identical to that of Adam. We call this new optimizer the Continual IVON (CoVON) optimizer and show that it not only consistently improves over existing VCL optimizers, but also performs better than other weight-regularization strategies across domain-incremental learning, continual pre-training, and fine-tuning of large language models.
[AI-82] Learning to Trigger: Reinforcement Learning at the Large Hadron Collider
链接: https://arxiv.org/abs/2606.23993
作者: Zixin Ding,Shaghayegh Emam,Giovanna Salvi,Cecilia Tosciri,Abhijith Gandrakota,Jennifer Ngadiuba,Nhan Tran,Christian Herwig,David W. Miller,Yuxin Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); High Energy Physics - Experiment (hep-ex)
备注:
Abstract:High-throughput scientific facilities such as the Large Hadron Collider depend on real-time event filtering (\textittriggering) under tight constraints on bandwidth, latency, and storage. In practice, trigger menus are largely static and hand-tuned and can become suboptimal as detector conditions, pileup, and background composition drift over time. We cast online threshold tuning as a sequential decision-making problem: a reinforcement learning agent ingests streaming summaries of recent rates and signal-sensitive features and updates trigger thresholds to maximize signal efficiency while tracking a target background rate within a tolerance band. We adapt Group-Filtered Policy Optimization (GFPO) to streaming control and introduce two variants (GFPO-F, GFPO-FR) that enforce background rate feasibility during training. On a benchmark that emulates realistic collider operation, we study two representative triggers: a total transverse energy ( H_T ) trigger sensitive to pileup variation, and an anomaly-detection (AD) trigger based on reconstruction loss for rare or non-standard signatures. On Monte Carlo streams, our agent increases the fraction of in-tolerance time intervals by 48% ( H_T ) and 28% (AD), with a cumulative gain of up to 2% in signal efficiency on those in-tolerance intervals. Transferring from simulation to \emphreal collision data (CMS Run 283408), the same agent, without fine-tuning, achieves a 56% ( H_T ) and 28% (AD) in-tolerance improvement over baselines, with further signal-efficiency gain on both triggers. To our knowledge, this is the \emphfirst demonstration of RL-based trigger control on real Large Hadron Collider collision data. Code is available at this https URL_LHC.
[AI-83] Offline Reinforcement Learning for Warehouse SLAM Throughput Control
链接: https://arxiv.org/abs/2606.23978
作者: Tina Dongxu Li,Mouhacine Benosman,Rajat Kumar,Kevin Tan,Ken Meszaros,Trevor Dardik
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at 2026 14th International Conference on Control, Mechatronics and Automation (ICCMA 2026)
Abstract:We present an offline reinforcement learning (RL) framework for optimizing SLAM throughput control in a warehouse fulfillment environment. SLAM (Scan/Label/Apply/Manifest) throughput directly influences system congestion and operational efficiency. Our RL-based control approach dynamically recommends SLAM throughput settings that adaptively balance throughput maximization with downstream stability through intelligent adjustment of throttling behavior. We include a history-informed state representation, action space abstraction for delayed-impact control, and a reward function that captures both upstream and downstream operational metrics. Our approach is algorithm-agnostic, enabling integration of multiple offline RL methods under a unified architecture. We instantiate our framework with three state-of-the-art offline RL algorithms, and trained the models offline using de-identified historical operational logs from a large-scale warehouse. Policy performance is evaluated using a comprehensive multi-method strategy. These include model-free approaches including immediate reward estimation via regression models and long-horizon Fitted Q Evaluation (FQE), as well as model-based Deep Koopman dynamics evaluation. Empirical results reveal that the CQL policy consistently outperforms alternatives, improving system health by 22.97% and reducing average throttling duration by 3.18%. These findings demonstrate the potential of offline RL for safe and scalable warehouse throughput control optimization.
[AI-84] RIFT-Bench: Dynamic Red-teaming For Agent ic AI Systems
链接: https://arxiv.org/abs/2606.23927
作者: Yarin Yerushalmi Levi,Roy Betser,Amit Giloni,Lidor Erez,Itay Gershon,Oren Rachmil,Sindhu Padakandla,Roman Vainshtein
类目: Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Agentic AI systems powered by large language models (LLMs) are rapidly evolving into autonomous decision-making systems, exposing attack vectors beyond those of traditional LLM vulnerabilities. Existing security evaluations are often tied to specific implementations or domains, limiting unified comparison across heterogeneous systems. To address this gap, we introduce RIFT-Bench, a graph representation-driven methodology for dynamic red-teaming that enables unified evaluations across diverse agentic architectures. Building on a novel hierarchical representation, RIFT-Bench operates in two automated phases: Discovery, which extracts system structure, and Scanning, which deploys adaptive adversarial attacks and produces a comprehensive evaluation report. It evaluates the examined system itself, leveraging a broad set of dynamically adaptable adversarial probes across diverse attack vectors and objectives. We demonstrate the effectiveness of the proposed evaluation pipeline across 45 agentic systems spanning a diverse range of implementations, showing that the approach generalizes effectively to heterogeneous agentic architectures. Beyond systems and attacks, RIFT-Bench also supports direct evaluation of mitigation strategies. These key capabilities make RIFT-Bench a scalable foundation for security evaluation of agentic AI systems.
[AI-85] Catastrophic Compositional Generation: Why Vanilla Diffusion Models Fail to Extrapolate
链接: https://arxiv.org/abs/2606.23920
作者: Duncan Soiffer,Chandler Squires,Yuan Guan,Jason Hartford,Pradeep Ravikumar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The task of compositional generation involves using a conditional generative model, trained only on a subset of the possible conditions, to produce samples from compositionally-defined target distributions such as a geometric combination of the source distributions. In this work, we argue that this task is often infeasible for vanilla conditional diffusion models: we conjecture that no inference-time technique can efficiently produce samples from the target distribution in certain well-motivated settings. This idea is supported by theory-guided generalization arguments and carefully-designed experiments on both synthetic and realistic data. In particular, while recent methods such as Feynman-Kac correction reduce inference-time approximation error, our results show that score estimation error has a more catastrophic effect on performance when the target distribution is out-of-distribution with respect to the sources, highlighting the need for a different approach to this task.
[AI-86] ARIA: Adaptive Region-Based Importance Allocation for Conditional Diffusion Distillation
链接: https://arxiv.org/abs/2606.23898
作者: Loay Mualem,Vinh Tong,Samir Darouich,Mathias Niepert
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 11 figures
Abstract:Distilling conditional diffusion models aims to transfer the behavior of a large teacher to a smaller student while preserving alignment across conditioning inputs. Unlike recognition tasks, knowledge distillation in conditional diffusion often struggles to transfer knowledge beyond the training distribution, since the predicted noise strongly depends on the conditioning signal. As a result, effective distillation requires exploring a large conditioning space. In practical settings, this creates a major bottleneck. Paired image-condition data may be limited, and generating synthetic images for every available condition is often computationally infeasible, while the pool of conditions, such as text prompts, can be extremely large. Recent work addresses this issue by switching conditions during training, exposing the student to a broader conditioning space without changing the distillation objective. Yet this raises a complementary question: once a large conditioning corpus is available, how should the training effort be allocated? In this work, we introduce ARIA, a framework that adaptively allocates training effort across coarse regions of the conditioning space. By maintaining online estimates of teacher-student discrepancy at the region level, ARIA focuses updates where misalignment persists while preserving the original distillation objective. Empirically, ARIA improves over RC across most architectures and settings, with the clearest gains observed in unseen and underrepresented regimes. We also provide a theoretical analysis showing that the proposed tracking mechanism follows the evolving discrepancy during training under bounded variance and drift assumptions.
[AI-87] JupOtter: Cell-Level Bug Detection in Jupyter Notebooks
链接: https://arxiv.org/abs/2606.23877
作者: Lukas Ottenhof,Thibaud Lutellier
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at the 42nd International Conference on Software Maintenance and Evolution - ICSME 2026 (Research Papers Track)
Abstract:Jupyter Notebooks are an increasingly popular coding environment used across many domains, especially in Python-based data science and scientific computing. Originally used for prototyping and interactive exploration, notebooks are increasingly used to develop more complex programs, leading to a rapid rise in buggy notebooks on platforms like GitHub. To address this trend, we present JupOtter, a bug detection system designed specifically for Jupyter Notebooks. JupOtter features three novel contributions: (1) a notebook-specific tokenization strategy that preserves cell structure, (2) a cell-level bug prediction technique, and (3) a new labeled dataset, OtterDataset, containing over 21,000 notebooks annotated for fine-grained cell-level bug detection. JupOtter achieves cell-level bug detection F1 scores that surpass static analyzers and large language models in two out of three evaluation datasets.
[AI-88] MGI: Member vs Generated Inference ECCV2026
链接: https://arxiv.org/abs/2606.23872
作者: Bihe Zhao,Michel Meintz,Juangui Xu,Franziska Boenisch,Adam Dziedzic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ECCV 2026
Abstract:As generative models increasingly produce samples that are indistinguishable from human-created content, it becomes difficult to determine whether a given data point was part of a model’s natural training set or was generated by the model itself, especially when models memorize and reproduce training data. We formalize this challenge as Member vs Generated Inference (MGI): given a sample and a target generative model, infer whether the sample is a true training member or a generated output of that model. Focusing on image generation, we show that existing membership inference methods systematically misclassify generated samples as training members, while attribution-based methods often misclassify true members as generated. This failure arises because both approaches rely on likelihood-related signals that are similarly elevated for training examples and for the model’s own outputs. To address MGI, we propose Data Circuit Breaker (DCB), a three-stage method that combines complementary signals from a generative model’s autoencoder and latent generator to distinguish training members from generated samples. Across multiple generative models, including image autoregressive and diffusion models, DCB consistently addresses the shortcomings of membership inference and attribution methods, remains effective even when models reproduce near-duplicates of training samples, and generalizes to challenging model derivative settings in which new models are trained on generated data.
[AI-89] Are Safety Guarantees in Neural Networks Safe? How to Compute Trustworthy Robustness Certifications
链接: https://arxiv.org/abs/2606.23858
作者: Merkouris Papamichail,Konstantinos Varsos,Giorgos Flouris,João Marques-Silva
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:A primary challenge in AI safety is the existence of adversarial examples – slightly distorted inputs that cause a neural network (NN) to misclassify. To mitigate this problem, recent research focuses on the computation of robustness certifications, which, for a given input, determine the largest distortion the input may receive without breaking the network’s prediction. Robustness certifications can be interpreted as an axis-aligned hyper-rectangle (multi-dimensional intervals). Most existing approaches focus on maximizing the certification’s volume, but recent intractability results prohibit the computation of volume-optimal certifications in reasonable time. We introduce the apothem measure and show how to compute apothem-optimal certifications in a linear number of calls to a NN verifier (oracle) w.r.t. the input domain’s diameter. Moreover, we prove that we cannot have a volume-optimal, oracle-based algorithm, even if we discard the oracle costs. Also, we introduce dual certifications – an interval including all instances of a class – thus providing apothem-minimum upper bounds to a robustness certification. Further, we present the ParallelepipedoNN system, which we evaluate on the standard MNIST and Fashion MNIST benchmarks. A preliminary comparison with existing work on the same datasets reveals at least two-fold improvement w.r.t. the minimum edge length.
[AI-90] Deciphering Fingerprints of 3D Molecular Surfaces for Accurate Epitope Prediction
链接: https://arxiv.org/abs/2606.23830
作者: Fang Wu,Weihao Xuan,Jure Leskovec,Yejin Choi,Li Erran Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Molecular surfaces encode the geometric and physicochemical patterns that determine antibody-antigen recognition, central to epitope prediction. However, existing methods rely on sequences or backbone structures and struggle to capture discontinuous, surface-driven epitopes. This study presents SurfBind, a surface-centric learning framework for epitope prediction that operates directly on molecular surface representations. SurfBind integrates geometric and physicochemical cues through a Transformer-based architecture with patch-level surface modeling, binder-aware cross-attention, and a hierarchical coarse-to-fine prediction paradigm. Experiments on challenging epitope identification benchmarks, including SAbDab and DB5.5, demonstrate that SurfBind achieves state-of-the-art performance and strong generalization across unseen antibodies and conformational states, highlighting the value of interaction-aware surface modeling for understanding the crucial mechanisms of protein-protein interactions.
[AI-91] Cryptographic certificates of validity for trustworthy AI
链接: https://arxiv.org/abs/2606.23768
作者: Murdoch J. Gabbay
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:We propose cryptographic certificates of validity for agentic AI systems. The core idea is to formally specify a correctness or policy condition as a logical predicate, compile this predicate to a witness-checking problem over polynomial constraints, and use a succinct cryptographic proof system (and optionally zero-knowledge) to certify that the condition holds. This offers a middle ground between formal verification of source code, and cryptographic authentication. An agent’s action can be accompanied by an independently checkable proof that it satisfies an agreed formal policy, without requiring the verifier to trust the agent or to re-execute computation. We outline the approach at a high level, give the core mathematical translation, relate the proposal to proof-carrying code, zkVMs, formal methods, and agent governance, and note the specification, auditing, and deployment questions that a full implementation must answer. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) MSC classes: 03B70, 68T27, 68T42 ACMclasses: D.2.4; F.4.1 Cite as: arXiv:2606.23768 [cs.CR] (or arXiv:2606.23768v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.23768 Focus to learn more arXiv-issued DOI via DataCite
[AI-92] Neuromorphic Speech Enhancement with Dual-Branch Spiking Neural Networks INTERSPEECH2026
链接: https://arxiv.org/abs/2606.23761
作者: Taiyu Meng,Wenbin Jiang,Haoyi Zhang,Yuhan Zhou,Haibing Yin
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 5 pages, 3 figures, 2 tables. Submitted to Interspeech 2026
Abstract:Spiking neural network (SNN)-based neuromorphic speech enhancement has emerged as a promising paradigm due to its energy efficiency, yet it still underperforms classical artificial neural network (ANN)-based approaches owing to binary activations and the lack of well-designed network architectures. To overcome this limitation, we propose a novel dual-branch spiking neural network architecture equipped with a gated spiking unit (GSU), termed GSU-DBNet. Specifically, GSU-DBNet simultaneously models the speech magnitude spectrum and complex spectrum, predicting the corresponding magnitude and complex spectral masks. Meanwhile, a dual-path GSU module is adopted to exploit temporal and frequency information for enhanced spatiotemporal feature representation. Experiments on a popular benchmark dataset show that GSU-DBNet achieves a PESQ score of 3.04 with only 394K parameters, outperforming existing SNN-based methods while using only 4.5%–10.6% of the parameters of representative ANN-based models.
[AI-93] VeriPilot: An LLM -Powered Verilog Debugging Framework
链接: https://arxiv.org/abs/2606.23759
作者: Yihan Wang,Cheng Liu,Jiazheng Zhang,Lei Zhang,Long Cheng,Xiaowei Li,Huawei Li
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 13 pages, 6 figures
Abstract:Verilog debugging remains one of the most time-consuming stages in digital circuit design. Recent advances in Large Language Models (LLMs) have enabled automated debugging; however, most existing approaches rely solely on test outputs and compiler feedback in an end-to-end manner, limiting their effectiveness on complex bugs. A key challenge is that the root cause of an error may be far removed from its observable outputs, making it difficult for LLMs to trace long dependency chains in code. This challenge is further exacerbated in large codebases, where long context lengths hinder efficient reasoning. To address these limitations, we propose VeriPilot, an LLM-powered debugging framework that leverages golden reference models to enable fine-grained bug localization and repair. VeriPilot goes beyond output-level comparison by aligning internal variable semantics between the Verilog design and its corresponding golden model through LLM-based analysis. It then performs step-by-step signal tracing using Control-Data-Flow Graphs (CDFGs) derived from static analysis, identifying a minimal set of suspicious code regions along with their correct counterparts from the golden model. These structured insights are subsequently provided to the LLM to guide reasoning and automated code repair. Experimental results on the Comprehensive Verilog Design Problems (CVDP) benchmark from NVIDIA demonstrate that VeriPilot improves the repair success rate of GPT-4o from 54.3% to 85.71%, significantly enhancing both bug localization accuracy and repair effectiveness for complex Verilog designs. The source code and benchmark are publicly available at Github this https URL.
[AI-94] Exploring Dualistic Meta-Learning to Enhance Domain Generalization in Open Set Scenarios
链接: https://arxiv.org/abs/2606.23758
作者: Xiran Wang,Jian Zhang,Lei Qi,Yang Gao,Yinghuan Shi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Domain generalization learns from multiple source domains to generalize to unseen target domains. However, it often neglects the realistic case of label mismatch between source and target. Open set domain generalization is then proposed to recognize unseen classes in unseen domains. A simple approach trains one-vs-all classifiers to separate each class and detect outliers as unknown. Yet, the imbalance between few positive samples and many negative samples skews the decision boundary towards the positive ones, leading the model to over-reject out-of-distribution data, even from known classes in unseen domains. In this paper, we propose a novel meta-learning stategy called dualistic MEta-learning with joint DomaIn-Class matching (MEDIC), which considers implicit gradient matching towards inter-domain and inter-class task splits simultaneously to find optimal boundaries balanced for both domains and classes. Experimental results show that MEDIC not only outperforms prior methods in open set scenarios, but also maintains competitive close set generalization ability.
[AI-95] Synergizing Physically Constrained MCMC and Chemical-Informed Gaussian Processes for Reaction Network Discovery
链接: https://arxiv.org/abs/2606.23757
作者: Runzhe Liu,Zihao Wang,Wenbo Yang,Shengyang Tao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Extracting interpretable governing equations from sparse, noisy chemical time-series data remains difficult because discrete reaction topology and continuous kinetic parameters are tightly coupled. We present PC-MCMC-CIGP, a reproducible gray-box workflow that combines spike-and-slab topology sampling, hard conservation and thermodynamic screening, and a Chemical-Informed Gaussian Process (CIGP) residual model for parameter calibration and experimental design. The methodological contribution is not a new MCMC or GP family in isolation; rather, it is the integration of these components into a physically constrained workflow with explicit uncertainty-aware acquisition choices. On the H2 + Br2 benchmark, the constrained sampler distinguishes elementary radical pathways from deceptive phenomenological fits in our experiments. On styrene epoxidation, the CIGP optimization loop improves final yield by 12.5% over the reported GP-BO baseline. A new 10-seed acquisition study shows that EI, GWU, PC-EI, uncertainty sampling, discrepancy hunting, and random search have different trade-offs: PC-EI substantially reduces low-yield BO suggestions, while EI-style criteria give the strongest final-yield performance.
[AI-96] Low-power analogue neural networks with trainable nonlinear connections for continuous control
链接: https://arxiv.org/abs/2606.23742
作者: Ian T. Vidamour,Fernando Aguirre,Thomas J. Hayward,Matthew O. A. Ellis,Charles Swindells,Alexander McDonnell,Martin Trefzer,Finley Robins,Luca Manneschi,Susan Stepney,Tony Kenyon,Oliver J. Sutton,Jack C. Gartside,Ivan Y. Tyukin,Adnan Mehonic,Eleni Vasilaki
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Preprint. Further verification of all simulations is ongoing. Any resulting corrections will be incorporated in a revised version
Abstract:Physical neural networks promise low-power machine learning by computing directly with analogue device physics, but most architectures force nonlinear device responses to act as scalar weights. Inspired by Kolmogorov-Arnold networks, we place trainable nonlinear functions on the connections, making each physical connection a learnable computational element. Realising these functions as analogue band-pass filters on field-programmable analogue arrays, we find that the benefit is task-dependent and follows from the smoothness of the physical basis: the networks represent smooth, continuously valued targets, including robotic kinematics, continuous control, and photovoltaic maximum-power-point tracking, with far fewer nodes and connections than multilayer perceptrons, but offer no parameter-efficiency advantage on classification-like decision boundaries. Trained networks transfer to hardware across approximately 35,000 connections with quantified fidelity, and a dedicated CMOS implementation is projected to operate at approximately 30 microwatts. A memristive realisation reproduces the same behaviour in simulation, indicating that the advantage comes from placing trainable nonlinearity on connections, rather than from a particular device.
[AI-97] A Survey on Federated Causal Discovery and Inference
链接: https://arxiv.org/abs/2606.23741
作者: Xianjie Guo,Yuwei Wang,Guodu Xiang,Xiaoli Tang,Kui Yu,Han Yu,Qiang Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 27 pages, 4 figures, 2 tables, journal
Abstract:Causal reasoning, which encompasses the discovery of causal structures and the inference of causal effects, is fundamental to data-driven decision making. In practice, data for reliable causal analysis are often distributed across institutions and cannot be centralized due to privacy regulations or communication constraints. Federated learning (FL) addresses this by enabling collaborative analysis without raw data sharing, giving rise to the rapidly growing field of federated causal discovery (FCD) and inference (FCI). However, the interdisciplinary nature of this field and the absence of a comprehensive survey present barriers to entry for researchers. This paper bridges that gap by providing a systematic review through multi-dimensional taxonomies. Grounded in the three core design decisions underlying any FCD solution, namely how structures are learned, how data are partitioned, and what structural knowledge each party obtains, we organize FCD along three axes: methodological paradigm, federation topology, and structural scope. We further examine key practical dimensions, including temporal dynamics, data heterogeneity, missing data, and non-identical variable sets. For FCI, we categorize methods by target estimand (average versus individualized/conditional treatment effects) and by estimation strategy, from classical weighting methods to modern deep generative architectures. Unlike prior works that treat FCD and FCI separately, we formalize their connection as complementary stages of a unified federated causal reasoning pipeline, where FCD supplies the structural knowledge required for valid effect estimation in FCI. Finally, we highlight their shared concerns regarding privacy, communication efficiency, theoretical guarantees, and application domains, and conclude by identifying open challenges for future research.
[AI-98] Weight-Space Geometry of Offline Reasoning Training ICML2026
链接: https://arxiv.org/abs/2606.23740
作者: Aleksandr Nikolich,Igor Kiselev,Vladimir Platonov,Karina Romanova
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: accepted for ICML 2026 workshop
Abstract:Offline reinforcement-learning losses (RFT, RIFT, DFT, Offline GRPO, DPO) are widely used to distill reasoning from large teachers into smaller students, and are typically compared on downstream accuracy alone. We ask whether they are mechanistically distinct or converge to a similar weight update. Training six methods (SFT, RFT, DFT, RIFT, Offline GRPO, DPO) on identical math rollouts from a single base model (Qwen3-4B) with attention-only LoRA, we analyze the resulting deltas via cosine similarity, principal-angle subspace analysis, linear mode connectivity, and CKA. We observe: (i) SFT, RFT, and RIFT have nearly colinear weight deltas (cosine = 0.97, top-1 principal angle ~7 deg median over 144 modules) and comparable GSM8K accuracy (87-88%, n=1319; pairwise McNemar p = 0.15); (ii) DFT diverges further in direction than any reward-weighted method despite using the same data; (iii) Offline GRPO adds a substantial component orthogonal to the SFT direction (~67% globally, up to ~86% in late layers) while staying in the SFT loss basin; (iv) DPO sits in a near-orthogonal subspace, shows a mode-connectivity barrier, and collapses late-layer CKA to ~0.46. DPO also reaches the highest accuracy in our protocol on both GSM8K (93.5%, McNemar p 10^-9 vs. each other method) and AIME26 (30.0% vs. 3.3-10.0%); its training uses a 10x smaller learning rate than the others (the standard convention), so the update-norm and accuracy gaps reflect loss-function and optimizer choices jointly, and a learning-rate-matched DPO comparison is left for future work.
[AI-99] A Unified Framework for Runtime Verification and Model-Based Diagnosis in LOLA
链接: https://arxiv.org/abs/2606.23720
作者: Raik Hipler,Martin Leucker,Patrick Rodler
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:We present an integrated framework that unifies runtime verification and model-based diagnosis within the stream specification language LOLA. By encoding system descriptions, component health states, and observations into a single stream-based formalism, the approach enables continuous, online fault localization directly alongside fault detection, without requiring separate toolchains. The framework supports both time-invariant and transient faults, and naturally accommodates nondeterministic observations.
[AI-100] Legal Reasoning Is Not Lawyering: Rethinking Legal Benchmarks for Pro Se Access to Justice ICML2026
链接: https://arxiv.org/abs/2606.23716
作者: Andrew Lou,David Shin
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Both authors contributed equally. Accepted to the AI4Law Workshop at ICML 2026
Abstract:Legal AI benchmark research frequently invokes the assumption that large language models can improve access to justice, including for people who cannot access lawyers in order to understand and exercise their legal rights. We argue that current benchmarks are not equipped to support this assumption because they evaluate legal reasoning over inputs that have already been preprocessed by legal experts, which measures the upper bound of model performance. Access to justice depends on a lower bound: how models perform when inputs come from pro se litigants, whose prompts may contain noisy narratives, buried facts, omissions, folk-legal assumptions, and surface-level errors. These degradations are comparable to conditions under which LLMs are known to degrade in the general machine learning literature, including long-context sensitivity, underspecification, hallucination, and typographical perturbations. We connect evidence from pro se literature with this body of machine learning research and present a small perturbation experiment on LEXam, a legal benchmark, to illustrate the gap between these two bounds. If model development continues to focus on benchmarks that measure only the upper bound, this gap may remain hidden or even widen. We conclude by calling for legal benchmarks that directly measure robustness under pro se-like inputs so that access-to-justice claims about legal AI can become empirically testable.
[AI-101] FP8 is All You Need (Part 2): Efficient Ozaki-Bailey Style FFT Through Tensor-core Garner Reformulation and Kulisch Escape Route
链接: https://arxiv.org/abs/2606.23698
作者: Satoshi Matsuoka
类目: Mathematical Software (cs.MS); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注: There is an accompanying Part (1) paper also submitted to arXiv:2606.06510
Abstract:NVIDIA’s Blackwell Ultra (B300) cuts FP64 vector throughput to ~1.3 TFLOPS per GPU, roughly 30x below B200 and well below the level at which bandwidth-limited FP64 workloads stay memory-bound. The Ozaki Scheme II framework recovers FP64-equivalent throughput by routing dense matrix multiply through FP8 tensor cores with a mantissa-sliced Chinese-remainder reconstruction. A companion Part (1) paper covers dense GEMM, batched GEMV, stencils, and SpMV; this paper adds the fifth canonical primitive, the 3-D FFT. We present Ozaki-Bailey FFT, an emulated 3-D FFT via the Bailey six-step decomposition with both 1-D FFT GEMMs on FP8 tensor cores. Bailey’s small inner factor k ~ sqrt(N) (k=32 for N=1024) puts the kernel in the regime k r^2, where the third TME parameter gamma (reconstruction latency) binds rather than amortising. Garner reconstruction splits into Phase A (inner products on FP8/INT8 tensor cores, ~1 ms for 1024^3 on B300) and Phase B (per-output reduction). We identify Kulisch fixed-point complete arithmetic as a Phase B reformulation that keeps full FP64 accuracy while running entirely on the INT32 SIMT pipe. We derive closed-form bandwidth-parity floors. The native FP64 floor is 1.56B_HBM (12.5 TF at 8 TB/s): B300’s 1.3 TF sits ~10x below, Rubin’s 33 TF within 4%. The Kulisch escape route needs an INT32 sub-floor 8.25B_HBM and an FP8 floor 170*B_HBM; B300 meets both. The projection is ~18 ms for 1024^3 at full FP64, essentially the 12.9 ms memory roof. A GPU meets memory-roof FFT parity if it satisfies either the native floor or both Kulisch floors. If the projection holds in practice, B300 becomes viable for full-FP64 FFT through software alone, motivating a libKulisch library and benchmark campaign.
[AI-102] SemChunk-C: Semantic Segmentation for C Code
链接: https://arxiv.org/abs/2606.23697
作者: Boris Nazarov,Darya Frolova,Shaked Leibzirer,Pavel Kisilev
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 7 pages, 9 tables, 2 figures
Abstract:Semantic segmentation of code written in a C-family language remains a challenging problem, due to the language’s complex syntax, macro expansion, and irregular structural patterns. Existing chunking methods, such as fixed-sized windows, heuristic splitting, and syntax-based tools, often fail to capture meaningful functional units, limiting the efficacy of retrieval and other downstream LLM driven tasks. In this paper, we address the problem of chunking in C-related languages. First, we define a set of code chunk categories. Second, we train an LLM-based classifier to a) identify chunk boundaries, and b) assign each chunk a descriptive functional attribute (a category), which can be useful for downstream tasks. By leveraging the LLM’s ability to capture semantic context within the code, we assume flexible chunk boundaries, allowing to adapt to the specific structure and context of each instance. Third, we introduce SemChunk-C, a family of lightweight language models for semantic chunking of C-related files (.c, .cpp, .h, .cs, etc.). These models are based on the first four Ettin encoders [1] with 17M, 32M, 68M, and 150M parameters. Despite their relatively small size, they are capable of identifying cohesive code units, such as data structures, interface blocks, and other components. Furthermore, we demonstrate the robustness of our approach on real-world code, including challenging constructs such as nested definitions and macros. We test our approach on various datasets, and show that it achieves high boundary accuracy and semantic coherence, matching or outperforming chunkers that are based on much larger code-oriented LLMs. We also validate the improved performance of the downstream tasks on a few curated benchmarks. Comments: 7 pages, 9 tables, 2 figures Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL) Cite as: arXiv:2606.23697 [cs.SE] (or arXiv:2606.23697v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2606.23697 Focus to learn more arXiv-issued DOI via DataCite
[AI-103] Beyond the Autoregressive Horizon: A Comprehensive Survey of Diffusion Models World Modelling and State Space Models for Code
链接: https://arxiv.org/abs/2606.23690
作者: Kishan Maharaj,Ashita Saxena,Srikanth Tamilselvam
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 14 Pages, 1 Table, 1 Figure
Abstract:Autoregressive (AR) language models have driven significant progress in automated software engineering, enabling powerful code generation and assistance systems. However, the next-token prediction paradigm introduces structural limitations for code reasoning, including restricted global planning, challenges in maintaining long-range dependencies, and limited grounding in program execution semantics. Noting the heavy skewness of existing literature towards AR models, we discuss emerging paradigms that could potentially overcome the logic and scaling bottlenecks of next-token prediction by unlocking next-generation architectural capabilities for code intelligence. Specifically, we discuss the potential of Diffusion Models, which generate code via holistic denoising that captures long-range syntactic constraints often missed by AR models. We also discuss Code World Models (CWMs), which simulate execution states to support reasoning, and State Space Models (SSMs), which provide linear-time efficiency for massive contexts. By connecting these developments with findings from cognitive neuroscience, we outline directions for developing “System 2” code generation agents.
[AI-104] Large-Language-Model Discovery of Quantum LDPC Codes through Structured Concept Evolution
链接: https://arxiv.org/abs/2606.24808
作者: Zidu Liu,Florian Marquardt
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures
Abstract:Quantum computers could outperform classical machines on important problems, but only if the errors that pervade quantum hardware can be corrected at scale. Quantum low-density parity-check (qLDPC) codes offer a promising route to this goal by combining sparse parity checks with finite encoding rate and growing distance, but their construction remains a challenging discrete design problem. Here we introduce structured concept evolution (SCE), a search framework that pairs a large language model with a structured algebraic mutation grammar to discover lifted-product code families, a class of CSS qLDPC codes. Instead of asking the LLM to design codes from first principles, SCE evolves structured concepts consisting of algebraic specifications paired with executable programs that realize them, using hierarchical mutations that modify the group algebra, protograph geometry, or base space. Running SCE, we discover a diverse set of competitive code families, ranging from abelian constructions to families over non-abelian groups beyond those underlying standard designs such as bivariate-bicycle codes, and characterize them under code-capacity depolarizing noise with BP+OSD decoding. These results are obtained with lightweight models (GPT-5.4-mini and GPT-5.4-nano).
[AI-105] DeepBD: A Grounded Agent ic Workflow for Variant Prioritization and Diagnosis of Genetic Birth Defects
链接: https://arxiv.org/abs/2606.24779
作者: Shiyu Li,Ziqi Yan,Zhihao Wu,Jielong Lu,Weiran Liao,Jiajun Yu,Genjie Li,Zeyu Chu,Jiajun Bu,Haishuai Wang
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注:
Abstract:Birth defects are a major cause of fetal loss, neonatal morbidity and long-term disability. In the subset with suspected genetic etiologies, exome and genome sequencing have moved many cases from variant detection to post-sequencing interpretation: clinicians must rank patient-specific candidate variants under incomplete fetal or infant phenotypes and heterogeneous evidence from population genetics, variant-effect prediction, gene-disease validity, phenotype ontologies, cellular and pathway context, protein structure and clinical literature. We present DeepBD, a grounded agentic workflow for variant prioritization and diagnostic interpretation of genetic birth defects. DeepBD organizes the workflow into LLM-assisted case structuring, a pretrained evidence engine, specialist evidence modules and a grounded diagnostic review layer. The evidence engine learns patient-specific variant scores from structured rule evidence, sequence and variant-effect representations and phenotype-conditioned biological context, whereas specialist modules and the agentic layer provide tool-based refinement, candidate-pool review and diagnosis-oriented synthesis from ranked candidates. Developed using an in-house fetal and infant cohort comprising 18,622 cases, DeepBD achieved Recall@1/3/5/10 of 0.658/0.882/0.912/0.929 on an internal held-out solved-case benchmark, outperforming standalone Exomiser, DeepRare and prompted LLM reranking baselines evaluated on Exomiser-derived top-20 candidate variants. Ablation and overlap analyses show that rule evidence, mechanistic context, and specialist refinement provide complementary signals. These findings support a grounded agentic workflow that separates evidence integration, tool-based refinement, and LLM-assisted diagnostic review for retrospective variant prioritization in genetic birth defects.
[AI-106] Infinitesimal Causality
链接: https://arxiv.org/abs/2606.24621
作者: Sridhar Mahadevan
类目: Category Theory (math.CT); Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
备注: 17 pages
Abstract:This paper introduces a categorical account of infinitesimal causality in Frobenius Markov categories equipped with tangent-bundle semantics. IDC captures the infinitesimal layer in which interventions act as tangent deformations of copy/discard structure. Two distinct Frobenius structures interact: (1) the categorical Frobenius algebra on classical variables encoding copying, comparing, and discarding; and (2) the geometric Frobenius integrability condition, namely involutive closure of the intervention distribution, distinct from the algebraic Frobenius structure. Categorical causal sufficiency is defined as the compatibility of these two notions. A key observation is that, for structural causal models, infinitesimal causality is most naturally formulated in the slice of deterministic mechanisms over exogenous variables, with visible stochastic kernels obtained only after pushforward. Interventions are tangent vectors that deform the Frobenius copy/discard operations; their Lie brackets measure whether this deformation preserves classical information-flow structure. Pearl’s do-calculus is used as a guiding example of intervention identities: ignoring irrelevant interventions corresponds to counit invariance, action/observation exchange to coproduct compatibility with pushforward, and independence to involutive bracket closure of the visible intervention distribution.
[AI-107] Adaptive Machine Learning Framework for UAV Trajectory Optimization in O-RAN
链接: https://arxiv.org/abs/2606.24483
作者: Chenrui Sun,Swarna Bindu Chetty,Gianluca Fontanesi,Mahnaz Arvaneh,Walid Saad,Hamed Ahmadi
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 16 pages, 12 figures, IEEE Transactions on Vehicular Technology
Abstract:The deployment of unmanned aerial vehicles (UAV) as open radio units (O-RUs) in 6G cellular systems presents a promising opportunity to achieve scalable and adaptive network coverage. However, optimizing UAV trajectories in dynamic and unfamiliar environments remains a critical challenge, particularly due to the need for extensive retraining in each new scenario. In this paper, we introduce a novel UAV trajectory optimization framework that integrates enhanced continual transfer learning within the O-RAN architecture. The proposed system maintains a library of pre-trained models and employs a model selection mechanism to identify and transfer knowledge from the most relevant environments, minimizing adaptation time and improving efficiency. When no sufficiently similar model is available, a fallback model empowered by continuous refinements ensures baseline performance. The framework leverages real-world city maps and ray tracing techniques to enhance learning reliability and improve trajectory planning. Simulation results demonstrate that the proposed model selection-based transfer learning approach reduces convergence time by 44% to 56% compared to retraining from scratch, and up to 40% compared to traditional transfer learning without model selection.
[AI-108] Breaking Shortcut Learning for Cross-Trial EEG-Guided Target Speech Extraction via Two-Stage Training INTERSPEECH2026
链接: https://arxiv.org/abs/2606.24164
作者: Wonchul Shin,Inyong Choi,Kyogu Lee
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted by Interspeech 2026
Abstract:Recent end-to-end models for EEG-guided target speech extraction report impressive results, underscoring potential for neuro-steered hearing technologies. However, our analysis reveals that high within-trial performance can be driven by trial-specific EEG structure that acts as shortcuts for target selection, leading to poor generalization on unseen trials. To overcome this gap, we propose TRUST-TSE, a two-stage framework to mitigate shortcut learning. By introducing contrastive pretraining with attended-speaker negative sampling, we encourage the EEG encoder to capture fine-grained EEG–speech alignment while suppressing trial-identity cues. We also employ a confidence-weighted extraction objective based on EEG–source similarity to guide extraction using the learned representations. Experiments on KUL and DTU datasets show that TRUST-TSE outperforms end-to-end baselines under strict cross-trial protocols, addressing a key reliability bottleneck of existing approaches.
[AI-109] DTT-BSR: A Generative-Regression Cascade for Music Source Restoration INTERSPEECH2026
链接: https://arxiv.org/abs/2606.24127
作者: Youran Ni,Shihong Tan,Yuzhu Wang,Gongping Huang
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted by Interspeech 2026
Abstract:Music source restoration (MSR) requires jointly addressing source unmixing and the inversion of non-linear production effects. Current methods struggle to achieve accurate target signal reconstruction while maintaining semantic consistency. To address this limitation, we propose DTT-BSR+, a two-stage cascade MSR system that decouples distribution fitting from signal reconstruction into separate stages. A generative DTT-BSR separator in the first stage produces stems matching the prior of clean sources, and a modified Demucs network in the second stage enhances the first stage output using time-domain and multi-resolution spectral losses. DTT-BSR+ improves multi-mel signal-to-noise ratio (MMSNR) over the single-stage DTT-BSR across all stems, and surpasses the state-of-the-art X-LANCE MSR system on five stems. We also reveal through Fréchet Audio Distance (FAD) decomposition an implicit trade-off between signal reconstruction accuracy and semantic distribution fitting across stems.
[AI-110] Promise and challenges of heart chamber segmentation from non-contrast CT scans using contrastive unpaired image translation: a feasibility study
链接: https://arxiv.org/abs/2606.23879
作者: Jing Wang,Tong Yu,Hao-En Lu,Zixue Zeng,Joseph K. Leader,Xin Meng,Jianbing Zhu,Jiantao Pu
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:
Abstract:Purpose: To evaluate the feasibility and challenges of heart chamber segmentation from non-contrast CT scans using contrastive unpaired image translation and deep learning-based segmentation. Approach: We developed ChameleonNet, a framework utilizing the Contrastive Unpaired Translation (CUT) network with decoupled contrastive learning (DCL) loss to synthesize non-contrast CT from contrast CT scans. Using annotations of four heart chambers (left atrium (LA), left ventricle (LV), right atrium (RA), and right ventricle (RV)) from contrast scans, we trained a Hausdorff distance loss-enhanced nnU-Net on synthesized non-contrast images. The translation model was trained with 35,538 contrast-enhanced and 37,197 non-contrast CT slices. The segmentation model was trained with 292 synthesized non-contrast scans. Performance was evaluated using Dice similarity coefficient (DSC) and 95th Hausdorff distance (HD95) on 36 synthesized non-contrast scans, and volume agreement on 36 real non-contrast CT scans was assessed using Pearson correlation, mean absolute percentage error (MAPE), and mean percentage error (MPE). Results: The segmentation model achieved DSC of 0.94 (0.01), 0.91 (0.04), 0.92 (0.03), 0.93 (0.02), and HD95 of 3.63 (1.49), 5.74 (4.08), 5.18 (1.77), 5.51 (3.21) mm on synthesized non-contrast images for LA, LV, RA, and RV, respectively. On real non-contrast CT scans, Pearson correlations were 0.93, 0.82, 0.87, and 0.89 (all p0.001), with MAPE ranging from 9.22% to 20.79%, and MPE ranging from -12.52% to 4.67%. Conclusions: ChameleonNet demonstrated feasibility for heart chamber segmentation from non-contrast CT without manual non-contrast annotations. However, volume errors, particularly for LV and RV, indicate that further refinement and validation are needed before clinical use.
[AI-111] he Measurable Majority
链接: https://arxiv.org/abs/2606.23853
作者: Lawrence S. Moss,Arthur Paul Pedersen
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Probability (math.PR)
备注:
Abstract:This paper studies strict majority reasoning in finite electorates using so-called \textitsocial decision frames : finite sets of voters equipped with distinguished families of coalitions interpreted as those voting blocs evaluated to form a strict majority. A coherence criterion for qualitative majority judgments is identified and shown to give an exact characterization for representability of strict majorities by finitely additive measures. In addition, a minimal natural logic for reasoning about strict majorities is shown to be sound and complete. These developments motivate examination of associated combinatorial questions concerning incoherence in finite families of sets; partial results and a conjecture are given. Finally, the results of this paper are applied to correct a classical representation theorem for weak qualitative probability structures due to Patrick Suppes and to establish a May-type characterization for ordinary strict majority rule for social decision frames.
[AI-112] Integrated Sensing and Communications for Real-time Avatar Control in XR over 5G
链接: https://arxiv.org/abs/2606.23771
作者: Nabeel Nisar Bhat,Javad Sameri,Rreze Halili,Rafael Berkvens,Maria Torres Vega,Jeroen Famaey
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:Extended Reality (XR) presents a challenging use case for 5G and 6G networks, requiring high data-rates and lowlatency communication to deliver a truly immersive experience. Moreover, in order to seamlessly translate physical actions to the virtual world, accurate gesture recognition and pose estimation are required. Current XR interaction solutions based on handheld controllers and cameras cannot easily capture full-body poses, inhibit the free use of hands, and require good visibility and a clear line of sight. In this work, we propose a multimodal sensing architecture for XR that combines 5G MillimeterWave (mmWave) Integrated sensing and communication (ISAC) and surface electromyography (sEMG) signals. 5G mmWave ISAC cannot only be used to deliver content wirelessly to the Head-mounted display (HMD), but also the same communication signals can be used to derive coarse body-level gestures and poses of the user, to support real-time avatar control. For fine-grained finger-level gestures, our architecture leverages lightweight sEMG sensors that capture forearm muscle activity. To illustrate the need of both modalities, we present evaluations of both sensing technologies. At the body level (5G), our architecture relies on power-per-beam-pair (PPBP), which can be computed from standard beam management or beam sweeping procedures of the 5G NR standard. PPBP-based sensing achieves 82.2 \pm 5.9% average accuracy when evaluated on users not seen during training. For fine-grained finger-level interactions, we show that surface electromyography (sEMG) carries strong discriminative information achieving consistent promising performance across different movement settings. Thus, combining the two modalities enables multi-scale gesture recognition, at the body level via existing 5G signals and finger level via lightweight sEMG sensors, forming a complete XR framework.
[AI-113] JEDEL: Zero-Shot DNA-Encoded Library Design for Early-Stage Drug Discovery
链接: https://arxiv.org/abs/2606.23745
作者: Zygimantas Jocys,Zhanxing Zhu,Henriette M. G. Willems,Katayoun Farrahi
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We present JEDEL, a framework for generating synthesis-ready DNA-encoded libraries (DELs) directly from three-dimensional pharmacophore representations of active ligands. JEDEL is the first model to map pharmacophore interaction patterns to actionable, scalable synthesis instructions, enabling the design of targeted libraries comprising potentially millions of molecules. Unlike existing generative approaches that produce virtual compounds requiring downstream synthesis planning, JEDEL operates within the space of purchasable building blocks and validated reactions, ensuring that every output is experimentally realizable by construction. JEDEL learns a predictive alignment between pharmacophore geometry and molecular structure and decodes this into combinatorial synthesis routes at scale. Across 18 protein targets, it generates focused libraries that outperform random and diversity-based baselines in predicted binding affinity, pharmacophore recovery, and sample efficiency, without target-specific retraining. JEDEL enables a shift from virtual molecule generation to experimentally deployable library design.
[AI-114] Random coloured digraphs defined by a Markov logic network
链接: https://arxiv.org/abs/2606.23715
作者: Yasmin Tousinejad,Vera Koponen
类目: Logic (math.LO); Artificial Intelligence (cs.AI)
备注:
Abstract:A Markov Logic Network (MLN) is a probabilistic relational model used in Statistical Relational Artificial Intelligence for defining a probability distribution on the set of possible worlds with domain D for an arbitrary finite domain D . An MLN consists of soft constraints with associated weights which are nonnegative real numbers. In this study we consider a language speaking about a property P(x) and a relation R(x, y) . We consider an MLN for which every Boolean combination of P(x) and R(x, y) is a soft constraint (with associated weight). Let n denote the size (cardinality) of the domain. We show that, for every choice of weights, if the weights are scaled by 1/n then, for every first-order sentence \varphi , the probability that \varphi holds tends to either 0 or 1 as n \to \infty ; that is, a 0-1 law for first-order logic holds. Morover, the limit probability does \em not depend on the weights. If we instead use the standard semantics of MLNs, in the case of which the weights are \em not scaled, then the limit behaviour is more complicated and \em depends on the weights. With unscaled weights we get 7 qualitatively different cases which depend on the weights. In some cases we have a 0-1 law for first-order logic, in some cases not, but we may still have a convergence law. The influence of the weights on the asymptotic probability of a first-order sentence may be in the form of a sudden ``phase transition’’ from one of the 7 cases to another. The presence of a convergence law has positive implications for inference on large domains.
[AI-115] Audio-visual Contrastive Alignment for Diffusion-based Visual-conditioned Speech Enhancement
链接: https://arxiv.org/abs/2606.23712
作者: Colombe Mboungou(MULTISPEECH),Mostafa Sadeghi(MULTISPEECH),Jean-Eudes Ayilo(MULTISPEECH),Romain Serizel(MULTISPEECH)
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:Audio-visual speech enhancement (AVSE) exploits visual cues such as lip movements to recover speech in noisy environments. Recent work introduced diffusion-based unsupervised AVSE, where a speech diffusion model conditioned on visual features via cross-attention is trained and used as a data-driven prior for posterior sampling-based speech enhancement. Despite promising performance over its audio-only counterpart, the impact of explicitly enforcing cross-modal alignment in the fusion remains unclear. In this work, we propose to augment the diffusion training objective with a contrastive audio-visual loss to encourage stronger use of visual information while keeping the posterior sampling framework unchanged. Experiments across matched and mismatched test data show consistent improvements in interference suppression, signal reconstruction, and perceptual quality, with the largest gains at low SNRs. Code is available at this https URL cexauce/AV-CA-DiffUSE
[AI-116] Coordinate-Queryable Neural Field Reconstruction for EEG Spatial Super-Resolution with Unseen-Electrode Generation
链接: https://arxiv.org/abs/2606.23707
作者: Hongjun Liu,Leyu Zhou,Zijianghao Yang,Chao Yao
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:EEG spatial super-resolution (EEGSR) in real deployments is challenged by random channel missingness, unstable electrode quality, and changing visible-channel patterns caused by bad contacts or device variability. Most existing EEGSR methods learn a fixed low-to-high channel mapping under pre-defined input-output layouts, which makes them brittle when missing channels vary at test time. In this paper, we reformulate EEGSR as learning a shared conditional scalp field from partially observed support channels. Specifically, a position-guided encoder summarizes the observed EEG channels and their coordinates into a latent condition, and a conditional implicit neural representation decoder reconstructs target EEG signals by querying this condition at desired electrode coordinates. During inference, the model directly reconstructs unseen electrode signals from the available EEG support and the queried coordinates. To strengthen the constraint of the encoded latent representation on the decoder and thereby construct a more stable scalp field consistent with the observed channels, we further introduce a fidelity-preserving channel corruption training strategy under mixed electrode states. Extensive experiments across multiple EEG datasets demonstrate the effectiveness of our framework for both random missing-channel reconstruction and strict unseen-electrode signal generation. Notably, under the strict held-out-electrode setting on AAD, our method reduces NMSE by 37.5% and improves SNR by 2.12 dB over the strongest baseline, showing its ability to synthesize signals at electrode locations never exposed during training.
[AI-117] Event-Aligned Analysis of Multi-Rater Pain Assessments Using Continuous Wearable Physiology
链接: https://arxiv.org/abs/2606.23705
作者: Saba A. Farahani,Elahe Khatibi,Thomas D. Hughes,Ariana M. Nelson,Hung Cao,Amir M. Rahmani
类目: Applications (stat.AP); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures. Accepted at IEEE EMBC 2026 (Toronto, Canada, July 26-30, 2026)
Abstract:Pain is assessed differently by patients, nurses, and clinicians, yet most computational approaches assume a single ground-truth label - effectively ignoring who is doing the rating. We introduce a rater-aware, event-aligned framework that converts sparse, rater-specific pain ratings into discrete pain-change events and aligns continuous wearable physiological signals to these events, preserving rater identity throughout. Applied to multimodal wearable data collected during spine-related pain procedures, the framework identifies substantial disagreement across rater groups and provides preliminary, exploratory evidence of rater-dependent physiological differences preceding reported pain increases. These findings suggest that pain-physiology relationships may not be rater-invariant, and that aggregating assessments across raters may mask meaningful physiological patterns. A rater-aware, event-aligned perspective is therefore a promising direction for interpreting wearable data in real-world clinical pain assessment.
[AI-118] Heterogeneous 2D/1D Signal Representation Fusion for Underwater Acoustic Modulation Recognition Under Distribution Shift
链接: https://arxiv.org/abs/2606.23702
作者: Ronglai Qian,Liang An,Xiaoyan Wang,Qing Fan,Ziwei Huang,Yang Ye
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Modulation recognition systems rely on heterogeneous signal representations. 2D signal-image modalities such as time-frequency and cyclostationary maps capture structural patterns, while 1D statistical descriptors such as higher-order power spectra encode complementary cues. Under distribution shift, these modalities degrade unevenly, making robust fusion a central challenge for practical deployment. Progress is further limited by the lack of a unified evaluation protocol that systematically separates different shift types. This paper addresses both challenges through a joint benchmark-and-model study in underwater acoustic modulation recognition. UAMR-ShiftBench is the first benchmark to jointly cover in-distribution, low-SNR, unseen-environment, unseen-communication-parameter, and measured sea-trial evaluation under a single matched protocol, with two independent real-world subsets collected during two sea-trial campaigns conducted in March and November in the South China Sea. SCP-TriCA fuses STFT, cyclostationary, and P2/P4 (second- and fourth-order power spectra) modalities hierarchically: the two 2D modalities are first aligned through bidirectional cross-attention, and the 1D statistical modality is then incorporated through a sample-adaptive selective gate. On UAMR-ShiftBench, SCP-TriCA achieves 95.33% in-distribution accuracy and 74.59% simulated OOD average, outperforming the strongest baseline by 5.12 percentage points, and reaches 91.14% and 94.86% on the two sea-trial subsets, exceeding the best baseline by 15.71 and 23.00 percentage points respectively. Ablation results confirm that the gains stem from modality complementarity and the hierarchical fusion design. Code and models are available at this https URL.
[AI-119] Reentrant value fields as delayed coupled reaction-diffusion systems on finite graphs
链接: https://arxiv.org/abs/2605.03940
作者: Karsten Bohlen
类目: Dynamical Systems (math.DS); Artificial Intelligence (cs.AI)
备注:
Abstract:We describe a dynamical system in which a symbolic field is coupled to a geometric field via a bipartite Hilbert-Schmidt kernel. The system is fully described by a retarded functional differential equation (RFDE) on the history space, subject to Lipschitz and small gain conditions. We show that the RFDE is well-posed under constant input and that it admits a compact global attractor. The principal subsystem (H_L, X_R, P) , which is comprised of the two primary fields as well as an executive field, is shown to be globally stable independent of delay, provided that the interfield coupling satisfies C_\mathcalK^2\mu_L\mu_R . In addition, we describe design specifications that fulfill the hypotheses of the main Theorem.
机器学习
[LG-0] Real vs. Complex Spectral Bases for Neural Operators: The Role of Greens Function Alignment
链接: https://arxiv.org/abs/2606.24851
作者: Jason Sulskis,Sathya Ravi
类目: Machine Learning (cs.LG)
*备注: Submitted to/in consideration for the 62nd Allerton Conference on Communication, Control, and Computing
Abstract:Fourier Neural Operators (FNO) learn solution operators of partial differential equations by parameterizing global convolutions in the complex Fourier domain. For real-valued PDE solutions, the complex FFT carries representational redundancy through conjugate symmetry. We introduce the Hartley Neural Operator (HNO), the exact real-valued mirror of FNO: it replaces the FFT with the purely real Discrete Hartley Transform and learns a single real multiplier per retained spectral mode, with no complex arithmetic. Because the real Hartley spectrum is not halved by conjugate symmetry, HNO retains twice as many frequency corners as FNO but one real weight where FNO carries a complex pair, so the two operators are iso-parametric at equal width and differ only in spectral basis. Our central thesis is that the best basis is a property of the operator. Self-adjoint elliptic operators (Poisson, biharmonic) have real, symmetric Green’s functions that the real Hartley multiplier diagonalizes exactly, and HNO is favored there. Time-dependent operators carry phase, from oscillation in the wave equation to transport in advection, Burgers, and Navier-Stokes, which a real diagonal multiplier cannot represent, so FNO is favored there, and increasingly so with the operator’s phase content, leaving the phaseless heat equation as the borderline case. Training both operators identically and benchmarking across PDE classes, initial-condition families, and boundary conditions, we find an elliptic-versus-time-dependent split that is monotone in operator phase content and matches the Green’s-function theory we develop. Rather than a universal winner, our findings give a predictive rule: match the spectral basis to the symmetry of the solution operator.
[LG-1] Dirac-Frenkel dynamics with inertia for nonlinearly parametrized solutions of evolution problems
链接: https://arxiv.org/abs/2606.24769
作者: Matteo Raviola,Benjamin Peherstorfer
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:Even when Dirac-Frenkel dynamics determine a well-defined evolution in function space, the corresponding parameter dynamics can be non-unique or ill-conditioned for redundant nonlinear parametrizations such as neural networks or mixture models. We propose to add inertia to the Dirac-Frenkel dynamics and show that this allows useful parameter velocity information to persist from the past trajectory in directions that are weakly informed, while well-informed parameter velocity directions continue to follow the Dirac-Frenkel dynamics. We prove that the inertial formulation yields well-posed parameter dynamics and provide a posteriori error bounds. After time discretization, the method requires the solution of the same type of regularized linear least-squares problem as standard Dirac-Frenkel dynamics, but with the previous velocity appearing as an anchor. Numerical experiments demonstrate the increased robustness obtained with inertia.
[LG-2] QC-SMOTE: Quality-Controlled SMOTE for Imbalanced Classification
链接: https://arxiv.org/abs/2606.24625
作者: Parth Upman,Shreyank N Gowda
类目: Machine Learning (cs.LG)
*备注:
Abstract:Class imbalance poses a significant challenge in classification, where existing methods such as SMOTE often generate low-quality synthetic samples in regions with noise or class overlap. We propose QC-SMOTE, a quality-controlled oversampling framework that estimates minority sample reliability using a composite neighbourhood trustworthiness score combining local density, safe-level, and isolation from the majority class. Synthetic candidates are generated using an IPQ-guided best-of-K strategy that evaluates midpoint purity and, when required, majority clearance, with allocation guided by sample reliability and boundary informativeness. Generation behaviour adapts across overlap–imbalance regimes, adjusting interpolation range and selection criteria to match local data geometry. Low-quality synthetic samples are replaced with original minority duplicates when neighbourhood purity falls below an adaptive threshold, providing graceful degradation by reverting to duplication in severely noisy regions. Experiments on 30 imbalanced datasets using repeated stratified cross-validation show that QC-SMOTE achieves the strongest average AUC-ROC and Macro F1 among the compared oversampling methods, with particularly clear gains under moderate and severe imbalance. These results demonstrate the importance of quality-aware, geometry-adaptive synthetic sampling for robust imbalanced classification.
[LG-3] Reasoning as Attractor Dynamics: Latent Memory Retrieval via Gibbs-Weighted Energy Minimization ICLR
链接: https://arxiv.org/abs/2606.24543
作者: Kanishk Awadhiya
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR Workshop 2026
Abstract:Large Language Models (LLMs) are traditionally viewed as autoregressive generators. However, from the perspective of collective computation, they function as high-dimensional Dense Associative Memories that store complex reasoning patterns as latent attractors. In this work, we investigate the energy landscape of mathematical reasoning. We posit that correct reasoning chains correspond to deep, wide attractor basins (“flat minima”) in the model’s output distribution, whereas hallucinations manifest as sharp, unstable local minima. To exploit this geometry, we introduce a retrieval mechanism based on a Gibbs measure of the trajectory’s spectral entropy. By sampling multiple reasoning paths and weighting them by their inverse energy ( P \propto e^-\beta E ), we approximate the equilibrium distribution of the associative memory, effectively ``relaxing’’ the system into a robust solution. Empirically, this physics-inspired mechanism improves Microsoft Phi-3.5 performance on GSM8K by 5.38% (84.7% \to 90.1%), demonstrating that inference is better modeled as a dynamic settling process into an attractor basin rather than greedy next-token prediction.
[LG-4] Data Augmentation: A Fourier Analysis Perspective COLT2026
链接: https://arxiv.org/abs/2606.24418
作者: Behrooz Tahmasebi,Melanie Weber,Stefanie Jegelka
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 42 pages, 1 figure. Published at COLT 2026
Abstract:Data augmentation is a simple and model-agnostic approach for exploiting known invariances in learning problems. Given a group acting on the input space, one augments the training set with transformed copies of each sample. Because it exploits symmetries without modifying the underlying learning algorithm, data augmentation can be applied broadly across learning methods. However, this universality comes at a computational cost: when the group is large, full group-sized augmentation quickly becomes computationally infeasible. This raises a fundamental question: Can partial data augmentation achieve the same statistical benefits as full augmentation in terms of generalization and sample complexity? We develop a general framework for investigating this question using Fourier analysis and the representation theory of finite groups. We show that, for a broad class of classical learning problems, partial data augmentation based on a randomly sampled subset of group elements achieves the same minimax rates as full augmentation, up to an approximation error that vanishes as the subset size increases. Our results provide a theoretical explanation for why partial augmentation can retain the statistical benefits of full augmentation despite enforcing symmetry only approximately, and shed light on a recently raised question in learning with symmetries: whether statistically optimal learning under general group invariances can be achieved using computationally scalable methods. Moreover, we prove a complementary impossibility result: enforcing exact invariance via data augmentation requires averaging over the entire group, and cannot be achieved by any strict subset when the hypothesis space is sufficiently expressive. Together, these results provide a unified perspective on full and partial data augmentation, as well as exact and approximate symmetry enforcement.
[LG-5] Natural Identifiers for Privacy and Data Audits in Large Language Models ICLR2026
链接: https://arxiv.org/abs/2606.24408
作者: Lorenzo Rossi,Bartłomiej Marek,Franziska Boenisch,Adam Dziedzic
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR 2026
Abstract:Assessing the privacy of large language models (LLMs) presents significant challenges. In particular, most existing methods for auditing differential privacy require the insertion of specially crafted canary data during training, making them impractical for auditing already-trained models without costly retraining. Additionally, dataset inference, which audits whether a suspect dataset was used to train a model, is infeasible without access to a private non-member held-out dataset. Yet, such held-out datasets are often unavailable or difficult to construct for real-world cases since they have to be from the same distribution (IID) as the suspect data. These limitations severely hinder the ability to conduct scalable, post-hoc audits. To enable such audits, this work introduces natural identifiers (NIDs) as a novel solution to the above-mentioned challenges. NIDs are structured random strings, such as cryptographic hashes and shortened URLs, naturally occurring in common LLM training datasets. Their format enables the generation of unlimited additional random strings from the same distribution, which can act as alternative canaries for audits and as same-distribution held-out data for dataset inference. Our evaluation highlights that indeed, using NIDs, we can facilitate post-hoc differential privacy auditing without any retraining and enable dataset inference for any suspect dataset containing NIDs without the need for a private non-member held-out dataset.
[LG-6] RE4: Transformation-aware Imitation of Object Interactions Using Manipulation Modes
链接: https://arxiv.org/abs/2606.24403
作者: Arsh Chawla,Rahul Shome
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, appendix
Abstract:Object interaction tasks have been a focus of advances in imitation learning. End-to-end methods, dominated by diffusion and flow-based variants have shown leaps in performance while sacrificing interpretability. Object-centric and pose-informed variants have had a role in learning from demonstration in manipulation tasks. In this paper, we revisit a few modern imitation learning benchmarks for object interactions, with the aim of composing a framework that repurposes principled theories of manipulation, preserving both performance and interpretability. For image observations, lightweight training is proposed for model-free pose estimation of the target object, using self-supervision over the demonstration data available for imitation learning. This information is then used to inform a manipulation mode-aware retrieval of a demonstration, a mode-aware transformation, a replan step that connects to the retrieval point while preserving mode constraints, and finally rolling out the transformed demonstration. These compose four key steps of the proposed RE4 framework, evaluated over state-based and image-based benchmarks in Push-T and Robomimic. An adversarial benchmark that evaluates sparse data regions of image-based Push-T showcases the robustness, further bolstered by indications from low-data regime experiments. The current work shows promise in using simple interpretable building blocks to learn manipulation skills.
[LG-7] Parallel Manifold Steering: Efficient Adaptation of Large Associative Memories via Residual Energy Shaping ICLR
链接: https://arxiv.org/abs/2606.24396
作者: Kanishk Awadhiya
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR Workshop 2026
Abstract:Large Transformer models function as Dense Associative Memories (DAMs), retrieving knowledge via high-dimensional attractor dynamics driven by the self-attention mechanism \citepramsauer2020hopfield, wu2024attention. However, adapting these frozen memory systems to new tasks presents a fundamental ``Plasticity-Stability’’ dilemma. Current methods either risk catastrophic interference by modifying synaptic weights directly (e.g., LoRA) \citephu2021lora or degrade associative capacity by clogging the retrieval buffer with static prompt tokens (e.g., VPT) \citepjia2022vpt. In this work, we propose \textbfH-Res (Hierarchical Residual Steering), a mechanism that modulates the effective energy landscape of the Transformer without altering its global equilibrium or expanding its sequence length. By formulating adaptation as a control problem on the activation manifold \citepchen2018neuralode, H-Res learns a state-dependent vector field that steers token trajectories into task-specific basins of attraction. We formally prove that H-Res preserves the attention entropy of the foundation model and facilitates Neural Collapse \citeppapyan2020prevalence. Empirically, Manifold Steering outperforms global weight modification by 26% on associative retrieval tasks and eliminates the computational overhead of prompt-based methods, scaling effectively to structured domains \citepzha2023vtab.
[LG-8] Managing Task Execution for Unknown Workloads in Batteryless IoT: A Hardware-Agnostic Evaluation
链接: https://arxiv.org/abs/2606.24340
作者: Samer Nasser,Henrique Duarte Moura,Ritesh Kumar Singh,Maarten Weyn,Jeroen Famaey
类目: Machine Learning (cs.LG)
*备注: Submitted to IEEE Transactions on Sustainable Computing
Abstract:In recent years, the Internet of Things (IoT) paradigm has been shifting toward batteryless, energy-harvesting architectures. Sustaining reliable operation in these systems requires intelligent management of highly volatile stored energy. As edge applications grow in complexity, traditional energy-aware schedulers struggle with unpredictable workloads due to their reliance on static execution thresholds or pre-measured, hardware-specific task profiles. To overcome this, we propose two novel, hardware-agnostic dynamic scheduling strategies treating applications as a “black box,” requiring no prior energy information: a model-free Reinforcement Learning (RL) agent and an on-the-fly Approximated Prediction (AP) method. We evaluate these methods against an adaptive task rate approach (AsTAR) and optimized static thresholds using a custom-built, physically accurate simulation framework driven by real-world solar data and dynamic LoRa transmission profiles. Rather than claiming universal superiority, our analysis exposes the distinct operational trade-offs of each method: the AP approach delivers lightweight, near-oracle task throughput; the RL agent provides tunable survival-execution balancing; and AsTAR excels at execution pacing across long energy gaps. Finally, we demonstrate that while these advanced strategies provide critical resilience for severely constrained systems with small capacitors, devices with larger energy buffers can efficiently rely on simpler, less computationally expensive static policies.
[LG-9] Deep numerical schemes for systems of Ergodic BSDEs with applications to regime-switching forward utilities
链接: https://arxiv.org/abs/2606.24271
作者: Guillaume Broux-Quemerais(LMM),Sarah Kaakai(LAGA),Anis Matoussi(LMM),Wissal Sabbagh(LMM)
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:In this paper, we introduce two neural-network-based numerical schemes for solving systems of coupled ergodic Backward Stochastic Differential Equations (eBSDEs), motivated by the approximation of optimal strategies within the framework of forward utilities in a regime-switching stochastic factor model. Our approach builds on the representation of such models through systems of eBSDEs introduced in [HLT20]. We first establish a link between the solution of the system of ergodic BSDEs and that of an associated multidimensional BSDE with random terminal time, given by the hitting time of the positive recurrent stochastic factor. Building on this representation, we introduce a locally additive deep learning scheme obtained by minimizing aggregated local error terms. We then present a new Deep Galerkin Method (DGM) inspired algorithm that minimizes the residual of the associated ergodic PDE system, relying on a representation of the ergodic cost. Finally, we apply this framework to regime-switching forward utilities in a stochastic factor model. We first derive a general consistency SPDE that characterizes regime-switching forward utilities and retrieve their representation with systems of ergodic BSDEs in the homothetic case. Numerical experiments demonstrate the performance of the proposed methods, with a particular focus on the impact on forward preferences of taking into account regime switches.
[LG-10] Project Ariadne: Prompt-Conditioned Route Generation for Synthesis Planning
链接: https://arxiv.org/abs/2606.24184
作者: Anton Morgunov,Victor S. Batista
类目: Machine Learning (cs.LG)
*备注: Code is available at this https URL
Abstract:Retrosynthetic planning seeks to connect a target molecule to commercially available starting materials through a multistep route. Classical planners construct such routes by iteratively applying single-step reaction models within a search procedure; constrained variants often require specialized algorithms or architectural changes. Direct route generation reframes retrosynthesis as sequence generation, but existing direct-generation methods still train separate models for different planning specifications. We introduce Ariadne, a decoder-only route generator that represents the target, optional constraints, and route in one prompt-completion sequence. On the RetroCast/PaRoutes mkt-cnv-160 benchmark family, one 24-layer checkpoint follows route-depth and required-starting-material prompts: adding the corresponding prompt fields raises Solv-0 by 13.7 points for depth constraints and 31.2 points for required-leaf constraints. Ariadne also improves over DESP, a bidirectional search planner, on required-leaf Top-10 and Solv-0 in 24 GPU-minutes versus 6.8 GPU-hours. On standard reconstruction, Ariadne is comparable to DMS Explorer XL at about half the reported inference time. Across additional target-only benchmarks, Ariadne’s clearest gains are on route-holdout reconstruction, whereas AiZynthFinder MCTS remains stronger on several Solv-0 comparisons. These results extend sequence generation from specialist retrosynthesis models to prompt-conditioned structural route generation. We release the codebase and training scripts to support further work, but do not introduce Tier-1–3 route checkers; those remain the main bottleneck before models of this kind can become useful to experimental chemists.
[LG-11] AsyncOPD: How Stale Can On-Policy Distillation Be?
链接: https://arxiv.org/abs/2606.24143
作者: Wonjun Kang,Kevin Galim,Seunghyuk Oh,Minjun Kang,Sanghyun Park,Donghoon Kim,Minjae Lee,Minseo Kim,Rishabh Tiwari,Yuchen Zeng,Hyung Il Koo,Kangwook Lee
类目: Machine Learning (cs.LG)
*备注: Code: this https URL
Abstract:On-policy distillation (OPD) trains a student on its own rollouts guided by teacher feedback and is becoming increasingly important for large language model (LLM) post-training. Like reinforcement learning (RL), however, OPD faces an on-policy systems bottleneck, as rollouts can dominate training time for reasoning workloads. Asynchronous training pipelines can alleviate this bottleneck by decoupling rollout generation from learner updates, but doing so introduces stale-policy data. While prior work has studied stale data in asynchronous RL, its effects in OPD remain underexplored. We present the first systematic study of staleness in asynchronous OPD, focusing on a practical setting where teacher feedback is implemented through local KL losses and full-vocabulary teacher logits are too expensive to store or transfer, necessitating finite teacher-score caches. We first show that KL direction changes the stale-data problem: teacher-weighted forward KL is more robust to stale rollouts, whereas student-weighted reverse KL is vulnerable. Second, for this vulnerable reverse-KL case, we study whether methods designed to stabilize asynchronous RL can mitigate OPD staleness. In our experiments, they do not improve over a simpler OPD-specific surrogate: recomputing the reverse-KL signal under the current student at learner time. Third, we analyze how finite teacher-score caches create a bias-variance tradeoff for sparse and sampled reverse-KL OPD estimators. This motivates multi-sample Monte Carlo (MC), which preserves MC correctability while reducing one-sample variance. Finally, we present and open-source AsyncOPD, a fully asynchronous OPD training pipeline built from these estimator choices. Experiments show that AsyncOPD improves training throughput by 1.6\times to 3.8\times over strict synchronous training while reaching comparable accuracy.
[LG-12] A Time-Reparameterized Cumulative Intensity Extrapolation Sampler for Discrete Flow Matching
链接: https://arxiv.org/abs/2606.24140
作者: Feiyang Fu,Hehe Fan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Discrete flow matching (DFM) provides a principled framework for generative modeling on discrete state spaces via continuous-time Markov chain dynamics. In practice, sampling for DFM commonly employs discretizations such as \tau -leaping, yet efficient sampling methods under a limited number of function evaluations (NFE) remain less studied. To address this gap, we propose the Time-Reparameterized Cumulative Intensity Extrapolation (TR-CIE) sampler, which aims to improve sampling quality when function evaluations are restricted. TR-CIE consists of two components. First, a schedule-based time reparameterization rescales the time grid according to the noise schedule. Under standard factorized DFM rate parameterizations, this transformation of variables absorbs the schedule-dependent growth term and mitigates stiffness near the terminal sampling stage. Second, we introduce a cumulative-intensity extrapolation updating rule. By reusing cached model outputs from the previous step as a history term, this improves the approximation of stepwise cumulative intensities on the resulting non-uniform time grid. We provide a theoretical analysis that bounds the local approximation error of cumulative intensities and establishes convergence results. The resulting sampler requires one NFE per step and introduces no additional model evaluations compared to the standard \tau -leaping sampler. Extensive experiments on synthetic tasks, text generation, and text-to-image benchmarks demonstrate that our method improves sampling quality under limited NFE.
[LG-13] FedUP: One-Shot Federated Unlearning via Centroid-Guided Plug-in Filters
链接: https://arxiv.org/abs/2606.24113
作者: Feihong Nan,Zhengyi Zhong,Pan Wang,Weidong Bao,Xiongtao Zhang,Quan Wen,Ji Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated unlearning (FU) is critical for complying with legal mandates like the right to be forgotten in decentralized systems, yet current methods face a persistent dilemma between non-target knowledge loss and high request latency. To resolve these issues, we propose FedUP, a one-shot federated unlearning framework utilizing lightweight pluggable filters that act as a “knowledge funnel” to screen out target data while preserving original model performance. By freezing original model parameters and training filters at the server side using differentially private (DP)-protected class centroid samples, FedUP bypasses the need for multi-round client-server communication and complex retraining, reducing unlearning latency from minutes to mere seconds. Additionally, the framework’s pluggable architecture ensures inherent reversibility, enabling the seamless restoration of forgotten knowledge by simply removing the filters. Extensive experiments on diverse image and text tasks demonstrate that FedUP effectively reduces non-target knowledge loss and achieves superior unlearning precision and efficiency across various scenarios. Code is available at: this https URL.
[LG-14] NeuroSonic: Conditional Flow Matching for EEG-to-Speech Reconstruction MICCAI2026
链接: https://arxiv.org/abs/2606.24087
作者: Wenhao Gao,Yifan Wang,Yijia Ma,Carl Yang,Wen Li,Chenyu You
类目: Machine Learning (cs.LG)
*备注: Accepted by MICCAI 2026
Abstract:Reconstructing continuous speech from scalp electroencephalography (EEG) remains fundamentally challenging. EEG provides a weak, spatially diffuse, and highly variable measurement of distributed cortical activity, whereas speech is organized as a coherent acoustic trajectory with strong harmonic and temporal structure. The resulting mismatch makes waveform regression unstable and causes stochastic multi-step generation to be sensitive to artifact-dependent conditioning and subject variability. We introduce NeuroSonic, a conditional flow-matching framework for EEG-to-speech reconstruction. Instead of predicting waveforms directly or refining them through stochastic denoising, NeuroSonic learns a deterministic probability-flow velocity field that transports a noise-corrupted acoustic state toward clean speech under EEG conditioning. EEG and audio are embedded into a shared token space and processed by a time-conditioned gated Transformer that parameterizes the transport ordinary differential equation. This formulation models trajectory evolution explicitly while avoiding iterative stochastic sampling. We evaluate NeuroSonic on the CineBrain and EAV benchmarks under cross-subject evaluation. Across both datasets, the proposed method improves distributional realism, spectral fidelity, and perceptual quality over representative GAN-, diffusion-, and mean-flow baselines, with up to a 26.3% gain in overall perceptual quality. The performance gap is most evident in artifact-heavy segments, where conditioning variability is strongest. These findings indicate that deterministic conditional transport provides a stable and effective formulation for EEG-driven speech reconstruction. Code is available at this https URL .
[LG-15] Information-Theoretic Classifier-Free Guidance with Adaptive Schedule Optimization
链接: https://arxiv.org/abs/2606.24025
作者: Haobo Chen,Xiangxiang Xu,Yuheng Bu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models have achieved strong performance in image, text-to-image, and video generation, where conditional generation is often controlled by classifier-free guidance (CFG). CFG improves condition consistency by increasing a guidance weight, but stronger guidance typically reduces diversity and distributional coverage. It remains unclear how this consistency-coverage trade-off should be controlled across the reverse trajectory, since the distribution induced by CFG is not simply the fixed-time tilted distribution given by the guided score field. To address this issue, we propose an information-theoretic framework for CFG schedule optimization. Our approach uses a clean endpoint reference to specify the desired consistency-coverage trade-off, while optimizing the actual distribution induced by the guided sampler toward this reference. We derive trajectory-level formulas to estimate the objective from samples and score evaluations, avoiding explicit density estimation. On ImageNet-512 with EDM-XXL and COCO with SD-XL, the learned schedules achieve competitive or improved trade-offs over constant guidance and allocate guidance selectively across noise levels.
[LG-16] You Dont Need to Run Every Eval
链接: https://arxiv.org/abs/2606.24020
作者: Yuchen Zeng,Dimitris Papailiopoulos
类目: Machine Learning (cs.LG)
*备注: 42 pages, 23 figures and tables
Abstract:A modern model release reports scores on 40+ benchmarks and the same evaluations were run many more times before it: to track training progress, compare design choices, and select the checkpoint for the release. But do we need to run every eval? We compile a public score matrix of 84 frontier models on 133 benchmarks (2,604 cells, 23.3% filled) and find it is approximately rank-2: a model’s scores across all 133 benchmarks are largely determined by just two numbers. We confirm this in two ways: scores hidden from the matrix are best recovered using two factors, and two factors already explain over 90% of the variation among models on the benchmarks they share. Building on this, we design BenchPress: a logit-space rank-2 matrix completion method that recovers held-out scores to within 4.6 points, and a confidence layer that says when each prediction can be trusted. Using BenchPress, we find a subset of five benchmarks GPQA-D, HLE, Codeforces, MMLU-Pro, ARC-AGI-1 that can recover the rest of a model’s public scorecard to within 3.93 points. For a tighter inference budget, a cheaper set GPQA-D, MMLU-Pro, Aider Polyglot, MATH-500, AIME 2026 can predict a model’s evals to within 4.55. We release the score matrix, the BenchPress code, and an interactive tool that predicts any model’s score on any benchmark.
[LG-17] A Comparative Study of Bayesian Contextual Bandits for Real-Time Warehouse Sorter Optimization
链接: https://arxiv.org/abs/2606.23977
作者: Tina Dongxu Li,Mouhacine Benosman,Ken Meszaros,Trevor Dardik
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted at 2026 IEEE International Conference on Mechatronics and Automation (IEEE ICMA 2026)
Abstract:Efficient sorter diversion control of automated material handling systems (MHS) is critical for optimizing operational efficiency in large-scale warehouse environments. In this study, we use an inbound receiving sorter at a high-volume e-commerce warehouse as our primary use case, where the sorter diversion system relies on cost functions with static weight configurations that fail to adapt to highly dynamic system contexts, such as volume mode, congestion level, equipment physical status, and upstream/downstream dependencies. To address this real-time sorter diversion optimization challenge, we conducted a comparative study of three candidate hybrid machine learning frameworks: Linear Regression with Gradient Descent Optimization (LR+GDO), XGBoost with Bayesian Optimization (XGB+BO), and Bayesian Contextual Bandits (BCB). Model training and evaluation were enabled by leveraging a high-fidelity physics-aware emulator to overcome the cold-start problem and allow a safe transition from offline to online learning. We performed comprehensive evaluations including reward model predictive accuracy, contextual sensitivity, action distribution, and projected reward uplift. Our results demonstrate that while tree-based reward models offer slightly better predictive power, the BCB framework achieved overall higher performance with 2.03% reward uplift over the heuristic baseline. Furthermore, BCB exhibits several superior characteristics, such as its decisive time-optimal policy backed by Bang-Bang control theory, continuous online learning capability, strategic balance between exploration and exploitation, and significantly shorter inference latency. These results demonstrate the potential of the BCB framework for real-time control optimization in large-scale warehouse environments, motivating further investigation toward operational deployment.
[LG-18] Forget Without Compromise: Nexus Sampling for Streaming KV-Cache Eviction Under Fixed Budgets
链接: https://arxiv.org/abs/2606.23961
作者: Duc Duong,Hoang Anh Duy Le,Jianwen Xie,Anshumali Shrivastava,Zhaozhuo Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Long-context and agentic LLM workloads push the KV cache past any fixed memory budget, forcing the inference stack to permanently evict tokens at every step of a continuous-inference stream. Existing methods all share the same template, a per-step direct-attention score followed by deterministic top- K selection, which converts a single below-cutoff step into an irreversible verdict and permanently erases any subtly important token that direct attention cannot single out from noise. To address this challenge, we propose Nexus Sampling, a training-free eviction method that pairs Nexus scoring, an iterative walk over direct attention that surfaces bridge tokens, with weighted reservoir sampling, which retains tokens with inclusion probability in place of deterministic top- K . Theoretically, we show that Nexus Sampling dominates deterministic top- K in long-run survival of subtly important tokens. Empirically, at 80% KV cache eviction, Nexus Sampling matches dense attention within 1% on LongBench while outperforming top- K baselines on retrieval-heavy tasks, with up to 10x smaller per-sequence cache memory.
[LG-19] Learning the Koopman Operator using Attention Free Transformers
链接: https://arxiv.org/abs/2606.23957
作者: Mohammed Nagdi,Evangelos-Marios Nikolados,Alexey Yermakov,Mars Gao,Nathan Kutz,Filippo Menolascina
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Molecular Networks (q-bio.MN)
*备注: 28 pages, 10 figures, 9 tables. Code: this https URL
Abstract:Learning Koopman operators with autoencoders enables linear prediction in a latent space, but long-horizon rollouts often drift off the learned manifold, leading to phase and amplitude errors on systems with switching, continuous spectra, or strong transients. We introduce two complementary components that make Koopman predictors more robust. First, we add an attention-free latent memory (AFT) block that aggregates a short window of past latents to produce a corrected latent before each Koopman update. Unlike multi-head attention, AFT operates in linear time and adds only \approx 30k parameters ( 3d^2 + T^2 , fewer than matched multi-head attention), yet captures the local temporal context needed to suppress error divergence. Second, we propose dynamic re-encoding: lightweight, online change-point triggers (EWMA, CUSUM, and sequential two-sample tests) that detect latent drift and project predictions back onto the autoencoder manifold. Across three benchmark systems – Duffing oscillator, Repressilator, IRMA – our model consistently reduces error accumulation compared to a Koopman autoencoder and matched-capacity multi-head attention. We also compare against GRU and Transformer autoencoders, evaluated both from initial conditions and with a 50-step context, and find that Koopman+AFT (with optional re-encoding) attains markedly lower long-horizon error while maintaining lower inference latency. We report improvements over horizons up to 1000 steps, together with ablations over trigger policies. The result is a fast, compact predictor that stays on the learned manifold over long horizons.
[LG-20] DREG: A Layer-Wise Jacobian Regularization as a General-Purpose Penalty
链接: https://arxiv.org/abs/2606.23942
作者: Rowan Martnishn
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present a large-scale empirical study isolating the contributions of the Derivative Regularization penalty (DREG). Across a fully-crossed factorial sweep of 960 experiments spanning 4 activations, 6 regularizers, 8 datasets, and 5 random seeds, we ask: when, where, and why does DREG work? Our results establish three principal findings. First, DREG achieves the highest overall and clean-regime accuracy among all regularizers evaluated (significantly so against the unregularized baseline, Weight Decay, and IGPen; Wilcoxon p \leq 0.031 ). It ranks second in noise robustness behind Spectral Normalization (SN) - the only two layer-wise regularizers in the study. Second, DREG is globally the best-performing regularizer under GELU, the default activation in modern transformer architectures, particularly on both messy vision and messy NLP benchmarks, suggesting direct applicability to frontier deep learning settings. Third, DREG’s advantage over competing regularizers is most pronounced under data scarcity, consistent with its role as a geometric inductive bias that substitutes for the regularizing effect of data volume. Throughout, DREG is applied with a single fixed hyperparameter \lambda = 10^-2.5 and no per-dataset tuning, supporting its characterization as a plug-and-play regularizer for neural networks with nontrivial Jacobian structure. These findings are consistent with DREG’s design: concentrating regularization pressure on layers where the activation derivative is largest, rather than constraining the network uniformly.
[LG-21] Flow-Corrected Thompson Sampling for Non-Stationary Contextual Bandits
链接: https://arxiv.org/abs/2606.23933
作者: AmirHossein Naghdi,Ali Baheri
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:We study non-stationary linear contextual bandits where the reward model drifts over time, rendering classical contextual bandit algorithms brittle because historical data becomes systematically biased. We propose Flow-Corrected Thompson Sampling (fcTS), a Bayesian method that reuses experience by transporting past rewards to the present using an explicit drift model and incorporating each transported observation with a confidence weight that reflects transport reliability. This yields a unified template that specializes in (i) linear parameter drift via online slope estimation and reward correction, (ii) periodic variation via phase-aware reuse across cycles, and (iii) recurring regime switches via changepoint detection and regime-specific posterior memory. The resulting posterior updates remain closed-form under a linear Gaussian model and can be implemented efficiently with truncated, incrementally updated sufficient statistics. Across five controlled case studies and a semi-synthetic portfolio-selection benchmark with multiple overlapping non-stationarities, fcTS outperforms standard forgetting-based baselines (discounting, sliding windows, and periodic restarts), with the largest gains in settings exhibiting recurring temporal structure. These results demonstrate that when non-stationarity is structured, correcting and reweighting historical observations can be substantially more sample-efficient than uniformly discarding them.
[LG-22] KLip-PPO: A per-sample KL perspective on PPO-Clip
链接: https://arxiv.org/abs/2606.23932
作者: Riccardo Colletti,Robin Holzinger
类目: Machine Learning (cs.LG)
*备注:
Abstract:Proximal Policy Optimization (PPO) is the standard policy-gradient algorithm for on-policy reinforcement learning. The literature presents it in two forms, a clipped surrogate that bounds the importance ratio between successive policies and a Kullback-Leibler penalty between them. These forms are treated as separate algorithms with their own gradients, their own hyperparameters, and their own reference implementations, and a sizeable body of empirical work compares them. We show that the gradient of the clipped surrogate is reproduced exactly by a Kullback-Leibler surrogate whose coefficient varies per sample, with closed-form dependence on the importance ratio and the advantage. The identity holds at every minibatch step and across the entire inner loop, and on five MuJoCo continuous-control benchmarks the two losses produce indistinguishable training curves. The reformulation exposes a structural feature of the clipped surrogate that the min notation hides. PPO-Clip’s implicit per-sample penalty is a step function at the boundary of the trust region, and the shape of this coefficient is the natural design axis for generalising the algorithm. We sketch the resulting follow-up directions in the discussion.
[LG-23] Closing the Loop: Formally Verified Law as a Reward Signal for Self-Improving Legal AI
链接: https://arxiv.org/abs/2606.23913
作者: Armin Heydari(Harvard University),Torben Leowald(Columbia University)
类目: Machine Learning (cs.LG)
*备注: 14 pages, no figures
Abstract:This article develops an architecture that creates a formally verifiable reward signal to train legal AI, adapting the LLM proposes, verifier disposes paradigm from mathematical AI to the distinctive demands of law. We present an architecture comprising LLM-driven autoformalization into a formal legal calculus extending Catala, a verification kernel, and explanation generation grounded in formal proof traces. For the computational components of law, the architecture provides provable correctness. For open-textured legal analysis, it provides structural guarantees: every required stage of the legal argument is addressed, argumentation is exercised at the correct stages and not omitted, and the deductive links between steps are valid. We demonstrate the architecture on procedural deadline calculations in German law, Commerce Clause analysis in U.S. constitutional law, and cross-jurisdictional sanction proportionality. We further show that the same architecture has a structural advantage for legal AI training: a deterministic external verifier supplies verifiable outcomes for legal problems and thereby closes the traditional reinforcement-learning loop gap in law.
[LG-24] GRACE: Gated Refinement for Accurate Causal Edge Discovery in High-Dimensional Time Series
链接: https://arxiv.org/abs/2606.23880
作者: Mohammad Fesanghary,Abhinav Havaldar
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:From climate teleconnections to gene regulation, modern time-series datasets encompass tens or hundreds of interacting variables, making causal discovery increasingly challenging. Constraint-based methods offer statistical rigor but their nonlinear CI tests are infeasible at scale, while score-based alternatives avoid CI testing but require arbitrary thresholds to binarize continuous edge scores. We propose GRACE ( \textbfG ated \textbfR efinement for \textbfA ccurate \textbfC ausal \textbfE dge discovery), which refines constraint-based discovery using Hard Concrete gates with L_0 regularization: each candidate edge has an independent gate whose values concentrate near 0 or 1, yielding a clean bimodal separation that makes the binary decision robust, unlike the narrow, overlapping score distributions produced by L_1 and attention-based methods. A fast linear CI skeleton provides high-recall candidates; a single gated model then prunes false positives by learning which edges genuinely improve prediction, with automatic regularization adapted to problem dimensions and skeleton density. Systematic experiments on synthetic benchmarks, spanning diverse graph topologies (scale-free, Erdős-R’enyi, small-world) and dimensionalities up to d=100 , show that GRACE substantially improves F1 over its base CI method while maintaining high precision, and outperforms attention-based and score-based alternatives. GRACE matches or exceeds expensive nonlinear CI tests at a fraction of the cost ( 75\times faster). On a real-world river flow dataset, where rainfall confounders, variable propagation lags, and distributional shifts violate standard assumptions, a temporal bootstrap variant of GRACE recovers 9 of 11 causal edges along the Elbe River with only 1 false positive ( F_1 = 0.86 , AUROC = 0.99 ), reducing the skeleton’s 106 false positives by 99%.
[LG-25] Federated Survival Analysis in Healthcare: A Multi-Model Evaluation on Cross-Institutional Heterogeneous Breast Cancer Data
链接: https://arxiv.org/abs/2606.23871
作者: Natalia Moreno-Blasco,Anusha Ihalapathirana,Pekka Siirtola,Miguel Fernandez-de-Retana
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 14 pages, 4 figures
Abstract:Survival analysis is central to clinical decision-making, yet reliable time-to-event models require large, diverse cohorts that are rarely available at a single institution, while privacy regulations restrict the centralization of patient data. Federated learning (FL) offers a privacy-preserving alternative by training shared models without exchanging raw data, but its effectiveness for survival modeling under realistic, heterogeneous conditions remains insufficiently understood. This paper presents a systematic, multi-model evaluation of federated survival analysis on a cross-institutional breast cancer cohort with naturally heterogeneous distributed clients. Three representative survival models, the Cox Proportional Hazards model, DeepSurv, and Random Survival Forest (RSF), are compared across centralized, local, and federated training, and three federated optimization strategies (FedAvg, FedProx, and FedAdam) are assessed for the gradient-based models. Results show that FL consistently outperforms local training and approaches, and occasionally exceeds, centralized performance, while RSF offers the best overall balance of discrimination, calibration, and robustness across heterogeneous clients. We further find that performance depends on the diversity of client distributions, and that FedAvg and FedProx are stronger and more stable than FedAdam. Based on these findings, we derive practical, decision-oriented guidelines mapping data, privacy, interpretability, and resource constraints to recommended model and training-paradigm choices for federated survival modeling in healthcare.
[LG-26] Exact Schur-Sylvester Dimensionality Reductions for Non-Smooth Stochastic Complexity and Manifold Sampling
链接: https://arxiv.org/abs/2606.23867
作者: Trenton Lau,Gary P. T. Choi
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Statistics Theory (math.ST)
*备注:
Abstract:The exact computation of the Normalized Maximum Likelihood (NML) codelength for regular non-smooth estimators (e.g., Lasso) has been historically limited by the cubic scaling walls of manifold-constrained projection and volume integration. At each step of the geometric Propose-and-Project Metropolis–Hastings (PPMH) sampler, evaluating the projection operator requires inverting an (N+k) \times (N+k) generalized KKT matrix, while calculating the volume factor requires the determinant of an (N-k) \times (N-k) Gram matrix. This paper presents an exact, mathematically equivalent formulation that bypasses both bottlenecks by utilizing the block Schur complement and Sylvester’s determinant identity. We prove that the computational complexity of both operations collapses from \mathcalO(N^3) to \mathcalO(k^3 + N^2 k) per step. We generalize this reduction to Sparse Support Vector Machines (SVMs), Elastic Net, and Group Lasso. Finally, we provide a rigorous numerical stability analysis and evaluate the sampler’s efficiency using the Effective Sample Size (ESS) per second. Our empirical benchmarks on high-dimensional datasets confirm a constant speedup exceeding 14,100\times while maintaining double-precision numerical equivalence, rendering exact non-smooth NML estimation highly tractable for large-scale statistical inference.
[LG-27] Sesame: Structure-Aware Molecular Generation via Spatial Density-Map Conditioning
链接: https://arxiv.org/abs/2606.23856
作者: Konstantin Yatsenko,Arvind Thiagarajan
类目: Machine Learning (cs.LG)
*备注: 24 pages, 4 figures, preprint
Abstract:Generative molecular models for drug design are a promising direction with much active research. In the next phase of computational drug design, such models will need to understand small molecule structure and protein-ligand interactions, and they will need to possess the machinery to generate molecules \textitde novo. Incorporating each feature poses a critical challenge. Equally important, yet often treated as secondary, is the ability to grow a molecule from a partial starting point – a scaffold or fragment supplied by a chemist – which is the central operation of lead optimization. We present Sesame (Spatial Evoformer for a Structure-Aware Molecular Engine), a diffusion-based molecular generation model that leverages a novel spatial pairformer module to condition on partial molecular structure and the surrounding protein pocket, both expressed as continuous spatial density maps. This single conditioning mechanism supports both \textitde novo generation and fragment-conditioned lead optimization, letting a medicinal chemist prune a hit to a scaffold and have Sesame grow it in productive ways. In addition to this module, we also introduce a diffusion framework for joint denoising of atom types, bond types, and positions, along with a trajectory finetuning scheme that trains on the model’s own sampling rollouts to improve generation quality. Sesame is trained on a large corpus of ligand-only and protein-ligand datasets.
[LG-28] he Degeneracy Distillery
链接: https://arxiv.org/abs/2606.23838
作者: T. Lucas Makinen,Deaglan J. Bartlett,Niall Jeffrey,Benjamin D. Wandelt
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注: 30 pages, 10 figures. Supporting code found at this https URL
Abstract:When two or more parameters or labels produce similar data, they are degenerate, or hard to distinguish. Degeneracies render both label prediction and inverse problems difficult, since both machine learning algorithms and probabilistic samplers rely on the distinguishability of data and its gradients with respect to parameters. However, identifying degeneracies in physical models or real-world datasets can be elucidating about the choice of model or the underlying process that produces the data. We present the degeneracy distillery, a method that (1) detects and (2) resolves degenerate parameter combinations (a) automatically and (b) symbolically, from parameter-data (or parameter-simulation) pairs alone, through estimation and flattening of the Fisher information matrix. By exploring the information geometry of the likelihood, we characterize degeneracies as an intrinsic property of the physical model, requiring no realised data observation. We demonstrate our approach on a range of synthetic and real-world problems, discovering symbolic coordinate transformations that identify the combinations of parameters of a model which yield independent effects on the data. The resulting coordinates flatten the Fisher information in expectation globally, in contrast to posterior-based methods that flatten only at a single point, and substantially reduce the simulation budget required for downstream neural posterior estimation. In test cases we require up to 10\times fewer simulations for posterior estimation at matched validation calibration whilst simultaneously gaining physical insight on the system.
[LG-29] Reconstructing GRACE Terrestrial Water Storag e with Spatio-Temporal Graph Neural Networks: An Application to South America
链接: https://arxiv.org/abs/2606.23833
作者: Lukas Arzoumanidis,Lara Johannsen,Klara Middendorf,Annette Eicker,Youness Dehbi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Terrestrial water storage (TWS) integrates snow, soil moisture, surface water, and groundwater and is a key indicator of how climate variability and human activity reshape the global water cycle. The GRACE and GRACE-FO satellite missions provide the only direct, globally consistent observations of TWS change, but their record only begins in 2002 which is too short for many climate-scale analyses. We present a deep learning application that reconstructs monthly GRACE-like TWS anomalies (TWSA) back to 1940 by learning the relationship between daily ERA5 meteorological forcing (precipitation, evapotranspiration, runoff) and monthly GRACE observations. In contrast to prior reconstruction approaches based on grid-cell-wise regression, CNNs, or LSTMs, we adapt a multi-variate time series graph neural network (MTGNN) architecture, which was originally developed for mobility and traffic forecasting on urban sensor networks to this satellite-geodesy task. Spatial dependencies are encoded in a static, interpretable hybrid adjacency matrix that combines geodesic proximity with lagged correlations of climatic time series, capturing both local hydrological coupling and large-scale teleconnections. The reconstruction achieves a grid-cell Pearson correlation of 0.69, a basin-mean correlation of 0.94, and a near-zero bias, and it reproduces the spatial fingerprints of the 2015/16 El Niño and 2020/21 La Niña events. A systematic comparison with established reconstruction approaches (GTWS-MLrec, RM-REC, GRAiCE) shows that the graph-based model is statistically competitive at basin scale, reaching a correlation within 0.025 of the best baseline while using only roughly half to a tenth of the predictors the other models require and revealing characteristic weaknesses in arid regions in all models. The complete implementation is publicly available at this http URL
[LG-30] One Ruler: A Same-Hands Re-Evaluation of Bivariate Causal Direction on Tuebingen with a Parameter-Free Compression Baseline
链接: https://arxiv.org/abs/2606.23767
作者: Wietse Stienstra
类目: Machine Learning (cs.LG)
*备注: 15 pages, 1 table. Code, pre-registrations and per-pair outputs: this https URL
Abstract:Headline accuracies on the Tuebingen cause-effect pairs are routinely compared across papers even though each is measured under its authors’ own protocol – different pair subsets, weightings, model-selection, and decision rates. We argue this is the wrong comparison and run the right one: a same-hands re-evaluation in which every method is run by us on the identical 102 pairs, with one strict rule – no tuning and a decision forced on every pair. As a clean reference point we introduce a deliberately minimal baseline: sorted-conditional compression, which feeds quantized, sorted, first-differenced data to an off-the-shelf compressor (bz2) and has zero fitted parameters. Under the common ruler the ranking differs sharply from the literature. Our baseline reaches 74.7% weighted accuracy (p = 3.7e-7); on the same 100 pairs that SLOPE is evaluated on it scores 76.0%, a 1.2-point gap below the authors’ own forced-decision SLOPE (77.2%) that is well inside noise (McNemar p = 0.39). A faithful re-run of RECI lands at 70.7% – inside the original authors’ reported error bar, not the 77.5% often quoted (which we trace to a mis-copied cell). SLOPE’s published 82.4% is a decided-subset figure: scoring the authors’ own stored output only on the pairs its significance test chose to answer reproduces 81.7%. Under the common ruler the methods cluster in the low-to-mid 70s and the zero-parameter compressor ties the strongest of them. We document the mechanisms that inflate published figures (test-set model selection, significance-gated abstention) and contribute two further results: compression score magnitude is a model-free confounding flag (p = 2.8e-68), and a pre-registered falsification test fails in an instructive way that bounds the method’s theoretical interpretation. Code, pre-registrations, and per-pair outputs are released.
[LG-31] Verifiable Foundation Models for Robot Safety
链接: https://arxiv.org/abs/2606.23754
作者: Davide Corsi,Kyungmin Kim,Roy Fox
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Deploying foundation models for robot control raises a central challenge: the expressive power that enables rich, multimodal perception also makes these models opaque and difficult to analyze formally, rendering them intractable for existing verification tools. In this paper, we present FEARL (Foundation-Enabled Assured Robot Learning), a framework that addresses this tension through a modular architectural decomposition. FEARL separates the policy into a large Controller © responsible for high-dimensional perception and task reasoning, and a small Safety module (S) that receives low-dimensional observations from dedicated safety sensors together with a bounded context embedding from C and produces the final action. Since many robot safety requirements, such as collision avoidance and workspace boundary constraints, can be expressed over these safety sensor observations, formal verification can be applied to S rather than to the full foundation-model backbone. This makes formal analysis tractable with existing tools while preserving the Controller’s expressive power for task reasoning. To show that the decomposed policy remains capable of solving diverse tasks, we evaluate FEARL on three simulated robotic domains using multiple Controller backbones and training procedures, including pretrained off-the-shelf vision-language-action models. We further transfer the learned policy from one of our simulated tasks to a physical robot, suggesting that the low-dimensional safety interface supports practical sim-to-real transfer.
[LG-32] EnerInfer: Energy-Aware On-Device LLM Inference
链接: https://arxiv.org/abs/2606.23001
作者: Bohua Zou,Nian Liu,Binqi Sun,Matteo Mascherin,Debayan Roy,Yutao Liu,Yu Peng,Ning Jia,Haibo Chen
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Operating Systems (cs.OS)
*备注:
Abstract:On-device LLM inference is increasingly attractive for privacy-preserving, reliable, and cost-effective deployment, yet its energy and thermal costs remain a critical bottleneck. Existing systems primarily optimize for decoding speed, implicitly assuming that faster execution is always preferable. We show instead that on-device LLM inference often has exploitable configuration slack: modestly lowering NPU and memory frequencies preserves quality of experience (QoE) while substantially improving energy efficiency and reducing heat. Realizing this opportunity in production is challenging. The most energy-efficient NPU/DDR setting varies with the model, inference engine, platform, and runtime conditions, with no stable ranking across configurations. Commercial devices further lack component-level power sensing, and shell temperature evolves with request arrivals, response lengths, and thermal history. To address these challenges, we propose EnerInfer, the first on-device LLM inference framework that jointly manages energy efficiency, throughput, and thermal comfort for LLM workloads. EnerInfer replaces per-model profiling and sensor-heavy control with disaggregated, model-structure-aware prediction and ranking-driven online feedback. It predicts throughput and power for unseen LLMs across NPU/DDR frequency settings, selects QoE-satisfying efficient configurations under runtime interference, and uses lightweight limited-horizon thermal prediction to dynamically switch between energy-optimized and thermally constrained inference. Evaluations on real-world LLMs show that EnerInfer improves energy efficiency by up to 65%, 12%, and 24% on phones, a laptop, and a development board, respectively, without QoE violation. Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG); Operating Systems (cs.OS) Cite as: arXiv:2606.23001 [cs.SE] (or arXiv:2606.23001v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2606.23001 Focus to learn more arXiv-issued DOI via DataCite
[LG-33] New Bounds for the Last Iterate of the Stochastic subGradient Method
链接: https://arxiv.org/abs/2606.24879
作者: Guglielmo Beretta,Tommaso Cesari,Roberto Colomboni,Andrea Paudice
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:We study the last iterate of the stochastic subgradient method for one-dimensional convex Lipschitz objectives. For a fixed horizon n , we consider the standard fixed stepsizes \eta =\Theta(1/\sqrt n) . We prove that, for such stepsize policies, under additive i.i.d. subgradient noise with uniformly bounded variance, the last iterate features an optimization error of order 1/\sqrt n , thereby removing the extra (\log n) factor present in existing generic bounds. On the other hand, we show that without the i.i.d. assumption, the optimization error can be of order (\log n)/\sqrt n . Thus, under the uniformly bounded variance assumption alone, the last iterate of SsGM is suboptimal even in dimension one, resolving negatively an open problem posed in Koren and Segal, COLT, 2020.
[LG-34] Model selection with proper scoring rules on data sets of time series
链接: https://arxiv.org/abs/2606.24715
作者: Giorgio Corani,Stefano Damato,Dario Azzimonti,Lorenzo Zambon
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We consider the problem of model selection between probabilistic models on data sets of time series. Chosen a proper scoring rule, we denote by the term \textitscore the average value of the scoring rule on the test of an individual time series. For model selection, we need aggregating the values of the scores across multiple time series. Three summary statistics are commonly used for model selection: mean score, median score, and mean rank. Results in previous papers show that these statistics can yield conflicting decisions; we show how the conflicting conclusions are due to the skewness of the distribution of scores. We also show that as the test set of each time series of the data set increases, the different model selection criteria progressively converge to the same conclusion. However, for short tests sets, only the mean score identifies the true model as the best. We illustrate these phenomena with an analysis on intermittent time series, including the data set of the M5 competition, where we underline the importance of having a large test set. In such experiments, we further notice that model selection based on mean ranks remains unchanged using different scaling factors. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2606.24715 [stat.ML] (or arXiv:2606.24715v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2606.24715 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-35] A Physics-Informed Fourier-Wavelet Transformer for Multiscale Computational Fluid Dynamics Surrogate Modeling
链接: https://arxiv.org/abs/2606.24696
作者: Somyajit Chakraborty,Ming Pan,Xizhong Chen
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
Abstract:Physics-informed surrogate models can accelerate computational fluid dynamics simulations. However, many existing methods reproduce global flow patterns more reliably than localized multiscale structures. This study presents a physics-informed Fourier-wavelet transformer for next-step velocity-field reconstruction in real-world flow benchmarks. The proposed formulation combines hybrid Fourier-wavelet spectral encoding with physics-biased self-attention based on partial differential equation residual diagnostics. It also uses self-supervised pretraining through Masked Physics Prediction and Equation Consistency Prediction. The experiments are conducted on two real benchmark cases: cylinder-wake flow and fluid-structure interaction. All approaches are evaluated under a shared local protocol and compared with spectral, transformer-based, operator-learning, and physics-informed neural-network baselines. On the cylinder-wake benchmark, the proposed model achieves the best aggregate accuracy, with an all-channel normalized mean-squared error of 0.05875 and an all-channel Pearson correlation coefficient of 0.97019. On the fluid-structure-interaction benchmark, it gives the lowest all-channel normalized mean-squared error of 2.70 \times 10^-4 , compared with 4.02 \times 10^-4 for the strongest baseline. Component-wise field comparisons and scale-separated diagnostics further show stronger recovery of localized wake structures, including near-body, wake-core, and far-wake features. The results demonstrate improved real-world flow reconstruction while maintaining a practical accuracy-cost tradeoff.
[LG-36] Extended pseudo-spectral physics-informed neural networks for phase-field models
链接: https://arxiv.org/abs/2606.24660
作者: Callum Marsh,Radek Erban,Andreas Munch
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Numerical Analysis (math.NA); Biological Physics (physics.bio-ph)
*备注: 20 pages, 10 figures, Data available: this https URL
Abstract:Phase-field models play a central role in the continuum description of phase separation, in which the bulk free-energy density and the interfacial thickness parameter determine pattern formation and microstructural evolution. In practice, these constitutive quantities are rarely known a priori and must be inferred from limited dynamical observations. In this work, an extended pseudo-spectral physics-informed neural network (ESPINN) framework is developed for the inverse identification of phase-field models from transient snapshot data. It enables the simultaneous recovery of both the bulk chemical potential and unknown gradient coefficients. Numerical experiments on the one-dimensional Cahn-Hilliard equation demonstrate accurate and statistically stable reconstruction in the noiseless regime, with substantial constitutive information recoverable from even a single snapshot pair. In the presence of noise, reconstruction accuracy degrades gracefully, and increasing the number of snapshots improves robustness by reducing variance across runs. These results establish ESPINN as a data-efficient and physically consistent approach for learning free-energy structure in continuum models of phase separation.
[LG-37] An Agnostic Machine Learning Model of Photosynthetic Habitability
链接: https://arxiv.org/abs/2606.24458
作者: Callum Gray,Cassandra Hall,Stefano Santabarbara,Klaus Schmidt-Rohr,Andrew Ringham,Edward Gillen,Thomas J. Haworth,Christopher D. P. Duffy
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 17 pages main body, 5 figures. Submitted to MNRAS
Abstract:The search for exoplanet biosignatures is guided by whether planetary environments can sustain photosynthesis. As such, the Photosynthetic Habitable Zone (PHZ) was recently proposed, as the overlap between the canonical habitable zone and the orbital range where stellar irradiance is sufficient to drive photosynthesis. Existing PHZ estimates rely on empirical light-response curves from Earth phytoplankton, and thus include implicit Earth-centric biases. We introduce an agnostic PHZ derived from a generalized model of photosynthesis grounded in thermodynamics and redox chemistry, without reference to model organisms. The model is built on a generic photochemical reaction in which photon capture couples oxidation of a donor molecule to the reduction of CO2. The optical properties and CO2 reduction rate are optimized against irradiance spectra for exoplanets orbiting main-sequence stars, using a genetic algorithm that mimics evolution by natural selection. Our simulations predict that photosynthetic organisms compensate for reduced flux by evolving larger light-harvesting structures. As a result, photosynthetic viability declines only linearly with orbital distance, despite stellar flux falling off quadratically. As such, the agnostic PHZ expands well beyond previous Earth-based estimates. Earth-like (visible light) oxygenic photosynthesis is flux-limited at the outer habitable zone for cool M-dwarf stars; however, both anoxygenic photosynthesis and a hypothetical, NIR-driven oxygenic photosynthesis are viable across the entire habitable zone for M, K, and G stars. This implies that M-dwarf exoplanets could sustain robust oxygenic photosynthesis, though it would be different to that found on Earth, presenting reflectance biosignatures in the NIR band rather than the visible.
[LG-38] PROTECT-90: A Fault Dataset for Power System Protection
链接: https://arxiv.org/abs/2606.24298
作者: Julian Oelhaf,Georg Kordowich,Christian Bergler,Andreas Maier,Johann Jäger,Siming Bayer
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, 3 tables. Accepted for publication at IEEE PES ISGT Europe 2026. Author accepted manuscript. Final published version will be available via IEEE Xplore
Abstract:The increasing interest in data-driven methods for power system protection is accompanied by a lack of standardized, publicly available high-voltage waveform datasets that enable transparent and reproducible evaluation. To address this gap, this paper introduces the PROTECT-90 dataset, an open electromagnetic transient (EMT)-simulated reference benchmark for high-voltage fault studies with consistent digital-fault-recorder-like measurements, publicly released with this work. The dataset comprises 9,022 physically consistent short-circuit simulation episodes generated on a standardized 90 kV double-line topology with systematically documented domain randomization of grid operating points, line parameters, and fault conditions. For each episode, synchronized three-phase voltage and current waveforms are recorded at eight measurement locations and released together with structured, machine-readable metadata describing fault type, fault location, inception time, and operating conditions. All modeling assumptions, parameter ranges, and data-generation procedures are explicitly documented to ensure transparency and cross-study comparability. By combining physically grounded EMT simulation, balanced scenario coverage, and open accessibility, PROTECT-90 establishes a standardized foundation for reproducible benchmarking of protection-oriented signal processing and learning-based methods.
[LG-39] Uniform Sampling from High-dimensional Spectral Norm Balls
链接: https://arxiv.org/abs/2606.24134
作者: Michael R. Metel
类目: Probability (math.PR); Machine Learning (cs.LG)
*备注:
Abstract:Motivated by an application in machine learning optimization, this paper focuses on the challenges of sampling a matrix uniformly from the unit spectral norm ball. It is proven that all singular values of sampled matrices converge to 1 almost surely as the matrix dimensions increase. This result provides the theoretical justification for a proposed simple sampling method applicable for large dimension sizes matching matrices found in modern large language models. Experimental results demonstrate both the convergence of the singular values, as well as the exact and proposed approximate sampling methods.
[LG-40] Low-rank Updates in Slowly Time-varying Graphs for Spatial-Temporal Signal Interpolation
链接: https://arxiv.org/abs/2606.24011
作者: Saghar Bagheri,Gene Cheung,Tim Eadie,Antonio Ortega
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:A crucial assumption in graph signal processing (GSP) is the existence of an underlying graph that captures the pairwise similarities between nodes, allowing filters to be designed based on this graph for tasks such as denoising. For spatial-temporal data in which node-to-node similarities evolve over time, a static spatial graph is insufficient. In this paper, to represent slowly time-varying pairwise relationships, we model the graph changes in two consecutive adjacency matrices P = W^(2) - W^(1) across time as a low-rank matrix. % Specifically, given an initial adjacency matrix W^(1) at time t=1 , we jointly interpolate a signal x_2 and estimate W^(2) at t=2 using both a graph signal smoothness prior for x_2 and a low-rank prior on ¶ . We alternate optimization steps. With W^(2) fixed, x_2 is interpolated by solving a linear system. Alternatively, holding x_2 fixed, W^(2) is updated via proximal gradient descent (PGD). The proximal mapping of the rank term Gamma(W^(2) - W^(1)) is approximated in linear time using a fast orthogonal matching pursuit (OMP) algorithm that selects a sparse combination of atoms from a dictionary cR formed by the outer products of W^(1) 's eigenvectors. We unroll iterations of our algorithm into layers to build a lightweight neural network for limited data-driven parameter tuning. Experiments show that our joint optimization achieves better signal interpolation compared to existing time-varying graph models.
[LG-41] Stochastic Expectation Maximization for Robust State-Space Radio Interferometric Imaging
链接: https://arxiv.org/abs/2606.23944
作者: Nawel Arab,Mohammed Nabil El Korso,Isabelle Vin,Pascal Larzabal
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:State–space models provide a flexible framework for analyzing dynamical systems, yet they often rely on Gaussian assumptions that fail to capture heavy-tailed or outlier-prone measurement noise. We propose a robust estimation scheme for linear state–space models subject to compound-Gaussian noise, as encountered for instance in radio interferometry affected by radio-frequency interference (RFI). The method relies on a Stochastic Approximation Expectation–Maximization (SAEM) algorithm in which the standard E-step is replaced by Monte Carlo sampling of the latent states and noise texture through closed-form Gibbs updates, enabling tractable inference despite the heavy-tailed likelihood. Numerical experiments show that the proposed method significantly improves reconstruction fidelity and robustness to RFI, outperforming a Gaussian EM algorithm and even an oracle RTS smoother. These results highlight the benefits of heavy-tailed state–space modeling and SAEM-based inference in interference-dominated imaging scenarios.
[LG-42] Prediction of Viscoelastic Droplet Impact Dynamics Using a Vision Transformer-Based Approach
链接: https://arxiv.org/abs/2606.23940
作者: Diego A. de Aguiar,Cassio M. Oishi
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
Abstract:Droplet impact on solid surfaces is a complex fluid dynamics problem with applications in spray cooling, inkjet printing, and pharmaceutical processing. Although numerical simulations are widely used to investigate these dynamics, their computational cost becomes significant when multiple parametric variations are considered. In this work, we investigate the use of a Video Vision Transformer (ViViT) architecture to predict the temporal evolution of viscoelastic droplets impacting solid surfaces using volume fraction fields obtained from the Volume of Fluid (VOF) method. In Newtonian fluids, impact dynamics are mainly characterized by the Reynolds number Re , representing the ratio of inertial to viscous forces, and the Weber number We , representing the ratio of inertial to surface tension forces. For viscoelastic fluids, additional parameters are required to account for elastic effects, namely the solvent viscosity ratio \beta and the Weissenberg number Wi , increasing simulation complexity and cost. Instead of simulating the entire droplet dynamics, the proposed approach uses only the initial 10% to 20% of the simulation to predict the remaining evolution. Depending on the prediction configuration, this strategy reduces computational cost by approximately 80% to 90% compared to full numerical simulations. The ViViT produces physically consistent predictions across different parameters and prediction horizons, successfully capturing both spreading and bouncing regimes while preserving geometric features and structural similarity. Since volume fraction fields can also be extracted from experimental videos, the proposed framework could be extended to incorporate experimental data during training, potentially improving the physical fidelity of the predicted dynamics.
[LG-43] Constrained Variable Projection for Structured Problems
链接: https://arxiv.org/abs/2606.23939
作者: Emanuele Zangrando,Sara Venturini,Francesco Rinaldi,Francesco Tudisco
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Variable projection is a classical technique for separable nonlinear least-squares problems, in which variables that enter linearly are eliminated exactly, yielding a reduced nonlinear problem. By expressing this framework as a particular instance of a broader class of bilevel optimization problems, we develop a constrained variable-projection framework for data-science models, where the remaining variables are subject to convex constraints and the eliminated variables arise from a lower-level least-squares problem. In particular, by interpreting variable projection as a collapsed bilevel optimization problem, we derive exact reduced-gradient formulas compatible with automatic differentiation and propose a conditional-gradient algorithm for the resulting constrained reduced problem. We establish convergence guarantees under standard smoothness and compactness assumptions, and discuss extensions to structured lower-level variables. Numerical experiments on sparse autoencoding, dictionary learning, blind deconvolution, and few-shot learning suggest that the method can improve wall-clock efficiency and data efficiency relative to natural joint-optimization baselines.
[LG-44] Hessian-augmented Supervised Learning for Hamilton-Jacobi-Bellm an PDEs
链接: https://arxiv.org/abs/2606.23827
作者: Matías Gómez-Aedo,Behzad Azmi,Yuyang Huang,Dante Kalise,Karl Kunisch
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:A data-driven method is developed for approximating value functions in deterministic optimal control problems with nonlinear control-affine dynamics. The Pontryagin Maximum Principle optimality system is solved from multiple initial conditions to generate training data consisting of values, gradients, and Hessians of the value function, where Hessian information is obtained from a matrix Riccati equation along optimal trajectories. These quantities augment a weighted least-squares regression over sparse polynomial bases on hyperbolic cross index sets, with gradients and Hessians contributing additional linear equations per sample and substantially reducing sample complexity compared to value-only regression. Feedback laws are recovered analytically from the learned value function. In high dimensions, a partial Hessian strategy controls the cost of data generation. The approach is validated on problems of increasing state dimension, where second-order data augmentation is shown to improve approximation accuracy and closed-loop performance, with up to an order-of-magnitude reduction in the number of training samples required relative to lower-order methods.
[LG-45] Machine Learning and Deep Learning for Exoplanet Detection and Atmospheric Characterization with JWST and the Upcoming Ariel Mission
链接: https://arxiv.org/abs/2606.23766
作者: Muallim Yakubu,Vwavware Oruaode Jude
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注: 15 pages
Abstract:The detection and atmospheric characterization of exoplanets have entered a new data-intensive era driven by the James Webb Space Telescope and the upcoming Ariel mission. Modern surveys produce millions of light curves and high-resolution spectra that overwhelm traditional pipelines, motivating the rapid integration of Machine Learning and Deep Learning methods into the exoplanet workflow. This review synthesizes the latest progress in applying ML/DL techniques to exoplanet detection (transit identification, candidate vetting, false-positive rejection) and atmospheric characterization (retrieval, detrending, cross-correlation, surrogate modelling) in the context of JWST and Ariel. We start with classical algorithms such as Random Forests and Convolutional Neural Networks, move through Transformers and Recurrent architectures, then survey modern simulation-based inference using Neural Posterior Estimation and Flow Matching Posterior Estimation with normalizing or continuous normalizing flows. We discuss benchmark efforts, including the Ariel Machine Learning Data Challenges (2019 to 2025) hosted with NeurIPS, and key JWST case studies such as the WASP-39b Early Release Science programme. Results indicate that DL approaches consistently match or exceed traditional pipelines in both speed and accuracy, while ML-driven retrievals reduce inference time from CPU-hours to seconds and can accelerate nested-sampling retrievals by factors of 3-8 without compromising Bayesian evidence. We identify outstanding challenges interpretability, calibration of uncertainties under noisy data, hybrid modelling, and the generalization of models across instruments and planet populations and outline a research roadmap spanning the JWST era and beyond into Ariel’s launch in 2029.
[LG-46] Computational references are not experiments: pre-registered validation of machine-learned sodium-cathode voltages
链接: https://arxiv.org/abs/2606.23725
作者: Krishna Teja Vepa
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:Machine-learning screens for battery materials are trained and judged almost entirely against computed reference voltages, and those references carry their own systematic errors. We report a case in which this matters quantitatively: our own screening stack (a graph-network voltage screen, a prior-art triage layer, and a local PBE+U bench) fails pre-registered validation against experiment-anchored literature values. Verdict thresholds, failure modes, and the primary metric were committed before analysis. On an operator-audited set of known Na-ion cathodes (n = 6 after one documented exclusion; verdict unchanged at n = 7), the raw held-out mean absolute error was 0.67 V, the pre-registered conservative metric, the upper 95% confidence bound of the cross-validated bias-corrected error, was 1.09 V, and the residual was strongly voltage-dependent (r = -0.94), so no additive calibration is valid. On the two compounds where prediction, database reference, and experiment could all be compared, the Materials Project PBE+U reference sat about 0.54 V below measurement: the reference, not the model, dominated the error. A prior-art screen found at least 70% of the targeted Na substitution space already published. We retire the screen, bound what “verified” means for our DFT ledger, and pre-register a calibration audit of it against four benchmark Li couples.
[LG-47] A Hybrid Quantum-Classical Approach for Melt Pool Prediction in Laser Powder Bed Fusion
链接: https://arxiv.org/abs/2606.23719
作者: Matthew M. Sato,Kincho H. Law
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 10 pages, 7 figures, to be presented at the ASME IDETC/CIE 2026 Conference
Abstract:Laser powder bed fusion (LPBF) is a promising additive manufacturing technique that suffers from quality assurance concerns. Predicting melt pools from process parameters is crucial for assessing quality prior to manufacturing but remains a difficult problem because of the complex physical processes underlying LPBF. Quantum computers present a new computing paradigm, providing a new approach to information processing using quantum entanglement and superposition. This paper presents a practical demonstration of a hybrid quantum-classical model that leverages quantum computing to improve process parameter feature extraction with a quantum feature encoder. To make the quantum approach computationally feasible for large datasets, we first employ a clustering algorithm to reduce the number of expensive quantum computations. These quantum features are then processed by a classical neural network to predict the melt pool morphology, allowing for more accurate predictions of melt pools. We demonstrate the method using a quantum simulator, analyze the effect of measurement shot noise on the predictive performance of the network, and verify the results using quantum hardware. Finally, by examining which quantum features are most important, we provide insights that can inform the future design of more effective quantum encoding circuits. Ultimately, the performance improvement over purely classical networks validates the hybrid approach, demonstrating an engineering application of quantum computing using noisy and intermediate scale quantum (NISQ) devices.
[LG-48] Dimensionality Reduction of QAOA Parameter Space with Kernel PCA for Max-Cut
链接: https://arxiv.org/abs/2606.23718
作者: Sidharth Brahmandam,Vayd Ramkumar
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, submitted to IEEE Quantum Week Conference
Abstract:The Quantum Approximate Optimization Algorithm (QAOA) is a leading variational algorithm for combinatorial optimization on near term quantum devices. As circuit depth increases, the number of optimization parameters grows, making the search landscape increasingly nonlinear and difficult to optimize. Previous studies have shown that optimal QAOA parameters often lie on a low dimensional manifold that can be approximated using Principal Component Analysis (PCA) at shallow circuit depths. However, the effectiveness of PCA decreases at higher depths because the underlying parameter manifold becomes increasingly nonlinear. In this work, we investigate Kernel Principal Component Analysis (KPCA) with a radial basis function kernel as a nonlinear dimensionality reduction technique for QAOA parameter optimization. The model is trained using 200 graphs from each of 3 graph families, namely Erdos-Renyi, Barabasi-Albert, and Watts-Strogatz, with graph sizes ranging from 7 to 10 nodes. Performance is evaluated on 30 test graphs containing 12 nodes at circuit depths 1, 2, 4, and 8. Experimental results demonstrate that KPCA consistently outperforms PCA at deeper circuit depths across all graph families. At depth 8, KPCA achieves approximation ratios above 0.86, while PCA declines to approximately 0.81 to 0.83. Both methods reduce the number of quantum circuit evaluations by more than 93 percent relative to unrestricted QAOA optimization. These findings suggest that nonlinear kernel methods more effectively capture the structure of the QAOA parameter manifold and provide a practical approach for scaling variational quantum optimization to deeper circuits.
[LG-49] WiFi-Based People Counting Using Beam-Steerable Antennas: A Test-bed Study
链接: https://arxiv.org/abs/2606.23710
作者: Riccardo Bersan,Anay Ajit Deshpande,Sanaz Kianoush,Daniele Piazza,Stefano Savazzi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Ubiquitous perception through RF signals is a pivotal opportunity for future technology: it enables personalized services such as smart living, remote healthcare, automated logistics or interaction through free-space gestures. The ubiquity of Wi-Fi and cellular networks presents a promising platform for the development of innovative sensing tools. Future standards will also introduce dedicated sensing features which, for example, will allow routers to work as frequency modulated continuous wave radios targeting radar applications. Most of the current chip designs support ad-hoc firmware for CSI extraction with MIMO arrangements of the transmitter (TX) and receiver (RX) antennas and OFDM subcarriers. The CSI describes the phase shift and amplitude attenuation of multiple propagation paths on each subcarrier. The latest IEEE 802.11be standard (Wi-Fi 7) offers a wider subcarrier bandwidth of 160MHz (up to 320MHz), providing at least 120 usable pilot subcarriers for CSI or CIR estimation. Additionally, Wi-Fi signals have been recently exploited to track daily human movements and behaviors, while Wi-Fi signal variations have been shown to differ between different people and can consequently be used for their re-identification.
附件下载


