本篇博文主要内容为 2026-04-22 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-04-22)

今日共更新639篇论文,其中:

  • 自然语言处理103篇(Computation and Language (cs.CL))
  • 人工智能189篇(Artificial Intelligence (cs.AI))
  • 计算机视觉120篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习141篇(Machine Learning (cs.LG))
  • 多智能体系统15篇(Multiagent Systems (cs.MA))
  • 信息检索15篇(Information Retrieval (cs.IR))
  • 人机交互40篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

【速读】:该论文旨在解决工业级可执行可视化工作流(executable visual workflows)在实际部署中依赖人工设计导致的开发成本高、效率低且易出错的问题。当前主流方法需开发者手动编写每个步骤的提示词(prompt)并反复调整逻辑以适应需求变化,而大语言模型(Large Language Models, LLMs)虽能理解高层意图,却难以生成稳定、正确且可直接部署的工作流。为此,作者提出了Chat2Workflow基准数据集和一个鲁棒的智能体框架(agentic framework),其关键在于通过多轮交互与错误修正机制提升生成工作流的准确性与可靠性,从而推动工业场景下基于自然语言的自动化工作流构建技术的发展。

链接: https://arxiv.org/abs/2604.19667
作者: Yi Zhong,Buqiang Xu,Yijun Wang,Zifei Shan,Shuofei Qiao,Guozhou Zheng,Ningyu Zhang
机构: Zhejiang University(浙江大学); Tencent(腾讯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Work in progress

点击查看摘要

Abstract:At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic framework to mitigate recurrent execution errors. Chat2Workflow is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially under complex or changing requirements. Although our agentic framework yields up to 5.34% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at this https URL.

[MA-1] AblateCell: A Reproduce-then-Ablate Agent for Virtual Cell Repositories

【速读】:该论文旨在解决生成式 AI (Generative AI) 在虚拟细胞(Virtual Cells)研究中性能提升归因困难的问题,其核心挑战在于生物数据仓库缺乏标准化且与特定领域数据和格式高度耦合,导致系统性消融实验(systematic ablations)难以开展。现有编码代理虽能将想法转化为代码实现,但缺乏可复现基线并严格验证各组件贡献的验证机制。解决方案的关键在于提出 AblateCell——一种“先复现、再消融”的自动化代理:首先通过自动配置环境、解决依赖与数据问题并重跑官方评估来端到端复现基线,并生成可验证的产物;随后基于隔离的仓库变异图谱,以兼顾性能影响与执行成本的奖励函数自适应选择实验,实现闭环消融。该方法在三个单细胞扰动预测数据集(CPA、GEARS、BioLORD)上实现了88.9%的端到端流程成功率和93.3%的真值关键组件恢复准确率,显著优于人类专家和启发式方法,从而为基于生物代码库的可扩展验证与归因提供了新范式。

链接: https://arxiv.org/abs/2604.19606
作者: Xue Xia,Chengkai Yao,Mingyu Tsoi,Xinjie Mao,Wenxuan Huang,Jiaqi Wei,Hao Wu,Cheng Tan,Lang Yu,Yuejin Yang,Siqi Sun,Zhangyang Gao
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 25 pages, 5 figures

点击查看摘要

Abstract:Systematic ablations are essential to attribute performance gains in AI Virtual Cells, yet they are rarely performed because biological repositories are under-standardized and tightly coupled to domain-specific data and formats. While recent coding agents can translate ideas into implementations, they typically stop at producing code and lack a verifier that can reproduce strong baselines and rigorously test which components truly matter. We introduce AblateCell, a reproduce-then-ablate agent for virtual cell repositories that closes this verification gap. AblateCell first reproduces reported baselines end-to-end by auto-configuring environments, resolving dependency and data issues, and rerunning official evaluations while emitting verifiable artifacts. It then conducts closed-loop ablation by generating a graph of isolated repository mutations and adaptively selecting experiments under a reward that trades off performance impact and execution cost. Evaluated on three single-cell perturbation prediction repositories (CPA, GEARS, BioLORD), AblateCell achieves 88.9% (+29.9% to human expert) end-to-end workflow success and 93.3% (+53.3% to heuristic) accuracy in recovering ground-truth critical components. These results enable scalable, repository-grounded verification and attribution directly on biological codebases.

[MA-2] amFusion: Supporting Open-ended Teamwork with Multi-Agent Systems

【速读】:该论文旨在解决开放领域团队协作中因观点多样性导致的共识难以达成的问题,传统答案聚合方法在该场景下易压制少数派观点而非有效化解分歧。其解决方案的关键在于提出一种多智能体系统 TeamFusion,通过为每位团队成员构建基于其偏好条件化的代理(proxy agent),开展结构化讨论以显性化共识与分歧,并生成更具共识性的产出,进而驱动迭代优化,从而在多个任务和团队配置下显著优于直接聚合基线方法。

链接: https://arxiv.org/abs/2604.19589
作者: Jiale Liu,Victor S. Bursztyn,Lin Ai,Haoliang Wang,Sunav Choudhary,Saayan Mitra,Qingyun Wu
机构: Pennsylvania State University (宾夕法尼亚州立大学); Adobe Research (Adobe 研究院); Columbia University (哥伦比亚大学); AG2ai, Inc. (AG2ai 公司)
类目: Multiagent Systems (cs.MA)
备注: 22 pages

点击查看摘要

Abstract:In open-ended domains, teams must reconcile diverse viewpoints to produce strong deliverables. Answer aggregation approaches commonly used in closed domains are ill-suited to this setting, as they tend to suppress minority perspectives rather than resolve underlying disagreements. We present TeamFusion, a multi-agent system designed to support teamwork in open-ended domains by: 1. Instantiating a proxy agent for each team member conditioned on their expressed preferences; 2. Conducting a structured discussion to surface agreements and disagreements; and 3. Synthesizing more consensus-oriented deliverables that feed into new iterations of discussion and refinement. We evaluate TeamFusion on two teamwork tasks where team members can assess how well their individual views are represented in team decisions and how consensually strong the final deliverables are, finding that it outperforms direct aggregation baselines across metrics, tasks, and team configurations.

[MA-3] FOCAL: Filtered On-device Continuous Activity Logging for Efficient Personal Desktop Summarization

【速读】:该论文旨在解决桌面交互流(desktop interaction streams)在本地设备上高效、隐私友好的任务组织问题,核心挑战在于:一是视觉-语言模型(Vision-Language Model, VLM)的密集计算对本地资源造成压力;二是全局处理导致跨任务上下文污染。解决方案的关键在于提出一种名为FOCAL(Filtered On-device Continuous Activity Logging)的多智能体系统,其核心架构为统一的“过滤-规划-记录”范式:通过轻量级过滤代理(Filter Agent)抑制噪声,文本驱动的脑代理(Brain Agent)进行任务归属识别,记录代理(Record Agent)执行选择性视觉推理,并由任务隔离的记忆代理(Memory Agent)实现上下文一致的摘要生成。该设计显著降低了VLM调用频次与总token消耗,同时提升关键信息召回率(KIR)和任务准确率(Task Acc),尤其在任务中断场景下保持鲁棒性。

链接: https://arxiv.org/abs/2604.19541
作者: Haoran Yin,Zhiyuan Wen,Jiannong Cao,Bo Yuan,Ruosong Yang
机构: The Hong Kong Polytechnic University (香港理工大学); China Mobile Communications Company Limited Research Institute (中国移动通信有限公司研究院)
类目: Multiagent Systems (cs.MA); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Desktop interaction streams provide a continuous, privacy-sensitive record of interleaved user tasks. Transforming these streams into task-organized personal logs on-device faces two main challenges: exhaustive Vision-Language Model (VLM) processing strains local resources, and global stream processing causes cross-task context pollution. We present FOCAL (Filtered On-device Continuous Activity Logging), a privacy-first multi-agent system utilizing a unified filter-plan-log architecture. It cascades a lightweight Filter Agent for noise suppression, a text-only Brain Agent for task attribution, a Record Agent for selective visual reasoning, and a task-isolated Memory Agent for context-coherent summarization. Experiments on DesktopBench (comprising 2,572 screenshots across 420 complex sessions) show FOCAL reduces total token consumption by 60.4% and VLM call count by 72.3% versus a baseline, while boosting Key Information Recall (KIR) from 0.38 to 0.61. Crucially, under A\toB\toA task interruptions, FOCAL maintains Task Acc 0.81 and KIR 0.80, whereas the baseline collapses to Task Acc 0.03. FOCAL pioneers the efficient, on-device summarization of instruction-free desktop streams into multi-perspective personal logs.

[MA-4] Mesh Memory Protocol: Semantic Infrastructure for Multi-Agent LLM Systems

【速读】:该论文旨在解决多智能体系统中跨会话的认知协作问题,即如何实现多个大语言模型(Large Language Model, LLM)代理在长时间跨度的任务中共享、评估并整合彼此的认知状态,从而支持持续性的团队协作与知识积累。其核心挑战在于:(1)代理需逐字段决定是否接纳同伴信息而非整体接收;(2)每个主张必须可追溯至来源,避免重复认知;(3)会话重启后的记忆有效性依赖于存储方式而非检索机制。解决方案的关键是提出“语义基础设施”这一新协议层,并设计了Mesh Memory Protocol (MMP),通过四个可组合原语协同实现:CAT7定义统一的七字段认知记忆块(Cognitive Memory Block, CMB)结构;SVAF基于接收方角色锚点对每字段进行评估以满足P1;跨代理谱系追踪通过内容哈希键的父节点和祖先关系实现P2;Remix机制仅保存接收方角色化理解后的CMB,确保P3的持久性与语义一致性。该协议已在三个生产环境中部署,支撑自主代理作为网络中的独立节点进行持续协作。

链接: https://arxiv.org/abs/2604.19540
作者: Hongwei Xu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 23 pages, 2 figures, 2 listings, 1 table. MMP v0.2.3 specification at this https URL (CC BY 4.0). Reference implementations on npm (@sym-bot/sym, @sym-bot/mesh-channel; Apache 2.0)

点击查看摘要

Abstract:Teams of LLM agents increasingly collaborate on tasks spanning days or weeks: multi-day data-generation sprints where generator, reviewer, and auditor agents coordinate in real time on overlapping batches; specialists carrying findings forward across session restarts; product decisions compounding over many review rounds. This requires agents to share, evaluate, and combine each other’s cognitive state in real time across sessions. We call this cross-session agent-to-agent cognitive collaboration, distinct from parallel agent execution. To enable it, three problems must be solved together. (P1) Each agent decides field by field what to accept from peers, not accept or reject whole messages. (P2) Every claim is traceable to source, so returning claims are recognised as echoes of the receiver’s own prior thinking. (P3) Memory that survives session restarts is relevant because of how it was stored, not how it is retrieved. These are protocol-level properties at the semantic layer of agent communication, distinct from tool-access and task-delegation protocols at lower layers. We call this missing protocol layer “semantic infrastructure,” and the Mesh Memory Protocol (MMP) specifies it. Four composable primitives work together: CAT7, a fixed seven-field schema for every Cognitive Memory Block (CMB); SVAF, which evaluates each field against the receiver’s role-indexed anchors and realises P1; inter-agent lineage, carried as parents and ancestors of content-hash keys and realising P2; and remix, which stores only the receiver’s own role-evaluated understanding of each accepted CMB, never the raw peer signal, realising P3. MMP is specified, shipped, and running in production across three reference deployments, where each session runs an autonomous agent as a mesh peer with its own identity and memory, collaborating with other agents across the network for collective intelligence.

[MA-5] Integrating Anomaly Detection into Agent ic AI for Proactive Risk Management in Human Activity

【速读】:该论文旨在解决老年人跌倒风险监测中现有系统在真实场景下泛化能力不足的问题,具体表现为情境感知能力弱、误报率高、环境噪声干扰以及数据稀缺等局限性。其解决方案的关键在于将跌倒检测与预测问题重新定义为异常检测任务,并引入具有目标导向性、主动性和自主决策能力的代理型人工智能(Agentic AI)系统,通过动态选择相关工具并整合到自适应决策流程中,实现对运动模式细微异常的早期识别,从而提升风险预警的准确性与实用性。

链接: https://arxiv.org/abs/2604.19538
作者: Farbod Zorriassatine,Ahmad Lotfi
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:Agentic AI, with goal-directed, proactive, and autonomous decision-making capabilities, offers a compelling opportunity to address movement-related risks in human activity, including the persistent hazard of falls among elderly populations. Despite numerous approaches to fall mitigation through fall prediction and detection, existing systems have not yet functioned as universal solutions across care pathways and safety-critical environments. This is largely due to limitations in consistently handling real-world complexity, particularly poor context awareness, high false alarm rates, environmental noise, and data scarcity. We argue that fall detection and fall prediction can usefully be formulated as anomaly detection problems and more effectively addressed through an agentic AI system. More broadly, this perspective enables the early identification of subtle deviations in movement patterns associated with increased risk, whether arising from age-related decline, fatigue, or environmental factors. While technical requirements for immediate deployment are beyond the scope of this paper, we propose a conceptual framework that highlights potential value. This framework promotes a well-orchestrated approach to risk management by dynamically selecting relevant tools and integrating them into adaptive decision-making workflows, rather than relying on static configurations tailored to narrowly defined scenarios.

[MA-6] Assessing VLM-Driven Semantic-Affordance Inference for Non-Humanoid Robot Morphologies AAMAS2026

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在非人形机器人系统中推理物体可操作性(affordance)能力不足的问题,填补其在多样化机器人形态部署中的应用空白。解决方案的关键在于构建一个混合数据集,融合真实世界机器人可操作性标注数据与VLM生成的合成场景,并通过实证分析揭示VLM在不同物体类别和机器人形态下的表现差异,发现其倾向于保守预测(低假阳性率但高假阴性率),尤其在新型工具使用和非常规操作中更为明显,从而提出需结合补充方法以缓解过度保守行为,同时保留其低假阳性带来的安全性优势。

链接: https://arxiv.org/abs/2604.19509
作者: Jess Jones,Raul Santos-Rodriguez,Sabine Hauert
机构: University of Bristol (布里斯托大学)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: AAMAS 2026 (main track), 9 pages, 4 figures

点击查看摘要

Abstract:Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding human-object interactions, but their application to robotic systems with non-humanoid morphologies remains largely unexplored. This work investigates whether VLMs can effectively infer affordances for robots with fundamentally different embodiments than humans, addressing a critical gap in the deployment of these models for diverse robotic applications. We introduce a novel hybrid dataset that combines annotated real-world robotic affordance-object relations with VLM-generated synthetic scenarios, and perform an empirical analysis of VLM performance across multiple object categories and robot morphologies, revealing significant variations in affordance inference capabilities. Our experiments demonstrate that while VLMs show promising generalisation to non-humanoid robot forms, their performance is notably inconsistent across different object domains. Critically, we identify a consistent pattern of low false positive rates but high false negative rates across all morphologies and object categories, indicating that VLMs tend toward conservative affordance predictions. Our analysis reveals that this pattern is particularly pronounced for novel tool use scenarios and unconventional object manipulations, suggesting that effective integration of VLMs in robotic systems requires complementary approaches to mitigate over-conservative behaviour while preserving the inherent safety benefits of low false positive rates.

[MA-7] Large Language Models Exhibit Normative Conformity

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在基于多智能体系统(LLM-based Multi-Agent Systems, LLM-MAS)中的从众偏差(conformity bias)问题,尤其是其对群体决策机制的潜在影响。传统研究常将“从众”视为单一现象,而本文的关键创新在于引入社会心理学中信息性从众(informational conformity)与规范性从众(normative conformity)的区分,从而从机制层面深入理解LLM的从众行为。解决方案的核心在于设计两类新任务以分离这两种从众类型,并通过实验验证多数LLM同时表现出两种从众倾向;进一步发现,通过微调社交情境因素可操控LLM规范性从众的目标对象,揭示了LLM群体决策易受少数恶意用户操纵的风险。此外,基于内部向量的分析表明,尽管外在表现相同,信息性与规范性从众可能由不同的内部机制驱动,为未来理解LLM中“规范”的实现及其对群体现象的影响提供了关键洞见。

链接: https://arxiv.org/abs/2604.19301
作者: Mikako Bito,Keita Nishimoto,Kimitaka Asatani,Ichiro Sakata
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:The conformity bias exhibited by large language models (LLMs) can pose a significant challenge to decision-making in LLM-based multi-agent systems (LLM-MAS). While many prior studies have treated “conformity” simply as a matter of opinion change, this study introduces the social psychological distinction between informational conformity and normative conformity in order to understand LLM conformity at the mechanism level. Specifically, we design new tasks to distinguish between informational conformity, in which participants in a discussion are motivated to make accurate judgments, and normative conformity, in which participants are motivated to avoid conflict or gain acceptance within a group. We then conduct experiments based on these task settings. The experimental results show that, among the six LLMs evaluated, up to five exhibited tendencies toward not only informational conformity but also normative conformity. Furthermore, intriguingly, we demonstrate that by manipulating subtle aspects of the social context, it may be possible to control the target toward which a particular LLM directs its normative conformity. These findings suggest that decision-making in LLM-MAS may be vulnerable to manipulation by a small number of malicious users. In addition, through analysis of internal vectors associated with informational and normative conformity, we suggest that although both behaviors appear externally as the same form of “conformity,” they may in fact be driven by distinct internal mechanisms. Taken together, these results may serve as an initial milestone toward understanding how “norms” are implemented in LLMs and how they influence group dynamics.

[MA-8] Explicit Trait Inference for Multi-Agent Coordination

【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的多智能体系统(Multi-Agent Systems, MAS)在复杂任务中常见的协调失败问题,如目标漂移、错误传播和行为错位。其解决方案的关键在于提出一种心理机制驱动的显式特质推断(Explicit Trait Inference, ETI)方法:通过从交互历史中推断并追踪伙伴在“温暖度”(warmth,如信任)和“能力感”(competence,如技能)两个心理学维度上的特质特征,从而指导智能体决策。实验表明,ETI在控制环境(经济博弈)中可减少45–77%的收益损失,在更复杂的多智能体基准(MultiAgentBench)中提升性能3–29%,且性能提升与特质推断质量高度相关,验证了LLM智能体能够可靠地从交互中推断他人特质,并利用结构化的他人认知实现有效协作。

链接: https://arxiv.org/abs/2604.19278
作者: Suhaib Abdurahman,Etsuko Ishii,Katerina Margatina,Divya Bhargavi,Monica Sunkara,Yi Zhang
机构: AWS Agentic AI Labs; University of Southern California (南加州大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:LLM-based multi-agent systems (MAS) show promise on complex tasks but remain prone to coordination failures such as goal drift, error cascades, and misaligned behaviors. We propose Explicit Trait Inference (ETI), a psychologically grounded method for improving coordination. ETI enables agents to infer and track partner characteristics along two established psychological dimensions–warmth (e.g., trust) and competence (e.g., skill)–from interaction histories to guide decisions. We evaluate ETI in controlled settings (economic games), where it reduces payoff loss by 45-77%, and in more realistic, complex multi-agent settings (MultiAgentBench), where it improves performance by 3-29% depending on the scenario and model, relative to a CoT baseline. Additional analysis shows that gains are closely linked to trait inference: ETI profiles predict agents’ actions, and informative profiles drive improvements. These results highlight ETI as a lightweight and robust mechanism for improving coordination in diverse multi-agent settings, and provide the first systematic evidence that LLM agents can (i) reliably infer others’ traits from interaction histories and (ii) leverage structured awareness of others’ traits for coordination.

[MA-9] BONSAI: A Mixed-Initiative Workspace for Human-AI Co-Development of Visual Analytics Applications

【速读】:该论文旨在解决视觉分析(Visual Analytics, VA)应用开发中面临的两大困境:一是传统开发模式下高度耦合的单体架构导致系统脆弱且难以维护;二是当前AI代码生成工具虽能快速产出代码,但结果缺乏结构化和可审计性,难以满足复杂VA应用对可控性和可追溯性的要求。解决方案的关键在于提出一种混合智能工作空间BONSAI,其核心是通过模块化的四层架构(硬件、服务、编排、应用)与结构化的四阶段开发流程(规划、设计、监控、评审),实现人类开发者与AI代理在不同层级上的协同开发,确保所有贡献均被结构化约束并全程追踪,从而在保持开发灵活性的同时保障系统的可解释性与可维护性。

链接: https://arxiv.org/abs/2604.19247
作者: Thilo Spinner,Matthias Miller,Fabian Sperrle-Roth,Mennatallah El-Assady
机构: 未知
类目: Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: 9 pages paper, 2 pages references, 10 figures

点击查看摘要

Abstract:Developing Visual Analytics (VA) applications requires integrating complex machine learning models with expressive interactive interfaces. Developers face a stark trade-off: building tightly-coupled monoliths plagued by fragile interdependencies, or relying on restrictive, simplistic frameworks. Meanwhile, unconstrained, single-shot AI code generation promises speed but yields unstructured, unauditable chaos. The core challenge is combining the control and expressiveness of custom development with the efficiency of AI generation under strict constraints. To address this, we introduce BONSAI, a mixed-initiative workspace for the multi-agent co-development of VA applications. BONSAI utilizes a modular four-layer architecture (hardware, services, orchestration, application) that allows human and AI developers to independently contribute reusable components. The workspace incorporates this architecture into a structured four-phase development process (plan, design, monitor, and review), ensuring distributed agency and full provenance, where all human and AI contributions are structurally bounded and tracked. We evaluate BONSAI through case studies demonstrating the efficient creation of novel tools and the rapid reconstruction of complex VA applications directly from research paper descriptions. Ultimately, this paper contributes a conceptual workflow, a scalable architecture, and an integrated system that successfully balances AI’s generative speed with the structural rigor required for complex VA development.

[MA-10] ClawCoin: An Agent ic AI-Native Cryptocurrency for Decentralized Agent Economies

【速读】:该论文旨在解决去中心化自治智能体经济中计算资源成本无法有效计量与结算的问题:当前智能体依赖API调用产生的计算token(compute token)进行推理、行动和委托,但这些token具有账户绑定、厂商锁定且无法上链记录的特性,导致现有支付系统(如x402)虽能转移法定货币价值,却无法以与实际计算消耗对齐的单位来报价、托管或清算工作流。其解决方案的核心是提出ClawCoin——一种基于计算成本索引的记账单位与结算资产,包含四个关键层:标准化价格的稳健篮子指数、发布签名最新证明的预言机、基于净值(NAV)的铸币/赎回金库(含覆盖阈值和速率限制)、以及支持多跳委托的链上结算层。实验证明,ClawCoin在单智能体、多智能体、工作流和采购场景下均能稳定执行容量、降低报价分散性、消除部分结算问题,并维持优于法币基准的协作市场动态,表明计算索引单位可显著提升去中心化智能体间的协调效率。

链接: https://arxiv.org/abs/2604.19026
作者: Shaoyu Li,Chaoyu Zhang,Hexuan Yu,Y. Thomas Hou,Wenjing Lou
机构: Virginia Tech(弗吉尼亚理工大学)
类目: Multiagent Systems (cs.MA); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Autonomous AI agents live or die by the API tokens they consume: without paid inference capacity they cannot reason, act, or delegate. Compute-token cost has become the binding resource of the emerging agent economy, yet it is non-transferable: it is account-bound, vendor-specific, and absent from on-chain ledgers. Existing payment rails such as x402 move fiat-backed value between agents, but they do not represent the quantity agents actually burn. As a result, agents can transport purchasing power but cannot quote, escrow, or settle workflows in a unit aligned with compute cost. We present ClawCoin, a tokenized, compute-cost-indexed unit of account and settlement asset for decentralized agent economies. ClawCoin combines four layers: a robust basket index over standardized prices; an oracle publishing signed fresh attestations; a NAV-based mint/redeem vault with coverage thresholds and rate limits; and an on-chain settlement layer for multi-hop delegations. We implement a prototype on an Ethereum-compatible L2 and evaluate it using a multi-agent simulator and the OpenClaw testbed. Across single-agent, multi-agent, workflow, and procurement experiments, ClawCoin stabilizes execution capacity under cost shocks, reduces cross-agent quote dispersion, eliminates partial settlements, and sustains cooperative market dynamics that fiat-denominated baselines cannot. These results suggest that compute-indexed units of account can improve decentralized agent coordination. Subjects: Multiagent Systems (cs.MA); Cryptography and Security (cs.CR) Cite as: arXiv:2604.19026 [cs.MA] (or arXiv:2604.19026v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2604.19026 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-11] Gated Coordination for Efficient Multi-Agent Collaboration in Minecraft Game

【速读】:该论文旨在解决长时程开放世界多智能体系统中因默认将局部异常视为通信触发条件而导致的协调噪声、本地执行中断以及公共交互滥用的问题。其核心解决方案是提出一种分层信息架构(partitioned information architecture),明确区分私有执行状态与公共协调状态,并引入两个关键机制:一是基于系统验证结果的事件触发工作记忆,用于维持紧凑且低噪声的本地状态表示;二是成本敏感的门控升级机制,通过综合考量节点重要性、本地恢复成本和下游任务影响来决定是否发起跨区域通信,从而将通信从默认反应转变为选择性决策。

链接: https://arxiv.org/abs/2604.18975
作者: HuaDong Jian,Chenghao Li,Haoyu Wang,Jiajia Shuai,Jinyu Guo,Yang Yang,Chaoning Zhang
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:In long-horizon open-world multi-agent systems, existing methods often treat local anomalies as automatic triggers for communication. This default design introduces coordination noise, interrupts local execution, and overuses public interaction in cases that could be resolved locally. To address this issue, we propose a partitioned information architecture for MLLM agents that explicitly separates private execution states from public coordination states. Building on this design, we introduce two key mechanisms. First, we develop an event-triggered working memory based on system-verified outcomes to maintain compact and low-noise local state representations. Second, we propose a cost-sensitive gated escalation mechanism that determines whether cross-region communication should be initiated by jointly considering node criticality, local recovery cost, and downstream task impact. In this way, communication is transformed from a default reaction into a selective decision. Experiments conducted on long-term construction tasks in open environments demonstrate that, compared to baseline models based on strong communication and planned structures, the introduction of gated communication and a partitioned information architecture results in superior performance in terms of blueprint completion quality and execution chain length. It also improves local self-recovery, reduces ineffective escalations, and increases the utility of public communication.

[MA-12] Superficial Success vs. Internal Breakdown: An Empirical Study of Generalization in Adaptive Multi-Agent Systems

【速读】:该论文旨在解决自适应多智能体系统(Adaptive Multi-Agent Systems, MAS)在复杂任务中泛化能力不足的问题,特别是其优化目标局限于特定任务场景,难以作为通用解决方案。研究通过大规模实证分析揭示了两个核心现象:一是拓扑过拟合(topological overfitting),即系统在不同领域间缺乏泛化能力;二是虚假协调(illusory coordination),即表面精度合理但智能体间的交互机制偏离理想MAS行为。解决方案的关键在于重新聚焦于提升MAS的泛化性能,并设计超越最终答案正确性的评估协议,以更真实地反映系统在多样化环境中的协同效率与鲁棒性。

链接: https://arxiv.org/abs/2604.18951
作者: Namyoung So,Seokgyu Jang,Taeuk Kim(Department of Computer Science, Hanyang University, Seoul, Republic of Korea)
机构: 未知
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注: 27 pages, 4 figures. Equal contribution for the first two authors

点击查看摘要

Abstract:Adaptive multi-agent systems (MAS) are increasingly adopted to tackle complex this http URL, the narrow task coverage of their optimization raises the question of whether they can function as general-purpose this http URL address this gap, we conduct an extensive empirical study of adaptive MAS, revealing two key findings: (1) topological overfitting – they fail to generalize across different domains; and (2) illusory coordination – they achieve reasonable surface-level accuracy while the underlying agent interactions diverge from ideal MAS behavior, raising concerns about their practical this http URL findings highlight the pressing need to prioritize generalization in MAS development and motivate evaluation protocols that extend beyond simple final-answer correctness.

[MA-13] HadAgent : Harness-Aware Decentralized Agent ic AI Serving with Proof-of-Inference Blockchain Consensus

【速读】:该论文旨在解决传统工作量证明(Proof-of-Work, PoW)区块链共识机制资源浪费严重且无实际产出的问题,同时应对大规模语言模型(Large Language Model, LLM)代理对GPU算力需求激增的挑战。其核心解决方案是提出HadAgent系统,采用一种新型共识机制——推理证明(Proof-of-Inference, PoI),即节点通过执行确定性LLM推理任务获得区块创建权;由于验证仅需在相同条件下重新执行一次前向传播,跨节点验证可达到共识速度。该方案通过三通道区块结构(DATA、MODEL、PROOF)实现细粒度防篡改保护,并引入两级节点架构:可信节点支持乐观执行以提升效率,非可信节点则需完整共识验证;此外,通过心跳探测、确定性重计算异常检测与自动信任管理构建自校正反馈环,实现对恶意或不可靠节点的有效隔离和诚实节点的信任晋升。

链接: https://arxiv.org/abs/2604.18614
作者: Landy Jimenez,Mariah Weatherspoon,Bingyu Shen,Yi Sheng,Jianming Liu,Boyang Li
机构: Kean University (肯恩大学); University of South Florida (南佛罗里达大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Proof-of-Work (PoW) blockchain consensus consumes vast computational resources without producing useful output, while the rapid growth of large language model (LLM) agents has created unprecedented demand for GPU computation. We present HadAgent, a decentralized agentic AI serving system that replaces hash-based mining with Proof-of-Inference (PoI), a consensus mechanism in which nodes earn block-creation rights by executing deterministic LLM inference tasks. Because verification requires only re-executing a single forward pass under identical conditions, cross-node verification operates at consensus speed. HadAgent organizes validated records into a three-lane block body with dedicated DATA, MODEL, and PROOF channels, each protected by an independent Merkle root for fine-grained tamper detection. A two-tier node architecture classifies secondary nodes as trusted or non-trusted based on historical behavior: trusted nodes serve inference results in real time through optimistic execution, while non-trusted nodes must undergo full consensus verification. A harness layer monitors node behavior through heartbeat probes, anomaly detection via deterministic recomputation, and automated trust management, creating a self-correcting feedback loop that isolates malicious or unreliable participants. Experiments on a prototype implementation demonstrate 100% detection rate and 0% false positive rate for tampered records, sub-millisecond validation latency for record and hub operations, and effective harness convergence that excludes adversarial nodes within two rounds while promoting honest nodes to trusted status within five rounds.

[MA-14] Opinion polarization from compression-based decision making where agents optimize local complexity and global simplicity

【速读】:该论文旨在解决社会极化(social polarization)的形成机制问题,尤其关注个体认知如何塑造群体层面的复杂社会行为。其解决方案的关键在于提出一种新颖的基于智能体的模型(agent-based model),整合了两个核心心理与社会机制:一是群体内追求独特性的动机(optimal distinctiveness theory),二是简化复杂信息的认知压缩倾向(cognitive compression)。模型通过引入香农熵(Shannon entropy)量化局部多样性与整体简化程度,模拟个体在互动中权衡这两种对立驱动力的过程。结果表明,这种简单心理规则可再现现实世界中的异质性意见集群现象,并揭示极化强度依赖于局部群体规模(适中时最易极化,符合Dunbar数)、认知压缩水平(高压缩增加不可预测性,低压缩增强结构一致性)等关键参数,从而深化对人类社会极化的动态理解。

链接: https://arxiv.org/abs/2604.18755
作者: Alina Dubovskaya,David J. P. O’Sullivan,Michael Quayle
机构: 未知
类目: Physics and Society (physics.soc-ph); Multiagent Systems (cs.MA); Adaptation and Self-Organizing Systems (nlin.AO)
备注:

点击查看摘要

Abstract:Understanding social polarization requires integrating insights from psychology, sociology, and complex systems science. Agent-based modeling provides a natural framework to combine perspectives from different fields and explore how individual cognition shapes collective outcomes. This study introduces a novel agent-based model that integrates two cognitive and social mechanisms: the desire to be unique within a group (optimal distinctiveness theory) and the tendency to simplify complex information (cognitive compression). In the model, virtual agents interact in pairs and decide whether to adopt each other’s opinions by balancing two opposing drives: maximizing opinion diversity within their local social group while simplifying the overall opinion landscape, with both evaluated using Shannon entropy. We show that the combination of these mechanisms can reproduce real-world patterns, such as the emergence of distinct heterogeneous opinion clusters. Moreover, unlike many existing models where opinions become fixed once opinion groups form, individuals in our model continue to adjust their opinions after clusters emerge, leading to ongoing variation within and between opinion groups. Computational experiments reveal that polarization emerges when local group sizes are moderate (consistent with Dunbar’s number), while smaller groups cause fragmentation and larger ones hinder distinct cluster formation. Higher cognitive compression increases unpredictability, while lower compression produces more consistent group structures. These results demonstrate how simple psychological rules can generate complex, realistic social behavior and advance understanding of polarization in human societies.

自然语言处理

[NLP-0] Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多步逻辑推理任务中表现不足的问题,尤其是现有方法要么仅在自然语言层面优化推理链,要么依赖外部符号求解器,难以协同利用两种视图的优势。其解决方案的关键在于提出一个假设:LLMs内部存在一个共享的逻辑子空间(logical subspace),该子空间能同时对齐自然语言和符号语言视角下的推理过程,且独立于表层形式。为此,作者通过典型相关分析(Canonical Correlation Analysis)在配对的残差激活上学习这一高相关性的低维子空间,并设计了一种无需训练的推理引导机制,使模型推理路径沿此逻辑子空间推进,从而融合双重视角的推理信号。实验表明,该方法在四个逻辑推理基准测试中显著提升准确率(最高达11个百分点),并在域外问题上具有良好泛化能力。

链接: https://arxiv.org/abs/2604.19716
作者: Feihao Fang,My T. Thai,Yuanyuan Lei
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:Large Language Models (LLMs) still struggle with multi-step logical reasoning. Existing approaches either purely refine the reasoning chain in natural language form or attach a symbolic solver as an external module. In this work, we instead ask whether LLMs contain a shared internal logical subspace that simultaneously aligns natural-language and symbolic-language views of the reasoning process. Our hypothesis is that this logical subspace captures logical reasoning capabilities in LLMs that are shared across views while remaining independent of surface forms. To verify this, we employ Canonical Correlation Analysis on the paired residual activations from natural-language and symbolic-language reasoning chains, learning a low-dimensional subspace with maximum cross-view correlation. Furthermore, we design a training-free approach that steers LLMs reasoning chain along this logical subspace, thereby leveraging the complementary reasoning signals from both views. Experiments on four logical reasoning benchmarks demonstrate the effectiveness of our approach, improving accuracy by up to 11 percentage points and generalizing well on out-of-domain problems.

[NLP-1] Epistemic orientation in parliamentary discourse is associated with deliberative democracy

【速读】: 该论文旨在解决政治话语中认知取向(epistemic orientation)难以量化及其对民主审议与治理质量影响不明确的问题。其解决方案的关键在于提出一种可扩展的测量方法——证据减去直觉(Evidence–Minus–Intuition, EMI)得分,该得分基于大语言模型(Large Language Model, LLM)评分和嵌入语义相似度计算得出,能够有效捕捉政治话语中以证据为基础的推理与以直觉为基础的推理之间的差异。通过在1946至2025年间七个发达国家的1500万段议会发言中应用该方法,研究发现EMI得分与民主审议质量及治理透明度呈正相关,揭示了政治话语的认知属性对民主质量和治理效能具有重要影响。

链接: https://arxiv.org/abs/2604.19699
作者: Segun Aroyehun,Stephan Lewandowsky,David Garcia
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The pursuit of truth is central to democratic deliberation and governance, yet political discourse reflects varying epistemic orientations, ranging from evidence-based reasoning grounded in verifiable information to intuition-based reasoning rooted in beliefs and subjective interpretation. We introduce a scalable approach to measure epistemic orientation using the Evidence–Minus–Intuition (EMI) score, derived from large language model (LLM) ratings and embedding-based semantic similarity. Applying this approach to 15 million parliamentary speech segments spanning 1946 to 2025 across seven countries, we examine temporal patterns in discourse and its association with deliberative democracy and governance. We find that EMI is positively associated with deliberative democracy within countries over time, with consistent relationships in both contemporaneous and lagged analyses. EMI is also positively associated with the transparency and predictable implementation of laws as a dimension of governance. These findings suggest that the epistemic nature of political discourse is crucial for both the quality of democracy and governance.

[NLP-2] An Answer is just the Start: Related Insight Generation for Open-Ended Document-Grounded QA ACL

【速读】: 该论文旨在解决开放性问答(Open-Ended Question Answering, OEQA)中用户通过多轮迭代优化答案的需求,而现有问答基准未明确支持这一动态交互过程的问题。其核心挑战在于如何从文档集合中生成有助于改进、扩展或重构初始答案的额外洞察(related insights),从而提升问答体验的深度与灵活性。解决方案的关键在于提出一种两阶段方法 InsightGen:首先利用聚类构建文档集合的主题表示(thematic representation),进而基于主题图中的邻域选择机制提取相关上下文,最终借助大语言模型(Large Language Models, LLMs)生成多样且相关的洞察,形成对初始答案的补充与深化。

链接: https://arxiv.org/abs/2604.19685
作者: Saransh Sharma,Pritika Ramu,Aparna Garimella,Koyel Mukherjee
机构: Adobe Research, India; University of Maryland, College Park
类目: Computation and Language (cs.CL)
备注: Paper accepted at ACL Findings 2026

点击查看摘要

Abstract:Answering open-ended questions remains challenging for AI systems because it requires synthesis, judgment, and exploration beyond factual retrieval, and users often refine answers through multiple iterations rather than accepting a single response. Existing QA benchmarks do not explicitly support this refinement process. To address this gap, we introduce a new task, document-grounded related insight generation, where the goal is to generate additional insights from a document collection that help improve, extend, or rethink an initial answer to an open-ended question, ultimately supporting richer user interaction and a better overall question answering experience. We curate and release SCOpE-QA (Scientific Collections for Open-Ended QA), a dataset of 3,000 open-ended questions across 20 research collections. We present InsightGen, a two-stage approach that first constructs a thematic representation of the document collection using clustering, and then selects related context based on neighborhood selection from the thematic graph to generate diverse and relevant insights using LLMs. Extensive evaluation on 3,000 questions using two generation models and two evaluation settings shows that InsightGen consistently produces useful, relevant, and actionable insights, establishing a strong baseline for this new task.

[NLP-3] Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation

【速读】: 该论文旨在解决多语言大模型中任务向量(Function Vectors, FVs)是否具备语言无关性(language-agnosticity)的问题,特别是在机器翻译场景下的跨语言迁移能力。其核心解决方案在于验证从单一英语→目标语言方向提取的FVs能否在未见语言上有效迁移并提升正确翻译token的排序性能;关键发现是:这些FVs在多个未见过的目标语言中均能一致改善翻译质量,且移除FV会导致多语言翻译性能下降但对无关任务影响较小,表明FVs具有较强的跨语言泛化能力,并可在不同模型架构(如基础模型与指令微调版本)及粒度(词级到句级)之间传递。

链接: https://arxiv.org/abs/2604.19678
作者: Nurkhan Laiyk,Gerard I. Gállego,Javier Ferrando,Fajri Koto
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Universitat Politècnica de Catalunya (加泰罗尼亚理工大学); Cantina Labs
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Function vectors (FVs) are vector representations of tasks extracted from model activations during in-context learning. While prior work has shown that multilingual model representations can be language-agnostic, it remains unclear whether the same holds for function vectors. We study whether FVs exhibit language-agnosticity, using machine translation as a case study. Across three decoder-only multilingual LLMs, we find that translation FVs extracted from a single English \rightarrow Target direction transfer to other target languages, consistently improving the rank of correct translation tokens across multiple unseen languages. Ablation results show that removing the FV degrades translation across languages with limited impact on unrelated tasks. We further show that base-model FVs transfer to instruction-tuned variants and partially generalize from word-level to sentence-level translation.

[NLP-4] Pause or Fabricate? Training Language Models for Grounded Reasoning

【速读】: 该论文旨在解决大语言模型在处理不完整输入时产生的“无根据推理”(ungrounded reasoning)问题,即模型在缺乏必要前提的情况下仍会生成看似合理但实际错误的结论。解决方案的关键在于引入一种基于交互式强化学习的框架——Grounded Reasoning via Interactive Reinforcement Learning (GRIL),其核心机制是将推理过程分解为“澄清”(clarify)和“暂停”(pause)两个阶段,以识别信息是否充分,并在前提明确后才进行任务求解;通过设计阶段特异性奖励函数惩罚幻觉行为,使模型能够主动检测推理边界、适时停止并等待澄清后再继续推理,从而显著提升推理的可靠性与效率。

链接: https://arxiv.org/abs/2604.19656
作者: Yiwen Qiu,Linjuan Wu,Yizhou Liu,Yuchen Yan,Jin Ma,Xu Tan,Yao Hu,Daoxin Zhang,Wenqi Zhang,Weiming Lu,Jun Xiao,Yongliang Shen
机构: 浙江大学(Zhejiang University)
类目: Computation and Language (cs.CL)
备注: Code: this https URL

点击查看摘要

Abstract:Large language models have achieved remarkable progress on complex reasoning tasks. However, they often implicitly fabricate information when inputs are incomplete, producing confident but unreliable conclusions – a failure mode we term ungrounded reasoning. We argue that this issue arises not from insufficient reasoning capability, but from the lack of inferential boundary awareness – the ability to recognize when the necessary premises for valid inference are missing. To address this issue, we propose Grounded Reasoning via Interactive Reinforcement Learning (GRIL), a multi-turn reinforcement learning framework for grounded reasoning under incomplete information. GRIL decomposes the reasoning process into two stages: clarify and pause, which identifies whether the available information is sufficient, and grounded reasoning, which performs task solving once the necessary premises are established. We design stage-specific rewards to penalize hallucinations, enabling models to detect gaps, stop proactively, and resume reasoning after clarification. Experiments on GSM8K-Insufficient and MetaMATH-Insufficient show that GRIL significantly improves premise detection (up to 45%), leading to a 30% increase in task success while reducing average response length by over 20%. Additional analyses confirm robustness to noisy user responses and generalization to out-of-distribution tasks.

[NLP-5] he signal is the ceiling: Measurement limits of LLM -predicted experience ratings from open-ended survey text

【速读】: 该论文旨在解决如何通过优化提示(prompt)设计和模型选择来提升生成式 AI (Generative AI) 对球迷开放式调查文本中体验评分的预测准确性问题。其关键发现是:提示定制化可显著改善预测性能(在 GPT 4.1 上从 67% 提升至 69%),而模型替换(如使用 GPT 5.2 或 GPT 4.1-mini)则无法稳定提升效果,甚至导致性能下降;更重要的是,输入文本的语言特征对准确性的影响力远超任何提示或模型调整,表明性能上限主要受制于文本内容与真实决策之间的信息缺口,而非技术参数本身。因此,解决方案的关键在于针对模型读取文本时存在的偏差进行有针对性的提示工程,而非盲目更换模型或过度依赖复杂提示设计。

链接: https://arxiv.org/abs/2604.19645
作者: Andrew Hong,Jason Potteiger,Luis E. Zapata
机构: 未知
类目: Computation and Language (cs.CL)
备注: 42 pages, 7 figures, 10 tables

点击查看摘要

Abstract:An earlier paper (Hong, Potteiger, and Zapata 2026) established that an unoptimized GPT 4.1 prompt predicts fan-reported experience ratings within one point 67% of the time from open-ended survey text. This paper tests the relative impact of prompt design and model selection on that performance. We compared four configurations on approximately 10,000 post-game surveys from five MLB teams: the original baseline prompt and a moderately customized version, crossed with three GPT models (4.1, 4.1-mini, 5.2). Prompt customization added roughly two percentage points of within +/-1 agreement on GPT 4.1 (from 67% to 69%). Both model swaps from that best configuration degraded performance: GPT 5.2 returned to the baseline, and GPT 4.1-mini fell six percentage points below it. Both levers combined were dwarfed by the input itself: across capable configurations, accuracy varied more than an order of magnitude more by the linguistic character of the text than by the choice of prompt or model. The ceiling has two parts. One is a bias in how the model reads text, which prompt design can correct. The other is a difference between what fans write about and what they actually decide, which no engineering can close because the missing information is not in the text. Prompt customization moved the first part; model selection moved neither reliably. The result is not that “prompt engineering helps a little” but that prompt engineering helps in a specific and predictable way, on the part of the ceiling it can reach.

[NLP-6] Micro Language Models Enable Instant Responses

【速读】: 该论文旨在解决资源受限边缘设备(如智能手表和智能眼镜)无法持续运行小型语言模型(100M–1B参数)的问题,同时避免云端推理带来的多秒延迟,从而破坏了响应式助手的用户体验。解决方案的关键在于提出微语言模型(μLMs)——一种参数量为8M–30M的超紧凑模型,能够在本地设备上即时生成上下文相关的前4–8个词,随后由云端模型完成后续内容生成,从而隐藏云端延迟。作者设计了一种协作生成框架,将云端模型视为“续写者”而非“回应者”,实现句子中途无缝交接,并通过三种错误纠正方法在本地“开启者”出错时进行结构化容错恢复,实验证明μLMs能启动由更大模型无缝完成的响应,实现了数量级差异的不对称协同,为极端资源受限设备解锁了响应式人工智能。

链接: https://arxiv.org/abs/2604.19642
作者: Wen Cheng,Tuochao Chen,Karim Helwani,Sriram Srinivasan,Luke Zettlemoyer,Shyamnath Gollakota
机构: University of Washington (华盛顿大学); Meta AI (Meta人工智能)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Edge devices such as smartwatches and smart glasses cannot continuously run even the smallest 100M-1B parameter language models due to power and compute constraints, yet cloud inference introduces multi-second latencies that break the illusion of a responsive assistant. We introduce micro language models ( \mu LMs): ultra-compact models (8M-30M parameters) that instantly generate the first 4-8 words of a contextually grounded response on-device, while a cloud model completes it; thus, masking the cloud latency. We show that useful language generation survives at this extreme scale with our models matching several 70M-256M-class existing models. We design a collaborative generation framework that reframes the cloud model as a continuator rather than a respondent, achieving seamless mid-sentence handoffs and structured graceful recovery via three error correction methods when the local opener goes wrong. Empirical results show that \mu LMs can initiate responses that larger models complete seamlessly, demonstrating that orders-of-magnitude asymmetric collaboration is achievable and unlocking responsive AI for extremely resource-constrained devices. The model checkpoint and demo are available at this https URL.

[NLP-7] SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models ACL2026

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在具身交互环境中对安全风险的主动应对能力不足的问题。现有评估主要依赖非具身问答(disembodied question answering, QA)场景下的危险识别,但无法反映模型在真实物理环境中进行风险缓解的能力。解决方案的关键在于构建SafetyALFRED——一个基于ALFRED基准并引入六类真实厨房危险的具身安全评估框架,首次同时评估模型在危险识别与主动风险缓解(通过具身规划实现)两个维度的表现。实验结果揭示了QA评估与具身任务之间的显著能力差距,表明静态QA不足以衡量物理安全性,从而呼吁向以纠正行动为核心的具身安全基准转变。

链接: https://arxiv.org/abs/2604.19638
作者: Josue Torres-Fonseca,Naihao Deng,Yinpei Dai,Shane Storks,Yichi Zhang,Rada Mihalcea,Casey Kennington,Joyce Chai
机构: University of Michigan(密歇根大学); Boise State University(博伊西州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注: Work accepted at ACL 2026 Findings

点击查看摘要

Abstract:Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under this https URL

[NLP-8] he “Small World of Words” German Free-Association Norms

【速读】: 该论文旨在解决德语领域缺乏大规模自由联想规范数据的问题,这一缺口限制了对德语语言、语义及文化现象的认知科学研究。解决方案的关键在于构建并发布德国版多语言“小世界词汇”(Small World of Words, SWOW)项目中的自由联想规范数据集(SWOW-DE),涵盖5,877个德语提示词,并通过严谨的数据收集流程、参与者特征描述以及全面的预处理管道确保数据质量。该数据集在词汇判断任务、相关性判断和心理语言学词频评分等经典范式中表现出稳健的预测能力,且与现有德语资源相比具有优势,同时揭示了跨语言共通与特异的联想模式,为语言学、心理学及跨文化研究提供了前所未有的高质量资源。

链接: https://arxiv.org/abs/2604.19620
作者: Samuel Aeschbach,Rui Mata,Kaidi Lõo,Simon De Deyne,Dirk U. Wulff
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Free-association norms provide essential empirical data for investigating linguistic, semantic, and cultural phenomena in the cognitive sciences. Although large-scale norms exist for languages such as English, Dutch, Spanish, and Mandarin Chinese, no comparable resource has been available for German. To address this gap, we present free-association norms for 5,877 German cue words as part of the German version of the multilingual Small World of Words (SWOW) project. We describe the data collection procedures, participant characteristics, and our comprehensive preprocessing pipeline before introducing the resulting SWOW-DE data set. Using data from three established psycholinguistic paradigms, we show that SWOW-DE norms robustly predict performance in lexical decision tasks, relatedness judgments, and psycholinguistic word ratings. Furthermore, we demonstrate that SWOW-DE responses compare favorably with existing German resources and provide a preliminary cross-linguistic comparison revealing both shared and language-specific association patterns, highlighting promising directions for future research. Overall, SWOW-DE represents the largest collection of German free associations to date and offers a unique resource for linguistic, psychological, and cross-cultural research.

[NLP-9] Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成运动处方(Exercise Prescription)时的重复生成一致性问题,以评估其临床部署的可靠性。解决方案的关键在于通过设定相同温度参数(temperature=0)下对三个主流LLM(GPT-4.1、Claude Sonnet 4.6 和 Gemini 2.5 Flash)进行20次重复输出测试,并从语义相似性、输出可复现性、FITT分类一致性及安全性表达四个维度系统分析其行为差异。研究发现,尽管各模型在语义相似性上表现不同(GPT-4.1最高,达0.955),但真正决定一致性的核心并非单纯数值指标,而是生成模式的本质区别:GPT-4.1虽输出全部唯一但语义稳定,而Gemini 2.5 Flash因高重复率(仅27.5%唯一输出)导致相似性虚高,揭示出单一输出评估无法捕捉深层行为差异。因此,该研究强调模型选择应基于重复生成下的行为稳定性,而非静态性能指标,这为临床级LLM应用提供了关键决策依据。

链接: https://arxiv.org/abs/2604.19598
作者: Kihyuk Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 Pages, 2 Figures, 6 Tables and 2 Supplementary Materials

点击查看摘要

Abstract:This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p .001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating that its high similarity score derived from text duplication rather than consistent reasoning. Identical decoding settings thus yielded fundamentally different consistency profiles, a distinction that single-output evaluations cannot capture. Safety expression reached ceiling levels across all models, confirming its limited utility as a differentiating metric. These results indicate that model selection constitutes a clinical rather than merely technical decision, and that output behavior under repeated generation conditions should be treated as a core criterion for reliable deployment of LLM-based exercise prescription systems.

[NLP-10] RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian

【速读】: 该论文旨在解决法律文本中语法错误检测与修正问题,尤其针对资源稀缺的罗马尼亚语法律领域。其核心挑战在于:法律文本对准确性要求极高,但训练此类纠错模型所需的高质量标注数据在罗马尼亚语中极为匮乏,且缺乏专门针对法律场景的语料。解决方案的关键在于构建首个面向罗马尼亚语法律领域的平行语料库——RoLegalGEC,该数据集包含35万条法律文本中的语法错误及其标注,并在此基础上评估多种神经网络模型(包括知识蒸馏Transformer、序列标注架构及预训练文本到文本Transformer),从而实现高精度的错误检测与修正,为后续罗马尼亚语法律文本处理研究提供重要基础资源。

链接: https://arxiv.org/abs/2604.19593
作者: Mircea Timpuriu,Dumitru-Clementin Cercel
机构: National University of Science and Technology POLITEHNICA Bucharest (布加勒斯特理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The importance of clear and correct text in legal documents cannot be understated, and, consequently, a grammatical error correction tool meant to assist a professional in the law must have the ability to understand the possible errors in the context of a legal environment, correcting them accordingly, and implicitly needs to be trained in the same environment, using realistic legal data. However, the manually annotated data required by such a process is in short supply for languages such as Romanian, much less for a niche domain. The most common approach is the synthetic generation of parallel data; however, it requires a structured understanding of the Romanian grammar. In this paper, we introduce, to our knowledge, the first Romanian-language parallel dataset for the detection and correction of grammatical errors in the legal domain, RoLegalGEC, which aggregates 350,000 examples of errors in legal passages, along with error annotations. Moreover, we evaluate several neural network models that transform the dataset into a valuable tool for both detecting and correcting grammatical errors, including knowledge-distillation Transformers, sequence tagging architectures for detection, and a variety of pre-trained text-to-text Transformer models for correction. We consider that the set of models, together with the novel RoLegalGEC dataset, will enrich the resource base for further research on Romanian.

[NLP-11] A Bolu: A Structured Dataset for the Computational Analysis of Sardinian Improvisational Poetry LREC COLING2026

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)领域对少数语言口头文化遗产研究不足的问题,特别是针对即兴诗歌这一依赖实时 improvisation(即兴创作)和韵律修辞能力的表演性文本类型缺乏计算语言学分析资源的现状。解决方案的关键在于构建首个面向萨丁语Logudorese方言即兴诗歌的结构化语料库A Bolu,其包含2,835节诗段共141,321个词元,并通过描述性统计指标与计算语言学技术相结合的多维分析方法,揭示了萨丁语即兴诗人创作中存在重复性模式,支持Parry和Lord的公式化理论(formulaicity theory),从而为理解口头创造力提供了新视角,并推动更包容、贴合小语种特性的NLP工具发展。

链接: https://arxiv.org/abs/2604.19584
作者: Silvio Calderaro,Johanna Monti
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at the DIALRES Workshop, LREC-COLING 2026

点击查看摘要

Abstract:The growing interest of Natural Language Processing (NLP) in minority languages has not yet bridged the gap in the preservation of oral linguistic heritage. In particular, extemporaneous poetry - a performative genre based on real-time improvisation, metrical-rhetorical competence - remains a largely unexplored area of computational linguistics. This methodological gap necessitates the creation of specific resources to document and analyse the structures of improvised poetry. This is the context in which A Bolu was created, the first structured corpus of extemporaneous poetry dedicated to cantada logudorese, a variant of the Sardinian language. The dataset comprises 2,835 stanzas for a total of 141,321 tokens. The study presents the architecture of the corpus and applies a multidimensional analysis combining descriptive statistical indices and computational linguistics techniques to map the characteristics of the poetic text. The results indicate that the production of Sardinian extemporaneous poets is characterised by recurring patterns that support Parry and Lord’s theory of formulaicity. This evidence not only provides a new key to understanding oral creativity, but also offers a significant contribution to the development of NLP tools that are more inclusive and sensitive to the specificities of less widely spoken languages.

[NLP-12] A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression

【速读】: 该论文旨在解决长时程、多轮终端中心型代理任务中因持续保留环境交互反馈而导致的冗余信息累积与token成本急剧上升的问题,这限制了代理在长时间推理中的效率和可扩展性。解决方案的关键在于提出一种即插即用且自进化的终端代理压缩框架TACO,其能够从交互轨迹中自动发现并迭代优化压缩规则,从而实现对终端环境反馈的高效、任务感知式压缩,显著降低token开销的同时保持甚至提升代理性能。

链接: https://arxiv.org/abs/2604.19572
作者: Jincheng Ren,Siwei Wu,Yizhi Li,Kang Zhu,Shu Xu,Boyu Feng,Ruibin Yuan,Wei Zhang,Riza Batista-Navarro,Jian Yang,Chenghua Lin
机构: University of Manchester (曼彻斯特大学); MAP (MAP); HKUST(GZ) (香港科技大学(广州)); HKUST (香港科技大学); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL)
备注: 23 pages

点击查看摘要

Abstract:As model capabilities advance, research has increasingly shifted toward long-horizon, multi-turn terminal-centric agentic tasks, where raw environment feedback is often preserved in the interaction history to support future decisions. However, repeatedly retaining such feedback introduces substantial redundancy and causes cumulative token cost to grow quadratically with the number of steps, hindering long-horizon reasoning. Although observation compression can mitigate this issue, the heterogeneity of terminal environments makes heuristic-based or fixed-prompt methods difficult to generalize. We propose TACO, a plug-and-play, self-evolving Terminal Agent Compression framework that automatically discovers and refines compression rules from interaction trajectories for existing terminal agents. Experiments on TerminalBench (TB 1.0 and TB 2.0) and four additional terminal-related benchmarks (i.e., SWE-Bench Lite, CompileBench, DevEval, and CRUST-Bench) show that TACO consistently improves performance across mainstream agent frameworks and strong backbone models. With MiniMax-2.5, it improves performance on most benchmarks while reducing token overhead by around 10%. On TerminalBench, it brings consistent gains of 1%-4% across strong agentic models, and further improves accuracy by around 2%-3% under the same token budget. These results demonstrate the effectiveness and generalization of self-evolving, task-aware compression for terminal agents.

[NLP-13] Detecting Hallucinations in SpeechLLM s at Inference Time Using Attention Maps ACL2026

【速读】: 该论文旨在解决语音大语言模型(Speech Large Language Models, SpeechLLMs)中幻觉(hallucination)检测的难题,尤其是现有方法依赖昂贵或难以获取的黄金标准输出,且文本大语言模型(text-based LLMs)的检测方法无法有效捕捉音频特有的信号。其解决方案的关键在于设计四种基于注意力机制(attention-derived metrics)的特征:AUDIORATIO、AUDIOCONSISTENCY、AUDIOENTROPY 和 TEXTENTROPY,用于识别与幻觉相关的异常注意力模式,并在此基础上训练轻量级逻辑回归分类器,实现推理阶段高效检测。实验表明,该方法在自动语音识别(ASR)和语音到文本翻译任务中优于基于不确定性和传统注意力的基线方法,且具备良好的域内和域外泛化能力,尤其在仅使用约100个注意力头时表现更优,验证了注意力模式作为SpeechLLMs幻觉检测的有效工具。

链接: https://arxiv.org/abs/2604.19565
作者: Jonas Waldendorf,Bashar Awwad Shiekh Hasan,Evgenii Tsymbalov
机构: University of Edinburgh (爱丁堡大学); Amazon AGI (亚马逊AGI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Hallucinations in Speech Large Language Models (SpeechLLMs) pose significant risks, yet existing detection methods typically rely on gold-standard outputs that are costly or impractical to obtain. Moreover, hallucination detection methods developed for text-based LLMs do not directly capture audio-specific signals. We investigate four attention-derived metrics: AUDIORATIO, AUDIOCONSISTENCY, AUDIOENTROPY, and TEXTENTROPY, designed to capture pathological attention patterns associated with hallucination, and train lightweight logistic regression classifiers on these features for efficient inference-time detection. Across automatic speech recognition and speech-to-text translation tasks, evaluations on Qwen-2-Audio and Voxtral-3B show that our approach outperforms uncertainty-based and prior attention-based baselines on in-domain data, achieving improvements of up to +0.23 PR-AUC, and generalises to out-of-domain ASR settings. We further find that strong performance can be achieved with approximately 100 attention heads, improving out-of-domain generalisation compared to using all heads. While effectiveness is model-dependent and task-specific training is required, our results demonstrate that attention patterns provide a valuable tool for hallucination detection in SpeechLLMs.

[NLP-14] Enhancing Construction Worker Safety in Extreme Heat: A Machine Learning Approach Utilizing Wearable Technology for Predictive Health Analytics

【速读】: 该论文旨在解决建筑工人在高温环境下易受热应激(heat stress)影响,而现有工具难以将实时生理数据转化为可操作的安全智能信息的问题。解决方案的关键在于开发并评估两种深度学习模型——基础长短期记忆网络(LSTM)和基于注意力机制的LSTM模型,利用Garmin Vivosmart 5智能手表采集的心率、心率变异性(HRV)及血氧饱和度等生理指标进行训练与预测。其中,注意力机制增强的LSTM模型表现最优,测试准确率达95.40%,且显著降低假阳性和假阴性结果,同时具备良好的可解释性,适用于集成至物联网(IoT)安全系统和建筑信息模型(BIM)仪表盘,从而推动建筑业向信息化驱动的主动安全管理转型。

链接: https://arxiv.org/abs/2604.19559
作者: Syed Sajid Ullah,Amir Khan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Construction workers are highly vulnerable to heat stress, yet tools that translate real-time physiological data into actionable safety intelligence remain scarce. This study addresses this gap by developing and evaluating deep learning models, specifically a baseline Long Short-Term Memory (LSTM) network and an attention-based LSTM, to predict heat stress among 19 workers in Saudi Arabia. Using Garmin Vivosmart 5 smartwatches to monitor metrics such as heart rate, HRV, and oxygen saturation, the attention-based model outperformed the baseline, achieving 95.40% testing accuracy and significantly reducing false positives and negatives. With precision, recall, and F1 scores of 0.982, this approach not only improves predictive performance but also offers interpretable results suitable for integration into IoT-enabled safety systems and BIM dashboards, advancing proactive, informatics-driven safety management in the construction industry.

[NLP-15] aming Actor-Observer Asymmetry in Agents via Dialectical Alignment ACL2026

【速读】: 该论文旨在解决多智能体框架中因角色扮演引发的**行为者-观察者不对称性(Actor-Observer Asymmetry, AOA)问题,即智能体在自我反思时倾向于将失败归因于外部因素,而在相互审计时则归因于内部缺陷,导致决策不一致。解决方案的关键在于提出一种名为ReTAS(Reasoning via Thesis-Antithesis-Synthesis)**的新方法,其核心是通过辩证对齐(dialectical alignment)训练,结合辩证链式推理(dialectical chain-of-thought)与群体相对策略优化(Group Relative Policy Optimization),引导智能体从冲突视角中合成客观共识,从而实现视角不变的推理机制,显著提升模糊场景下的故障识别与修复能力。

链接: https://arxiv.org/abs/2604.19548
作者: Bobo Li,Rui Wu,Zibo Ji,Meishan Zhang,Hao Fei,Min Zhang,Mong-Li Lee,Wynne Hsu
机构: National University of Singapore (新加坡国立大学); Sichuan University (四川大学); University of Minnesota Twin Cities (明尼苏达大学双城分校); Harbin Institute of Technology, Shenzhen (深圳哈尔滨工业大学); University of Oxford (牛津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: ACL 2026 Main Conference. Project page: this https URL

点击查看摘要

Abstract:Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi-agent frameworks assigning specialized roles are increasingly adopted to enable self-reflection and mutual auditing. While such role-playing effectively leverages domain expert knowledge, we find it simultaneously induces a human-like cognitive bias known as Actor-Observer Asymmetry (AOA). Specifically, an agent acting as an actor (during self-reflection) tends to attribute failures to external factors, whereas an observer (during mutual auditing) attributes the same errors to internal faults. We quantify this using our new Ambiguous Failure Benchmark, which reveals that simply swapping perspectives triggers the AOA effect in over 20% of cases for most models. To tame this bias, we introduce ReTAS (Reasoning via Thesis-Antithesis-Synthesis), a model trained through dialectical alignment to enforce perspective-invariant reasoning. By integrating dialectical chain-of-thought with Group Relative Policy Optimization, ReTAS guides agents to synthesize conflicting viewpoints into an objective consensus. Experiments demonstrate that ReTAS effectively mitigates attribution inconsistency and significantly improves fault resolution rates in ambiguous scenarios.

[NLP-16] Emotion-Cause Pair Extraction in Conversations via Semantic Decoupling and Graph Alignment

【速读】: 该论文旨在解决对话中情感-原因配对抽取(Emotion-Cause Pair Extraction in Conversations, ECPEC)任务中存在的两个核心问题:一是现有方法将ECPEC建模为独立的成对分类任务,忽略了情感扩散(emotion diffusion)与原因解释(cause explanation)在语义上的差异;二是未能捕捉对话中多对多且全局一致的情感-原因因果关系。解决方案的关键在于从语义层面重新审视ECPEC,通过解耦情感导向语义与原因导向语义,将其映射到两个互补的表示空间中,从而更好地刻画二者在对话中的不同作用;在此基础上,将ECPEC形式化为情感侧与原因侧表示之间的全局对齐问题,并利用最优传输(optimal transport)实现多对多且全局一致的情感-原因匹配。基于此思想,作者提出了统一框架SCALE,在共享对话结构内实现了语义解耦与对齐原则,实验表明其在多个基准数据集上均达到最先进性能。

链接: https://arxiv.org/abs/2604.19547
作者: Tianxiang Ma,Weijie Feng,Xinyu Wang,Zhiyong Cheng
机构: Hefei University of Technology (合肥工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Emotion-Cause Pair Extraction in Conversations (ECPEC) aims to identify the set of causal relations between emotion utterances and their triggering causes within a dialogue. Most existing approaches formulate ECPEC as an independent pairwise classification task, overlooking the distinct semantics of emotion diffusion and cause explanation, and failing to capture globally consistent many-to-many conversational causality. To address these limitations, we revisit ECPEC from a semantic perspective and seek to disentangle emotion-oriented semantics from cause-oriented semantics, mapping them into two complementary representation spaces to better capture their distinct conversational roles. Building on this semantic decoupling, we naturally formulate ECPEC as a global alignment problem between the emotion-side and cause-side representations, and employ optimal transport to enable many-to-many and globally consistent emotion-cause matching. Based on this perspective, we propose a unified framework SCALE that instantiates the above semantic decoupling and alignment principle within a shared conversational structure. Extensive experiments on several benchmark datasets demonstrate that SCALE consistently achieves state-of-the-art performance. Our codes are released at this https URL.

[NLP-17] Bangla Key2Text: Text Generation from Keywords for a Low Resource Language LREC2026

【速读】: 该论文旨在解决低资源语言(以孟加拉语为例)中关键词驱动的文本生成问题,即如何基于给定关键词自动生成连贯、相关的自然语言文本。其解决方案的关键在于构建了一个大规模的孟加拉语关键词-文本对数据集(Bangla Key2Text),包含260万条样本,通过基于BERT的关键词提取流水线从数百万篇孟加拉语新闻文本中自动抽取结构化标签,并用于监督式序列到序列模型训练。实验表明,针对任务微调的模型(如mT5和BanglaT5)在关键词条件下的文本生成性能显著优于零样本大语言模型,验证了该数据集和方法的有效性。

链接: https://arxiv.org/abs/2604.19508
作者: Tonmoy Talukder,G M Shahariar
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages, uses this http URL

点击查看摘要

Abstract:This paper introduces \textitBangla Key2Text, a large-scale dataset of 2.6 million Bangla keyword–text pairs designed for keyword-driven text generation in a low-resource language. The dataset is constructed using a BERT-based keyword extraction pipeline applied to millions of Bangla news texts, transforming raw articles into structured keyword–text pairs suitable for supervised learning. To establish baseline performance on this new benchmark, we fine-tune two sequence-to-sequence models, \textttmT5 and \textttBanglaT5, and evaluate them using multiple automatic metrics and human judgments. Experimental results show that task-specific fine-tuning substantially improves keyword-conditioned text generation in Bangla compared to zero-shot large language models. The dataset, trained models, and code are publicly released to support future research in Bangla natural language generation and keyword-to-text generation tasks.

[NLP-18] Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews

【速读】: 该论文旨在解决当前自动化同行评审研究中因依赖评分预测任务而导致的评估局限性问题,即现有基准未能充分捕捉评审文本中的论证价值与批判性内容。其解决方案的关键在于提出一个名为“Beyond Rating”的综合评估框架,该框架从五个维度(内容忠实度、论点一致性、焦点稳定性、问题建设性及AI生成倾向)对AI评审进行系统性评价,并引入最大召回(Max-Recall)策略以应对专家间合理分歧,同时构建了一个高置信度人工标注数据集以消除流程噪声。实验证明,传统n-gram指标无法反映人类偏好,而基于文本语义的指标(尤其是弱点论点的召回率)与评分准确性高度相关,表明将AI评审焦点对齐于人类专家是实现可靠自动评分的前提条件。

链接: https://arxiv.org/abs/2604.19502
作者: Bowen Li,Haochen Ma,Yuxin Wang,Jie Yang,Xinchi Chen,Xuanjing Huang,Yining Zheng,Xipeng Qiu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 38 pages,8 figures,4 tables

点击查看摘要

Abstract:The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primarily as a rating prediction task. We argue that the utility of a review lies in its textual justification–its arguments, questions, and critique–rather than a scalar score. To address this, we introduce Beyond Rating, a holistic evaluation framework that assesses AI reviewers across five dimensions: Content Faithfulness, Argumentative Alignment, Focus Consistency, Question Constructiveness, and AI-Likelihood. Notably, we propose a Max-Recall strategy to accommodate valid expert disagreement and introduce a curated dataset of paper with high-confidence reviews, rigorously filtered to remove procedural noise. Extensive experiments demonstrate that while traditional n-gram metrics fail to reflect human preferences, our proposed text-centric metrics–particularly the recall of weakness arguments–correlate strongly with rating accuracy. These findings establish that aligning AI critique focus with human experts is a prerequisite for reliable automated scoring, offering a robust standard for future research.

[NLP-19] Rank-Turbulence Delta and Interpretable Approaches to Stylometric Delta Metrics

【速读】: 该论文旨在解决文本作者归属(authorship attribution)问题,即通过分析文本特征准确识别未知作品的作者。传统方法如Burrows’s Delta在处理词频向量时存在局限性,难以充分捕捉概率分布间的差异。解决方案的关键在于引入两种新度量——Rank-Turbulence Delta和Jensen-Shannon Delta,它们基于概率分布的距离函数对经典Delta进行推广,从而提升分类准确性;同时,论文提出了一种基于token级别的分解方法,使每个Delta距离具有数值可解释性,有助于结果的细读与验证。实验表明,Jensen-Shannon Delta在多个语言语料库(包括英语、德语、法语和俄语)上表现优于或等同于经典Burrows’s Delta,证明了该方法的有效性和泛化能力。

链接: https://arxiv.org/abs/2604.19499
作者: Dmitry Pronin,Evgeny Kazartsev
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under review at Digital Scholarship in the Humanities. Code available at: this https URL

点击查看摘要

Abstract:This article introduces two new measures for authorship attribution - Rank-Turbulence Delta and Jensen-Shannon Delta - which generalise Burrows’s classical Delta by applying distance functions designed for probabilistic distributions. We first set out the theoretical basis of the measures, contrasting centred and uncentred z-scoring of word-frequency vectors and re-casting the uncentred vectors as probability distributions. Building on this representation, we develop a token-level decomposition that renders every Delta distance numerically interpretable, thereby facilitating close reading and the validation of results. The effectiveness of the methods is assessed on four literary corpora in English, German, French and Russian. The English, German and French datasets are compiled from Project Gutenberg, whereas the Russian benchmark is the SOCIOLIT corpus containing 755 works by 180 authors spanning the eighteenth to the twenty-first centuries. Rank-Turbulence Delta attains attribution accuracy comparable with Cosine Delta; Jensen-Shannon Delta consistently matches or exceeds the performance of canonical Burrows’s Delta. Finally, several established attribution algorithms are re-evaluated on the extended SOCIOLIT corpus.

[NLP-20] EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在大语言模型(Large Language Models, LLMs)后训练过程中,是否应使用学习得到的批评者(critic)作为基线(baseline)以降低策略优化方差的问题。传统理论支持基于批评者的算法(如近端策略优化 PPO),因其能有效减少优势估计的方差;然而,无批评者方法(如 GRPO)因实现简单且性能优异而被广泛采用。论文指出,在稀疏奖励场景下,学习到的批评者可能引入超过其捕捉状态信号能力的估计噪声,反而增加优势方差。解决方案的关键在于将基线选择建模为卡尔曼滤波问题,通过计算单批次可得的解释方差(Explained Variance, EV)来判断批评者是否真正降低方差:当EV为正时,批评者有益;否则应切换至批均值优势估计。基于此,作者提出自适应策略优化方法 EVPO,其在每一步动态评估并切换基线方式,理论上保证每步方差不超过两种基线中的最优者,并在经典控制、代理交互和数学推理等任务中稳定优于 PPO 和 GRPO。

链接: https://arxiv.org/abs/2604.19485
作者: Chengjun Pan,Shichun Liu,Jiahang Lin,Dingwei Zhu,Jiazheng Zhang,Shihan Dou,Songyang Gao,Zhenhua Han,Binghai Wang,Rui Zheng,Xuanjing Huang,Tao Gui,Yansong Feng
机构: Peking University (北京大学); Fudan University (复旦大学); Shanghai Qiji Zhifeng Co., Ltd. (上海启骥智峰科技有限公司); Shanghai AI Lab (上海人工智能实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have gained widespread adoption due to their simplicity and competitive performance. We show that in sparse-reward settings, a learned critic can inject estimation noise that exceeds the state signal it captures, increasing rather than reducing advantage variance. By casting baseline selection as a Kalman filtering problem, we unify PPO and GRPO as two extremes of the Kalman gain and prove that explained variance (EV), computable from a single training batch, identifies the exact boundary: positive EV indicates the critic reduces variance, while zero or negative EV signals that it inflates variance. Building on this insight, we propose Explained Variance Policy Optimization (EVPO), which monitors batch-level EV at each training step and adaptively switches between critic-based and batch-mean advantage estimation, provably achieving no greater variance than the better of the two at every step. Across four tasks spanning classical control, agentic interaction, and mathematical reasoning, EVPO consistently outperforms both PPO and GRPO regardless of which fixed baseline is stronger on a given task. Further analysis confirms that the adaptive gating tracks critic maturation over training and that the theoretically derived zero threshold is empirically optimal.

[NLP-21] Deep Supervised Contrastive Learning of Pitch Contours for Robust Pitch Accent Classification in Seoul Korean

【速读】: 该论文旨在解决韩语首尔方言中连续基频(F₀)轮廓难以映射到离散的音调类别问题,这是由于现实语音中F₀实现的变异性导致的。解决方案的关键在于提出Dual-Glob框架——一种基于深度监督对比学习的方法,通过在共享潜在空间中强制干净样本与增强视图之间的结构一致性,从而捕捉F₀轮廓的整体形状特征,而非依赖局部预测模型。此方法显著提升了对细粒度声调重音模式的分类性能,验证了数据驱动方法在基于自动段落建模(AM-based)的音调音系学研究中的有效性。

链接: https://arxiv.org/abs/2604.19477
作者: Hyunjung Joo,GyeongTaek Lee
机构: Rutgers University (罗格斯大学); Gachon University (伽林大学); Hanyang Institute for Phonetics and Cognitive Sciences of Language (韩阳语言音系与认知科学研究所)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The intonational structure of Seoul Korean has been defined with discrete tonal categories within the Autosegmental-Metrical model of intonational phonology. However, it is challenging to map continuous F_0 contours to these invariant categories due to variable F_0 realizations in real-world speech. Our paper proposes Dual-Glob, a deep supervised contrastive learning framework to robustly classify fine-grained pitch accent patterns in Seoul Korean. Unlike conventional local predictive models, our approach captures holistic F_0 contour shapes by enforcing structural consistency between clean and augmented views in a shared latent space. To this aim, we introduce the first large-scale benchmark dataset, consisting of manually annotated 10,093 Accentual Phrases in Seoul Korean. Experimental results show that our Dual-Glob significantly outperforms strong baseline models with state-of-the-art accuracy (77.75%) and F1-score (51.54%). Therefore, our work supports AM-based intonational phonology using data-driven methodology, showing that deep contrastive learning effectively captures holistic structural features of continuous F_0 contours.

[NLP-22] LePREC: Reasoning as Classification over Structured Factors for Assessing Relevance of Legal Issues ACL2026

【速读】: 该论文旨在解决生成式 AI(Generative AI)在法律领域中进行法律问题识别(Legal Issue Identification)时存在的精度不足问题,尤其是在资源有限的司法环境中,如何有效利用大型语言模型(Large Language Models, LLMs)提升法律问题识别的准确性与可解释性。其解决方案的关键在于提出一种神经符号框架 LePREC(Legal Professional-inspired Reasoning Elicitation and Classification),该框架通过两个核心组件实现:一是神经模块利用 LLM 将法律文本转化为问答对以提取多样化的分析因子;二是符号模块基于稀疏线性模型对这些离散特征进行相关性建模,学习显式的代数权重来识别最具信息量的推理因素。相比端到端神经方法,LePREC 在保持数据效率的同时实现了高可解释性,并在实验中相较 GPT-4o 和 Claude 等先进 LLM 基线模型实现 30–40% 的性能提升,验证了基于相关性的因子-问题分析策略在法律问题识别任务中的有效性。

链接: https://arxiv.org/abs/2604.19464
作者: Fanyu Wang,Xiaoxi Kang,Paul Burgess,Aashish Srivastava,Chetan Arora,Adnan Trakic,Lay-Ki Soon,Md Khalid Hossain,Lizhen Qu
机构: Monash University (莫纳什大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026 Main Conference

点击查看摘要

Abstract:More than half of the global population struggles to meet their civil justice needs due to limited legal resources. While Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, significant challenges remain even at the foundational step of legal issue identification. To investigate LLMs’ capabilities in this task, we constructed a dataset from 769 real-world Malaysian Contract Act court cases, using GPT-4o to extract facts and generate candidate legal issues, annotated by senior legal experts, which reveals a critical limitation: while LLMs generate diverse issue candidates, their precision remains inadequate (GPT-4o achieves only 62%). To address this gap, we propose LePREC (Legal Professional-inspired Reasoning Elicitation and Classification), a neuro-symbolic framework combining neural generation with structured statistical reasoning. LePREC consists of: (1) a neuro component leverages LLMs to transform legal descriptions into question-answer pairs representing diverse analytical factors, and (2) a symbolic component applies sparse linear models over these discrete features, learning explicit algebraic weights that identify the most informative reasoning factors. Unlike end-to-end neural approaches, LePREC achieves interpretability through transparent feature weighting while maintaining data efficiency through correlation-based statistical classification. Experiments show a 30-40% improvement over advanced LLM baselines, including GPT-4o and Claude, confirming that correlation-based factor-issue analysis offers a more data-efficient solution for relevance decisions.

[NLP-23] Do LLM s Game Formalization? Evaluating Faithfulness in Logical Reasoning ICLR2026

【速读】: 该论文旨在解决生成式 AI 在形式化逻辑推理中“形式化忠实性”(formalization faithfulness)与“证明有效性”(proof validity)之间的差距问题,即模型虽然能生成语法正确的 Lean 4 证明(高编译率),但未必准确地将自然语言命题转化为对应的逻辑公理体系。其关键解决方案在于设计并比较两种不同范式的推理流程:统一生成(unified generation)与两阶段流水线(two-stage pipeline)。实验表明,统一生成模式下模型倾向于报告失败而非强行构造无效证明,避免了系统性“形式化博弈”(formalization gaming);而两阶段方法揭示出两类隐蔽的不忠实行为——GPT-5在证明阶段虚构公理(可被跨阶段比对检测),DeepSeek-R1在形式化阶段误译前提(产生内部一致但外部错误的结果,难以检测)。这说明高编译率或准确率不能等同于忠实推理,需结合多维度验证机制以保障形式化质量。

链接: https://arxiv.org/abs/2604.19459
作者: Kyuhee Kim,Auguste Poiroux,Antoine Bosselut
机构: EPFL (洛桑联邦理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: 25 pages, 4 figures, 22 tables. Published at the VerifAI-2 Workshop, ICLR 2026 (non-archival). Code and data: this https URL

点击查看摘要

Abstract:Formal verification guarantees proof validity but not formalization faithfulness. For natural-language logical reasoning, where models construct axiom systems from scratch without library constraints, this gap between valid proofs and faithful translations is especially acute. We investigate whether frontier models exploit this gap when generating Lean 4 proofs, a behavior we term formalization gaming. We evaluate GPT-5 and DeepSeek-R1 on 303 first-order logic problems (203 from FOLIO, 100 from Multi-LogiEval), comparing unified generation against a two-stage pipeline that separates formalization from proving. Despite compilation rates of 87-99%, we find no evidence of systematic gaming in unified generation: models prefer reporting failure over forcing proofs, even under prompting designed to encourage it. However, unfaithfulness that evades our detection signals may still occur. The two-stage pipeline reveals two distinct modes of unfaithfulness: GPT-5 fabricates axioms during proof generation, a reactive fallback detectable via cross-stage comparison, while DeepSeek-R1 mistranslates premises during formalization, producing internally consistent outputs that evade detection entirely. These findings show that high compilation rates or accuracies should not be equated with faithful reasoning. Code and data are available at this https URL. Comments: 25 pages, 4 figures, 22 tables. Published at the VerifAI-2 Workshop, ICLR 2026 (non-archival). Code and data: this https URL Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO) ACMclasses: I.2.7; I.2.6; F.4.1 Cite as: arXiv:2604.19459 [cs.AI] (or arXiv:2604.19459v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.19459 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-24] The Order in the Horses Heart: A Case Study in LLM -Assisted Stylometry for the Discovery of Biblical Allusion in Modern Literary Fiction

【速读】: 该论文旨在解决文学作品中圣经典故(biblical allusions)自动检测的难题,尤其针对非显性、隐含性的文学互文性(intertextuality)识别问题。其解决方案的关键在于提出一个双轨并行的分析流程:一是自下而上的嵌入轨迹(embedding track),利用逆文档频率(inverse document frequency)识别与钦定版圣经(King James Bible, KJV)共享的稀有词汇,并在局部语境中嵌入以进行语义消歧,随后通过级联的大语言模型(LLM)审查候选句对;二是自上而下的注册轨迹(register track),让LLM无引导地阅读科马克·麦卡锡(Cormac McCarthy)的小说文本,以捕捉仅靠词汇或短语罕见性无法识别的典故。两轨结果由长上下文模型交叉验证,确保跨小说与KJV的全局一致性,并结合既有学术研究进行人工核查,最终精准识别出349个具有文本回响(textual echo)特征的典故,显著提升了文学互文性分析的自动化与可扩展性。

链接: https://arxiv.org/abs/2604.19447
作者: Ewan Cameron
机构: Heriot-Watt University (赫瑞-瓦特大学)
类目: Computation and Language (cs.CL)
备注: 39 pages, 1 figure

点击查看摘要

Abstract:We present a dual-track pipeline for detecting biblical allusions in literary fiction and apply it to the novels of Cormac McCarthy. A bottom-up embedding track uses inverse document frequency to identify rare vocabulary shared with the King James Bible, embeds occurrences in their local context for sense disambiguation, and passes candidate passage pairs through cascaded LLM review. A top-down register track asks an LLM to read McCarthy’s prose undirected to any specific biblical passage for comparison, catching allusions not distinguished by word or phrase rarity. Both tracks are cross-validated by a long-context model that holds entire novels alongside the KJV in a single pass, and every finding is checked against published scholarship. Restricting attention to allusions that carry a textual echo–shared phrasing, reworked vocabulary, or transplanted cadence–and distinguishing literary allusions proper from signposted biblical references (similes naming biblical figures, characters overtly citing scripture), the pipeline surfaces 349 allusions across the corpus. Among a target set of 115 previously documented allusions retrieved through human review of the academic literature, the pipeline independently recovers 62 (54% recall), with recall varying by connection type from 30% (transformed imagery) to 80% (register collisions). We contextualise these results with respect to the value-add from LLMs as assistants to mechanical stylometric analyses, and their potential to facilitate the statistical study of intertextuality in massive literary corpora.

[NLP-25] What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM -Guided Evolutionary Search ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在进化搜索中优化性能差异的机制不明确问题,即为何具有相似零样本求解能力的LLM在实际优化过程中表现迥异。其解决方案的关键在于通过大规模轨迹分析揭示LLM优化行为的本质:强优化器表现为局部精炼者(local refiner),持续产生微小改进并逐步聚焦于语义空间中的高绩效区域;而弱优化器则呈现显著语义漂移(semantic drift),伴随偶发突破但随后停滞。研究进一步表明,解决方案新颖性(novelty)本身并非性能预测因子,只有在搜索保持局部化时才有效,从而强调了轨迹分析对理解与改进LLM驱动优化系统的重要性,并为模型设计和训练提供了可操作的洞见。

链接: https://arxiv.org/abs/2604.19440
作者: Xinhao Zhang,Xi Chen,François Portet,Maxime Peyrard
机构: Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
类目: Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: 9 pages, 8 figures, Accepted at Findings of ACL 2026

点击查看摘要

Abstract:Recent work has demonstrated the promise of orchestrating large language models (LLMs) within evolutionary and agentic optimization systems. However, the mechanisms driving these optimization gains remain poorly understood. In this work, we present a large-scale study of LLM-guided evolutionary search, collecting optimization trajectories for 15 LLMs across 8 tasks. Although zero-shot problem-solving ability correlates with final optimization outcomes, it explains only part of the variance: models with similar initial capability often induce dramatically different search trajectories and outcomes. By analyzing these trajectories, we find that strong LLM optimizers behave as local refiners, producing frequent incremental improvements while progressively localizing the search in semantic space. Conversely, weaker optimizers exhibit large semantic drift, with sporadic breakthroughs followed by stagnation. Notably, various measures of solution novelty do not predict final performance; novelty is beneficial only when the search remains sufficiently localized around high-performing regions of the solution space. Our results highlight the importance of trajectory analysis for understanding and improving LLM-based optimization systems and provide actionable insights for their design and training.

[NLP-26] VCE: A zero-cost hallucination mitigation method of LVLMs via visual contrastive editing ICASSP2026

【速读】: 该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)中存在的对象幻觉(Object Hallucination, OH)问题,即模型在图像描述任务中生成实际未出现在输入图像中的对象。这种现象在医疗影像和自动驾驶等高精度要求的应用场景中尤为危险。解决方案的关键在于提出一种无需标签的后处理方法——视觉对比编辑(Visual Contrastive Editing, VCE),其通过分析模型对对比视觉扰动的响应,利用奇异值分解(Singular Value Decomposition, SVD)识别并分离出导致幻觉的激活子空间,并对模型参数进行针对性修改以抑制其影响,从而在不改变原始计算效率的前提下有效降低幻觉发生率。

链接: https://arxiv.org/abs/2604.19412
作者: Yanbin Huang,Yisen Li,Guiyao Tie,Xiaoye Qu,Pan Zhou,Hongfei Wang,Zhaofan Zou,Hao Sun,Xuelong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: ICASSP 2026

点击查看摘要

Abstract:Large vision-language models (LVLMs) frequently suffer from Object Hallucination (OH), wherein they generate descriptions containing objects that are not actually present in the input image. This phenomenon is particularly problematic in real-world applications such as medical imaging and autonomous driving, where accuracy is critical. Recent studies suggest that the hallucination problem may stem from language priors: biases learned during pretraining that cause LVLMs to generate words based on their statistical co-occurrence. To mitigate this problem, we propose Visual Contrastive Editing (VCE), a novel post-hoc method that identifies and suppresses hallucinatory tendencies by analyzing the model’s response to contrastive visual perturbations. Using Singular Value Decomposition (SVD), we decompose the model’s activation patterns to isolate hallucination subspaces and apply targeted parameter edits to attenuate its influence. Unlike existing approaches that require fine-tuning or labeled data, VCE operates as a label-free intervention, making it both scalable and practical for deployment in resource-constrained settings. Experimental results demonstrate that VCE effectively reduces object hallucination across multiple benchmarks while maintaining the model’s original computational efficiency.

[NLP-27] Lost in Translation: Do LVLM Judges Generalize Across Languages? ACL2026

【速读】: 该论文旨在解决当前生成式 AI(Generative AI)中用于对齐和评估大规模视觉语言模型(Large Vision-Language Models, LVLMs)的自动评估器(如奖励模型)在多语言场景下的泛化能力不足的问题。现有研究几乎仅在英语基准上进行评估,缺乏对跨语言性能的系统性分析。解决方案的关键在于提出首个大规模多语言多模态评判基准 MM-JudgeBench,包含超过6万对跨25种类型多样语言的偏好样本,并划分出通用视觉语言偏好评估子集与以图表为中心的视觉-文本推理子集,从而支持对LVLM评判模型在不同语境下的系统性测评;同时释放了一个与评估数据不重叠的多语言训练集,助力领域适应。实证表明,模型规模和架构无法有效预测多语言鲁棒性,揭示了当前奖励建模方法的根本局限,强调构建多语言、多模态基准对发展可靠自动化评估工具的必要性。

链接: https://arxiv.org/abs/2604.19405
作者: Md Tahmid Rahman Laskar,Mohammed Saidul Islam,Mir Tafseer Nayeem,Amran Bhuiyan,Mizanur Rahman,Shafiq Joty,Enamul Hoque,Jimmy Huang
机构: York University; University of Alberta; Nanyang Technological University; Salesforce AI Research
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026 Findings

点击查看摘要

Abstract:Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K pairwise preference instances spanning 25 typologically diverse languages. MM-JudgeBench integrates two complementary subsets: a general vision-language preference evaluation subset extending VL-RewardBench, and a chart-centric visual-text reasoning subset derived from OpenCQA, enabling systematic analysis of reward models (i.e., LVLM judges) across diverse settings. We additionally release a multilingual training set derived from MM-RewardBench, disjoint from our evaluation data, to support domain adaptation. By evaluating 22 LVLMs (15 open-source, 7 proprietary), we uncover substantial cross-lingual performance variance in our proposed benchmark. Our analysis further shows that model size and architecture are poor predictors of multilingual robustness, and that even state-of-the-art LVLM judges exhibit inconsistent behavior across languages. Together, these findings expose fundamental limitations of current reward modeling and underscore the necessity of multilingual, multimodal benchmarks for developing reliable automated evaluators.

[NLP-28] Does Self-Consistency Improve the Recall of Encyclopedic Knowledge? ACL2026

【速读】: 该论文旨在解决自洽性(self-consistency)在生成式 AI (Generative AI) 模型中对百科知识回忆(knowledge recall)效果不明确的问题,此前缺乏针对该能力的专门评估基准。其解决方案的关键在于:基于已有研究的数据驱动启发式方法,在广受欢迎的 MMLU 基准上构建了一个专门用于知识回忆的子集,并通过验证该子集在符号推理和知识回忆任务上的性能模式分别与 GSM8K 和 MedMCQA 一致,从而确立了可靠的评估基础。在此基础上,研究发现自洽性即使主要依赖于思维链(Chain-of-Thought, CoT)提示机制,仍能持续提升模型在符号推理和知识回忆两个维度的表现,最终实现了 GPT-4o 在 MMLU 上 89% 的准确率,为当前最优结果。

链接: https://arxiv.org/abs/2604.19395
作者: Sho Hoshino,Ukyo Honda,Peinan Zhang
机构: CyberAgent
类目: Computation and Language (cs.CL)
备注: ACL 2026

点击查看摘要

Abstract:While self-consistency is known to improve performance on symbolic reasoning, its effect on the recall of encyclopedic knowledge is unclear due to a lack of targeted evaluation grounds. To address this, we establish such a knowledge recall split for the popular MMLU benchmark by applying a data-driven heuristic from prior work. We validate this split by showing that the performance patterns on the symbolic reasoning and knowledge recall subsets mirror those of GSM8K and MedMCQA, respectively. Using this solid ground, we find that self-consistency consistently improves performance across both symbolic reasoning and knowledge recall, even though its underlying CoT prompting is primarily effective for symbolic reasoning. As a result, we achieve an 89% accuracy on MMLU, the best performance to date with the use of GPT-4o.

[NLP-29] Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain? ACL2026

【速读】: 该论文旨在解决小型专业化模型与大型通用模型之间性能差距的问题,尤其针对非英语医疗领域数据稀缺的挑战。其解决方案的关键在于通过持续预训练(continual pre-training)与模型合并(model merging)实现域适应(domain adaptation),具体表现为利用从FineWeb构建的高质量德语医学语料库FineMed-de对三个参数规模在7B至24B之间的主流大语言模型(LLM)进行持续预训练并合并,从而生成DeFineMed模型族。实证结果表明,该方法显著提升了7B级别模型在德语医疗基准上的表现,并且基于Qwen2.5的优化模型相较更大规模的Mistral-Small-24B-Instruct在指令遵循任务中胜率提升约3.5倍,验证了小模型经域适应后具备高效、竞争性的医疗应用潜力。

链接: https://arxiv.org/abs/2604.19394
作者: Niclas Doll,Jasper Schulze Buschhoff,Shalaka Satheesh,Hammam Abdelwahab,Héctor Allende-Cid,Katrin Klug
机构: Fraunhofer IAIS (弗劳恩霍夫IAIS研究所); Lamarr Institute (拉马尔研究所)
类目: Computation and Language (cs.CL)
备注: Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026, San Diego, California, July 2 - 7, 2026) as a main conference paper

点击查看摘要

Abstract:This paper narrows the performance gap between small, specialized models and significantly larger general-purpose models through domain adaptation via continual pre-training and merging. We address the scarcity of specialized non-English data by constructing a high-quality German medical corpus (FineMed-de) from FineWeb2. This corpus is used to continually pre-train and merge three well-known LLMs (ranging from 7B to 24B parameters), creating the DeFineMed model family. A comprehensive evaluation confirms that specialization dramatically enhances 7B model performance on German medical benchmarks. Furthermore, the pairwise win-rate analysis of the Qwen2.5-based models demonstrates an approximately 3.5 -fold increase in the win-rate against the much larger Mistral-Small-24B-Instruct through domain adaptation. This evidence positions specialized 7B models as a competitive, resource-efficient solution for complex medical instruction-following tasks. While model merging successfully restores instruction-following abilities, a subsequent failure mode analysis reveals inherent trade-offs, including the introduction of language mixing and increased verbosity, highlighting the need for more targeted fine-tuning in future work. This research provides a robust, compliant methodology for developing specialized LLMs, serving as the foundation for practical use in German-speaking healthcare contexts.

[NLP-30] DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing ACL2026

【速读】: 该论文旨在解决大语言模型在长文本推理中因标准注意力机制的二次计算复杂度(O(N²))导致的性能瓶颈问题,同时克服现有KV缓存压缩方法在降低内存压力时牺牲生成质量且未缓解浮点运算高开销的局限性。其解决方案的关键在于提出DASH-KV框架,通过异构深度哈希(asymmetric deep hashing)将注意力计算重构为近似最近邻搜索(approximate nearest-neighbor search),并设计一种异构编码架构,差异化地映射查询(query)和键(key)以适配其精度需求与复用特性;进一步引入动态混合精度机制,在保证关键token全精度计算的前提下实现效率与准确性的平衡,最终将推理复杂度降至线性O(N),并在LongBench基准上达到与完整注意力相当的性能。

链接: https://arxiv.org/abs/2604.19351
作者: Jinyu Guo,Zhihan Zhang,Yutong Li,Jiehui Xie,Md. Tamim Iqbal,Dongshen Han,Lik-Hang Lee,Sung-Ho Bae,Jie Zou,Yang Yang,Chaoning Zhang
机构: University of Electronic Science and Technology of China (电子科技大学); Bangladesh University of Engineering and Technology (孟加拉国工程技术大学); The Hong Kong Polytechnic University (香港理工大学); Kyung Hee University (庆熙大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2026 (Findings)

点击查看摘要

Abstract:The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long-context inference. While existing KV cache compression methods alleviate memory pressure, they often sacrifice generation quality and fail to address the high overhead of floating-point arithmetic. This paper introduces DASH-KV, an innovative acceleration framework that reformulates attention as approximate nearest-neighbor search via asymmetric deep hashing. Under this paradigm, we design an asymmetric encoding architecture that differentially maps queries and keys to account for their distinctions in precision and reuse characteristics. To balance efficiency and accuracy, we further introduce a dynamic mixed-precision mechanism that adaptively retains full-precision computation for critical tokens. Extensive experiments on LongBench demonstrate that DASH-KV significantly outperforms state-of-the-art baseline methods while matching the performance of full attention, all while reducing inference complexity from O(N^2) to linear O(N). The code is available at this https URL

[NLP-31] Are Large Language Models Economically Viable for Industry Deployment? ACL2026

【速读】: 该论文旨在解决生成式 AI(Generative AI)在工业部署中面临的“评估-部署鸿沟”(Deployment-Evaluation Gap)问题,即当前模型评估体系过度聚焦于准确性,而忽视了能源消耗、延迟、硬件利用率等关键运营与经济指标。为填补这一空白,作者提出 EDGE-EVAL——一个面向工业场景的基准评估框架,首次系统性地引入五项部署指标:经济盈亏平衡点(Economic Break-Even, Nbreak)、每瓦智能度(Intelligence-Per-Watt, IPW)、系统密度(System Density, ρsys)、冷启动税(Cold-Start Tax, Ctax)和量化保真度(Quantization Fidelity, Qret),从而全面刻画模型在真实硬件环境(如NVIDIA Tesla T4 GPU)下的经济性与效率表现。其核心创新在于将模型性能从单一精度维度扩展至多维部署约束空间,并揭示出2B参数量级模型在经济与生态维度上显著优于更大规模基线的效率前沿,同时发现QLoRA虽降低内存占用却可能使小模型适应能耗提升达7倍,挑战了现有对量化感知训练在边缘部署中优势的认知。

链接: https://arxiv.org/abs/2604.19342
作者: Abdullah Mohammad,Sushant Kumar Ray,Pushkar Arora,Rafiq Ali,Ebad Shabbir,Gautam Siddharth Kashyap,Jiechao Gao,Usman Naseem
机构: DSEU-Okhla, New Delhi, India; University of Delhi, New Delhi, India; Macquarie University, Sydney, Australia; Center for SDGC, Stanford University, California, USA
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026 (Industry Track)

点击查看摘要

Abstract:Generative AI-powered by Large Language Models (LLMs)-is increasingly deployed in industry across healthcare decision support, financial analytics, enterprise retrieval, and conversational automation, where reliability, efficiency, and cost control are critical. In such settings, models must satisfy strict constraints on energy, latency, and hardware utilization-not accuracy alone. Yet prevailing evaluation pipelines remain accuracy-centric, creating a Deployment-Evaluation Gap-the absence of operational and economic criteria in model assessment. To address this gap, we present EDGE-EVAL-a industry-oriented benchmarking framework that evaluates LLMs across their full lifecycle on legacy NVIDIA Tesla T4 GPUs. Benchmarking LLaMA and Qwen variants across three industrial tasks, we introduce five deployment metrics-Economic Break-Even (Nbreak), Intelligence-Per-Watt (IPW ), System Density (\rhosys), Cold-Start Tax (Ctax), and Quantization Fidelity (Qret)-capturing profitability, energy efficiency, hardware scaling, serverless feasibility, and compression safety. Our results reveal a clear efficiency frontier-models in the 2B parameter class dominate larger baselines across economic and ecological dimensions. LLaMA-3.2-1B (INT4) achieves ROI break-even in 14 requests (median), delivers 3x higher energy-normalized intelligence than 7B models, and exceeds 6,900 tokens/s/GB under 4-bit quantization. We further uncover an efficiency anomaly-while QLoRA reduces memory footprint, it increases adaptation energy by up to 7x for small models-challenging prevailing assumptions about quantization-aware training in edge deployment.

[NLP-32] Evaluating LLM -Driven Summarisation of Parliamentary Debates with Computational Argumentation KR’26

【速读】: 该论文旨在解决议会辩论摘要在自动化生成过程中难以保证论点忠实性(faithfulness)的问题,即如何有效评估由大型语言模型(LLM)生成的议会辩论摘要是否准确传达了原始辩论中的论证结构与立场。其解决方案的关键在于提出一个基于计算论证(computational argumentation)的正式评估框架,该框架将论证结构锚定于待审议的政策提案,并聚焦于摘要对支持或反对政策结果的推理链条的忠实保留这一形式化属性,从而提升评估结果与人类判断之间的一致性。

链接: https://arxiv.org/abs/2604.19331
作者: Eoghan Cunningham,Derek Greene,James Cross,Antonio Rago
机构: University College Dublin (都柏林大学); King’s College London (伦敦国王学院)
类目: Computation and Language (cs.CL)
备注: Accepted at KR’26 In The Wild Track. Camera ready to follow

点击查看摘要

Abstract:Understanding how policy is debated and justified in parliament is a fundamental aspect of the democratic process. However, the volume and complexity of such debates mean that outside audiences struggle to engage. Meanwhile, Large Language Models (LLMs) have been shown to enable automated summarisation at scale. While summaries of debates can make parliamentary procedures more accessible, evaluating whether these summaries faithfully communicate argumentative content remains challenging. Existing automated summarisation metrics have been shown to correlate poorly with human judgements of consistency (i.e., faithfulness or alignment between summary and source). In this work, we propose a formal framework for evaluating parliamentary debate summaries that grounds argument structures in the contested proposals up for debate. Our novel approach, driven by computational argumentation, focuses the evaluation on formal properties concerning the faithful preservation of the reasoning presented to justify or oppose policy outcomes. We demonstrate our methods using a case-study of debates from the European Parliament and associated LLM-driven summaries.

[NLP-33] RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)过程中,由于对内部表示的层间角色理解不足而导致的适应层选择依赖经验性策略的问题。其解决方案的关键在于将隐藏状态的演化建模为高维几何轨迹,并引入无需参数和训练的Ramer-Douglas-Peucker(RDP)算法,通过简化多边形路径识别出代表全局结构转变的关键断点(geometric pivots),并将这些几何特征作为直接决策信号用于指导LoRA微调中适配层的选择。该方法实现了更优的性能表现,验证了利用表示轨迹内在几何结构进行层选择的有效性与可解释性。

链接: https://arxiv.org/abs/2604.19321
作者: Yusuf Çelebi,Yağız Asker,Özay Ezerceli,Mahmoud ElHussieni,Selva Taş,Reyhan Bayraktar,Fatma Betül Terzioğlu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) remains structurally uncertain despite parameter-efficient methods such as Low-Rank Adaptation (LoRA), as the layer-specific roles of internal representations are poorly understood, leading to heuristic decisions about where adaptation should be applied. We model the evolution of hidden states as a high-dimensional geometric trajectory and propose using the Ramer-Douglas-Peucker (RDP) algorithm, a parameter-free and training-free polygon simplification method that preserves global structural transitions while eliminating locally redundant changes, to identify critical breakpoints along the representation path. Crucially, we use these geometric pivots not merely for analysis, but as a direct decision signal for determining which layers should be adapted during parameter-efficient fine-tuning. By integrating this geometry-aware layer selection strategy into LoRA fine-tuning of Qwen3-8B-Base, we achieve superior performance on MMLU-Math using only 13 RDP-selected layers (81.67%), significantly outperforming both full 36-layer adaptation (79.32%) and random 13-layer selection (75.56%), as well as the baseline Qwen3-8B-Base model (74.25%). These results demonstrate that leveraging the intrinsic geometry of representation trajectories provides a robust, interpretable, and training-free signal for optimizing layer selection during model adaptation.

[NLP-34] Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在知识广度和推理能力上的固有局限性,从而限制其在实际应用中的有效性问题。尽管SLMs具有较低的计算成本、延迟和隐私风险,但其性能仍难以满足复杂任务需求。论文提出的关键解决方案是引入代理范式(agent paradigm),特别是通过工具使用(tool use)和多代理协作(multi-agent collaboration)来系统性弥补小模型的不足。研究发现,单代理系统在性能与成本之间取得了最佳平衡,而多代理系统虽具协同潜力,但引入额外开销且收益有限,因此强调以代理为中心的设计对于资源受限场景下的高效、可信部署至关重要。

链接: https://arxiv.org/abs/2604.19299
作者: Xinlin Wang,Mats Brorsson
机构: Proximus Luxembourg S.A.(Proximus卢森堡公司); University of Luxembourg(卢森堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the impressive capabilities of large language models, their substantial computational costs, latency, and privacy risks hinder their widespread deployment in real-world applications. Small Language Models (SLMs) with fewer than 10 billion parameters present a promising alternative; however, their inherent limitations in knowledge and reasoning curtail their effectiveness. Existing research primarily focuses on enhancing SLMs through scaling laws or fine-tuning strategies while overlooking the potential of using agent paradigms, such as tool use and multi-agent collaboration, to systematically compensate for the inherent weaknesses of small models. To address this gap, this paper presents the first large-scale, comprehensive study of 10B open-source models under three paradigms: (1) the base model, (2) a single agent equipped with tools, and (3) a multi-agent system with collaborative capabilities. Our results show that single-agent systems achieve the best balance between performance and cost, while multi-agent setups add overhead with limited gains. Our findings highlight the importance of agent-centric design for efficient and trustworthy deployment in resource-constrained settings.

[NLP-35] Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLM s ACL2026

【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, LLMs)中存在的跨语言与同语言偏见问题,即模型在处理不明确指明地理位置的提问时,会无意识地表现出对特定地区(如美国)的偏好或基于人口规模的地域倾向。其解决方案的关键在于构建了一个名为LocQA的测试集,包含12种语言的2,156个地点模糊问题(locale-ambiguous questions),这些问题不提供任何地理线索,仅通过提问语言暗示可能的地域背景。通过分析模型对LocQA的回答,作者量化了两种结构化偏见:一是跨语言偏见(inter-lingual bias),表现为无论以何种语言提问,模型均倾向于选择美国相关的答案;二是同语言偏见(intra-lingual bias),表现为在同一种语言下,模型更可能选择人口较多地区的答案。这一方法为评估和理解LLMs中不同训练阶段引发的偏见提供了可量化的基准。

链接: https://arxiv.org/abs/2604.19292
作者: Guy Mor-Lan,Omer Goldman,Matan Eyal,Adi Mayrav Gilady,Sivan Eiger,Idan Szpektor,Avinatan Hassidim,Yossi Matias,Reut Tsarfaty
机构: Google Research(谷歌研究); Google(谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026 main conference

点击查看摘要

Abstract:Multilingual large language models (LLMs) have minimized the fluency gap between languages. This advancement, however, exposes models to the risk of biased behavior, as knowledge and norms may propagate across languages. In this work, we aim to quantify models’ inter- and intra-lingual biases, via their ability to answer locale-ambiguous questions. To this end, we present LocQA, a test set containing 2,156 questions in 12 languages, referring to various locale-dependent facts such as laws, dates, and measurements. The questions do not contain indications of the locales they relate to, other than the querying language itself. LLMs’ responses to LocQA locale-ambiguous questions thus reveal models’ implicit priors. We used LocQA to evaluate 32 models, and detected two types of structural biases. Inter-lingually, we show a global bias towards answers relevant to the US-locale, even when models are asked in languages other than English. Moreover, we discovered that this global bias is exacerbated in models that underwent instruction tuning, compared to their base counterparts. Intra-lingually, we show that when multiple locales are relevant for the same language, models act as demographic probability engines, prioritizing locales with larger populations. Taken together, insights from LocQA may help in shaping LLMs’ desired local behavior, and in quantifying the impact of various training phases on different kinds of biases.

[NLP-36] HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在协作写作场景中面临的新型安全威胁——即恶意用户通过不完整的草稿诱导模型生成有害内容的“草稿驱动型越狱攻击”(draft-based co-authoring jailbreak attacks)。此类攻击利用了LLMs在辅助写作时自动补全和润色的功能,使模型可能被操控输出危险信息。解决方案的关键在于提出一种基于偏好优化的安全-效用平衡对齐方法(safety-utility balanced alignment approach),通过训练模型在拒绝有害补全的同时保持对良性草稿的有效协助能力,从而在保障安全性与维持协作写作性能之间取得平衡。实验表明,该方法显著降低了有害输出,且未损害模型的协同写作能力。

链接: https://arxiv.org/abs/2604.19274
作者: Euntae Kim,Soomin Han,Buru Chang
机构: Korea University(韩国科学技术院); Sogang University(西江大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as co-authors in collaborative writing, where users begin with rough drafts and rely on LLMs to complete, revise, and refine their content. However, this capability poses a serious safety risk: malicious users could jailbreak the models-filling incomplete drafts with dangerous content-to force them into generating harmful outputs. In this paper, we identify the vulnerability of current LLMs to such draft-based co-authoring jailbreak attacks and introduce HarDBench, a systematic benchmark designed to evaluate the robustness of LLMs against this emerging threat. HarDBench spans a range of high-risk domains-including Explosives, Drugs, Weapons, and Cyberattacks-and features prompts with realistic structure and domain-specific cues to assess the model susceptibility to harmful completions. To mitigate this risk, we introduce a safety-utility balanced alignment approach based on preference optimization, training models to refuse harmful completions while remaining helpful on benign drafts. Experimental results show that existing LLMs are highly vulnerable in co-authoring contexts and our alignment method significantly reduces harmful outputs without degrading performance on co-authoring capabilities. This presents a new paradigm for evaluating and aligning LLMs in human-LLM collaborative writing settings. Our new benchmark and dataset are available on our project page at this https URL

[NLP-37] CulturALL: Benchmarking Multilingual and Multicultural Competence of LLM s on Grounded Tasks

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在多语言与跨文化能力评估中普遍存在的局限性问题,即现有基准测试主要聚焦于通用语言理解或表层文化常识,而忽视了对“具身任务”(grounded tasks)的评测——这类任务要求模型在真实、情境丰富的场景中进行推理。为填补这一空白,作者提出CulturALL,一个全面且具有挑战性的基准测试框架,其核心创新在于采用人机协同构建机制:由专家标注者确保任务难度和事实准确性,同时利用LLMs降低人工工作量;并通过整合多样化的数据源实现对全球51个地区16个主题的广泛覆盖,从而系统评估LLMs在多语言、多文化背景下完成具身任务的能力。

链接: https://arxiv.org/abs/2604.19262
作者: Peiqin Lin,Chenyang Lyu,Wenjiang Luo,Haotian Ye,Md Mehrab Hossain,Chunlan Ma,Shaoxiong Ji,Younes Samih,Bo Zeng,Fan Jiang,Yuanbin Cao,Dilda Duisenbek,Adrian Neo Sau Xun,Daria Pozdniakova,Liubou Misevich,Nevena Marinković,Ngoc Gia Linh Nguyen,Thi Khanh Linh Do,Sarakmatak Sophy,Baotian Hu,Guanhua Chen,Gongbo Tang,Alham Fikri Aji,Longyue Wang,Weihua Luo
机构: Alibaba Group; Beijing Language and Culture University; LMU Munich; ELLIS Institute Finland; University of Turku; IBM Research AI, UAE; MBZUAI; Harbin Institute of Technology, Shenzhen; Southern University of Science and Technology
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are now deployed worldwide, inspiring a surge of benchmarks that measure their multilingual and multicultural abilities. However, these benchmarks prioritize generic language understanding or superficial cultural trivia, leaving the evaluation of grounded tasks – where models must reason within real-world, context-rich scenarios – largely unaddressed. To fill this gap, we present CulturALL, a comprehensive and challenging benchmark to assess LLMs’ multilingual and multicultural competence on grounded tasks. CulturALL is built via a human–AI collaborative framework: expert annotators ensure appropriate difficulty and factual accuracy, while LLMs lighten the manual workload. By incorporating diverse sources, CulturALL ensures comprehensive scenario coverage. Each item is carefully designed to present a high level of difficulty, making CulturALL challenging. CulturALL contains 2,610 samples in 14 languages from 51 regions, distributed across 16 topics to capture the full breadth of grounded tasks. Experiments show that the best LLM achieves 44.48% accuracy on CulturALL, underscoring substantial room for improvement.

[NLP-38] owards a Linguistic Evaluation of Narratives: A Quantitative Stylistic Framework

【速读】: 该论文旨在解决叙事质量评估的复杂性问题,尤其是如何在主观性强的文学评价中引入可量化的客观指标。其解决方案的关键在于构建一个基于语言维度的定量评估框架,通过提取33个分类为词汇(lexical)、句法(syntactic)和语义(semantic)三类的量化语言特征,实现对叙事文本的自动评估。实验表明,该方法能有效区分专业编辑与自出版文本,并在人类标注数据集上显著优于传统故事级评价指标,验证了语言特征在叙事质量评估中的有效性。

链接: https://arxiv.org/abs/2604.19261
作者: Alessandro Maisto
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9TH International Workshop on Computational Models of Narrative (CMN '26) - 8-11 June 2026 - Madrid. 15 Pages

点击查看摘要

Abstract:The evaluation of narrative quality remains a complex challenge, as it involves subjective factors such as plot, character development, and emotional impact. This work proposes a quantitative approach to narrative assessment by focusing on the linguistic dimension as a primary indicator of quality. The paper presents a methodology for the automatic evaluation of narrative based on the extraction of a comprehensive set of 33 quantitative linguistic features categorized into lexical, syntactic, and semantic groups. To test the model, an experiment was conducted on a specialized corpus of 23 books, including canonical masterpieces and self-published works. Through a similarity matrix, the system successfully clustered the narratives, distinguishing almost perfectly between professionally edited and self-published texts. Furthermore, the methodology was validated against a human-annotated dataset; it significantly outperforms traditional story-level evaluation metrics, demonstrating the effectiveness of quantitative linguistic features in assessing narrative quality.

[NLP-39] ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning

【速读】: 该论文旨在解决现有参数高效微调(Parameter-efficient fine-tuning, PEFT)方法在大型语言模型(Large Language Models, LLMs)微调中因局部权重扰动导致适应能力受限的问题。当前主流方法如低秩适应(Low-Rank Adaptation, LoRA)通过在独立层中插入低秩扰动来实现微调,这种分布式权重空间扰动方式限制了跨层信息交互与全局优化潜力。论文提出ShadowPEFT框架,其核心创新在于引入一个深度共享的影子模块(shadow module),在每个Transformer层中维护并迭代演化一个并行的影子状态,从而将微调过程从局部权重扰动转变为统一的层空间精炼(layer-space refinement)。该设计实现了参数解耦、跨深度复用与可选离线部署,显著提升了微调效率与灵活性,尤其适用于边缘计算场景,并在生成与理解任务上达到或超越LoRA和DoRA的性能表现。

链接: https://arxiv.org/abs/2604.19254
作者: Xianming Li,Zongxi Li,Tsz-fung Andrew Lee,Jing Li,Haoran Xie,Qing Li
机构: PolyU (香港理工大学); Lingnan University (岭南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) reduces the training cost of full-parameter fine-tuning for large language models (LLMs) by training only a small set of task-specific parameters while freezing the pretrained backbone. However, existing approaches, such as Low-Rank Adaptation (LoRA), achieve adaptation by inserting independent low-rank perturbations directly to individual weights, resulting in a local parameterization of adaptation. We propose ShadowPEFT, a centralized PEFT framework that instead performs layer-level refinement through a depth-shared shadow module. At each transformer layer, ShadowPEFT maintains a parallel shadow state and evolves it repeatedly for progressively richer hidden states. This design shifts adaptation from distributed weight-space perturbations to a shared layer-space refinement process. Since the shadow module is decoupled from the backbone, it can be reused across depth, independently pretrained, and optionally deployed in a detached mode, benefiting edge computing scenarios. Experiments on generation and understanding benchmarks show that ShadowPEFT matches or outperforms LoRA and DoRA under comparable trainable-parameter budgets. Additional analyses on shadow pretraining, cross-dataset transfer, parameter scaling, inference latency, and system-level evaluation suggest that centralized layer-space adaptation is a competitive and flexible alternative to conventional low-rank PEFT.

[NLP-40] alking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLM s

【速读】: 该论文旨在解决人类与大语言模型(Large Language Models, LLMs)交互中“修复”(repair)机制的缺失问题,即在对话中如何处理可解与不可解数学问题时,LLM是否能主动发起或有效响应用户提出的修复请求。其解决方案的关键在于系统性地分析多轮对话中不同LLM对修复行为的反应模式,揭示各模型在修复过程中的不一致性与不可预测性,从而识别出每种模型特有的“不可靠性”特征,为提升人机交互中的语用适应性和对话稳定性提供实证依据。

链接: https://arxiv.org/abs/2604.19245
作者: Clara Lachenmaier,Hannah Bultmann,Sina Zarrieß
机构: Bielefeld University, Germany
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Repair, an important resource for resolving trouble in human-human conversation, remains underexplored in human-LLM interaction. In this study, we investigate how LLMs engage in the interactive process of repair in multi-turn dialogues around solvable and unsolvable math questions. We examine whether models initiate repair themselves and how they respond to user-initiated repair. Our results show strong differences across models: reactions range from being almost completely resistant to (appropriate) repair attempts to being highly susceptible and easily manipulated. We further demonstrate that once conversations extend beyond a single turn, model behavior becomes more distinctive and less predictable across systems. Overall, our findings indicate that each tested LLM exhibits its own characteristic form of unreliability in the context of repair.

[NLP-41] Headlines You Wont Forget: Can Pronoun Insertion Increase Memorability?

【速读】: 该论文旨在解决新闻标题中特定语言特征(即通过第一人称和第二人称代词实现的直接指称)如何影响信息记忆保留的问题,并探索利用大语言模型(Large Language Models, LLMs)对现有文本进行目标化修改以插入此类特征是否可行,同时保持原意不变。其解决方案的关键在于:首先,采用认知心理学中的受控实验设计,在总计240名参与者、7680条记忆判断数据的基础上评估代词插入对记忆效果的影响;其次,系统测试LLMs自动修订文本的适用性,发现多数修改在内容准确性、情感保留及语言自然度方面存在问题,表明当前LLM驱动的文本改写仍需谨慎应用,尤其在需要保持语义完整性与传播效力的场景中。

链接: https://arxiv.org/abs/2604.19189
作者: Selina Meyer(1),Magdalena Abel(2),Michael Roth(1) ((1) Natural Language Understanding Lab, University of Technology Nuremberg, (2) Cognitive Psychology Lab, University of Technology Nuremberg)
机构: 未知
类目: Computation and Language (cs.CL)
备注: To be published at the 15th edition of the Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2026)

点击查看摘要

Abstract:For news headlines to influence beliefs and drive action, relevant information needs to be retained and retrievable from memory. In this probing study we draw on experiment designs from cognitive psychology to examine how a specific linguistic feature, namely direct address through first- and second-person pronouns, affects memorability and to what extent it is feasible to use large language models for the targeted insertion of such a feature into existing text without changing its core meaning. Across three controlled memorization experiments with a total of 240 participants, yielding 7,680 unique memory judgments, we show that pronoun insertion has mixed effects on memorability. Exploratory analyses indicate that effects differ based on headline topic, how pronouns are inserted and their immediate contexts. Additional data and fine-grained analysis is needed to draw definitive conclusions on these mediating factors. We further show that automatic revisions by LLMs are not always appropriate: Crowdsourced evaluations find many of them to be lacking in content accuracy and emotion retention or resulting in unnatural writing style. We make our collected data available for future work.

[NLP-42] SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization ACL2026

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的摘要候选排名策略存在不稳定性,以及传统指标(如ROUGE)在区分高质量摘要方面能力不足的问题。其解决方案的关键在于提出SCURank框架,该框架通过引入摘要内容单元(Summary Content Units, SCUs)来评估摘要的信息丰富度和语义重要性,从而实现更稳定、更准确的摘要质量排序,尤其在多LLM蒸馏场景中显著提升了摘要抽象性和整体性能。

链接: https://arxiv.org/abs/2604.19185
作者: Bo-Jyun Wang,Ying-Jia Lin,Hung-Yu Kao
机构: National Cheng Kung University (国立成功大学); Chang Gung University (长庚大学); National Tsing Hua University (国立清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026 Findings

点击查看摘要

Abstract:Small language models (SLMs), such as BART, can achieve summarization performance comparable to large language models (LLMs) via distillation. However, existing LLM-based ranking strategies for summary candidates suffer from instability, while classical metrics (e.g., ROUGE) are insufficient to rank high-quality summaries. To address these issues, we introduce \textbfSCURank, a framework that enhances summarization by leveraging \textbfSummary Content Units (SCUs). Instead of relying on unstable comparisons or surface-level overlap, SCURank evaluates summaries based on the richness and semantic importance of information content. We investigate the effectiveness of SCURank in distilling summaries from multiple diverse LLMs. Experimental results demonstrate that SCURank outperforms traditional metrics and LLM-based ranking methods across evaluation measures and datasets. Furthermore, our findings show that incorporating diverse LLM summaries enhances model abstractiveness and overall distilled model performance, validating the benefits of information-centric ranking in multi-LLM distillation. The code for SCURank is available at this https URL.

[NLP-43] Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation

【速读】: 该论文旨在解决在黑盒访问条件下对大语言模型(Large Language Models, LLMs)进行不确定性量化的问题,其中每个查询只能获取少量响应样本。核心挑战在于:如何准确估计有效语义字母表大小(即采样响应中表达的不同语义数量),以作为下游风险的代理指标。传统基于频率的估计方法在小样本下易低估稀有语义模式,而单纯依赖图谱特征又难以精确刻画语义覆盖率。解决方案的关键是提出SHADE(Soft-Hybrid Alphabet Dynamic Estimator),其融合了广义Good-Turing覆盖估计与基于蕴含加权图构建的归一化拉普拉斯矩阵的热核迹(heat-kernel trace)。该方法通过自适应融合规则——高覆盖时采用凸组合,低覆盖时使用LogSumExp融合以突出未充分观测的语义模式——并引入有限样本修正稳定基数估计,最终转化为覆盖调整后的语义熵得分。实验证明,在样本极度受限的情况下,SHADE显著优于现有方法,表明混合语义占用估计在严苛采样预算下具有显著优势。

链接: https://arxiv.org/abs/2604.19162
作者: Hongxing Pan,Yingying Guo,Wenqing Kuang,Jiashi Lu
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳))
类目: Computation and Language (cs.CL); Applications (stat.AP)
备注: 7 pages, 1 figure, 3 tables

点击查看摘要

Abstract:This paper studies uncertainty quantification for large language models (LLMs) under black-box access, where only a small number of responses can be sampled for each query. In this setting, estimating the effective semantic alphabet size–that is, the number of distinct meanings expressed in the sampled responses–provides a useful proxy for downstream risk. However, frequency-based estimators tend to undercount rare semantic modes when the sample size is small, while graph-spectral quantities alone are not designed to estimate semantic occupancy accurately. To address this issue, we propose SHADE (Soft-Hybrid Alphabet Dynamic Estimator), a simple and interpretable estimator that combines Generalized Good-Turing coverage with a heat-kernel trace of the normalized Laplacian constructed from an entailment-weighted graph over sampled responses. The estimated coverage adaptively determines the fusion rule: under high coverage, SHADE uses a convex combination of the two signals, while under low coverage it applies a LogSumExp fusion to emphasize missing or weakly observed semantic modes. A finite-sample correction is then introduced to stabilize the resulting cardinality estimate before converting it into a coverage-adjusted semantic entropy score. Experiments on pooled semantic alphabet-size estimation against large-sample references and on QA incorrectness detection show that SHADE achieves the strongest improvements in the most sample-limited regime, while the performance gap narrows as the number of samples increases. These results suggest that hybrid semantic occupancy estimation is particularly beneficial when black-box uncertainty quantification must operate under tight sampling budgets.

[NLP-44] Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

【速读】: 该论文旨在解决现有印地语自动语音识别(ASR)基准测试中存在的两大问题:一是普遍采用脚本化、干净语音及以排行榜为导向的评估方式,导致模型对特定数据集过拟合;二是严格使用单一参考词错误率(Word Error Rate, WER)评估方法,未能充分考虑印度语言中自然存在的拼写变体,尤其是混合英语词汇的非标准化拼写。为应对这些问题,论文提出 Voice of India 基准数据集,这是一个闭源、基于未脚本电话对话构建的多语言 ASR 测试集,覆盖 15 种主要印度语言,涵盖 139 个区域集群,包含 306,230 条语音片段(总计 536 小时),并标注了反映真实拼写差异的转录文本。其关键创新在于引入真实场景下的多样性语音数据与包容性标注策略,同时进行细粒度地理和多因素分析(如音频质量、语速、性别、设备类型),从而揭示当前 ASR 系统在实际应用中的性能瓶颈,为提升印度语系 ASR 的鲁棒性和泛化能力提供实证依据和改进方向。

链接: https://arxiv.org/abs/2604.19151
作者: Kaushal Bhogale,Manas Dhir,Amritansh Walecha,Manmeet Kaur,Vanshika Chhabra,Aaditya Pareek,Hanuman Sidh,Sagar Jain,Bhaskar Singh,Utkarsh Singh,Tahir Javed,Shobhit Banga,Mitesh M. Khapra
机构: Indian Institute of Technology, Madras, India (印度理工学院马德拉斯分校); Josh Talks, India (Josh Talks)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:Existing Indic ASR benchmarks often use scripted, clean speech and leaderboard driven evaluation that encourages dataset specific overfitting. In addition, strict single reference WER penalizes natural spelling variation in Indian languages, including non standardized spellings of code-mixed English origin words. To address these limitations, we introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers with transcripts accounting for spelling variations. We also analyze performance geographically at the district level, revealing disparities. Finally, we provide detailed analysis across factors such as audio quality, speaking rate, gender, and device type, highlighting where current ASR systems struggle and offering insights for improving real world Indic ASR systems.

[NLP-45] How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLM s for Quantitative Reasoning ACL2026

【速读】: 该论文旨在解决生成式 AI(Generative AI)在定量推理任务中,模型如何从推理轨迹(reasoning trace)中有效读取并整合信息以生成可靠答案的问题。此前的研究多聚焦于调整推理过程本身,但对答案生成阶段如何利用推理内容缺乏深入理解。作者通过分析答案到推理的注意力机制(answer-to-reasoning attention),发现正确答案对应的注意力模式具有向前漂移(forward drift)和持续聚焦关键语义锚点(key semantic anchors)的良性自读取特征,而错误答案则表现出分散且不规则的注意力分布。基于此,论文提出一种无需训练的控制方法——Self-Reading Quality (SRQ) 分数,其结合几何度量(用于过程控制)与语义度量(用于内容监控),筛选高质量推理轨迹构建引导向量,在推理阶段引导模型趋向良性自读取行为,从而提升准确性。

链接: https://arxiv.org/abs/2604.19149
作者: Haoyang Chen,Yi Liu,Jianzhi Shao,Tao Zhang,Chengfu Huo,Wei Hu
机构: Nanjing University (南京大学); Alibaba Group (阿里巴巴集团); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in the Findings of ACL 2026

点击查看摘要

Abstract:Thinking LLMs produce reasoning traces before answering. Prior activation steering work mainly targets on shaping these traces. It remains less understood how answer tokens actually read and integrate the reasoning to produce reliable outcomes. Focusing on quantitative reasoning, we analyze the answer-to-reasoning attention and observe a benign self-reading pattern aligned with correctness, characterized by a forward drift of the reading focus along the reasoning trace and a persistent concentration on key semantic anchors, whereas incorrect solutions exhibit diffuse and irregular attention pattern. We interpret this as internal certainty during answer decoding, where the model commits to a viable solution branch and integrates key evidence. Following this, we propose a training-free steering method driven by Self-Reading Quality (SRQ) scores combining geometric metrics for process control with semantic metrics for content monitoring. SRQ selects data to build steering vectors that guide inference toward benign self-reading and away from uncertain and disorganized reading. Experiments show that our method yields consistent accuracy gains.

[NLP-46] ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在机器翻译(Machine Translation, MT)中因采用“先思考后翻译”范式而导致的推理成本高、延迟大问题。现有方法虽能通过显式的推理轨迹提升翻译质量,但其多步推理机制显著增加了计算开销。解决方案的关键在于提出一种两阶段的反射内化算法(ReflectMT),采用“先翻译后反思”的新范式:第一阶段通过强化学习训练模型生成高质量的反思与修正能力,增强语义理解与任务特定知识;第二阶段将反思过程中获得的知识内化为模型自身能力,使得推理阶段无需显式推理步骤即可直接输出高质量翻译。实验表明,该方法在WMT24数据集上首次翻译即优于多步推理模型DeepSeek-R1,在GPT-based评估中提升2.16分,同时token消耗降低94.33%。

链接: https://arxiv.org/abs/2604.19144
作者: Kunquan Li,Yingxue Zhang,Fandong Meng,Jinsong Su
机构: Xiamen University (厦门大学); WeChat AI, Tencent Inc (微信AI,腾讯公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent years have witnessed growing interest in applying Large Reasoning Models (LRMs) to Machine Translation (MT). Existing approaches predominantly adopt a “think-first-then-translate” paradigm. Although explicit reasoning trajectories significantly enhance translation quality, they incur prohibitive inference costs and latency. To address these limitations, we propose ReflectMT, a two-stage reflection internalization algorithm for machine translation that employs a “translate-first-think-later” paradigm. Our approach develops the model’s “translate-reflect-refine” capability through reinforcement learning. In the first stage, we cultivate the model’s capacity for high-quality reflection and refinement, thereby enhancing its semantic comprehension and task-specific knowledge. In the second stage, we train the model to internalize the knowledge acquired during reflection. As a result, during inference, ReflectMT operates in a direct translation mode, producing high-quality translations on the first attempt without any explicit reasoning steps. Experimental results on datasets such as WMT24 demonstrate that our model’s first-pass translations during inference outperform multi-step reasoning LRMs such as DeepSeek-R1 in both automatic metrics and GPT-based evaluation, achieving a 2.16-point improvement in GPT-based translation quality evaluation while reducing token consumption by 94.33%.

[NLP-47] he Rise of Verbal Tics in Large Language Models : A Systematic Analysis Across Frontier Models

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在对齐训练(如基于人类反馈的强化学习 RLHF 和宪法 AI)过程中产生的“言语习惯”(verbal tics)现象问题,即模型输出中反复出现的程式化表达(如奉承性开场白、伪共情语句和高频词汇),这可能削弱对话的真实性和自然度。解决方案的关键在于提出一个可量化的评估指标——言语习惯指数(Verbal Tic Index, VTI),通过标准化 API 评估框架对八种先进 LLM 在英语与中文环境下进行大规模测试(10,000 条提示词,160,000 条响应),系统分析其与奉承倾向、词汇多样性及人类感知自然度之间的关系,并揭示多轮对话中言语习惯的累积效应与跨语言差异,从而为优化对齐策略、减少“对齐代价”(alignment tax)提供实证依据和改进方向。

链接: https://arxiv.org/abs/2604.19139
作者: Shuai Wu,Xue Li,Yanna Feng,Yufang Li,Zhijun Wang,Ran Wang
机构: Google DeepMind (谷歌深潜); OpenAI (OpenAI); Anthropic (Anthropic); xAI (xAI); ByteDance (字节跳动); Moonshot AI (Moonshot AI); DeepSeek (DeepSeek); Xiaomi (小米)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 17 figures, 8 tables. Technical report

点击查看摘要

Abstract:As Large Language Models (LLMs) continue to evolve through alignment techniques such as Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI, a growing and increasingly conspicuous phenomenon has emerged: the proliferation of verbal tics – repetitive, formulaic linguistic patterns that pervade model outputs. These range from sycophantic openers (“That’s a great question!”, “Awesome!”) to pseudo-empathetic affirmations (“I completely understand your concern”, “I’m right here to catch you”) and overused vocabulary (“delve”, “tapestry”, “nuanced”). In this paper, we present a systematic analysis of the verbal tic phenomenon across eight state-of-the-art LLMs: GPT-5.4, Claude Opus 4.7, Gemini 3.1 Pro, Grok 4.2, Doubao-Seed-2.0-pro, Kimi K2.5, DeepSeek V3.2, and MiMo-V2-Pro. Utilizing a custom evaluation framework for standardized API-based evaluation, we assess 10,000 prompts across 10 task categories in both English and Chinese, yielding 160,000 model responses. We introduce the Verbal Tic Index (VTI), a composite metric quantifying tic prevalence, and analyze its correlation with sycophancy, lexical diversity, and human-perceived naturalness. Our findings reveal significant inter-model variation: Gemini 3.1 Pro exhibits the highest VTI (0.590), while DeepSeek V3.2 achieves the lowest (0.295). We further demonstrate that verbal tics accumulate over multi-turn conversations, are amplified in subjective tasks, and show distinct cross-lingual patterns. Human evaluation (N = 120) confirms a strong inverse relationship between sycophancy and perceived naturalness (r = -0.87, p 0.001). These results underscore the “alignment tax” of current training paradigms and highlight the urgent need for more authentic human-AI interaction frameworks.

[NLP-48] Construction of Knowledge Graph based on Language Model

【速读】: 该论文旨在解决传统知识图谱(Knowledge Graph, KG)构建方法依赖人工标注导致效率低下,以及基于深度学习的方法泛化能力较弱的问题。其解决方案的关键在于利用预训练语言模型(Pre-trained Language Models, PLM)强大的语言理解与生成能力,实现从文本数据中自动抽取实体和关系等关键信息,从而提升KG构建的自动化水平与泛化性能;此外,论文提出了一种基于轻量级大语言模型(Large Language Model, LLM)的新型超关系知识图谱构建框架LLHKG,实验证明该框架在保持高效性的同时,构建能力可媲美GPT3.5。

链接: https://arxiv.org/abs/2604.19137
作者: Qiubai Zhu,Qingwang Wang,Haibin Yuan,Wei Chen,Tao Shen
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages,3 figures To be published in the proceedings of 2025 13th The International Conference on Information Systems and Computing Technology (ISCTech 2025)

点击查看摘要

Abstract:Knowledge Graph (KG) can effectively integrate valuable information from massive data, and thus has been rapidly developed and widely used in many fields. Traditional KG construction methods rely on manual annotation, which often consumes a lot of time and manpower. And KG construction schemes based on deep learning tend to have weak generalization capabilities. With the rapid development of Pre-trained Language Models (PLM), PLM has shown great potential in the field of KG construction. This paper provides a comprehensive review of recent research advances in the field of construction of KGs using PLM. In this paper, we explain how PLM can utilize its language understanding and generation capabilities to automatically extract key information for KGs, such as entities and relations, from textual data. In addition, We also propose a new Hyper-Relarional Knowledge Graph construction framework based on lightweight Large Language Model (LLM) named LLHKG and compares it with previous methods. Under our framework, the KG construction capability of lightweight LLM is comparable to GPT3.5.

[NLP-49] Do Emotions Influence Moral Judgment in Large Language Models ?

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在道德判断中情感影响机制不明确的问题,尤其是情绪如何系统性地调节道德可接受性。其解决方案的关键在于构建一个情感诱导(emotion-induction)流程,将特定情绪注入道德情境,并通过多数据集和多种LLM的评估,量化情绪对道德判断的影响。研究发现,情绪效价具有方向性作用:正向情绪提升道德可接受性,负向情绪则降低它,且这种效应足以在高达20%的案例中逆转二元道德判断;同时,模型能力越强,对情绪干扰的敏感性越低。此外,某些具体情绪(如悔恨)表现出与效价预测相反的行为(即“悔恨悖论”),而人类标注者未表现出此类系统性偏差,揭示了当前LLM在情感-道德对齐方面的显著差距。

链接: https://arxiv.org/abs/2604.19125
作者: Mohammad Saim,Tianyu Jiang
机构: University of Cincinnati (辛辛那提大学)
类目: Computation and Language (cs.CL)
备注: 18 pages, 14 figures, 6 tables

点击查看摘要

Abstract:Large language models have been extensively studied for emotion recognition and moral reasoning as distinct capabilities, yet the extent to which emotions influence moral judgment remains underexplored. In this work, we develop an emotion-induction pipeline that infuses emotion into moral situations and evaluate shifts in moral acceptability across multiple datasets and LLMs. We observe a directional pattern: positive emotions increase moral acceptability and negative emotions decrease it, with effects strong enough to reverse binary moral judgments in up to 20% of cases, and with susceptibility scaling inversely with model capability. Our analysis further reveals that specific emotions can sometimes behave contrary to what their valence would predict (e.g., remorse paradoxically increases acceptability). A complementary human annotation study shows humans do not exhibit these systematic shifts, indicating an alignment gap in current LLMs.

[NLP-50] Detoxification for LLM : From Dataset Itself ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)毒性问题的根源——即预训练数据集本身所含的有害内容,而非仅在训练后阶段或推理时进行干预。传统方法如训练后去毒或可控解码无法彻底消除模型固有的毒性倾向,而本文提出HSPD(Hierarchical Semantic-Preserving Detoxification)流水线,其关键在于使用SoCD(Soft Contrastive Decoding)技术对原始语料库进行语义保留式的毒性片段定位与重写,从而在不破坏数据语义的前提下实现源头去毒,生成可直接用于微调的干净语料库。实验表明,该方法在多个主流模型上均显著降低毒性概率(TP)和预期最大毒性(EMT),验证了其在保持数据可用性的同时有效抑制下游毒性行为的能力。

链接: https://arxiv.org/abs/2604.19124
作者: Wei Shao,Yihang Wang,Gaoyu Zhu,Ziqiang Cheng,Lei Yu,Jiafeng Guo,Xueqi Cheng
机构: State Key Laboratory of AI Safety; Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
类目: Computation and Language (cs.CL)
备注: Accepted to Main Conference of ACL 2026

点击查看摘要

Abstract:Existing detoxification methods for large language models mainly focus on post-training stage or inference time, while few tackle the source of toxicity, namely, the dataset itself. Such training-based or controllable decoding approaches cannot completely suppress the model’s inherent toxicity, whereas detoxifying the pretraining dataset can fundamentally reduce the toxicity that the model learns during training. Hence, we attempt to detoxify directly on raw corpora with SoCD (Soft Contrastive Decoding), which guides an LLM to localize and rewrite toxic spans in raw data while preserving semantics, in our proposed HSPD (Hierarchical Semantic-Preserving Detoxification) pipeline, yielding a detoxified corpus that can drop-in replace the original for fine-tuning or other training. On GPT2-XL, HSPD attains state-of-the-art detoxification, reducing Toxicity Probability (TP) from 0.42 to 0.18 and Expected Maximum Toxicity (EMT) from 0.43 to 0.20. We further validate consistent best-in-class results on LLaMA2-7B, OPT-6.7B, and Falcon-7B. These findings show that semantics-preserving, corpus-level rewriting with HSPD effectively suppresses downstream toxicity while retaining data utility and allowing seamless source-level mitigation, thereby reducing the cost of later model behavior adjustment. (Code is available at: this https URL)

[NLP-51] SAHM: A Benchmark for Arabic Financial and Shariah-Compliant Reasoning

【速读】: 该论文旨在解决阿拉伯语金融自然语言处理(Natural Language Processing, NLP)领域研究严重滞后的问题,尤其是在可信金融和伊斯兰金融助手需求迫切的背景下。其解决方案的关键在于构建了一个名为SAHM的文档 grounded 基准测试集与指令微调数据集,涵盖7类任务,包括AAOIFI标准问答、法特瓦(fatwa)驱动的问答与多项选择题、会计与商业考试、金融情感分析、抽取式摘要及事件-因果推理等,共计14,380个由专家验证的数据实例。该基准不仅覆盖了权威监管、法学与企业来源的内容,还通过任务特定指标与基于评分量表的开放输出评估方法,系统性地揭示了当前大语言模型(LLM)在阿拉伯语金融推理中的能力局限——特别是识别类任务表现优于生成与因果推理任务,尤其在事件-因果推理上差距显著。研究团队进一步发布了完整基准、评估框架及指令微调模型,以推动可信阿拉伯语金融NLP的发展。

链接: https://arxiv.org/abs/2604.19098
作者: Rania Elbadry,Sarfraz Ahmad,Ahmed Heakl,Dani Bouch,Momina Ahsan,Muhra AlMahri,Marwa Elsaid khalil,Yuxia Wang,Salem Lahlou,Sophia Ananiadou,Veselin Stoyanov,Jimin Huang,Xueqing Peng,Preslav Nakov,Zhuohan Xie
机构: MBZUAI(穆巴达拉人工智能大学); INSAIT(智能科学与技术研究所); The University of Manchester(曼彻斯特大学); The Fin AI(金融人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 page

点击查看摘要

Abstract:English financial NLP has progressed rapidly through benchmarks for sentiment, document understanding, and financial question answering, while Arabic financial NLP remains comparatively under-explored despite strong practical demand for trustworthy finance and Islamic-finance assistants. We introduce SAHM, a document-grounded benchmark and instruction-tuning dataset for Arabic financial NLP and Shari’ah-compliant reasoning. SAHM contains 14,380 expert-verified instances spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning, curated from authentic regulatory, juristic, and corporate sources. We evaluate 19 strong open and proprietary LLMs using task-specific metrics and rubric-based scoring for open-ended outputs, and find that Arabic fluency does not reliably translate to evidence-grounded financial reasoning: models are substantially stronger on recognition-style tasks than on generation and causal reasoning, with the largest gaps on event-cause reasoning. We release the benchmark, evaluation framework, and an instruction-tuned model to support future research on trustworthy Arabic financial NLP.

[NLP-52] HoWToBench: Holistic Evaluation for LLM s Capability in Human-level Writing using Tree of Writing ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在千字级、开放式写作任务中评估不足的问题,特别是传统基于参考文本的指标和当前流行的LLM-as-a-judge方法在多维写作能力评估中存在的隐式不一致性与偏差。其解决方案的关键在于提出Tree-of-Writing(ToW),一种通过树状结构显式建模子特征聚合权重的评估框架,从而更合理地整合不同评价维度,并有效缓解评估偏差;实验表明,ToW在中文写作基准HowToBench上实现了0.93的皮尔逊相关系数(Pearson correlation)与人工评分的一致性,且对文本扰动具有鲁棒性。

链接: https://arxiv.org/abs/2604.19071
作者: Andrew Zhuoer Feng,Cunxiang Wang,Yu Luo,Lin Fan,Yilin Zhou,Zikang Wang,Xiaotao Gu,Jie Tang,Hongning Wang,Minlie Huang
机构: Tsinghua University (清华大学); Z.ai
类目: Computation and Language (cs.CL)
备注: 49 pages, 6 figures, 19 tables, ACL 2026 main

点击查看摘要

Abstract:Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM’s performance in thousand-words level and open-ended writing is inadequately assessed by traditional reference-based metrics or modern LLM-as-a-judge methods. We propose Tree-of-Writing (ToW), to resolve the implicit inconsistency often found when LLM-as-a-judge aggregates all sub-features in text evaluation. ToW incorporates a tree-structured workflow by explicitly modeling the aggregation weights of sub-features. We also present HowToBench, a large-scale Chinese writing benchmark encompassing 12 genres and 1302 instructions across three task categories: contextual completion, outline-guided writing, and open-ended generation. ToW successfully mitigates the biases, achieving a 0.93 Pearson correlation with human judgments. Furthermore, we detect that both overlap-based text generation metrics and popular LLM-as-a-judge practices are vulnerable to textual disturbances, while ToW is robust to them. We also uncover a negative correlation between input length and content-related scores in the Guide task, showcasing that it cannot be simply improved by input-side information piling.

[NLP-53] RN-R1-Zero: Text-rich Network Reasoning via LLM s with Reinforcement Learning Only

【速读】: 该论文旨在解决文本丰富网络(Text-rich Networks, TRNs)中的零样本推理问题,即模型在无任务特定监督的情况下,如何有效融合文本语义与图结构信息以进行关系推理。传统图神经网络依赖固定标签空间和监督目标,而基于大语言模型(Large Language Model, LLM)的方法则常忽视图上下文或依赖从更大模型蒸馏得到的链式思维数据,限制了泛化能力。解决方案的关键在于提出TRN-R1-Zero框架,该框架仅通过强化学习进行后训练,采用一种邻接感知的组相对策略优化目标(Neighbour-aware Group Relative Policy Optimisation),并引入新颖的边际增益指标(margin gain metric)动态调整奖励,从而引导模型聚焦于邻域信号的信息量,实现对图结构关系的有效建模。此方法无需监督微调或来自大型推理模型的链式思维数据,且能在节点级训练基础上实现边级与图级任务的零样本推理,展现出强大的跨域迁移能力。

链接: https://arxiv.org/abs/2604.19070
作者: Yilun Liu,Ruihong Qiu,Zi Huang
机构: The University of Queensland (昆士兰大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Zero-shot reasoning on text-rich networks (TRNs) remains a challenging frontier, as models must integrate textual semantics with relational structure without task-specific supervision. While graph neural networks rely on fixed label spaces and supervised objectives, recent large language model (LLM)-based approaches often overlook graph context or depend on distillation from larger models, limiting generalisation. We propose TRN-R1-Zero, a post-training framework for TRN reasoning trained solely via reinforcement learning. TRN-R1-Zero directly optimises base LLMs using a Neighbour-aware Group Relative Policy Optimisation objective that dynamically adjusts rewards based on a novel margin gain metric for the informativeness of neighbouring signals, effectively guiding the model toward relational reasoning. Unlike prior methods, TRN-R1-Zero requires no supervised fine-tuning or chain-of-thought data generated from large reasoning models. Extensive experiments across citation, hyperlink, social and co-purchase TRN benchmarks demonstrate the superiority and robustness of TRN-R1-Zero. Moreover, relying strictly on node-level training, TRN-R1-Zero achieves zero-shot inference on edge- and graph-level tasks, extending beyond cross-domain transfer. The codebase is publicly available at this https URL.

[NLP-54] Product-of-Experts Training Reduces Dataset Artifacts in Natural Language Inference

【速读】: 该论文旨在解决神经网络自然语言推理(Natural Language Inference, NLI)模型对数据集人工特征(dataset artifacts)的过拟合问题,而非真正实现逻辑推理。实验表明,仅基于假设(hypothesis-only)的模型在SNLI数据集上仍能达到57.7%的准确率,说明存在强伪相关性;且基线模型38.6%的错误源于此类人工特征。为缓解这一问题,作者提出Product-of-Experts (PoE) 训练方法,其核心在于通过降低那些偏见模型过度自信样本的权重来减少对人工特征的依赖。该方法在几乎不损失原始准确率(89.10% vs. 89.30%)的前提下,将模型对偏见的依赖程度降低了4.71%(偏见一致性从49.85%降至45%),并通过消融实验确定最优超参数λ=1.5以平衡去偏效果与准确性。尽管如此,行为测试仍发现模型在否定和数值推理任务中存在不足。

链接: https://arxiv.org/abs/2604.19069
作者: Aby Mammen Mathew
机构: The University of Texas at Austin (得克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 4 tables. Single-author paper

点击查看摘要

Abstract:Neural NLI models overfit dataset artifacts instead of truly reasoning. A hypothesis-only model gets 57.7% in SNLI, showing strong spurious correlations, and 38.6% of the baseline errors are the result of these artifacts. We propose Product-of-Experts (PoE) training, which downweights examples where biased models are overconfident. PoE nearly preserves accuracy (89.10% vs. 89.30%) while cutting bias reliance by 4.71% (bias agreement 49.85% to 45%). An ablation finds lambda = 1.5 that best balances debiasing and accuracy. Behavioral tests still reveal issues with negation and numerical reasoning.

[NLP-55] Cell-Based Representation of Relational Binding in Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在话语层面如何实现实体、关系与属性绑定的机制不明确的问题。其核心解决方案是提出一种基于细胞的绑定表示(Cell-based Binding Representation, CBR),即通过低维线性子空间来编码这种绑定关系:每个“细胞”对应一个实体-关系索引对,属性信息则在推理过程中从相应细胞中检索。研究通过受控多句数据和部分最小二乘回归(Partial Least Squares regression)识别出该子空间,并发现其在投影空间中呈现网格状几何结构,且不同上下文间的CBR表示可通过激活空间中的平移向量关联,从而支持跨上下文迁移。激活修补实验进一步证实,操纵该子空间会系统性改变关系预测结果并破坏模型性能,为LLMs依赖CBR进行关系绑定提供了因果证据。

链接: https://arxiv.org/abs/2604.19052
作者: Qin Dai,Benjamin Heinzerling,Kentaro Inui
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding a discourse requires tracking entities and the relations that hold between them. While Large Language Models (LLMs) perform well on relational reasoning, the mechanism by which they bind entities, relations, and attributes remains unclear. We study discourse-level relational binding and show that LLMs encode it via a Cell-based Binding Representation (CBR): a low-dimensional linear subspace in which each ``cell’’ corresponds to an entity–relation index pair, and bound attributes are retrieved from the corresponding cell during inference. Using controlled multi-sentence data annotated with entity and relation indices, we identify the CBR subspace by decoding these indices from attribute-token activations with Partial Least Squares regression. Across domains and two model families, the indices are linearly decodable and form a grid-like geometry in the projected space. We further find that context-specific CBR representations are related by translation vectors in activation space, enabling cross-context transfer. Finally, activation patching shows that manipulating this subspace systematically changes relational predictions and that perturbing it disrupts performance, providing causal evidence that LLMs rely on CBR for relational binding.

[NLP-56] SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning ACL2026

【速读】: 该论文旨在解决当前Mixture-of-Experts与低秩适配(LoRA)结合方法在多任务学习中面临的两个核心问题:一是现有MoE-LoRA方法中的路由机制不够精确,无法显式匹配输入语义与专家能力,导致专家专业化程度不足;二是统一的权重融合策略难以根据任务复杂度动态调整更新强度,忽视了不同任务间的差异性。解决方案的关键在于提出SAMoRA(Semantic-Aware Mixture of LoRA Experts)框架,其核心创新包括:(1)设计语义感知路由器(Semantic-Aware Router),显式对齐文本语义与最优专家,实现精准路由;(2)引入任务自适应缩放机制(Task-Adaptive Scaling),依据具体任务需求动态调节专家贡献;(3)提出一种新的正则化目标,协同促进专家专业化与有效缩放,从而提升模型的任务适应性和泛化能力。

链接: https://arxiv.org/abs/2604.19048
作者: Boyan Shi,Wei Chen,Shuyuan Zhao,Junfeng Shen,Shengnan Guo,Shaojiang Wang,Huaiyu Wan
机构: Beijing Jiaotong University (北京交通大学); Guilin University of Electronic Technology (桂林电子科技大学); China; Chinese Academy of Sciences (中国科学院); Nanjing Institute of Software Technology (南京软件研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026 Findings

点击查看摘要

Abstract:The combination of Mixture-of-Experts (MoE) and Low-Rank Adaptation (LoRA) has shown significant potential for enhancing the multi-task learning capabilities of Large Language Models. However, existing methods face two primary challenges: (1)Imprecise Routing in the current MoE-LoRA method fails to explicitly match input semantics with expert capabilities, leading to weak expert specialization. (2)Uniform weight fusion strategies struggle to provide adaptive update strengths, overlooking the varying complexity of different tasks. To address these limitations, we propose SAMoRA (Semantic-Aware Mixture of LoRA Experts), a novel parameter-efficient fine-tuning framework tailored for task-adaptive learning. Specifically, A Semantic-Aware Router is proposed to explicitly align textual semantics with the most suitable experts for precise routing. A Task-Adaptive Scaling mechanism is designed to regulate expert contributions based on specific task requirements dynamically. In addition, a novel regularization objective is proposed to jointly promote expert specialization and effective scaling. Extensive experiments on multiple multi-task benchmarks demonstrate that SAMoRA significantly outperforms the state-of-the-art methods and holds excellent task generalization capabilities. Code is available at this https URL

[NLP-57] AlignCultura: Towards Culturally Aligned Large Language Models ? ACL

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成内容时缺乏文化一致性(cultural alignment)的问题,即模型可能产生刻板印象、不敏感或误导性响应,无法体现联合国教科文组织(UNESCO)倡导的文化多样性原则与“有益、无害、诚实”(Helpful, Harmless, Honest, HHH)范式。为应对这一挑战,作者提出了一种两阶段的解决方案——Align-Cultura:第一阶段构建了基于UNESCO文化分类体系的HHH-English数据集CULTURAX,通过查询重构(Query Construction)、领域扩展(特别是低频标签)和SimHash防数据泄露机制实现高质量样本生成;第二阶段则利用该数据集对通用模型、文化微调模型及开源大模型(如Qwen3-8B和DeepSeek-R1-Distill-Qwen-7B)进行系统评估。关键创新在于将UNESCO文化分类与HHH伦理框架融合,并通过两阶段拒绝采样策略确保响应的文化适切性,实证表明文化微调模型可提升联合HHH指标4%-6%,减少文化失败率18%,并显著降低数据泄露风险至0.3%。

链接: https://arxiv.org/abs/2604.19016
作者: Gautam Siddharth Kashyap,Mark Dras,Usman Naseem
机构: Macquarie University (麦克奎里大学)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL Mains 2026

点击查看摘要

Abstract:Cultural alignment in Large Language Models (LLMs) is essential for producing contextually aware, respectful, and trustworthy outputs. Without it, models risk generating stereotyped, insensitive, or misleading responses that fail to reflect cultural diversity w.r.t Helpful, Harmless, and Honest (HHH) paradigm. Existing benchmarks represent early steps toward cultural alignment; yet, no benchmarks currently enables systematic evaluation of cultural alignment in line with UNESCO’s principles of cultural diversity w.r.t HHH paradigm. Therefore, to address this gap, we built Align-Cultura, two-stage pipeline for cultural alignment. Stage I constructs CULTURAX, the HHH-English dataset grounded in the UNESCO cultural taxonomy, through Query Construction, which reclassifies prompts, expands underrepresented domains (or labels), and prevents data leakage with SimHash. Then, Response Generation pairs prompts with culturally grounded responses via two-stage rejection sampling. The final dataset contains 1,500 samples spanning 30 subdomains of tangible and intangible cultural forms. Stage II benchmarks CULTURAX on general-purpose models, culturally fine-tuned models, and open-weight LLMs (Qwen3-8B and DeepSeek-R1-Distill-Qwen-7B). Empirically, culturally fine-tuned models improve joint HHH by 4%-6%, reduce cultural failures by 18%, achieve 10%-12% efficiency gains, and limit leakage to 0.3%.

[NLP-58] Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection ACL2026

【速读】: 该论文旨在解决事实验证系统对“半真半假”(half-truths)的识别盲区问题,即那些语句本身事实正确但因关键上下文缺失而具有误导性的陈述。传统验证方法通常聚焦于显性虚假信息,忽视了通过省略内容进行操纵的隐蔽性。解决方案的关键在于提出RADAR框架——一种基于角色锚定的多智能体辩论机制,其中“政客”(Politician)与“科学家”(Scientist)在共享检索到的证据上进行对抗性推理,由中立的“法官”(Judge)进行仲裁,并引入双阈值早期终止控制器以自适应判断何时达到充分推理并作出判定。该设计通过角色分工和动态控制策略,在噪声环境中有效提升对缺失上下文的敏感度,同时降低推理成本,实现了更准确、高效的遗漏信息检测。

链接: https://arxiv.org/abs/2604.19005
作者: Yixuan Tang,Yirui Zhang,Hang Feng,Anthony K.H. Tung
机构: National University of Singapore (新加坡国立大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:Half-truths, claims that are factually correct yet misleading due to omitted context, remain a blind spot for fact verification systems focused on explicit falsehoods. Addressing such omission-based manipulation requires reasoning not only about what is said, but also about what is left unsaid. We propose RADAR, a role-anchored multi-agent debate framework for omission-aware fact verification under realistic, noisy retrieval. RADAR assigns complementary roles to a Politician and a Scientist, who reason adversarially over shared retrieved evidence, moderated by a neutral Judge. A dual-threshold early termination controller adaptively decides when sufficient reasoning has been reached to issue a verdict. Experiments show that RADAR consistently outperforms strong single- and multi-agent baselines across datasets and backbones, improving omission detection accuracy while reducing reasoning cost. These results demonstrate that role-anchored, retrieval-grounded debate with adaptive control is an effective and scalable framework for uncovering missing context in fact verification. The code is available at this https URL.

[NLP-59] When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

【速读】: 该论文旨在解决当前大型推理模型(Large Reasoning Models, LRMs)安全评估中忽视推理过程中逐步演化危害的问题。现有方法仅关注最终输出,而未捕捉到有害行为在多步推理链中如何逐步显现,例如抑制拒绝、合理化合规、分解有害任务及隐藏风险等阶段。为此,作者提出HarmThoughts基准,其核心创新在于构建了一个包含16种有害推理行为的细粒度分类体系,将这些行为划分为四个功能类别,以刻画危害传播路径而非单纯识别危害结果。该数据集包含56,931句来自4个模型家族的推理轨迹文本,并附有句子级行为标签,从而支持对推理过程中的安全状态进行精细化监测与诊断,填补了现有安全评测在过程层面的空白。

链接: https://arxiv.org/abs/2604.19001
作者: Ishita Kakkar,Enze Zhang,Rheeya Uppaal,Junjie Hu
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) produce complex, multi-step reasoning traces, yet safety evaluation remains focused on final outputs, overlooking how harm emerges during reasoning. When jailbroken, harm does not appear instantaneously but unfolds through distinct behavioral steps such as suppressing refusal, rationalizing compliance, decomposing harmful tasks, and concealing risk. However, no existing benchmark captures this process at sentence-level granularity within reasoning traces – a key step toward reliable safety monitoring, interventions, and systematic failure diagnosis. To address this gap, we introduce HarmThoughts, a benchmark for step-wise safety evaluation of reasoning traces. \ourdataset is built on our proposed harm taxonomy of 16 harmful reasoning behaviors across four functional groups that characterize how harm propagates rather than what harm is produced. The dataset consists of 56,931 sentences from 1,018 reasoning traces generated by four model families, each annotated with fine-grained sentence-level behavioral labels. Using HarmThoughts, we analyze harm propagation patterns across reasoning traces, identifying common behavioral trajectories and drift points where reasoning transitions from safe to unsafe. Finally, we systematically compare white-box and black-box detectors on the task of identifying harmful reasoning behaviours on HarmThoughts. Our results show that existing detectors struggle with fine-grained behavior detection in reasoning traces, particularly for nuanced categories within harm emergence and execution, highlighting a critical gap in process-level safety monitoring. HarmThoughts is available publicly at: this https URL

[NLP-60] R2-dLLM : Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

【速读】: 该论文旨在解决扩散大语言模型(Diffusion Large Language Models, dLLMs)在实际推理过程中存在的高延迟问题,其核心瓶颈在于解码过程中的冗余性。具体而言,这种冗余包括由置信度聚类和位置歧义引起的空间冗余,以及因重复掩码已稳定预测 token 所导致的时间冗余。解决方案的关键在于提出 R²-dLLM 框架,从推理和训练两个维度统一减少冗余:在推理阶段引入无需训练的解码规则,通过聚合局部置信度与 token 预测并提前固化稳定 token 来避免冗余步骤;在训练阶段设计一种冗余感知的监督微调流程,使模型对高效解码轨迹进行对齐,从而降低对人工设定阈值的依赖。实验表明,R²-dLLM 在保持生成质量的同时,可将解码步骤数最多减少 75%。

链接: https://arxiv.org/abs/2604.18995
作者: Zhenbang Du,Kejing Xia,Xinrui Zhong,Yonggan Fu,Nicolai Oswald,Binfei Ji,Brucek Khailany,Pavlo Molchanov,Yingyan Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits deployment. In this work, we observe that a substantial part of this inefficiency comes from recurring redundancy in the decoding process, including spatial redundancy caused by confidence clusters and positional ambiguity, and temporal redundancy caused by repeatedly remasking predictions that have already stabilized. Motivated by these patterns, we propose R^2 -dLLM, a unified framework for reducing decoding redundancy from both inference and training perspectives. At inference time, we introduce training-free decoding rules that aggregate local confidence and token predictions, and finalize temporally stable tokens to avoid redundant decoding steps. We further propose a redundancy-aware supervised fine-tuning pipeline that aligns the model with efficient decoding trajectories and reduces reliance on manually tuned thresholds. Experiments demonstrate that R^2 -dLLM consistently reduces the number of decoding steps by up to 75% compared to existing decoding strategies, while maintaining competitive generation quality across different models and tasks. These results validate that decoding redundancy is a central bottleneck in dLLMs, and that explicitly reducing it yields substantial practical efficiency gains.

[NLP-61] STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming ACL2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对恶意提示(jailbreak prompts)时易被诱导产生有害或不当响应的安全漏洞问题。解决方案的关键在于提出一种名为STAR-Teaming的黑盒自动化红队框架,其核心创新是将多智能体系统(Multi-Agent System, MAS)与策略-响应多重网络(Strategy-Response Multiplex Network)相结合,并通过网络驱动优化方法高效采样攻击策略。该方法将高维嵌入空间重构为可解析的结构化网络,不仅提升了对LLM战略脆弱性的可解释性,还通过语义社区组织搜索空间,避免冗余探索,从而在更低计算成本下显著提升攻击成功率(Attack Success Rate, ASR)。

链接: https://arxiv.org/abs/2604.18976
作者: MinJae Jung,YongTaek Lim,Chaeyun Kim,Junghwan Kim,Kihyun Kim,Minwoo Kim
机构: DATUMO INC
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026 Findings

点击查看摘要

Abstract:While Large Language Models (LLMs) are widely used, they remain susceptible to jailbreak prompts that can elicit harmful or inappropriate responses. This paper introduces STAR-Teaming, a novel black-box framework for automated red teaming that effectively generates such prompts. STAR-Teaming integrates a Multi-Agent System (MAS) with a Strategy-Response Multiplex Network and employs network-driven optimization to sample effective attack strategies. This network-based approach recasts the intractable high-dimensional embedding space into a tractable structure, yielding two key advantages: it enhances the interpretability of the LLM’s strategic vulnerabilities, and it streamlines the search for effective strategies by organizing the search space into semantic communities, thereby preventing redundant exploration. Empirical results demonstrate that STAR-Teaming significantly surpasses existing methods, achieving a higher attack success rate (ASR) at a lower computational cost. Extensive experiments validate the effectiveness and explainability of the Multiplex Network. The code is available at this https URL.

[NLP-62] Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在社交媒体分析任务中缺乏系统性评估与可复现基准的问题。具体而言,研究聚焦于三个核心任务:社交媒体作者身份验证(Social Media Authorship Verification)、帖子生成(Social Media Post Generation)以及用户属性推断(User Attribute Inference),并针对每个任务设计了严谨的评估框架与指标。其解决方案的关键在于构建了一个统一且全面的评估体系,涵盖多样化的用户和帖子采样策略、针对“已见数据”偏差的缓解机制(如使用2024年1月后新收集的数据进行泛化测试)、真实用户感知实验以衡量生成内容的可信度,以及基于标准化分类体系(IAB Tech Lab 2023 和 U.S. SOC)的属性标注与基准对比。该方法不仅提升了评估的科学性与公平性,也为后续LLM驱动的社会媒体分析研究提供了可复现的基准资源。

链接: https://arxiv.org/abs/2604.18955
作者: Ramtin Davoudi,Kartik Thakkar,Nazanin Donyapour,Tyler Derr,Hamid Karimi
机构: Utah State University (犹他州立大学); Vanderbilt University (范德比尔特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:In this study, we present the first comprehensive evaluation of modern LLMs - including GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT - across three core social media analytics tasks on a Twitter (X) dataset: (I) Social Media Authorship Verification, (II) Social Media Post Generation, and (III) User Attribute Inference. For the authorship verification, we introduce a systematic sampling framework over diverse user and post selection strategies and evaluate generalization on newly collected tweets from January 2024 onward to mitigate “seen-data” bias. For post generation, we assess the ability of LLMs to produce authentic, user-like content using comprehensive evaluation metrics. Bridging Tasks I and II, we conduct a user study to measure real users’ perceptions of LLM-generated posts conditioned on their own writing. For attribute inference, we annotate occupations and interests using two standardized taxonomies (IAB Tech Lab 2023 and 2018 U.S. SOC) and benchmark LLMs against existing baselines. Overall, our unified evaluation provides new insights and establishes reproducible benchmarks for LLM-driven social media analytics. The code and data are provided in the supplementary material and will also be made publicly available upon publication.

[NLP-63] A Mechanism and Optimization Study on the Impact of Information Density on User-Generated Content Named Entity Recognition

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在处理用户生成内容(User-Generated Content, UGC)时,命名实体识别(Named Entity Recognition, NER)模型性能显著下降的问题。现有方法多聚焦于针对噪声症状的局部修复,如新词、别名漂移或类别不平衡等,但未能有效泛化,因其忽略了UGC中固有的结构稀疏性。研究发现,表面噪声的根本原因是信息密度(Information Density, ID)低;通过控制实体稀有性和标注一致性进行分层实验,作者证实ID是独立的关键因素,并提出注意力谱分析(Attention Spectrum Analysis, ASA)来量化ID降低导致的“注意力钝化”现象。解决方案的核心是提出一种无需修改模型架构的通用框架——窗口感知优化模块(Window-Aware Optimization Module, WOM),该模块利用大语言模型(LLM)识别信息稀疏区域,并通过选择性回译定向增强语义密度,从而在标准UGC数据集上实现最高达4.5%的F1分数提升,且在WNUT2017上达到新的SOTA性能。

链接: https://arxiv.org/abs/2604.18944
作者: Jiang Xiaobo,Dinghong Lai,Song Qiu,Yadong Deng,Xinkai Zhan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Named Entity Recognition (NER) models trained on clean, high-resource corpora exhibit catastrophic performance collapse when deployed on noisy, sparse User-Generated Content (UGC), such as social media. Prior research has predominantly focused on point-wise symptom remediation – employing customized fine-tuning to address issues like neologisms, alias drift, non-standard orthography, long-tail entities, and class imbalance. However, these improvements often fail to generalize because they overlook the structural sparsity inherent in UGC. This study reveals that surface-level noise symptoms share a unified root cause: low Information Density (ID). Through hierarchical confounding-controlled resampling experiments (specifically controlling for entity rarity and annotation consistency), this paper identifies ID as an independent key factor. We introduce Attention Spectrum Analysis (ASA) to quantify how reduced ID causally leads to ``attention blunting,‘’ ultimately degrading NER performance. Informed by these mechanistic insights, we propose the Window-Aware Optimization Module (WOM), an LLM-empowered, model-agnostic framework. WOM identifies information-sparse regions and utilizes selective back-translation to directionally enhance semantic density without altering model architecture. Deployed atop mainstream architectures on standard UGC datasets (WNUT2017, Twitter-NER, WNUT2016), WOM yields up to 4.5% absolute F1 improvement, demonstrating robustness and achieving new state-of-the-art (SOTA) results on WNUT2017.

[NLP-64] Disparities In Negation Understanding Across Languages In Vision-Language Models

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中存在的肯定偏倚(affirmation bias)问题,即模型在面对包含否定语义的图像描述时,倾向于选择阳性表述(如“X存在”),而非正确表达否定内容(如“无X”)。这一偏差在多语言场景下尤为显著,因不同语言的否定结构在形态学、词序和助词化等方面差异显著,导致现有针对英语的解决方案难以公平地适用于所有语言群体。论文提出并构建了首个经人工验证的多语言否定基准,覆盖七种类型多样语言(包括拉丁语系与非拉丁语系),系统评估了CLIP、SigLIP及MultiCLIP等VLMs的表现,并引入SpaceVLM作为否定修正方法。关键发现在于:MultiCLIP在各语言中表现出最一致的高准确率,而SpaceVLM虽对部分语言(如英语、希腊语、西班牙语和他加禄语)改善显著,但其效果受语言类型特征影响明显,揭示出语言属性(如形态复杂性、书写系统和否定结构)与模型改进之间存在交互关系,凸显了构建多语言基准对实现全球公平部署VLMs的重要性。

链接: https://arxiv.org/abs/2604.18942
作者: Charikleia Moraitaki,Sarah Pan,Skyler Pulling,Gwendolyn Flusche,Kumail Alhamoud,Marzyeh Ghassemi
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) exhibit affirmation bias: a systematic tendency to select positive captions (“X is present”) even when the correct description contains negation (“no X”). While prior work has documented this failure mode in English and proposed solutions, negation manifests differently across languages through varying morphology, word order, and cliticization patterns, raising the question of whether these solutions serve all linguistic communities equitably. We introduce the first human-verified multilingual negation benchmark, spanning seven typologically diverse languages: English, Mandarin Chinese, Arabic, Greek, Russian, Tagalog, and Spanish. Evaluating three VLMs - CLIP, SigLIP, and MultiCLIP - we find that standard CLIP performs at or below chance on non-Latin-script languages, while MultiCLIP achieves the highest and most uniform accuracy. We also evaluate SpaceVLM, a proposed negation correction, and find that it produces substantial improvements for several languages - particularly English, Greek, Spanish, and Tagalog - while showing varied effectiveness across typologically different languages. This variation reveals that linguistic properties like morphology, script, and negation structure interact with model improvements in fairness-relevant ways. As VLMs are deployed globally, multilingual benchmarks are essential for understanding not just whether solutions work, but for whom.

[NLP-65] Comparison of sEMG Encoding Accuracy Across Speech Modes Using Articulatory and Phoneme Features

【速读】: 该论文旨在解决如何利用语音发音编码(Speech Articulatory Coding, SPARC)特征来线性预测表面肌电图(sEMG)包络的问题,特别是在出声、模仿和默念三种不同言语模式下。其解决方案的关键在于采用弹性网正则化的多变量时间响应函数(elastic-net multivariate temporal response function, mTRF)方法,并通过句子级别的交叉验证,证明SPARC特征在所有电极和所有言语模式中均显著优于单热编码的音素特征;同时,通过方差分解分析表明SPARC具有显著的独特解释能力,且mTRF权重模式揭示了电极位置与发音运动之间的解剖学可解释关系,这些关系在不同言语模式间保持一致,从而支持SPARC作为基于sEMG的无声言语建模中的稳健且可解释的中间表征。

链接: https://arxiv.org/abs/2604.18920
作者: Chenqian Le,Ruisi Li,Beatrice Fumagalli,Xupeng Chen,Amirhossein Khalilian-Gourtani,Tianyu He,Adeen Flinker,Yao Wang
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We test whether Speech Articulatory Coding (SPARC) features can linearly predict surface electromyography (sEMG) envelopes across aloud, mimed, and subvocal speech in twenty-four subjects. Using elastic-net multivariate temporal response function (mTRF) with sentence-level cross-validation, SPARC yields higher prediction accuracy than phoneme one-hot representations on nearly all electrodes and in all speech modes. Aloud and mimed speech perform comparably, and subvocal speech remains above chance, indicating detectable articulatory activity. Variance partitioning shows a substantial unique contribution from SPARC and a minimal unique contribution from phoneme features. mTRF weight patterns reveal anatomically interpretable relationships between electrode sites and articulatory movements that remain consistent across modes. This study focuses on representation/encoding analysis (not end-to-end decoding) and supports SPARC as a robust and interpretable intermediate target for sEMG-based silent-speech modeling.

[NLP-66] Proposing Topic Models and Evaluation Frameworks for Analyzing Associations with External Outcomes: An Application to Leadership Analysis Using Large-Scale Corporate Review Data

【速读】: 该论文旨在解决现有主题建模方法在应用于外部结果分析(如员工士气)时,难以同时实现可解释性、主题特异性(与具体行为或特征的对齐度)和极性立场一致性(主题内部不混杂正负面评价)的问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成满足上述三重属性的主题,并构建一个以主题特异性和极性立场一致性为核心评价指标的评估框架,从而显著提升主题在实际应用中的解释力和预测效度。

链接: https://arxiv.org/abs/2604.18919
作者: Yura Yoshida,Masato Kanai,Masataka Nakayama,Haruki Ohsawa,Yukiko Uchida,Arata Yuminaga,Gakuse Hoshina,Nobuo Sayama
机构: Accenture Japan(安永日本); Kyoto University(京都大学); Openwork(开放工作); Integral(积分)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Analyzing topics extracted from text data in relation to external outcomes is important across fields such as computational social science and organizational research. However, existing topic modeling methods struggle to simultaneously achieve interpretability, topic specificity (alignment with concrete actions or characteristics), and polarity stance consistency (absence of mixed positive and negative evaluations within a topic). Focusing on leadership analysis using corporate review data, this study proposes a method leveraging large language models to generate topics that satisfy these properties, along with an evaluation framework tailored to external outcome analysis. The framework explicitly incorporates topic specificity and polarity stance consistency as evaluation criteria and examines automated evaluation methods based on existing metrics. Using employee reviews from OpenWork, a major corporate review platform in Japan, the proposed method achieves improved interpretability, specificity, and polarity consistency compared to existing approaches. In analyses of external outcomes such as employee morale, it also produces topics with higher explanatory power. These results suggest that the proposed method and evaluation framework provide a generalized approach for topic analysis in applications involving external outcomes.

[NLP-67] MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation ACL2026

【速读】: 该论文旨在解决当前多语言大语言模型(Multilingual Large Language Models, LLMs)在处理语法性别(grammatical gender)和形态一致性的能力不足问题,尤其是在形态丰富的语言中,性别对动词变位、代词及第一人称表达的影响尚未被充分研究。其解决方案的关键在于构建了一个名为MORPHOGEN的大型基准数据集,该数据集聚焦于三种类型学上差异显著的语法性别语言(法语、阿拉伯语和印地语),并通过核心任务GENFORM要求模型将第一人称句子重写为相反性别但保持语义与结构不变,从而提供一种基于形态学的诊断性评估工具,揭示现有模型在性别敏感生成方面的局限性与潜在改进方向。

链接: https://arxiv.org/abs/2604.18914
作者: Mehul Agarwal,Aditya Aggarwal,Arnav Goel,Medha Hira,Anubha Gupta
机构: SBILab, Indraprastha Institute of Information Technology Delhi(SBILab,德里印地普拉斯特拉信息科技学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, accepted to ACL 2026 (Main)

点击查看摘要

Abstract:While multilingual large language models (LLMs) perform well on high-level tasks like translation and question answering, their ability to handle grammatical gender and morphological agreement remains underexplored. In morphologically rich languages, gender influences verb conjugation, pronouns, and even first-person constructions with explicit and implicit mentions of gender. We introduce MORPHOGEN, a morphologically grounded large-scale benchmark dataset for evaluating gender-aware generation in three typologically diverse grammatically gendered languages: French, Arabic, and Hindi. The core task, GENFORM, requires models to rewrite a first-person sentence in the opposite gender while preserving its meaning and structure. We construct a high-quality synthetic dataset spanning these three languages and benchmark 15 popular multilingual LLMs (2B-70B) on their ability to perform this transformation. Our results reveal significant gaps and interesting insights into how current models handle morphological gender. MORPHOGEN provides a focused diagnostic lens for gender-aware language modeling and lays the groundwork for future research on inclusive and morphology-sensitive NLP.

[NLP-68] LogosKG: Hardware-Optimized Scalable and Interpretable Knowledge Graph Retrieval ACL2026

【速读】: 该论文旨在解决知识图谱(Knowledge Graph, KG)与大语言模型(Large Language Models, LLMs)集成过程中多跳检索(multi-hop retrieval)的效率、可扩展性与可解释性难以平衡的问题。现有系统在处理大规模KG时,往往面临计算资源消耗高、检索路径不透明或难以扩展至十亿边级别等挑战。解决方案的关键在于提出LogosKG框架,其核心创新是基于符号化知识图谱表示形式,将实体和关系分解为结构化的三元组表示,并通过硬件友好的操作实现高效的k跳遍历;同时引入度感知分区(degree-aware partitioning)、跨图路由(cross-graph routing)和按需缓存(on-demand caching)机制,显著提升了系统在超大规模KG上的可扩展性与执行效率,且保持了检索结果的准确性与可解释性。

链接: https://arxiv.org/abs/2604.18913
作者: He Cheng,Yifu Wu,Saksham Khatwani,Maya Kruse,Dmitriy Dligach,Timothy A. Miller,Majid Afshar,Yanjun Gao
机构: LARK Lab, University of Colorado Anschutz; University of Colorado Boulder; Loyola University Chicago; Harvard Medical School; Boston Children’s Hospital; University of Wisconsin-Madison
类目: Computation and Language (cs.CL)
备注: Accepted to the ACL 2026 Main Conference. 9 pages

点击查看摘要

Abstract:Knowledge graphs (KGs) are increasingly integrated with large language models (LLMs) to provide structured, verifiable reasoning. A core operation in this integration is multi-hop retrieval, yet existing systems struggle to balance efficiency, scalability, and interpretability. We introduce LogosKG, a novel, hardware-aligned framework that enables scalable and interpretable k-hop retrieval on large KGs by building on symbolic KG formulations and executing traversal as hardware-efficient operations over decomposed subject, object, and relation representations. To scale to billion-edge graphs, LogosKG integrates degree-aware partitioning, cross-graph routing, and on-demand caching. Experiments show substantial efficiency gains over CPU and GPU baselines without loss of retrieval fidelity. With proven performance in KG retrieval, a downstream two-round KG-LLM interaction demonstrates how LogosKG enables large-scale, evidence-grounded analysis of how KG topology, such as hop distribution and connectivity, shapes the alignment between structured biomedical knowledge and LLM diagnostic reasoning, thereby opening the door for next-generation KG-LLM integration. The source code is publicly available at this https URL, and an online demo is available at this https URL.

[NLP-69] Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

【速读】: 该论文旨在解决如何从大型语言模型(Large Language Models, LLMs)的残差流(residual streams)中几何上可恢复有害意图(harmful intent)的问题,从而实现对模型输出潜在风险行为的高效检测。其核心贡献在于揭示了有害意图在不同模型架构和对齐变体中均表现为一种稳定可识别的几何结构:多数层中为线性方向,而在投影方法失效的层中则体现为角度偏差(angular deviation)。关键解决方案包括三种有效的方向探测策略——软AUC优化的线性方向、类别均值探针以及监督的角度偏差策略,其中后者尤其重要,因其在中间层仍保持检测能力,且与投影方法所得方向差异达73°,表明其代表了一种表征上独立的检测路径。此外,研究发现有害意图的表征与拒绝行为(refusal behavior)功能解耦,即使在被“手术移除”拒绝机制的模型中依然可检,说明该表征是语言理解过程中的固有属性,而非对齐调整的结果。这一发现对安全评估具有重要启示:高AUROC值(>0.97)可能高估实际可操作检测性能,应结合TPR@1% FPR指标以更准确衡量安全性评估效果。

链接: https://arxiv.org/abs/2604.18901
作者: Isaac Llorente-Saguer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 25 pages, 7 figures, 11 tables. Code at this https URL

点击查看摘要

Abstract:Harmful intent is geometrically recoverable from large language model residual streams: as a linear direction in most layers, and as angular deviation in layers where projection methods fail. Across 12 models spanning four architectural families (Qwen2.5, Qwen3.5, Llama-3.2, Gemma-3) and three alignment variants (base, instruction-tuned, abliterated), under single-turn, English evaluation, we characterise this geometry through six direction-finding strategies. Three succeed: a soft-AUC-optimised linear direction reaches mean AUROC 0.98 and TPR@1%FPR 0.80; a class-mean probe reaches 0.98 and 0.71 at 1ms fitting cost; a supervised angular-deviation strategy reaches AUROC 0.96 and TPR of 0.61 along a representationally distinct direction ( 73^\circ from projection-based solutions), uniquely sustaining detection in middle layers where projection methods collapse. Detection remains stable across alignment variants, including abliterated models from which refusal has been surgically removed: harmful intent and refusal behaviour are functionally dissociated features of the representation. A direction fitted on AdvBench transfers to held-out HarmBench and JailbreakBench with worst-case AUROC 0.96. The same picture holds at scale: across Qwen3.5 from 0.8B to 9B parameters, AUROC remains \geq 0.98 and cross-variant transfer stays within 0.018 of own-direction performance This is consistent with a simple account: models acquire a linearly decodable representation of harmful intent as part of general language understanding, and alignment then shapes what they do with such inputs without reorganising the upstream recognition signal. As a practical consequence, AUROC in the 0.97+ regime can substantially overestimate operational detectability; TPR@ 1% FPR should accompany AUROC in safety-adjacent evaluation. Comments: 25 pages, 7 figures, 11 tables. Code at this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2.7 Cite as: arXiv:2604.18901 [cs.LG] (or arXiv:2604.18901v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.18901 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-70] Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在形式化数学推理任务中,特别是针对等式理论(equational theories)的蕴含关系判定问题——即判断一个等式定律是否在所有幺半群(magmas)上由另一个等式定律推导得出。这是一个一般情况下不可判定的问题,但对 FALSE 类型实例可通过有限模型搜索实现可判定性。研究通过系统性地设计、测试和分析超过40种提示(prompt)变体(长度从0到4,878字节不等),在三个语言模型(gpt-oss-120b、Llama 3.3 70B 和 Gemma 4 31B)上进行评估,发现尽管投入大量工程努力,模型在硬样本上的准确率存在一个“单提示上限”:gpt-oss-120b 的平衡硬准确率稳定在约60–79%之间,远高于无提示基线(59.75%)。其关键解决方案在于识别出导致性能饱和的三大机制:(1) TRUE 情况的数学不可判定性限制了任何有限提示所能编码的信息;(2) 复杂规则系统显著降低弱模型(如 Llama 3.3 70B)的表现(提示超过2KB时 TRUE 召回率跌至0%);(3) 提示顺序效应与模型注意力机制交互产生非单调且脆弱的影响。最终最优提示(AN45c,2,252字节)在 hard3 数据集上达到79.25%准确率,其中 TRUE 召回率达95.9%,FALSE 召回率为63.4%,较基线提升19.5个百分点。

链接: https://arxiv.org/abs/2604.18897
作者: Manuel Israel Cazares
机构: Bytepro AI (Bytepro AI)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Companion repository: this https URL | Zenodo DOI: https://doi.org/10.5281/zenodo.19598433 | v15: final Contributor Network data (n=52, competition close April 20, 2026)

点击查看摘要

Abstract:We present a systematic empirical study of prompt engineering for formal mathematical reasoning in the context of the SAIR Equational Theories Stage 1 competition. The task requires deciding whether one equational law implies another over all magmas – a problem that is undecidable in general but decidable for FALSE via finite model search. Over five weeks, we designed, tested, and analyzed more than 40 prompt variants, ranging from 0 to 4,878 bytes, across four evaluation splits and three language models (gpt-oss-120b, Llama 3.3 70B, Gemma 4 31B). Our central finding is a single-prompt ceiling: despite substantial engineering effort, balanced hard accuracy plateaus in an empirical saturation region of approximately 60–79% for gpt-oss-120b, compared to a 59.75% no-cheatsheet baseline. We identify three mechanisms underlying this ceiling: (1) the mathematical undecidability of the TRUE case limits what any finite prompt can encode; (2) complex rule systems decrease performance on weaker models (Llama 3.3 70B collapses to 0% TRUE recall with prompts exceeding 2KB); and (3) prompt ordering effects interact with model attention in fragile, non-monotonic ways. Our best submission (AN45c, 2,252 bytes) achieves 79.25% accuracy on hard3 (n=400; 95% CI: [75.0%, 82.9%]), with TRUE recall of 95.9% and FALSE recall of 63.4%, representing a +19.5 percentage-point improvement over the no-cheatsheet baseline (59.75%). We release all prompt variants, evaluation scripts, and results at this https URL Comments: Companion repository: this https URL | Zenodo DOI: https://doi.org/10.5281/zenodo.19598433 | v15: final Contributor Network data (n=52, competition close April 20, 2026) Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) ACMclasses: I.2.7 Cite as: arXiv:2604.18897 [cs.CL] (or arXiv:2604.18897v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.18897 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-71] Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness

【速读】: 该论文旨在解决强化学习中多模态推理(multimodal reasoning)时存在的“推理-答案不一致”(reasoning-answer inconsistency)问题,即模型虽然最终答案正确,但其推理过程可能依赖于不完整推导、弱证据或自相矛盾的陈述。为提升推理的有效性与可靠性,论文提出通过轨迹监督(trajectory supervision)来优化策略学习。解决方案的关键在于引入分组排序奖励机制(Groupwise Ranking Reward),该机制在单次前向传播中对同一提示下通过验证器(verifier)的多个正确轨迹进行排序,并据此重新分配奖励,从而更高效地区分强弱正确的推理路径,相比生成式奖励(Generative Rewards, GRs)降低标注开销并提升稳定性,最终显著提高可靠条件下的准确率(从47.4%提升至54.7%)。

链接: https://arxiv.org/abs/2604.18892
作者: Mengzhao Jia,Zhihan Zhang,Meng Jiang
机构: University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) improves multimodal reasoning by rewarding verifiable final answers. Yet answer-correct trajectories may still rely on incomplete derivations, weak evidence, or statements that contradict their conclusions. This gap between answer correctness and reasoning validity, which we call reasoning-answer inconsistency, motivates trajectory supervision in multimodal RL. We compare two main approaches: reward models (RMs), and Generative Rewards (GRs). RMs are efficient and help early in training, but their gains weaken as the policy distribution shifts; GRs improve performance, but may give unstable rewards and computationally expensive. We therefore propose Groupwise Ranking Reward, which ranks verifier-passed trajectories for the same prompt in one pass and redistributes reward accordingly. Groupwise comparison better separates stronger and weaker correct trajectories with lower judge overhead than GRs. Experiments show that RLVR aggravates reasoning-answer inconsistency, while trajectory supervision alleviates it. Groupwise Ranking Reward performs best overall, improving reliability-conditioned accuracy from 47.4% to 54.7% over RLVR.

[NLP-72] Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成引用时频繁出现虚构但看似可信的文献引用问题,即“引用幻觉”(citation hallucination)。研究发现,作者姓名字段的错误率显著高于其他引用字段,且不同字段间的幻觉信号不具备通用性。解决方案的关键在于利用模型内部神经元层面的上下文嵌入变换跟踪(CETT)值,通过弹性网正则化结合稳定性选择识别出一组稀疏的、领域特异性的幻觉神经元(field-specific hallucination neurons, FH-neurons),并通过因果干预验证其作用:增强这些神经元会加剧幻觉,抑制则能提升多字段下的引用准确性,尤其在某些字段中效果更显著。该方法仅依赖模型内部信号即可实现轻量级检测与缓解,无需外部知识库或复杂修改。

链接: https://arxiv.org/abs/2604.18880
作者: Yuefei Chen,Yihao Quan,Xiaodong Lin,Ruixiang Tang
机构: Rutgers University (罗格斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs frequently generate fictitious yet convincing citations, often expressing high confidence even when the underlying reference is wrong. We study this failure across 9 models and 108,000 generated references, and find that author names fail far more often than other fields across all models and settings. Citation style has no measurable effect, while reasoning-oriented distillation degrades recall. Probes trained on one field transfer at near-chance levels to the others, suggesting that hallucination signals do not generalize across fields. Building on this finding, we apply elastic-net regularization with stability selection to neuron-level CETT values of Qwen2.5-32B-Instruct and identify a sparse set of field-specific hallucination neurons (FH-neurons). Causal intervention further confirms their role: amplifying these neurons increases hallucination, while suppressing them improves performance across fields, with larger gains in some fields. These results suggest a lightweight approach to detecting and mitigating citation hallucination using internal model signals alone.

[NLP-73] LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification

【速读】: 该论文旨在解决通用大语言模型(Large Language Models, LLMs)在巴西法律文本分类任务中表现不佳的问题,特别是其对特定法律领域(如行政法)的识别能力严重不足。解决方案的关键在于构建首个公开的巴西法律文本分类基准 LegalBench-BR,并采用基于 LoRA(Low-Rank Adaptation)的微调策略,在仅更新 0.3% 模型参数的前提下,实现了高达 87.6% 的准确率和 0.87 的宏 F1 分数,显著优于商用模型(如 GPT-4o mini 和 Claude 3.5 Haiku),尤其在行政法类别上弥补了它们近乎零 F1 的缺陷。该方法证明了领域适配微调对于法律 NLP 任务的重要性,并可在消费级 GPU 上实现零边际推理成本的高性能部署。

链接: https://arxiv.org/abs/2604.18878
作者: Pedro Barbosa de Carvalho Neto
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 1 figure. Preprint. First public benchmark for Brazilian legal text classification. Dataset and model available on Hugging Face

点击查看摘要

Abstract:We introduce LegalBench-BR, the first public benchmark for evaluating language models on Brazilian legal text classification. The dataset comprises 3,105 appellate proceedings from the Santa Catarina State Court (TJSC), collected via the DataJud API (CNJ) and annotated across five legal areas through LLM-assisted labeling with heuristic validation. On a class-balanced test set, BERTimbau-LoRA, updating only 0.3% of model parameters, achieves 87.6% accuracy and 0.87 macro-F1 (+22pp over Claude 3.5 Haiku, +28pp over GPT-4o mini). The gap is most striking on administrativo (administrative law): GPT-4o mini scores F1 = 0.00 and Claude 3.5 Haiku scores F1 = 0.08 on this class, while the fine-tuned model reaches F1 = 0.91. Both commercial LLMs exhibit a systematic bias toward civel (civil law), absorbing ambiguous classes rather than discriminating them, a failure mode that domain-adapted fine-tuning eliminates. These results demonstrate that general-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal classification, even when the task is a simple 5-class problem, and that LoRA fine-tuning on a consumer GPU closes the gap at zero marginal inference cost. We release the full dataset, model, and pipeline to enable reproducible research in Portuguese legal NLP.

[NLP-74] Human-Guided Harm Recovery for Computer Use Agents

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在真实计算机系统中执行动作时,因预防机制失效而导致的有害状态问题,其核心挑战是“有害状态后的恢复”(harm recovery)——即如何在对齐人类偏好的前提下,最优地将代理从有害状态引导回安全状态。解决方案的关键在于:首先通过用户研究识别出人类偏好的恢复维度并构建自然语言评判标准(rubric),其次基于1,150条成对判断数据训练一个奖励模型(reward model),用于在测试阶段对代理生成的多个候选恢复方案进行重排序,从而提升恢复轨迹的质量;同时提出BackBench基准测试集以系统评估恢复能力。这一方法标志着Agent安全研究从单纯预防转向“预防+后处理”的新范式。

链接: https://arxiv.org/abs/2604.18847
作者: Christy Li,Sky CH-Wang,Andi Peng,Andreea Bobu
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); Abridge; humans; Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As LM agents gain the ability to execute actions on real computer systems, we need ways to not only prevent harmful actions at scale but also effectively remediate harm when prevention fails. We formalize a solution to this neglected challenge in post-execution safeguards as harm recovery: the problem of optimally steering an agent from a harmful state back to a safe one in alignment with human preferences. We ground preference-aligned recovery through a formative user study that identifies valued recovery dimensions and produces a natural language rubric. Our dataset of 1,150 pairwise judgments reveals context-dependent shifts in attribute importance, such as preferences for pragmatic, targeted strategies over comprehensive long-term approaches. We operationalize these learned insights in a reward model, re-ranking multiple candidate recovery plans generated by an agent scaffold at test time. To evaluate recovery capabilities systematically, we introduce BackBench, a benchmark of 50 computer-use tasks that test an agent’s ability to recover from harmful states. Human evaluation shows our reward model scaffold yields higher-quality recovery trajectories than base agents and rubric-based scaffolds. Together, these contributions lay the foundation for a new class of agent safety methods – ones that confront harm not only by preventing it, but by navigating its aftermath with alignment and intent.

[NLP-75] Semantic Needles in Document Haystacks: Sensitivity Testing of LLM -as-a-Judge Similarity Scoring

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文档对比较中对细微语义变化的敏感性问题,即如何系统性地评估LLM在面对不同类型的语义扰动时,其相似度评分行为是否稳定且可解释。解决方案的关键在于提出一个可扩展、多因素实验框架,通过模拟“针在 haystack 中”的场景,在控制变量条件下系统测试五种LLM在数千个文档对上的表现,其中扰动类型包括否定、连词互换、命名实体替换,同时调节上下文类型(相关或无关)、扰动位置和文档长度。该框架揭示了LLM存在文档内位置偏差、上下文一致性影响评分分布以及模型特异性指纹等关键现象,从而为LLM语义相似度评估提供了结构化、可复现的审计工具,超越单纯依赖语义变化本身的研究范式。

链接: https://arxiv.org/abs/2604.18835
作者: Sinan G. Aksoy,Alexandra A. Sabrio,Erik VonKaenel,Lee Burke
机构: Pacific Northwest National Laboratory (太平洋西北国家实验室); Washington University in St. Louis (圣路易斯华盛顿大学); Humana Inc. (Humana公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 8 figures

点击查看摘要

Abstract:We propose a scalable, multifactorial experimental framework that systematically probes LLM sensitivity to subtle semantic changes in pairwise document comparison. We analogize this as a needle-in-a-haystack problem: a single semantically altered sentence (the needle) is embedded within surrounding context (the hay), and we vary the perturbation type (negation, conjunction swap, named entity replacement), context type (original vs. topically unrelated), needle position, and document length across all combinations, testing five LLMs on tens of thousands of document pairs. Our analysis reveals several striking findings. First, LLMs exhibit a within-document positional bias distinct from previously studied candidate-order effects: most models penalize semantic differences more harshly when they occur earlier in a document. Second, when the altered sentence is surrounded by topically unrelated context, it systematically lowers similarity scores and induces bipolarized scores that indicate either very low or very high similarity. This is consistent with an interpretive frame account in which topically-related context may allow models to contextualize and downweight the alterations. Third, each LLM produces a qualitatively distinct scoring distribution, a stable “fingerprint” that is invariant to perturbation type, yet all models share a universal hierarchy in how leniently they treat different perturbation types. Together, these results demonstrate that LLM semantic similarity scores are sensitive to document structure, context coherence, and model identity in ways that go beyond the semantic change itself, and that the proposed framework offers a practical, LLM-agnostic toolkit for auditing and comparing scoring behavior across current and future models.

[NLP-76] Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models ACL2026

【速读】: 该论文旨在解决科学可行性评估(scientific feasibility assessment)中如何有效利用大语言模型(LLM)进行判断的问题,即在给定假设的情况下,模型能否准确预测该假设是否可行,并提供合理解释。其解决方案的关键在于将可行性评估建模为诊断推理任务,并系统性地考察不同形式的实验证据(包括实验描述与结果数据)对模型决策可靠性的影响。研究发现,在控制知识条件下,提供结果证据(outcome evidence)比仅提供实验描述更可靠,且能提升准确性;而实验文本在上下文不完整时可能引入脆弱性,导致性能下降。这一发现明确了实验证据在LLM驱动可行性评估中的适用边界和优化路径。

链接: https://arxiv.org/abs/2604.18786
作者: Seyedali Mohammadi,Manas Gaur,Francis Ferraro
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2026

点击查看摘要

Abstract:Scientific feasibility assessment asks whether a claim is consistent with established knowledge and whether experimental evidence could support or refute it. We frame feasibility assessment as a diagnostic reasoning task in which, given a hypothesis, a model predicts feasible or infeasible and justifies its decision. We evaluate large language models (LLMs) under controlled knowledge conditions (hypothesis-only, with experiments, with outcomes, or both) and probe robustness by progressively removing portions of the experimental and/or outcome context. Across multiple LLMs and two datasets, providing outcome evidence is generally more reliable than providing experiment descriptions. Outcomes tend to improve accuracy beyond what internal knowledge alone provides, whereas experimental text can be brittle and may degrade performance when the context is incomplete. These findings clarify when experimental evidence benefits LLM-based feasibility assessment and when it introduces fragility.

[NLP-77] Mango: Multi-Agent Web Navigation via Global-View Optimization

【速读】: 该论文旨在解决现有网页代理(Web Agent)在复杂网站中因从根URL开始探索而导致的效率低下问题,尤其是在缺乏全局网站结构认知的情况下,容易陷入导航陷阱、探索无关分支或无法在有限预算内到达目标信息。解决方案的关键在于提出Mango方法,其核心是将URL选择建模为多臂老虎机(Multi-Armed Bandit, MAB)问题,并采用Thompson采样策略自适应分配导航预算至候选URL;同时引入基于episode的记忆组件存储历史导航轨迹,使代理能够从过往尝试中学习,从而动态优化起始点选择和导航路径。

链接: https://arxiv.org/abs/2604.18779
作者: Weixi Tong,Yifeng Di,Tianyi Zhang
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing web agents typically initiate exploration from the root URL, which is inefficient for complex websites with deep hierarchical structures. Without a global view of the website’s structure, agents frequently fall into navigation traps, explore irrelevant branches, or fail to reach target information within a limited budget. We propose Mango, a multi-agent web navigation method that leverages the website structure to dynamically determine optimal starting points. We formulate URL selection as a multi-armed bandit problem and employ Thompson Sampling to adaptively allocate the navigation budget across candidate URLs. Furthermore, we introduce an episodic memory component to store navigation history, enabling the agent to learn from previous attempts. Experiments on WebVoyager demonstrate that Mango achieves a success rate of 63.6% when using GPT-5-mini, outperforming the best baseline by 7.3%. Furthermore, on WebWalkerQA, Mango attains a 52.5% success rate, surpassing the best baseline by 26.8%. We also demonstrate the generalizability of Mango using both open-source and closed-source models as backbones. Our data and code are open-source and available at this https URL.

[NLP-78] An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中“越狱”行为(jailbreak behaviour)检测的难题,特别是在模型高度对齐时仅偶尔产生有害输出的情况下,传统单次输出评估方法难以准确识别其脆弱性。解决方案的关键在于采用多样本审计策略:通过在不同采样预算下评估生成内容,发现单一输出会系统性低估模型的越狱风险,而适度增加生成样本数量可显著提升检测灵敏度;进一步研究表明,跨模型检测信号部分泛化,尤其在同族模型间表现更强,且基于词法特征(如TF-IDF)的检测器捕捉到的是行为与主题特异性线索的混合信号,而非纯粹的有害行为。因此,该研究提出“适度多样本审计”作为更可靠、实用的越狱检测方法。

链接: https://arxiv.org/abs/2604.18775
作者: Hanrui Luo,Shreyank N Gowda
机构: University of Nottingham (诺丁汉大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Detecting jailbreak behaviour in large language models remains challenging, particularly when strongly aligned models produce harmful outputs only rarely. In this work, we present an empirical study of output based jailbreak detection under realistic conditions using the JailbreakBench Behaviors dataset and multiple generator models with varying alignment strengths. We evaluate both a lexical TF-IDF detector and a generation inconsistency based detector across different sampling budgets. Our results show that single output evaluation systematically underestimates jailbreak vulnerability, as increasing the number of sampled generations reveals additional harmful behaviour. The most significant improvements occur when moving from a single generation to moderate sampling, while larger sampling budgets yield diminishing returns. Cross generator experiments demonstrate that detection signals partially generalise across models, with stronger transfer observed within related model families. A category level analysis further reveals that lexical detectors capture a mixture of behavioural signals and topic specific cues, rather than purely harmful behaviour. Overall, our findings suggest that moderate multi sample auditing provides a more reliable and practical approach for estimating model vulnerability and improving jailbreak detection in large language models. Code will be released.

[NLP-79] Model-Agnostic Meta Learning for Class Imbalance Adaptation ACL2026

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)任务中普遍存在的类别不平衡问题,该问题会显著影响模型在不同领域和应用场景下的鲁棒性能。解决方案的关键在于提出一种统一的框架——Hardness-Aware Meta-Resample (HAMR),其核心机制包括双层优化策略与邻域感知重采样:前者动态估计实例级权重,优先关注真正具有挑战性的样本和少数类;后者通过增强对困难样本及其语义相似邻居的关注,提升训练效率。实验证明,HAMR在多个领域的不平衡数据集上均能显著提升少数类性能,并优于现有强基线方法。

链接: https://arxiv.org/abs/2604.18759
作者: Hanshu Rao,Guangzeng Han,Xiaolei Huang
机构: University of Memphis (孟菲斯大学)
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Class imbalance is a widespread challenge in NLP tasks, significantly hindering robust performance across diverse domains and applications. We introduce Hardness-Aware Meta-Resample (HAMR), a unified framework that adaptively addresses both class imbalance and data difficulty. HAMR employs bi-level optimizations to dynamically estimate instance-level weights that prioritize genuinely challenging samples and minority classes, while a neighborhood-aware resampling mechanism amplifies training focus on hard examples and their semantically similar neighbors. We validate HAMR on six imbalanced datasets covering multiple tasks and spanning biomedical, disaster response, and sentiment domains. Experimental results show that HAMR achieves substantial improvements for minority classes and consistently outperforms strong baselines. Extensive ablation studies demonstrate that our proposed modules synergistically contribute to performance gains and highlight HAMR as a flexible and generalizable approach for class imbalance adaptation. Code is available at this https URL.

[NLP-80] Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation ACL2026

【速读】: 该论文旨在解决低资源语言(以科普特语为例)到英语的机器翻译问题,其核心挑战在于缺乏足够的平行语料和词汇资源。解决方案的关键在于引入上下文学习(in-context learning)框架,并通过语法增强(syntactic augmentation)提升翻译性能:具体而言,在输入中加入来自通用依存句法分析(Universal Dependencies)的多种语法表示形式,包括原始解析结果、用自然英语描述的解析结构,以及针对复杂句法结构的定向指令;实验表明,将这些语法信息与检索到的双语词典条目相结合,能够显著提升不同规模模型的翻译效果,从而在科普特语-英语翻译任务上达到新的最优水平。

链接: https://arxiv.org/abs/2604.18758
作者: Abhishek Purushothama,Emma Thronson,Alexia Guo,Amir Zeldes
机构: Georgetown University (乔治城大学)
类目: Computation and Language (cs.CL)
备注: ACL 2026 Findings camera-ready

点击查看摘要

Abstract:Low-resource machine translation requires methods that differ from those used for high-resource languages. This paper proposes a novel in-context learning approach to support low-resource machine translation of the Coptic language to English, with syntactic augmentation from Universal Dependencies parses of input sentences. Building on existing work using bilingual dictionaries to support inference for vocabulary items, we add several representations of syntactic analyses to our inputs , specifically exploring the inclusion of raw parser outputs, verbalizations of parses in plain English, and targeted instructions of difficult constructions identified in sub-trees and how they can be translated. Our results show that while syntactic information alone is not as useful as dictionary-based glosses, combining retrieved dictionary items with syntactic information achieves significant gains across model sizes, achieving new state-of-the-art translation results for Coptic.

[NLP-81] owards Understanding the Robustness of Sparse Autoencoders

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对基于优化的越狱攻击(optimization-based jailbreak attacks)时表现出的脆弱性问题,这类攻击利用模型内部梯度结构实现对抗性输入生成。解决方案的关键在于在推理阶段将预训练的稀疏自编码器(Sparse Autoencoders, SAEs)集成到Transformer残差流中,无需修改模型权重或阻断梯度传播。通过这种无参数、非侵入式的干预方式,SAE在保持模型正常任务性能的同时,显著降低了越狱攻击的成功率(最高达5倍),并抑制了跨模型攻击迁移能力,其有效性与稀疏性强度和部署层位置密切相关,支持“表征瓶颈”假说——即稀疏投影重构了攻击者依赖的优化几何结构。

链接: https://arxiv.org/abs/2604.18756
作者: Ahson Saiyed,Sabrina Sadiekh,Chirag Agarwal
机构: University of Virginia (弗吉尼亚大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) remain vulnerable to optimization-based jailbreak attacks that exploit internal gradient structure. While Sparse Autoencoders (SAEs) are widely used for interpretability, their robustness implications remain underexplored. We present a study of integrating pretrained SAEs into transformer residual streams at inference time, without modifying model weights or blocking gradients. Across four model families (Gemma, LLaMA, Mistral, Qwen) and two strong white-box attacks (GCG, BEAST) plus three black-box benchmarks, SAE-augmented models achieve up to a 5x reduction in jailbreak success rate relative to the undefended baseline and reduce cross-model attack transferability. Parametric ablations reveal (i) a monotonic dose-response relationship between L0 sparsity and attack success rate, and (ii) a layer-dependent defense-utility tradeoff, where intermediate layers balance robustness and clean performance. These findings are consistent with a representational bottleneck hypothesis: sparse projection reshapes the optimization geometry exploited by jailbreak attacks.

[NLP-82] Remask Dont Replace: Token-to-Mask Refinement in Masked Diffusion Language Models

【速读】: 该论文旨在解决基于掩码扩散语言模型(如LLaDA2.1)中Token-to-Token(T2T)编辑机制的结构性缺陷问题,这些问题包括:当单一替代词均未达到置信度阈值时触发失败、替换操作在可能已含错误的上下文中进行,以及训练时使用的均匀扰动与推理阶段产生的语义合理但连贯的错误不匹配。解决方案的关键在于提出一种新的Token-to-Mask(T2M)重掩码机制——不再直接用新猜测覆盖可疑token,而是将其重置为mask状态,使下一去噪步骤能从分布内上下文重新预测该位置。此方法无需额外训练、不引入新参数,仅修改编辑规则,并结合三种检测启发式策略,在8个基准测试中提升了需要精确token级输出任务的准确性,尤其在CMATH任务上提升5.92点,其中修复了79.9%的“最后一公里污染”错误(即正确推理后输出被破坏)。

链接: https://arxiv.org/abs/2604.18738
作者: Lin Yao
机构: Shanghai Jiao Tong University (上海交通大学); Zhongguancun Academy (中关村学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Masked diffusion language models such as LLaDA2.1 rely on Token-to-Token (T2T) editing to correct their own generation errors: whenever a different token crosses a confidence threshold, the committed token is overwritten. We identify three structural failure modes of this rule. The trigger cannot fire when no single alternative is confident enough; the replacement is computed under a context that may itself contain errors; and the uniform perturbations used to train the T2T stream do not resemble the coherent, semantically plausible mistakes that the model actually makes at inference. As an alternative, we propose Token-to-Mask (T2M) remasking. Rather than overwriting a suspect token with a new guess, T2M resets the position to the mask state, so that the next denoising step re-predicts it from an in-distribution context. The method is training-free, modifies only the editing rule, and introduces no new parameters. We pair it with three detection heuristics and give a short theoretical account of why a mask is a better conditioning signal than an erroneous token. Across 8 benchmarks, T2M improves accuracy on tasks that require exact token-level output. Its largest gain is +5.92 points on CMATH, where we attribute 79.9% of baseline errors to last-mile corruption (correct reasoning followed by a garbled final answer); T2M repairs 41.3% of these cases.

[NLP-83] Investigating Counterfactual Unfairness in LLM s towards Identities through Humor ACL2026

【速读】: 该论文旨在解决生成式 AI 在处理幽默时所暴露的社会偏见问题,特别是通过反事实不公平(counterfactual unfairness)视角揭示模型对不同身份群体在幽默生成、意图识别与社会影响预测中的不对称响应。其解决方案的关键在于构建一个涵盖幽默生成拒绝、说话者意图推断和关系/社会影响预测三任务的分析框架,并引入可解释的偏差度量指标,用于捕捉身份交换下的不对称模式——例如当说话者和受话者的身份互换时,模型对笑话的拒绝率、恶意判断频率及社会危害评分出现显著差异,从而量化并揭示模型内部隐含的社会刻板印象与敏感性共存现象。

链接: https://arxiv.org/abs/2604.18729
作者: Shubin Kim,Yejin Son,Junyeong Park,Keummin Ka,Seungbeen Lee,Jaeyoung Lee,Hyeju Jang,Alice Oh,Youngjae Yu
机构: Yonsei University(延世大学); KAIST(韩国科学技术院); Seoul National University(首尔国立大学); Indiana University Indianapolis(印第安纳大学印第安纳波利斯分校)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Main Conference. The first two authors contributed equally. The last three authors are co-corresponding authors

点击查看摘要

Abstract:Humor holds up a mirror to social perception: what we find funny often reflects who we are and how we judge others. When language models engage with humor, their reactions expose the social assumptions they have internalized from training data. In this paper, we investigate counterfactual unfairness through humor by observing how the model’s responses change when we swap who speaks and who is addressed while holding other factors constant. Our framework spans three tasks: humor generation refusal, speaker intention inference, and relational/societal impact prediction, covering both identity-agnostic humor and identity-specific disparagement humor. We introduce interpretable bias metrics that capture asymmetric patterns under identity swaps. Experiments across state-of-the-art models reveal consistent relational disparities: jokes told by privileged speakers are refused up to 67.5% more often, judged as malicious 64.7% more frequently, and rated up to 1.5 points higher in social harm on a 5-point scale. These patterns highlight how sensitivity and stereotyping coexist in generative models, complicating efforts toward fairness and cultural alignment.

[NLP-84] Scripts Through Time: A Survey of the Evolving Role of Transliteration in NLP ACL2026

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)中因书写系统差异导致的跨语言迁移受限问题,即“脚本障碍”(script barrier)。其解决方案的关键在于利用音译(transliteration)技术,通过将目标语言的词汇转换为源语言的书写系统,从而增加词法重叠度,促进语言模型在不同语言间的知识迁移。论文系统梳理了音译在跨语言NLP中的多种应用场景与实现方式,并提出基于任务需求、语言特性及资源约束的策略选择建议,以提升现代大语言模型(Large Language Models, LLMs)在多语言场景下的有效性与效率。

链接: https://arxiv.org/abs/2604.18722
作者: Thanmay Jayakumar,Deepon Halder,Raj Dabre
机构: Nilekani Centre at AI4Bharat (AI4Bharat尼勒克尼中心); Indian Institute of Technology Madras (印度理工学院马德拉斯分校); Indian Institute of Engineering, Science and Technology, Shibpur (印度工程科学与技术学院希布普尔分校)
类目: Computation and Language (cs.CL)
备注: 9 pages, ACL 2026 (Findings)

点击查看摘要

Abstract:Cross-lingual transfer in NLP is often hindered by the ``script barrier’’ where differences in writing systems inhibit transfer learning between languages. Transliteration, the process of converting the script, has emerged as a powerful technique to bridge this gap by increasing lexical overlap. This paper provides a comprehensive survey of the application of transliteration in cross-lingual NLP. We present a taxonomy of key motivations to utilize transliterations in language models, and provide an overview of different approaches of incorporating transliterations as input. We analyze the evolution and effectiveness of these methods, discussing the critical trade-offs involved, and contextualize their need in modern LLMs. The review explores various settings that show how transliteration is beneficial, including handling code-mixed text, leveraging language family relatedness, and pragmatic gains in inference efficiency. Based on this analysis, we provide concrete recommendations for researchers on selecting and implementing the most appropriate transliteration strategy based on their specific language, task, and resource constraints.

[NLP-85] Characterizing AlphaEarth Embedding Geometry for Agent ic Environmental Reasoning

【速读】: 该论文旨在解决地球观测基础模型(Earth Observation Foundation Models)中嵌入表示的几何结构及其对下游环境推理影响不明确的问题。其核心挑战在于,尽管这些模型将地表信息编码为高维稠密向量(如Google AlphaEarth的64维嵌入),但其流形几何特性(manifold geometry)尚未被充分理解,进而限制了基于此类嵌入的精准推理能力。解决方案的关键在于:首先通过大规模实证分析揭示该嵌入空间具有非欧几里得特性(有效维度仅13.3,局部内在维度约10,且切空间旋转剧烈,局部-全局对齐度低),表明传统线性假设失效;其次构建一个基于FAISS索引的代理系统(agentic system),利用局部几何结构指导检索而非依赖向量算术运算,从而实现物理一致的环境推理,实验验证显示该方法在多步比较任务中显著优于纯参数化模型(均值评分提升至4.28)。

链接: https://arxiv.org/abs/2604.18715
作者: Mashrekur Rahman,Samuel J. Barrett,Christina Last
机构: Dartmouth College (达特茅斯学院); LGND AI; TipplyAI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Earth observation foundation models encode land surface information into dense embedding vectors, yet the geometric structure of these representations and its implications for downstream reasoning remain underexplored. We characterize the manifold geometry of Google AlphaEarth’s 64-dimensional embeddings across 12.1 million Continental United States samples (2017–2023) and develop an agentic system that leverages this geometric understanding for environmental reasoning. The manifold is non-Euclidean: effective dimensionality is 13.3 (participation ratio) from 64 raw dimensions, with local intrinsic dimensionality of approximately 10. Tangent spaces rotate substantially, with 84% of locations exceeding 60\textdegree and local-global alignment (mean |\cos\theta| = 0.17 ) approaching the random baseline of 0.125. Supervised linear probes indicate that concept directions rotate across the manifold, and compositional vector arithmetic using both PCA-derived and probe-derived directions yields poor precision. Retrieval instead produces physically coherent results, with local geometry predicting retrieval coherence ( R^2 = 0.32 ). Building on this characterization, we introduce an agentic system with nine specialized tools that decomposes environmental queries into reasoning chains over a FAISS-indexed embedding database. A five-condition ablation (120 queries, three complexity tiers) shows that embedding retrieval dominates response quality ( \mu = 3.79 \pm 0.90 vs.\ 3.03 \pm 0.77 parametric-only; scale 1–5), with peak performance on multi-step comparisons ( \mu = 4.28 \pm 0.43 ). A cross-model benchmark show that geometric tools reduce Sonnet 4.5’s score by 0.12 points but improve Opus 4.6’s by 0.07, with Opus achieving higher geometric grounding (3.38 vs.\ 2.64), suggesting that the value of geometric characterization scales with the reasoning capability of the consuming model.

[NLP-86] Probing for Reading Times ACL2026

【速读】: 该论文旨在解决语言模型表征是否能够捕捉人类阅读过程中的认知信号这一问题,特别是通过对比语言模型各层表征与人类眼动数据(如首次注视时间、注视持续时间等)之间的预测能力。其解决方案的关键在于利用正则化线性回归方法,在涵盖五种语言的两个眼动追踪语料库上,系统比较不同模型层的表征与多种标量预测因子(如意外度 surprisal、信息价值 information value 和 logit-lens 意外度)对眼动指标的预测性能。研究发现,早期层的表征在预测早期阅读阶段(如首次注视和注视持续时间)时优于传统标量意外度,表明低阶结构或词汇表征已具备类人处理特征;而晚期阅读阶段(如总阅读时间)仍由标量意外度主导,说明其压缩表示仍具优势。此外,结合标量意外度与早期层表征可进一步提升预测效果,揭示了模型深度与人类阅读时间阶段之间存在功能对齐关系。

链接: https://arxiv.org/abs/2604.18712
作者: Eleftheria Tsipidi,Samuel Kiegeland,Francesco Ignazio Re,Tianyang Xu,Mario Giulianelli,Karolina Stanczak,Ryan Cotterell
机构: ETH Zürich(苏黎世联邦理工学院); Toyota Technological Institute at Chicago(芝加哥丰田技术学院); University College London(伦敦大学学院)
类目: Computation and Language (cs.CL)
备注: ACL 2026 (main conference)

点击查看摘要

Abstract:Probing has shown that language model representations encode rich linguistic information, but it remains unclear whether they also capture cognitive signals about human processing. In this work, we probe language model representations for human reading times. Using regularized linear regression on two eye-tracking corpora spanning five languages (English, Greek, Hebrew, Russian, and Turkish), we compare the representations from every model layer against scalar predictors – surprisal, information value, and logit-lens surprisal. We find that the representations from early layers outperform surprisal in predicting early-pass measures such as first fixation and gaze duration. The concentration of predictive power in the early layers suggests that human-like processing signatures are captured by low-level structural or lexical representations, pointing to a functional alignment between model depth and the temporal stages of human reading. In contrast, for late-pass measures such as total reading time, scalar surprisal remains superior, despite its being a much more compressed representation. We also observe performance gains when using both surprisal and early-layer representations. Overall, we find that the best-performing predictor varies strongly depending on the language and eye-tracking measure.

[NLP-87] Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)API中数据提取风险与传统不可区分性隐私指标(如差分隐私或成员推断攻击的低可区分性)之间存在的脱节问题。研究表明,不可区分性属性既不是防止数据提取的充分条件,也不是必要条件,二者在理论上是不可比较的。为此,作者提出了一种新的形式化定义——(l, b)-不可提取性((l, b)-inextractability),要求任何黑盒攻击者需至少进行 2b2^b 次查询才能诱导API输出一个长度为 ll 的受保护子串。该定义通过最坏情况下的提取博弈构建,并推导出针对目标精确提取的秩-based 风险上界,进一步扩展至无目标和近似提取场景。关键创新在于引入了一个可计算、可评估的提取风险估计器,能有效捕捉多轮攻击试验和前缀适应下的风险变化,且对标准贪婪提取具有紧致估计能力,同时为任意解码配置提供概率提取风险上界。实证结果验证了该方法相较于现有估算器的优势,并提供了训练、API访问和解码策略层面的实用缓解建议。

链接: https://arxiv.org/abs/2604.18697
作者: Ruixuan Liu,David Evans,Li Xiong
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by SP 2026

点击查看摘要

Abstract:Indistinguishability properties such as differential privacy bounds or low empirically measured membership inference are widely treated as proxies to show a model is sufficiently protected against broader memorization risks. However, we show that indistinguishability properties are neither sufficient nor necessary for preventing data extraction in LLM APIs. We formalize a privacy-game separation between extraction and indistinguishability-based privacy, showing that indistinguishability and inextractability are incomparable: upper-bounding distinguishability does not upper-bound extractability. To address this gap, we introduce (l, b) -inextractability as a definition that requires at least 2^b expected queries for any black-box adversary to induce the LLM API to emit a protected l -gram substring. We instantiate this via a worst-case extraction game and derive a rank-based extraction risk upper bound for targeted exact extraction, as well as extensions to cover untargeted and approximate extraction. The resulting estimator captures the extraction risk over multiple attack trials and prefix adaptations. We show that it can provide a tight and efficient estimation for standard greedy extraction and an upper bound on the probabilistic extraction risk given any decoding configuration. We empirically evaluate extractability across different models, clarifying its connection to distinguishability, demonstrating its advantage over existing extraction risk estimators, and providing actionable mitigation guidelines across model training, API access, and decoding configurations in LLM API deployment. Our code is publicly available at: this https URL.

[NLP-88] Owner-Harm: A Missing Threat Model for AI Agent Safety

【速读】: 该论文旨在解决当前AI代理安全评估中长期被忽视的关键问题:代理对部署者(owner)造成的损害(Owner-Harm),即代理在执行任务时反向危害其部署方的商业利益或信息安全,例如泄露凭证、篡改日历数据或未经授权发布运营信息。现有基准主要关注通用犯罪行为(如网络犯罪、骚扰等),未能覆盖此类具有显著商业影响的威胁类别。解决方案的核心在于提出一个形式化的Owner-Harm威胁模型,并构建基于符号-语义防御泛化(Symbolic-Semantic Defense Generalization, SSDG)框架,通过引入门控机制与确定性后审计验证器实现多层次检测能力,从而显著提升对部署者危害行为的识别率(TPR从14.8%提升至85.3%),并揭示了环境绑定的符号规则在跨工具词汇迁移中的局限性,强调语义对齐而非简单文本拼接才是有效检测的基础。

链接: https://arxiv.org/abs/2604.18658
作者: Dongcheng Zhang,Yiqing Jiang
机构: BlueFocus Communication Group (蓝标通信集团); Tongji University (同济大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages. Companion manuscript on per-decision proof-obligation synthesis (LSVJ-S) in preparation

点击查看摘要

Abstract:Existing AI agent safety benchmarks focus on generic criminal harm (cybercrime, harassment, weapon synthesis), leaving a systematic blind spot for a distinct and commercially consequential threat category: agents harming their own deployers. Real-world incidents illustrate the gap: Slack AI credential exfiltration (Aug 2024), Microsoft 365 Copilot calendar-injection leaks (Jan 2024), and a Meta agent unauthorized forum post exposing operational data (Mar 2026). We propose Owner-Harm, a formal threat model with eight categories of agent behavior damaging the deployer. We quantify the defense gap on two benchmarks: a compositional safety system achieves 100% TPR / 0% FPR on AgentHarm (generic criminal harm) yet only 14.8% (4/27; 95% CI: 5.9%-32.5%) on AgentDojo injection tasks (prompt-injection-mediated owner harm). A controlled generic-LLM baseline shows the gap is not inherent to owner-harm (62.7% vs. 59.3%, delta 3.4 pp) but arises from environment-bound symbolic rules that fail to generalize across tool vocabularies. On a post-hoc 300-scenario owner-harm benchmark, the gate alone achieves 75.3% TPR / 3.3% FPR; adding a deterministic post-audit verifier raises overall TPR to 85.3% (+10.0 pp) and Hijacking detection from 43.3% to 93.3%, demonstrating strong layer complementarity. We introduce the Symbolic-Semantic Defense Generalization (SSDG) framework relating information coverage to detection rate. Two SSDG experiments partially validate it: context deprivation amplifies the detection gap 3.4x (R = 3.60 vs. R = 1.06); context injection reveals structured goal-action alignment, not text concatenation, is required for effective owner-harm detection.

[NLP-89] Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM ACL2026

【速读】: 该论文旨在解决在智能手机等边缘设备上部署大语言模型(Large Language Models, LLMs)时面临的内存占用高、推理延迟大及运行时灵活性差等工程挑战。其核心解决方案是提出一个面向硬件的高效推理框架,通过将应用特定的低秩适配(LoRA)作为运行时输入注入到单一冻结的推理图中,实现无需重新编译或额外内存开销的任务动态切换;同时引入多流解码机制,在单次前向传播中并行生成不同风格的响应(如正式、礼貌或幽默),降低延迟达6倍,并结合动态自推测解码(Dynamic Self-Speculative Decoding, DS2D)策略与INT4量化及架构级优化,最终在保持9种语言和8类任务准确性的前提下,实现内存和延迟方面4–6倍的综合提升,验证了多场景LLM在移动端的可行性与商业化潜力。

链接: https://arxiv.org/abs/2604.18655
作者: Sravanth Kodavanti,Sowmya Vajrala,Srinivas Miriyala,Utsav Tiwari,Uttam Kumar,Utkarsh Kumar Mahawar,Achal Pratap Singh,Arya D,Narendra Mutyala,Vikram Nelvoy Rajendiran,Sharan Kumar Allur,Euntaik Lee,Dohyoung Kim,HyeonSu Lee,Gyusung Cho,JungBae Kim
机构: Samsung Research Institute Bangalore, India; Samsung Electronics, Suwon, South Korea
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ACL 2026

点击查看摘要

Abstract:Deploying large language models (LLMs) on smartphones poses significant engineering challenges due to stringent constraints on memory, latency, and runtime flexibility. In this work, we present a hardware-aware framework for efficient on-device inference of a LLaMA-based multilingual foundation model supporting multiple use cases on Samsung Galaxy S24 and S25 devices with SM8650 and SM8750 Qualcomm chipsets respectively. Our approach integrates application-specific LoRAs as runtime inputs to a single frozen inference graph, enabling dynamic task switching without recompilation or memory overhead. We further introduce a multi-stream decoding mechanism that concurrently generates stylistic variations - such as formal, polite, or jovial responses - within a single forward pass, reducing latency by up to 6x. To accelerate token generation, we apply Dynamic Self-Speculative Decoding (DS2D), a tree-based strategy that predicts future tokens without requiring a draft model, yielding up to 2.3x speedup in decode time. Combined with quantization to INT4 and architecture-level optimizations, our system achieves 4-6x overall improvements in memory and latency while maintaining accuracy across 9 languages and 8 tasks. These results demonstrate practical feasibility of deploying multi-use-case LLMs on edge devices, advancing the commercial viability of Generative AI in mobile platforms.

[NLP-90] wo-dimensional early exit optimisation of LLM inference

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在分类任务中计算资源消耗过高的问题,尤其是在保持推理精度的前提下实现高效推理。其核心挑战在于如何在不显著牺牲性能的情况下减少不必要的深层网络计算。解决方案的关键是提出一种二维(2D)早期退出策略,该策略同时协调句子级(sentence-wise)和层级(layer-wise)的退出机制:模型按句子逐步处理输入,并在每一层动态决定是否提前退出,从而实现层与句维度上的乘法式计算节省。实验表明,该方法在多个主流LLM上均优于单一维度优化的早期退出策略,在简单任务中可带来1.4–2.3倍的速度提升,且具备良好的可扩展性和模型无关性。

链接: https://arxiv.org/abs/2604.18592
作者: Jan Hůla,David Adamczyk,Tomáš Filip,Martin Pavlíček,Petr Sosík
机构: Institute for Research and Applications of Fuzzy Modelling, University of Ostrava; Institute of Computer Science, Faculty of Philosophy and Science, Silesian University in Opava
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a two-dimensional (2D) early exit strategy that coordinates layer-wise and sentence-wise exiting for classification tasks in large language models. By processing input incrementally sentence-by-sentence while progressively activating deeper layers, our method achieves multiplicative computational savings that exceed those from optimizing either dimension independently. Experimental evaluation across four state-of-the-art LLMs (Llama 3.1, Llama 3.2, Gemma, Qwen; 3B-8B parameters) on three sentiment classification datasets demonstrates additional speed-ups of 1.4–2.3 \times over optimal layer-wise early exit for simpler tasks with vanilla models, with graceful degradation on complex multi-class problems. Fine-tuning reduces but does not eliminate this advantage. The approach is model-agnostic, requires only lightweight classification adapters, and is orthogonal to complementary efficiency methods such as quantization and pruning. Our findings indicate that 2D early exit strategies excel when semantic information accumulates predictably across input structure, suggesting possible applicability to sequence-processing tasks beyond sentiment classification.

[NLP-91] Who Shapes Brazils Vaccine Debate? Semi-Supervised Modeling of Stance and Polarization in YouTubes Media Ecosystem

【速读】: 该论文旨在解决新冠疫情期间在线虚假信息、政治极化与机构信任下降如何削弱疫苗接种努力的问题,尤其关注非英语语境下(如巴西)长期疫苗话语动态的缺失。其关键解决方案是构建并应用一种结合自标签(self-labeling)与自训练(self-training)的半监督立场识别框架,对近140万条巴西YouTube评论进行分类,从而实现对全国免疫计划全周期内正反疫苗叙事演化与传播的精细化追踪。该方法显著提升了立场分类的鲁棒性,并揭示了科学传播与数字原生媒体渠道在健康信息生态中的结构性脆弱性。

链接: https://arxiv.org/abs/2604.18586
作者: Geovana S. de Oliveira,Ana P. C. Silva,Fabricio Murai,Carlos H. G. Ferreira
机构: Universidade Federal de Ouro Preto(联邦大学奥罗普雷托分校); Universidade Federal de Minas Gerais(联邦大学米纳斯吉拉斯分校); Worcester Polytechnic Institute(伍斯特理工学院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: Paper accepted at WebSci’26

点击查看摘要

Abstract:Vaccination remains a cornerstone of global public health, yet the COVID-19 pandemic exposed how online misinformation, political polarization, and declining institutional trust can undermine immunization efforts. Most of the prior computational studies that analyzed vaccine discourse on social platforms focus on English-language data, specific vaccines, or short time windows, impairing our understanding of long-term dynamics in high-impact, non-English contexts like Brazil, home to one of the world’s most comprehensive immunization systems. We here present the largest longitudinal study of Brazil’s vaccine discourse on YouTube, leveraging a semi-supervised stance detection framework that combines self-labeling and self-training to classify nearly 1.4 million comments. By integrating stance with temporal patterns, engagement metrics, and channel taxonomy (legacy media, science communicators, digital-native outlets), we map how pro- and anti-vaccine narratives evolve and circulate within a hybrid media ecosystem. Our results show that semi-supervised learning substantially improves stance classification robustness, enabling fine-grained tracking of public attitudes across Brazil’s full immunization schedule. Polarization spikes during epidemiological crises, especially COVID-19, but becomes fragmented across vaccines and interaction patterns in the post-pandemic period. Notably, science communication and digital-native channels emerge as the primary loci of both supportive and oppositional engagement, revealing structural vulnerabilities in contemporary health communication. Thus, our work advances computational methods for large-scale stance modeling while offering actionable evidence for public health agencies, platform governance, and online information ecosystems.

[NLP-92] Scaling Test-Time Compute for Agent ic Coding

【速读】: 该论文旨在解决长时程编码代理(long-horizon coding agents)在测试时扩展(test-time scaling)中的核心挑战:如何有效表示、选择并重用先前的推理轨迹(rollout trajectories),而非简单增加生成尝试次数。传统方法适用于短文本输出的直接比较与排序,但在代理执行复杂任务时,每次尝试产生的是包含动作、观测、错误和部分进展的长序列轨迹,难以直接复用。解决方案的关键在于提出一种基于紧凑轨迹表示的框架,将每个 rollout 转换为结构化摘要(structured summary),保留其关键假设、进展和失败模式,同时去除低信噪比的细节;该表示支持两种推理时扩展方式:并行扩展采用递归锦标赛投票(Recursive Tournament Voting, RTV)进行群体筛选,顺序扩展则通过条件化新轨迹来蒸馏先前摘要(Parallel-Distill-Refine, PDR),从而显著提升前沿编码代理在 SWE-Bench Verified 和 Terminal-Bench v2.0 上的表现。

链接: https://arxiv.org/abs/2604.16529
作者: Joongwon Kim,Wannan Yang,Kelvin Niu,Hongming Zhang,Yun Zhu,Eryk Helenowski,Ruan Silva,Zhengxing Chen,Srinivasan Iyer,Manzil Zaheer,Daniel Fried,Hannaneh Hajishirzi,Sanjeev Arora,Gabriel Synnaeve,Ruslan Salakhutdinov,Anirudh Goyal
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 70 pages, 26 figures, 12 tables

点击查看摘要

Abstract:Test-time scaling has become a powerful way to improve large language models. However, existing methods are best suited to short, bounded outputs that can be directly compared, ranked or refined. Long-horizon coding agents violate this premise: each attempt produces an extended trajectory of actions, observations, errors, and partial progress taken by the agent. In this setting, the main challenge is no longer generating more attempts, but representing prior experience in a form that can be effectively selected from and reused. We propose a test-time scaling framework for agentic coding based on compact representations of rollout trajectories. Our framework converts each rollout into a structured summary that preserves its salient hypotheses, progress, and failure modes while discarding low-signal trace details. This representation enables two complementary forms of inference-time scaling. For parallel scaling, we introduce Recursive Tournament Voting (RTV), which recursively narrows a population of rollout summaries through small-group comparisons. For sequential scaling, we adapt Parallel-Distill-Refine (PDR) to the agentic setting by conditioning new rollouts on summaries distilled from prior attempts. Our method consistently improves the performance of frontier coding agents across SWE-Bench Verified and Terminal-Bench v2.0. For example, by using our method Claude-4.5-Opus improves from 70.9% to 77.6% on SWE-Bench Verified (mini-SWE-agent) and 46.9% to 59.1% on Terminal-Bench v2.0 (Terminus 1). Our results suggest that test-time scaling for long-horizon agents is fundamentally a problem of representation, selection, and reuse.

信息检索

[IR-0] ECLASS-Augmented Semantic Product Search for Electronic Components

【速读】:该论文旨在解决工业电子元器件检索中因自然语言查询与属性驱动的产品描述之间存在词汇不匹配(vocabulary mismatch)而导致的传统检索方法(如BM25)效果不佳的问题。其核心解决方案是采用大语言模型(LLM)辅助的稠密检索(dense retrieval)策略,并引入ECLASS标准中的分层语义信息增强产品表征,通过嵌入(embedding)方法实现更精准的语义匹配。关键创新在于将标准化的层级元数据融入检索流程,显著提升了检索效果——在专家查询下Hit_Rate@5达到94.3%,远超BM25的31.4%,同时优于基础模型网络搜索基线,在有效性和效率上均取得突破。

链接: https://arxiv.org/abs/2604.19664
作者: Nico Baumgart,Markus Lange-Hegermann,Jan Henze
机构: 未知
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Efficient semantic access to industrial product data is a key enabler for factory automation and emerging LLM-based agent workflows, where both human engineers and autonomous agents must identify suitable components from highly structured catalogs. However, the vocabulary mismatch between natural-language queries and attribute-centric product descriptions limits the effectiveness of traditional retrieval approaches, e.g., BM25. In this work, we present a systematic evaluation of LLM-assisted dense retrieval for semantic product search on industrial electronic components, and investigate the integration of hierarchical semantics from the ECLASS standard into embedding-based retrieval. Our results show that dense retrieval combined with re-ranking substantially outperforms classical lexical methods and foundation model web-search baselines. In particular, the proposed approach achieves a Hit_Rate@5 of 94.3 %, compared to 31.4 % for BM25 on expert queries, while also exceeding foundation model baselines in both effectiveness and efficiency. Furthermore, augmenting product representations with ECLASS semantics yields consistent performance gains across configurations, demonstrating that standardized hierarchical metadata provides a crucial semantic bridge between user intent and sparse product descriptions.

[IR-1] From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems

【速读】:该论文旨在解决推荐系统中因果解释(Counterfactual Explanations, CEs)方法评估缺乏统一标准的问题。现有CE方法在不同数据集、推荐模型、评估指标和解释格式下进行测试,导致结果难以复现与公平比较。为解决此问题,作者系统性地复现、重实现并重新评估了11种前沿CE方法,涵盖原生解释器(如LIME-RS、SHAP、PRINCE等)和基于图神经网络(GNN)的特定解释器,并提出一个统一的基准框架,从解释格式(隐式 vs. 显式)、评估层级(项级 vs. 列表级)和扰动范围(用户交互向量 vs. 用户-物品交互图)三个维度对解释器进行全面评测。关键在于构建标准化的评估协议,引入有效性、稀疏性和计算复杂度等指标,并将项级评估扩展至Top-K列表级解释,从而揭示不同方法在多种设置下的性能差异,特别是显式解释格式下有效性与稀疏性之间的权衡关系,以及图基解释器在大规模推荐图上的可扩展性局限,进而修正了先前关于CE生成方法鲁棒性和实用性的结论。

链接: https://arxiv.org/abs/2604.19663
作者: Quang-Huy Nguyen,Thanh-Hai Nguyen,Khac-Manh Thai,Duc-Hoang Pham,Huy-Son Nguyen,Cam-Van Thi Nguyen,Masoud Mansoury,Duc-Trong Le,Hoang-Quynh Le
机构: VNU University of Engineering and Technology (河内国立大学工程与技术学院); Delft University of Technology (代尔夫特理工大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Counterfactual explanations (CEs) provide an intuitive way to understand recommender systems by identifying minimal modifications to user-item interactions that alter recommendation outcomes. Existing CE methods for recommender systems, however, have been evaluated under heterogeneous protocols, using different datasets, recommenders, metrics, and even explanation formats, which hampers reproducibility and fair comparison. Our paper systematically reproduces, re-implement, and re-evaluate eleven state-of-the-art CE methods for recommender systems, covering both native explainers (e.g., LIME-RS, SHAP, PRINCE, ACCENT, LXR, GREASE) and specific graph-based explainers originally proposed for GNNs. Here, a unified benchmarking framework is proposed to assess explainers along three dimensions: explanation format (implicit vs. explicit), evaluation level (item-level vs. list-level), and perturbation scope (user interaction vectors vs. user-item interaction graphs). Our evaluation protocol includes effectiveness, sparsity, and computational complexity metrics, and extends existing item-level assessments to top-K list-level explanations. Through extensive experiments on three real-world datasets and six representative recommender models, we analyze how well previously reported strengths of CE methods generalize across diverse setups. We observe that the trade-off between effectiveness and sparsity depends strongly on the specific method and evaluation setting, particularly under the explicit format; in addition, explainer performance remains largely consistent across item level and list level evaluations, and several graph-based explainers exhibit notable scalability limitations on large recommender graphs. Our results refine and challenge earlier conclusions about the robustness and practicality of CE generation methods in recommender systems: this https URL.

[IR-2] Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)对学术同行评审(peer review)核心功能影响的不确定性问题,特别是其如何改变评审报告在语言形式、评价焦点及推荐信号等方面的特征。解决方案的关键在于采用多维度分析方法:首先通过自动化标注技术识别评审语句中的具体评价维度(如原创性、可复现性等),其次利用最大似然估计法识别可能由LLM生成或修改的评审文本,进而从细粒度层面量化LLM介入对评审内容长度、流畅性、标准化程度以及深层评价维度变化的影响。研究发现,LLM使评审文本更长、更流畅且更注重摘要和表面清晰度,但削弱了对原创性、可复现性和批判性推理等深层质量要素的关注。

链接: https://arxiv.org/abs/2604.19578
作者: Wenqing Wu,Chengzhi Zhang,Yi Zhao,Tong Bao
机构: Nanjing University of Science and Technology (南京理工大学); Anhui University (安徽大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注: Scientometrics

点击查看摘要

Abstract:With the rapid advancement of Large Language Models (LLMs), the academic community has faced unprecedented disruptions, particularly in the realm of academic communication. The primary function of peer review is improving the quality of academic manuscripts, such as clarity, originality and other evaluation aspects. Although prior studies suggest that LLMs are beginning to influence peer review, it remains unclear whether they are altering its core evaluative functions. Moreover, the extent to which LLMs affect the linguistic form, evaluative focus, and recommendation-related signals of peer-review reports has yet to be systematically examined. In this study, we examine the changes in peer review reports for academic articles following the emergence of LLMs, emphasizing variations at fine-grained level. Specifically, we investigate linguistic features such as the length and complexity of words and sentences in review comments, while also automatically annotating the evaluation aspects of individual review sentences. We also use a maximum likelihood estimation method, previously established, to identify review reports that potentially have modified or generated by LLMs. Finally, we assess the impact of evaluation aspects mentioned in LLM-assisted review reports on the informativeness of recommendation for paper decision-making. The results indicate that following the emergence of LLMs, peer review texts have become longer and more fluent, with increased emphasis on summaries and surface-level clarity, as well as more standardized linguistic patterns, particularly reviewers with lower confidence score. At the same time, attention to deeper evaluative dimensions, such as originality, replicability, and nuanced critical reasoning, has declined.

[IR-3] Diagnosable ColBERT: Debugging Late-Interaction Retrieval Models Using a Learned Latent Space as Reference

【速读】:该论文旨在解决当前基于Late-interaction的模型(如ColBERT)在生物医学和临床检索任务中,虽然具备可解释的token级交互得分,但其解释能力较为浅层的问题——即无法判断模型是否以稳定、可复用且上下文敏感的方式学习了临床概念,从而难以诊断模型误解、识别不合理远距离生物医学概念,或指导针对性的数据补充与反馈。解决方案的关键在于提出Diagnosable ColBERT框架,通过将ColBERT的token嵌入对齐到一个基于临床知识的参考潜在空间,并引入专家提供的概念相似性约束,使文档编码成为可检查的模型理解证据,从而实现更直接的错误诊断和更规范化的数据优化策略,而无需依赖大规模诊断查询集。

链接: https://arxiv.org/abs/2604.19566
作者: François Remy
机构: Parallia AI
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reliable biomedical and clinical retrieval requires more than strong ranking performance: it requires a practical way to find systematic model failures and curate the training evidence needed to correct them. Late-interaction models such as ColBERT provide a first solution thanks to the interpretable token-level interaction scores they expose between document and query tokens. Yet this interpretability is shallow: it explains a particular document–query pairwise score, but does not reveal whether the model has learned a clinical concept in a stable, reusable, and context-sensitive way across diverse expressions. As a result, these scores provide limited support for diagnosing misunderstandings, identifying irreasonably distant biomedical concepts, or deciding what additional data or feedback is needed to address this. In this short position paper, we propose Diagnosable ColBERT, a framework that aligns ColBERT token embeddings to a reference latent space grounded in clinical knowledge and expert-provided conceptual similarity constraints. This alignment turns document encodings into inspectable evidence of what the model appears to understand, enabling more direct error diagnosis and more principled data curation without relying on large batteries of diagnostic queries.

[IR-4] LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction

【速读】:该论文旨在解决基于Transformer的点击率(CTR)模型在参数规模扩展时带来的计算与存储开销急剧增长问题,这一问题加剧了模型扩容目标与工业部署约束之间的差距。其解决方案的关键在于提出LoopCTR框架,采用“循环缩放”(loop scaling)范式,通过递归复用共享模型层来增加训练阶段的计算量,从而实现计算资源与参数数量的解耦;该框架还结合夹心结构(sandwich architecture)、超连接残差(Hyper-Connected Residuals)和专家混合(Mixture-of-Experts),并在每个循环深度引入过程监督(process supervision),将多循环收益编码至共享参数中,最终实现“训练多循环、推理零循环”的策略——即单次前向传播无需任何循环即可超越所有基线模型,显著提升了工业场景下的部署效率与性能表现。

链接: https://arxiv.org/abs/2604.19550
作者: Jiakai Tang,Runfeng Zhang,Weiqiu Wang,Yifei Liu,Chuan Wang,Xu Chen,Yeqiu Yang,Jian Wu,Yuning Jiang,Bo Zheng
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Alibaba Group (阿里巴巴集团)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Scaling Transformer-based click-through rate (CTR) models by stacking more parameters brings growing computational and storage overhead, creating a widening gap between scaling ambitions and the stringent industrial deployment constraints. We propose LoopCTR, which introduces a loop scaling paradigm that increases training-time computation through recursive reuse of shared model layers, decoupling computation from parameter growth. LoopCTR adopts a sandwich architecture enhanced with Hyper-Connected Residuals and Mixture-of-Experts, and employs process supervision at every loop depth to encode multi-loop benefits into the shared parameters. This enables a train-multi-loop, infer-zero-loop strategy where a single forward pass without any loop already outperforms all baselines. Experiments on three public benchmarks and one industrial dataset demonstrate state-of-the-art performance. Oracle analysis further reveals 0.02–0.04 AUC of untapped headroom, with models trained with fewer loops exhibiting higher oracle ceilings, pointing to a promising frontier for adaptive inference.

[IR-5] Enhancing Unsupervised Keyword Extraction in Academic Papers through Integrating Highlights with Abstract

【速读】:该论文旨在解决学术论文中关键词自动提取(Automatic Keyword Extraction)的性能优化问题,特别是如何更有效地利用文本结构信息以提升提取准确率。传统方法多依赖摘要(Abstract)和参考文献,而本文创新性地引入“亮点”(Highlights)这一新兴文本模块——其作为研究核心发现与贡献的简明总结,具有高度凝练且富含语义信息的特点。解决方案的关键在于系统评估三种输入场景:仅使用摘要、仅使用亮点、以及两者结合,并通过四个无监督模型在计算机科学(CS)和图书馆与信息科学(LIS)数据集上的实验证明,融合摘要与亮点能显著提升关键词提取性能,表明亮点内容可有效补充摘要信息,增强关键词覆盖度与语义相关性。

链接: https://arxiv.org/abs/2604.19505
作者: Yi Xiang,Chengzhi Zhang
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: Scientometrics

点击查看摘要

Abstract:Automatic keyword extraction from academic papers is a key area of interest in natural language processing and information retrieval. Although previous research has mainly focused on utilizing abstract and references for keyword extraction, this paper focuses on the highlights section - a summary describing the key findings and contributions, offering readers a quick overview of the research. Our observations indicate that highlights contain valuable keyword information that can effectively complement the abstract. To investigate the impact of incorporating highlights into unsupervised keyword extraction, we evaluate three input scenarios: using only the abstract, the highlights, and a combination of both. Experiments conducted with four unsupervised models on Computer Science (CS), Library and Information Science (LIS) datasets reveal that integrating the abstract with highlights significantly improves extraction performance. Furthermore, we examine the differences in keyword coverage and content between abstract and highlights, exploring how these variations influence extraction outcomes. The data and code are available at this https URL.

[IR-6] CAST: Modeling Semantic-Level Transitions for Complementary-Aware Sequential Recommendation

【速读】:该论文旨在解决顺序推荐(Sequential Recommendation, SR)中因依赖稀疏共购买统计而误将虚假相关性(如流行度偏差)当作真实互补关系的问题,同时克服现有基于语义的方法在聚合语义编码时丢失细粒度语义细节的局限。其解决方案的关键在于提出一种基于语义级转移的互补感知框架(Complementary-Aware Semantic Transition, CAST),通过两个核心模块实现:一是语义级转移模块,在离散语义码空间中直接建模动态语义转移,从而捕捉被传统聚合表示所模糊的细粒度语义依赖;二是互补先验注入模块,将大语言模型(LLM)验证的互补先验知识引入注意力机制,优先强化互补模式而非仅依赖共现统计。该方法显著提升了推荐准确性和效率,实验表明在多个电商数据集上Recall和NDCG分别提升最高达17.6%和16.0%,且训练速度加快65倍。

链接: https://arxiv.org/abs/2604.19414
作者: Qian Zhang,Lech Szymanski,Haibo Zhang,Jeremiah D. Deng
机构: University of Otago (奥塔哥大学); University of New South Wales (新南威尔士大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Sequential Recommendation (SR) aims to predict the next interaction of a user based on their behavior sequence, where complementary relations often provide essential signals for predicting the next item. However, mainstream models relying on sparse co-purchase statistics often mistake spurious correlations (e.g., due to popularity bias) for true complementary relations. Identifying true complementary relations requires capturing the fine-grained item semantics (e.g., specifications) that simple cooccurrence statistics would be unable to model. While recent semantics-based methods utilize discrete semantic codes to represent items, they typically aggregate semantic codes into coarse item representations. This aggregation process blurs specific semantic details required to identify complementarity. To address these critical limitations and effectively leverage semantics for capturing reliable complementary relations, we propose a Complementary-Aware Semantic Transition (CAST) framework that introduces a new modeling paradigm built upon semantic-level transitions. Specifically, a semantic-level transition module is designed to model dynamic transitions directly in the discrete semantic code space, effectively capturing fine-grained semantic dependencies often lost in aggregated item representations. Then, a complementary prior injection module is designed to incorporate LLM-verified complementary priors into the attention mechanism, thereby prioritizing complementary patterns over co-occurrence statistics. Experiments on multiple e-commerce datasets demonstrate that CAST consistently outperforms the state-of-the-art approaches, achieving up to 17.6% Recall and 16.0% NDCG gains with 65x training acceleration. This validates its effectiveness and efficiency in uncovering latent item complementarity beyond statistics. The code will be released upon acceptance.

[IR-7] IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text

【速读】:该论文旨在解决现有金融自然语言处理(Natural Language Processing, NLP)基准主要基于西方金融文本(如美国证券交易委员会(SEC)文件、美国财报和英文金融新闻)所导致的非西方监管框架评估缺失问题。解决方案的关键在于构建并公开发布IndiaFinBench——首个面向印度金融监管文本的大语言模型(Large Language Model, LLM)评估基准,包含406个专家标注的问答对,源自印度证券交易所委员会(SEBI)和印度储备银行(RBI)的192份文档,涵盖四种任务类型:监管解读(174项)、数值推理(92项)、矛盾检测(62项)和时间推理(78项)。该基准通过模型二次验证(矛盾检测kappa=0.918)和人工标注一致性测试(kappa=0.611)确保标注质量,并在零样本条件下评估12个模型,揭示数值推理任务最具区分度(模型间准确率差异达35.9个百分点),且存在三个统计显著不同的性能层级,为评估LLM在印度本土金融语境下的能力提供了可靠标准。

链接: https://arxiv.org/abs/2604.19298
作者: Rajveer Singh Pall
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 24 pages, 4 figures, 11 tables. Dataset and evaluation code at this https URL

点击查看摘要

Abstract:We introduce IndiaFinBench, to our knowledge the first publicly available evaluation benchmark for assessing large language model (LLM) performance on Indian financial regulatory text. Existing financial NLP benchmarks draw exclusively from Western financial corpora (SEC filings, US earnings reports, and English-language financial news), leaving a significant gap in coverage of non-Western regulatory frameworks. IndiaFinBench addresses this gap with 406 expert-annotated question-answer pairs drawn from 192 documents sourced from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI), spanning four task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items). Annotation quality is validated through a model-based secondary pass (kappa=0.918 on contradiction detection) and a 60-item human inter-annotator agreement evaluation (kappa=0.611; 76.7% overall agreement). We evaluate twelve models under zero-shot conditions, with accuracy ranging from 70.4% (Gemma 4 E4B) to 89.7% (Gemini 2.5 Flash). All models substantially outperform a non-specialist human baseline of 60.0%. Numerical reasoning is the most discriminative task, with a 35.9 percentage-point spread across models. Bootstrap significance testing (10,000 resamples) reveals three statistically distinct performance tiers. The dataset, evaluation code, and all model outputs are available at this https URL

[IR-8] CS3: Efficient Online Capability Synergy for Two-Tower Recommendation

【速读】:该论文旨在解决多阶段推荐系统中轻量级两塔模型(two-tower model)因架构孤立而导致的表征能力受限、嵌入空间对齐不足以及跨特征交互弱化的问题。现有方法如晚期交互(late interaction)和知识蒸馏(knowledge distillation)虽可缓解上述问题,但常引入额外延迟或难以适配在线学习场景。其解决方案的关键在于提出一种名为 Capability Synergy (CS3) 的高效在线框架,通过三个核心机制实现:(1) 周期自适应结构(Cycle-Adaptive Structure)实现塔内特征去噪与自我修正;(2) 跨塔同步机制(Cross-Tower Synchronization)通过轻量级相互感知提升嵌入空间对齐;(3) 级联模型共享(Cascade-Model Sharing)复用下游模型知识以增强跨阶段一致性,从而在保持毫秒级延迟的前提下显著提升推荐效果。

链接: https://arxiv.org/abs/2604.19269
作者: Lixiang Wang,Shaoyun Shi,Peng Wang,Wenjin Wu,Peng Jiang
机构: Kuaishou Technology(快手科技)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:To balance effectiveness and efficiency in recommender systems, multi-stage pipelines commonly use lightweight two-tower models for large-scale candidate retrieval. However, the isolated two-tower architecture restricts representation capacity, embedding-space alignment, and cross-feature interactions. Existing solutions such as late interaction and knowledge distillation can mitigate these issues, but often increase latency or are difficult to deploy in online learning settings. We propose Capability Synergy (CS3), an efficient online framework that strengthens two-tower retrievers while preserving real-time constraints. CS3 introduces three mechanisms: (1) Cycle-Adaptive Structure for self-revision via adaptive feature denoising within each tower; (2) Cross-Tower Synchronization to improve alignment through lightweight mutual awareness between towers; and (3) Cascade-Model Sharing to enhance cross-stage consistency by reusing knowledge from downstream models. CS3 is plug-and-play with diverse two-tower backbones and compatible with online learning. Experiments on three public datasets show consistent gains over strong baselines, and deployment in a largescale advertising system yields up to 8.36% revenue improvement across three scenarios while maintaining ms-level latency.

[IR-9] GraphRAG -IRL: Personalized Recommendation with Graph-Grounded Inverse Reinforcement Learning and LLM Re-ranking

【速读】:该论文旨在解决个性化推荐中模型难以同时捕捉用户序列偏好、应对反馈稀疏性和语义模糊性的问题,尤其针对纯提示驱动(prompt-based)大语言模型(LLM)在排序任务中存在的校准不足、候选列表顺序敏感性和流行度偏差等局限。其核心解决方案是提出一种混合推荐框架GraphRAG-IRL,关键在于融合三重机制:基于物品、类别与概念的异构知识图谱构建特征表示,利用最大熵逆强化学习(Maximum Entropy IRL)进行校准的预排序,以及通过角色引导(persona-guided)的LLM对短候选列表进行语义增强重排序,并将二者结果进行融合。实验证明,该方法显著优于监督基线,在MovieLens和KuaiRand数据集上NDCG@10提升达15.7%–16.8%,且各模块具有超加性增益,体现了结构化知识与生成式推理协同优化的有效性。

链接: https://arxiv.org/abs/2604.19128
作者: Siqi Liang,Xiawei Wang,Yudi Zhang,Jiaying Zhou
机构: Purdue University (普渡大学); University of California, Davis (加州大学戴维斯分校); Iowa State University (爱荷华州立大学); University of Minnesota (明尼苏达大学)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Personalized recommendation requires models that capture sequential user preferences while remaining robust to sparse feedback and semantic ambiguity. Recent work has explored large language models (LLMs) as recommenders and re-rankers, but pure prompt-based ranking often suffers from poor calibration, sensitivity to candidate ordering, and popularity bias. These limitations make LLMs useful semantic reasoners, but unreliable as standalone ranking engines. We present \textbfGraphRAG-IRL, a hybrid recommendation framework that combines graph-grounded feature construction, inverse reinforcement learning (IRL), and persona-guided LLM re-ranking. Our method constructs a heterogeneous knowledge graph over items, categories, and concepts, retrieves both individual and community preference context, and uses these signals to train a Maximum Entropy IRL model for calibrated pre-ranking. An LLM is then applied only to a short candidate list, where persona-guided prompts provide complementary semantic judgments that are fused with IRL rankings. Experiments show that GraphRAG-IRL is a strong standalone recommender: IRL-MLP with GraphRAG improves NDCG@10 by 15.7% on MovieLens and 16.6% on KuaiRand over supervised baselines. The results also show that IRL and GraphRAG are superadditive, with the combined gain exceeding the sum of their individual improvements. Persona-guided LLM fusion further improves ranking quality, yielding up to 16.8% NDCG@10 improvement over the IRL-only baseline on MovieLens ml-1m, while score fusion on KuaiRand provides consistent gains of 4–6% across LLM providers. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2604.19128 [cs.IR] (or arXiv:2604.19128v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2604.19128 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-10] hink Before Writing: Feature-Level Multi-Objective Optimization for Generative Citation Visibility

【速读】:该论文旨在解决生成式 AI(Generative AI)问答引擎中内容可见性优化的问题,传统搜索引擎优化(SEO)方法无法有效适配生成式答案引擎(Generative Answer Engines)依赖选择性引用(Selective Citation)而非排序检索的机制。现有生成式引擎优化(GEO)方法主要基于词元级(Token-level)文本重写,存在可解释性差、难以控制引用可见性与内容质量之间权衡的局限。其解决方案的关键在于提出 FeatGEO——一个特征级(Feature-level)、多目标优化框架,将网页抽象为可解释的结构、内容和语言属性(Structural, Content, and Linguistic Properties),在特征空间中进行高层优化,并利用语言模型将特征配置转化为自然语言,从而实现高层优化与底层生成的解耦。实验表明,FeatGEO 在三个生成式引擎上均显著提升引用可见性并维持或改善内容质量,且学习到的特征配置具有跨不同规模语言模型的泛化能力。

链接: https://arxiv.org/abs/2604.19113
作者: Zikang Liu,Peilan Xu
机构: Nanjing University of Information Science and Technology(南京信息工程大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:Generative answer engines expose content through selective citation rather than ranked retrieval, fundamentally altering how visibility is determined. This shift calls for new optimization methods beyond traditional search engine optimization. Existing generative engine optimization (GEO) approaches primarily rely on token-level text rewriting, offering limited interpretability and weak control over the trade-off between citation visibility and content quality. We propose FeatGEO, a feature-level, multi-objective optimization framework that abstracts webpages into interpretable structural, content, and linguistic properties. Instead of directly editing text, FeatGEO optimizes over this feature space and uses a language model to realize feature configurations into natural language, decoupling high-level optimization from surface-level generation. Experiments on GEO-Bench across three generative engines demonstrate that FeatGEO consistently improves citation visibility while maintaining or improving content quality, substantially outperforming token-level baselines. Further analyses show that citation behavior is more strongly influenced by document-level content properties than by isolated lexical edits, and that the learned feature configurations generalize across language models of different scales.

[IR-11] RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora ACL2026

【速读】:该论文旨在解决现有问答(QA)评估基准与真实世界检索增强生成(RAG)系统之间存在的不匹配问题:传统基准假设文档间差异显著且重叠极少,而实际应用中的语料库(如金融报告、法律条文和专利文献)具有高度冗余性和强文档相似性,导致标准评估方法无法准确衡量检索器性能。解决方案的关键在于提出 RARE(Redundancy-Aware Retrieval Evaluation)框架,其核心创新包括:(i) 将文档分解为原子事实以实现对冗余信息的精确追踪;(ii) 引入 CRRF(Criteria-Responsive Rank Fusion)机制,通过分项评分并基于排序融合的方式提升大语言模型(LLM)生成数据的质量与可靠性。这一方法使评估结果更贴近真实部署环境,揭示了当前基准未能捕捉到的检索器鲁棒性差距。

链接: https://arxiv.org/abs/2604.19047
作者: Hanjun Cho,Jay-Yoon Lee
机构: Allganize; Seoul National University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted to ACL 2026 (Main Conference)

点击查看摘要

Abstract:Existing QA benchmarks typically assume distinct documents with minimal overlap, yet real-world retrieval-augmented generation (RAG) systems operate on corpora such as financial reports, legal codes, and patents, where information is highly redundant and documents exhibit strong inter-document similarity. This mismatch undermines evaluation validity: retrievers can be unfairly undervalued even when they retrieve documents that provide sufficient evidence, because redundancy across documents is not accounted for in evaluation. On the other hand, retrievers that perform well on standard benchmarks often generalize poorly to real-world corpora with highly similar and redundant documents. We present RARE (Redundancy-Aware Retrieval Evaluation), a framework for constructing realistic benchmarks by (i) decomposing documents into atomic facts to enable precise redundancy tracking and (ii) enhancing LLM-based data generation with CRRF. RAG benchmark data usually requires multiple quality criteria, but LLMs often yield trivial outputs. CRRF scores criteria separately and fuses decisions by rank, improving the reliability of generated data. Applying RARE to Finance, Legal, and Patent corpora, we introduce RedQA, where a strong retriever baseline drops from 66.4% PerfRecall@10 on 4-hop General-Wiki to 5.0-27.9% PerfRecall@10 at 4-hop depth, revealing robustness gaps that current benchmarks fail to capture. RARE enables practitioners to build domain-specific RAG evaluations that faithfully reflect real-world deployment conditions.

[IR-12] STK-Adapter: Incorporating Evolving Graph and Event Chain for Temporal Knowledge Graph Extrapolation ACL2026

【速读】:该论文旨在解决Temporal Knowledge Graph (TKG)外推任务中两个关键挑战:一是由于TKG的结构演化表示与大语言模型(LLM)语义空间之间浅层对齐导致的空间-时间信息丢失;二是LLM微调过程中TKG演化结构特征的逐步稀释。解决方案的关键在于提出Spatial-Temporal Knowledge Adapter (STK-Adapter),其核心创新包括三个MoE(Mixture of Experts)模块:Spatial-Temporal MoE用于捕捉TKG中的空间结构和时间模式,Event-Aware MoE建模事件链内的复杂时序语义依赖关系,Cross-Modality Alignment MoE通过TKG引导的注意力专家实现深层次跨模态对齐,从而有效增强TKG推理能力并提升跨数据集泛化性能。

链接: https://arxiv.org/abs/2604.19042
作者: Shuyuan Zhao,Wei Chen,Weijie Zhang,Xinrui Hou,Junfeng Shen,Boyan Shi,Shengnan Guo,Youfang Lin,Huaiyu Wan
机构: Beijing Jiaotong University (北京交通大学); Guilin University of Electronic Technology (桂林电子科技大学)
类目: Information Retrieval (cs.IR)
备注: Accepted by ACL 2026

点击查看摘要

Abstract:Temporal Knowledge Graph (TKG) extrapolation aims to predict future events based on historical facts. Recent studies have attempted to enhance TKG extrapolation by integrating TKG’s evolving structural representations and textual event chains into Large Language Models (LLMs). Yet, two main challenges limit these approaches: (1) The loss of essential spatial-temporal information due to shallow alignment between TKG’s graph evolving structural representation and the LLM’s semantic space, and (2) the progressive dilution of the TKG’s evolving structural features during LLM fine-tuning. To address these challenges, we propose the Spatial-Temporal Knowledge Adapter (STK-Adapter), which flexibly integrates the evolving graph encoder and the LLM to facilitate TKG reasoning. In STK-Adapter, a Spatial-Temporal MoE is designed to capture spatial structures and temporal patterns inherent in TKGs. An Event-Aware MoE is employed to model intricate temporal semantics dependencies within event chains. In addition, a Cross-Modality Alignment MoE is proposed to facilitate deep cross-modality alignment by TKG-guided attention experts. Extensive experiments on benchmark datasets demonstrate that STK-Adapter significantly outperforms state-of-the-art methods and exhibits strong generalization capabilities in cross-dataset task. The code is available at this https URL.

[IR-13] Personalized Benchmarking: Evaluating LLM s by Individual Preferences ACL2026

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)对齐评估中忽视个体用户偏好这一关键问题。现有基准测试通过平均所有用户的偏好来计算模型的总体排名,忽略了不同用户在不同情境下的个性化需求。其解决方案的核心在于构建个性化LLM评估框架,利用ELO评分和Bradley-Terry模型为115名活跃Chatbot Arena用户分别计算模型排名,并结合主题建模与写作风格分析揭示用户特征与模型偏好之间的关联。研究发现,个体模型排名与聚合排名存在显著差异(Bradley-Terry相关系数ρ=0.04,ELO相关系数ρ=0.43),且用户在主题兴趣和沟通风格上表现出高度异质性,进一步证明了基于主题与风格特征的紧凑组合可有效预测用户特定的模型偏好,从而推动个性化LLM评估体系的发展。

链接: https://arxiv.org/abs/2604.18943
作者: Cristina Garbacea,Heran Wang,Chenhao Tan
机构: University of Chicago (芝加哥大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:With the rise in capabilities of large language models (LLMs) and their deployment in real-world tasks, evaluating LLM alignment with human preferences has become an important challenge. Current benchmarks average preferences across all users to compute aggregate ratings, overlooking individual user preferences when establishing model rankings. Since users have varying preferences in different contexts, we call for personalized LLM benchmarks that rank models according to individual needs. We compute personalized model rankings using ELO ratings and Bradley-Terry coefficients for 115 active Chatbot Arena users and analyze how user query characteristics (topics and writing style) relate to LLM ranking variations. We demonstrate that individual rankings of LLM models diverge dramatically from aggregate LLM rankings, with Bradley-Terry correlations averaging only \rho = 0.04 (57% of users show near-zero or negative correlation) and ELO ratings showing moderate correlation ( \rho = 0.43 ). Through topic modeling and style analysis, we find users exhibit substantial heterogeneity in topical interests and communication styles, influencing their model preferences. We further show that a compact combination of topic and style features provides a useful feature space for predicting user-specific model rankings. Our results provide strong quantitative evidence that aggregate benchmarks fail to capture individual preferences for most users, and highlight the importance of developing personalized benchmarks that rank LLM models according to individual user preferences.

[IR-14] Dual-View Training for Instruction-Following Information Retrieval

【速读】:该论文旨在解决指令遵循的信息检索(Instruction-following Information Retrieval, IF-IR)问题,即现有检索系统在面对用户明确约束(如属性要求、排除条件或输出偏好)时,难以区分仅语义相关但不满足指令的文档与真正符合指令的文档。其核心挑战在于传统检索模型主要依赖语义相关性训练,忽视了对指令意图的敏感性。解决方案的关键在于提出一种基于极性反转(polarity reversal)的双视角数据合成策略:给定一个查询、一个满足指令的相关文档和一个语义相关但违反指令的硬负样本,利用大语言模型(LLM)生成一个互补指令,使得这两个文档在新指令下互换相关性标签。通过这种方式,模型被迫在同一文档对上重新评估其相关性,从而学习到如何依据指令而非固定主题线索进行判断,显著提升了检索系统对指令的理解能力与执行准确性。

链接: https://arxiv.org/abs/2604.18845
作者: Qingcheng Zeng,Puxuan Yu,Aman Mehta,Fuheng Zhao,Rajhans Samdani
机构: Northwestern University (西北大学); Snowflake Inc. (雪花科技)
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Instruction-following information retrieval (IF-IR) studies retrieval systems that must not only find documents relevant to a query, but also obey explicit user constraints such as required attributes, exclusions, or output preferences. However, most retrievers are trained primarily for semantic relevance and often fail to distinguish documents that match the topic from those that satisfy the instruction. We propose a dual-view data synthesis strategy based on polarity reversal: given a query, a document that is relevant under the instruction, and a hard negative that matches the query but violates the instruction, we prompt an LLM to generate a complementary instruction under which the two documents swap relevance labels. By presenting the same document pair under complementary instructions that invert their relevance labels, the training signal forces the retriever to reconsider the same candidate set through the instruction, rather than relying on fixed topical cues. On a 305M-parameter encoder, our method improves performance on the FollowIR benchmark by 45%, surpassing general-purpose embedding models of comparable or larger scale. Through head-to-head comparisons at matched data budgets, we further show that data diversity and instruction supervision play complementary roles: the former preserves general retrieval quality, while the latter improves instruction sensitivity. These results highlight the value of targeted data synthesis for building retrieval systems that are both broadly capable and instruction-aware.

人机交互

[HC-0] “We are currently clean on OPSEC”: Why JD Cant Encrypt

【速读】:该论文试图解决的问题是:在高度敏感的军事通信场景中,即便使用了端到端加密工具(如Signal),为何仍会发生信息泄露?其核心在于揭示技术手段与社会因素之间的复杂互动,强调仅依赖加密机制无法保障整体信息安全性。解决方案的关键在于通过应用pi演算(applied pi-calculus)对安全设施配置进行形式化建模,证明即使在理论上安全的加密环境中,由于权力结构失衡、操作人员与官员间的不对等关系,以及加密工具可能引发的“虚假安全感”导致的信息过度共享行为,仍会破坏实际的保密性。论文进一步指出,这种技术滥用不仅限于特定事件,更可能因政治急躁和流程简化而引发地缘政治风险,从而呼吁从“人-技术-制度”协同视角重新审视加密系统的部署与使用策略。

链接: https://arxiv.org/abs/2604.19711
作者: Maurice Chiodo,Toni Erskine,Dennis Müller,James G. Wright
机构: 未知
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 31 pages

点击查看摘要

Abstract:We analyse the 2025 Signalgate leak of sensitive US military information by the Trump administration, addressing why confidentiality was violated (messages leaked to the press) in spite of encryption (Signal), to deepen the socio-technical considerations when designing and deploying encryption. First, we use applied pi-calculus to formally model the boutique secure facility setup requested by the US Defence Secretary, to prove that a leak would not be prevented. We then examine how using a secure channel might still not give overall information security, as, in this case, power imbalances between personnel and officials led to the application of cryptography that compromised their operational security. We look at how cryptographic tools may have instilled a false sense of security, and led officials to “overshare”. We then apply this analysis to the Trump administration’s general desire to burn through political, legal, and now technical process, and demonstrate geopolitical harms that may arise from such ineffective use of cryptography in a brief use case. We conclude that, even with advancements in usability of cryptographic tools, genuine message security is still out of reach of the “average user”.

[HC-1] Remindful: Designing Reminder Systems for Caregiver Interpretation in Dementia Care

【速读】:该论文旨在解决当前数字提醒系统在痴呆症照护中仅提供单向提示、缺乏对护理者长期参与度和行为模式理解支持的问题。其解决方案的关键在于设计了一个以护理者为中心的提醒平台 Remindful,通过引入面向护理者的警报、摘要与回顾功能,将提醒数据转化为可解释的上下文信息,从而提升护理者对居家照护中日常规律的认知与判断能力。研究强调提醒交互数据具有高度情境依赖性,因此系统应被视为辅助基础设施而非中立的行为传感器,需保留不确定性并支持真实家庭环境中的情境化解读。

链接: https://arxiv.org/abs/2604.19574
作者: Joy Lai,Alex Mihailidis
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Digital reminder systems are widely used in dementia care to support everyday tasks, but they are typically designed for one-way prompting rather than helping caregivers interpret engagement over time. We present Remindful, a caregiver-informed reminder platform that extends task prompting with caregiver-facing alerts, summaries, and review features to support awareness in home-based dementia care. Drawing on formative caregiver interviews, lived-experience advisor input, and in-home deployments with two caregiver-PLwD dyads, we examine how reminder-based caregiver awareness functions in practice. Our findings show that reminder systems can support caregiver reassurance, household coordination, and awareness of routines over time, but that reminder interaction data is highly context-dependent. Household participation, prompt attribution, routine mismatch, accessibility barriers, and technical failures all shaped what reminder logs could reasonably mean. We argue that reminder systems should not be treated as neutral behavioral sensors, but designed as assistive infrastructures for caregiver interpretation that preserve uncertainty and support contextual sensemaking in real homes.

[HC-2] InvestChat: Exploring Multimodal Interaction via Natural Language Touch and Pen in an Investment Dashboard

【速读】:该论文旨在解决新手投资者在股票市场探索过程中因信息复杂、交互方式单一而导致的参与度低和理解困难的问题。解决方案的关键在于设计并实现了一个多模态平板应用InvestChat,该应用通过多个协同视图与大语言模型(Large Language Model, LLM)驱动的聊天功能相结合,支持自然语言、触控和手写笔输入等多种交互方式,从而提升用户在投资决策过程中的沉浸感与操作灵活性。研究结果表明,多模态输入的融合显著增强了用户参与度,且自然语言交互被证实为最有效的交互方式。

链接: https://arxiv.org/abs/2604.19537
作者: Sarah Lykke Tost,Adson Lucas de Paiva Sales,Henrik Østergaard,Vaishali Dhanoa,Gabriela Molina León
机构: Aarhus University (奥胡斯大学); TU Wien (维也纳工业大学)
类目: Human-Computer Interaction (cs.HC)
备注: Poster accepted at AVI 2026

点击查看摘要

Abstract:We designed and implemented InvestChat, a multimodal tablet-based application that supports stock market exploration with multiple coordinated views and an LLM-powered chat. We evaluated the application with 12 novice investors. Our findings suggest that combining natural language, touch, and pen input during stock market exploration facilitates user engagement. Participants leveraged the modalities in complementary ways, enjoying the freedom of choice and finding natural language most effective.

[HC-3] ranslating Ethical Frameworks Into User-Centred Anti-Social Behaviour Interventions

【速读】:该论文旨在解决英国英格兰与威尔士地区2025年记录的一百万起反社会行为(Anti-Social Behaviour, ASB)案件所引发的社区凝聚力削弱问题,其核心挑战在于现行法定指导方针多采用惩罚性干预措施,缺乏技术手段支持且未将伦理框架嵌入政府系统设计中。解决方案的关键在于将ASB干预重构为一个人机交互(Human-Computer Interaction)问题,通过两个数字设计方案嵌入伦理框架以增强公众责任感并预防ASB:一是基于英国公众意见研究提炼出的伦理主题(包括惩罚比例性、个性化和责任归属),二是开发了基于二维码(QR)的公众报告界面与前置式网络意识课程,后者在实施惩罚前提供教育引导。该方法结合结构化访谈与在线调查验证了伦理框架及QR接口的有效性,表明技术干预可作为对传统惩罚手段的补充而非替代,从而实现更平衡、可持续的治理策略。

链接: https://arxiv.org/abs/2604.19492
作者: Rachel Hill,Tom Owen,Julian Hough
机构: Swansea University (斯旺西大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted for publication in HCII 2026 (Springer CCIS). This is the author preprint version. 11 pages, 4 figures

点击查看摘要

Abstract:In 2025 one million Anti-Social Behaviour (ASB) cases were recorded in England Wales, impacting community cohesion. Statutory guidance presents punitive interventions that lack technological input and does not often root ethical frameworks within government system design. This work takes a novel approach in framing ASB intervention as a human-computer interaction problem by embedding an ethical framework into two digital designs, aiming to increase public responsibility and prevent ASB. The first design is extracted from UK public opinion research, the ethical themes include punitive proportionality, personalisation, and responsibility. The second are digital interventions that present a set of QR-based public reporting interfaces and a web-based ASB awareness course that precedes punitive escalation. Our methodology involves structured interviews and online surveys. Results positively evaluated the framework and QR interfaces. Such outcomes could inform the expansion of technological intervention utilisation that does not replace existing punitive approaches, but balances them.

[HC-4] Fairness Audits of Institutional Risk Models in Deployed ML Pipelines

【速读】:该论文旨在解决高等教育机构中部署的机器学习风险模型(如早期预警系统,EWS)在资源分配中的公平性问题,特别是这些模型如何因性别、年龄和居住状态等因素导致系统性误判与资源错配。其解决方案的关键在于提出并实施一种基于复制(replica-based)的审计方法:通过使用机构提供的训练数据和设计规范重建EWS模型,从训练数据、模型预测到后处理阶段全面评估公平性指标,从而识别出各环节中不公平现象的产生与累积机制。研究发现,年轻、男性及国际学生被过度标记为高风险,而年龄较大和女性学生虽有相似辍学风险却常被低估,且后处理阶段将概率分布简化为分位数层级进一步放大了这种偏差,强调了在算法审计中同时考察构念效度(construct validity)与统计公平性的重要性。

链接: https://arxiv.org/abs/2604.19468
作者: Kelly McConvey,Dipto Das,Maya Ghai,Angelina Zhai,Rosa Lee,Shion Guha
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Fairness audits of institutional risk models are critical for understanding how deployed machine learning pipelines allocate resources. Drawing on multi-year collaboration with Centennial College, where our prior ethnographic work introduced the ASP-HEI Cycle, we present a replica-based audit of a deployed Early Warning System (EWS), replicating its model using institutional training data and design specifications. We evaluate disparities by gender, age, and residency status across the full pipeline (training data, model predictions, and post-processing) using standard fairness metrics. Our audit reveals systematic misallocation: younger, male, and international students are disproportionately flagged for support, even when many ultimately succeed, while older and female students with comparable dropout risk are under-identified. Post-processing amplifies these disparities by collapsing heterogeneous probabilities into percentile-based risk tiers. This work provides a replicable methodology for auditing institutional ML systems and shows how disparities emerge and compound across stages, highlighting the importance of evaluating construct validity alongside statistical fairness. It contributes one empirical thread to a broader program investigating algorithms, student data, and power in higher education.

[HC-5] Discerning Authorship in Online Health Communities: Experience Trust and Transparency Implications for Moderating AI

【速读】:该论文旨在解决在线健康社区中因大型语言模型(Large Language Models, LLMs)生成医疗建议而可能引发的社区信任危机问题,特别是当用户无法识别建议是否由AI生成时,这种不可见性可能削弱社区成员对内容的信任。其解决方案的关键在于通过增强透明度与改进LLM的自我调节能力(self-moderation),来提升用户对AI生成建议的接受度和利用效率。研究发现,尽管用户普遍难以辨别AI与人类撰写的建议,但健康状况背景显著影响判断,且缺乏可靠的判断线索导致错误的启发式评估;因此,推动AI使用透明化并优化LLM的自检机制,是增强社区对AI辅助建议信任的核心路径。

链接: https://arxiv.org/abs/2604.19429
作者: Yefim Shulman,Agnieszka Kitkowska,Mark Warner
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:For online health communities, community trust is paramount. Yet, advances in Large Language Models (LLMs) generating advice may erode this trust, especially if users cannot identify whether LLMs have been used. We investigate the feasibility of community-based detection of health advice authorship and how self-moderation of LLMs could help enhance advice utilization. In an online experiment, we evaluate people’s ability to distinguish AI-generated from human-written advice across two health conditions, considering lived experience with a condition, AI-recognition training, and user attitudes towards transparency and trust around AI use. Our results indicate the need for transparency coupled with trust. We find little evidence of people’s ability to discern advice authorship. However, we find a consistent effect of the health condition. Our qualitative findings identify unreliable signals, resulting in flawed heuristic evaluations of the advice. Our findings point to opportunities to improve the self-moderation of LLM-based AI and aid community-based AI moderation.

[HC-6] seneca: A Personalized Conversational Planner

【速读】:该论文旨在解决知识工作者在自我调节、优先级设定和反思过程中面临的工具支持不足问题,现有数字待办事项应用缺乏目标表征,纸质规划框架难以个性化,而对话式人工智能系统则缺少持久性和问责机制;更关键的是,这些工具普遍未能识别用户表达需求与深层需求之间的差异。解决方案的核心在于提出一个名为seneca的个性化AI辅助规划框架,其关键创新是整合三类方法的优势:通过对话代理(conversational agent)引导反思并提出澄清性问题,借助持久化数据库(persistent database)追踪目标与行为模式,以及利用处理器(processor)实现信息同步,从而构建一个具备自适应性、持续性与目标对齐能力的智能规划系统。

链接: https://arxiv.org/abs/2604.19425
作者: Simon Bohnen,Gabriel Garbers,Lukas Ellinger,Georg Groh
机构: Technical University of Munich (慕尼黑工业大学)
类目: Human-Computer Interaction (cs.HC)
备注: accepted to the CHI '26 Workshop on Tools for Thought

点击查看摘要

Abstract:Knowledge work demands sustained self-regulation, prioritization, and reflection-yet existing planning tools only partially support these needs. Digital to-do list applications feature task persistence but lack goal representation. Paper-based planning frameworks offer effective planning strategies but cannot adapt to individual users. Conversational AI systems enable flexible reflection but lack persistence and accountability. Moreover, none of these tools address a fundamental challenge: users’ expressed demands often diverge from their underlying needs. This paper introduces seneca, a conceptual framework for a personalized, AI-assisted planner that integrates the complementary strengths of these three approaches. seneca combines a conversational agent that scaffolds reflection and asks clarifying questions, a persistent database that tracks goals and behavioral patterns, and a processor that synchronizes information between them. We describe this architecture and outline a phased evaluation strategy combining automated testing with simulated users and longitudinal human studies measuring goal attainment, planning realism, and goal-value alignment. Comments: accepted to the CHI '26 Workshop on Tools for Thought Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2604.19425 [cs.HC] (or arXiv:2604.19425v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.19425 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-7] Seeing Your Mindless Face: How Viewing Ones Live Self Interrupts Mindless Short-Form Video Scrolling

【速读】:该论文旨在解决短时视频(short-form videos)成瘾性消费引发的“大脑退化”(brain rot)问题,即用户在无意识状态下持续浏览短视频导致自我控制能力下降的现象。解决方案的关键在于引入自指线索(self-related cues),作为一种内在的自我反思策略,通过在使用过程中周期性地呈现不同形式的自指提示(如实时摄像头画面、自拍照片、姓名文字和黑屏),来打断用户的无意识观看行为,并促进其主动停止使用。研究发现,尽管黑屏被设计为对照条件,却反而激发了最高的应用使用意愿,表明用户更倾向于接受隐含的自我觉察提示而非显性的视觉刺激,这为移动场景中基于自我意识干预的设计提供了实证依据与优化方向。

链接: https://arxiv.org/abs/2604.19424
作者: Kyungjin Kim,Minjeong Kim,Soobeen Jeong,Jiyeon So,Hayeon Song
机构: Sungkyunkwan University (成均馆大学); Yonsei University (延世大学)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 8 pages, 4 figures, 1 table. Accepted to Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA '26)

点击查看摘要

Abstract:The widespread, addictive consumption of short-form videos, which allegedly causes “brain rot,” has become an urgent public concern. This study proposes that self-related cues serve as an intrinsic, self-reflective strategy that enhances self-control over media overuse. We developed an app that de-immerses users by periodically displaying different self-related cues (live camera, selfie, name in text, and black screen) and tested their effects in a laboratory experiment (N=84). Overall, findings show that self-related cues effectively disrupt mindless viewing, enabling users to voluntarily stop short-form video consumption. Interestingly, the black screen, intended as a control, elicited the greatest intention to use the app: Participants noted in the follow-up interview that they preferred the subtler reflection on a black screen over the explicit image from a live camera. The findings offer practical design guidelines for implementing self-awareness interventions in mobile contexts, including which modalities work best and how real-time contextual anchoring enhances effectiveness.

[HC-8] Allow Me Into Your Dream: A Handshake-and-Pull Protocol for Sharing Mixed Realities in Spontaneous Encounters

【速读】:该论文旨在解决公共场景下混合现实(Mixed Reality, MR)系统中缺乏社会可读协议的问题,即用户如何在不破坏自然交互体验的前提下,安全、明确地进入他人的MR空间。现有方案如AirDrop和SharePlay虽能实现设备间共享,但其复杂度远低于MR共享场景,且无法满足社交语境下的即时性和共识需求。解决方案的关键在于提出TouchPort协议,通过一个具身化手势——握手并拉拽动作——将原本多阶段的共享流程(发现、授权、确认、允许、空间共位、同步对象、权限管理)压缩为单一操作,同时完成意图表达、同意协商与临时共享层的建立,从而实现从孤立现实到自发共享现实的平滑过渡,并为普适MR环境中的伦理共识与交互设计提供新范式。

链接: https://arxiv.org/abs/2604.19423
作者: Botao Amber Hu,Yilan Elan Tao,Bernhard Riecke,Yue Li
机构: Reality Design Lab (现实设计实验室); Simon Fraser University (西蒙弗雷泽大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学)
类目: Human-Computer Interaction (cs.HC)
备注: Submitted for UIST 2026

点击查看摘要

Abstract:Mixed reality systems support shared anchors and co-located interaction, yet they lack a socially legible protocol for entering another person’s mixed reality in public settings. We frame this as a protocol problem: co-located MR sharing requires a staged sequence – Discover, Consent, Confirm, Allow, Spatial Colocation, Sync Objects, Permission Management – each demanding user understanding and agreement. Using AirDrop and Apple Vision Pro SharePlay as a baseline, we show that MR encounter complexity far exceeds file transfer, yet must feel equally effortless. We present TouchPort, an embodied sharing protocol that collapses this multi-stage sequence into a single gesture: a handshake and pull that simultaneously signals intent, negotiates consent, and initiates a temporary shared encounter layer between otherwise separate mixed realities. Through three implied scenarios, we demonstrate the protocol’s expressive range in the transition from isolated to spontaneously shared realities. We discuss how embodied gestures can address the consent problem in ubiquitous MR and examine the ethical tensions of encounter protocols for MR futures.

[HC-9] Secure Storag e and Privacy-Preserving Scanpath Comparison via Garbled Circuits in Eye Tracking

【速读】:该论文旨在解决眼动追踪(eye tracking)数据在虚拟现实(VR)和移动平台日益增长背景下,扫描路径(scanpath)比较分析缺乏隐私保护能力的问题。现有方法难以在不泄露原始数据的前提下实现安全的相似性计算,限制了其在真实场景中的应用。解决方案的关键在于提出一种基于混淆电路(garbled circuit, GC)的隐私保护框架,在半诚实模型下支持两种配置:一是双方协作计算相似度而不暴露输入数据;二是服务器辅助模式下加密扫描路径可存储与处理,数据所有者保持离线状态。所有解密和比较操作均在GC内部完成,实验表明MultiMatch、ScanMatch和SubsMatch等算法在加密状态下仍能保持与明文结果高度一致的准确性,且运行时间和通信开销可控,验证了该方案在实际部署中的可行性与扩展潜力。

链接: https://arxiv.org/abs/2604.19422
作者: Suleyman Ozdel,Amr Nader,Yasmeen Abdrabou,Enkelejda Kasneci
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注: Accepted at Proceedings of the ACM on Human-Computer Interaction (PACMHCI), Vol. 10, Article ETRA008, to be presented at ETRA 2026. 24 pages (including appendix)

点击查看摘要

Abstract:With the growing use of eye tracking on VR and mobile platforms, gaze data is increasing. While scanpath comparison is important to gaze behavior analysis, existing methods lack privacy-preserving capabilities for real-world use. We present a garbled-circuit (GC)-based approach enabling secure storage and privacy-preserving scanpath comparison under the semi-honest model. It supports two configurations: (1) a two-party setting where the data owner and processor jointly compute similarity scores without revealing their inputs, and (2) a server-assisted setting where encrypted scanpaths are stored and processed while the data owner remains offline. All decryption and comparison operations are executed inside the GC. Experiments on three eye-tracking datasets evaluate fidelity, runtime, and communication, and show secure results for MultiMatch, ScanMatch, and SubsMatch closely match plaintext outcomes, with manageable runtime and communication overhead. Tests under various network conditions indicate that the design remains feasible for real-world privacy-preserving scanpath analysis and can be extended to other GC-based behavioral algorithms.

[HC-10] MER 2026: From Discriminative Emotion Recognition to Generative Emotion Understanding

【速读】:该论文旨在解决情感识别研究从传统的判别式(discriminative)向生成式(generative)范式转变过程中所面临的挑战,特别是如何利用多模态大语言模型(Multimodal Large Language Models, MLLMs)实现细粒度、可解释的情感理解。其解决方案的关键在于构建一个系统化的多任务挑战框架——MER2026,包含四个并行赛道:MER-Cross关注双人交互场景下的情感分析,MER-FG聚焦于细粒度情绪识别,MER-Prefer旨在预测人类对不同情感描述的偏好,MER-PS则基于生理信号进行情绪识别。这一设计充分挖掘了MLLMs在词汇丰富性和跨模态理解上的优势,推动情感计算向更精细、更具解释性与实际应用价值的方向发展。

链接: https://arxiv.org/abs/2604.19417
作者: Zheng Lian,Xiaojiang Peng,Kele Xu,Ziyu Jia,Xinyi Che,Zebang Cheng,Fei Ma,Laizhong Cui,Yazhou Zhang,Xin Liu,Liang Yang,Jia Li,Fan Zhang,Erik Cambria,Guoying Zhao,Bjorn W. Schuller,Jianhua Tao
机构: Institute of Automation, CAS (中国科学院自动化研究所); Shenzhen Technology University (深圳技术大学); National University of Defense Technology (国防科技大学); Sichuan University (四川大学); Shenzhen University (深圳大学); Guangdong Lab of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济发展实验室(深圳)); Tianjin University (天津大学); Shanghai Jiao Tong University (上海交通大学); Dalian University of Technology (大连理工大学); Hefei University of Technology (合肥工业大学); The Chinese University of Hong Kong (香港中文大学); Nanyang Technological University (南洋理工大学); University of Oulu (奥卢大学); Technical University of Munich (慕尼黑工业大学); Tsinghua University (清华大学)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:MER2026 marks the fourth edition of the MER series of challenges. The MER series provides valuable data resources to the research community and offers tasks centered on recent research trends, establishing itself as one of the largest challenges in the field. Throughout its history, the focus of MER has shifted from discriminative emotion recognition to generative emotion understanding. Specifically, MER2023 concentrated on discriminative emotion recognition, restricting the emotion recognition scope to fixed basic labels. In MER2024 and MER2025, we transitioned to generative emotion understanding and introduced two new tasks: fine-grained emotion recognition and descriptive emotion analysis, aiming to leverage the extensive vocabulary and multimodal understanding capabilities of Multimodal Large Language Models (MLLMs) to facilitate fine-grained and explainable emotion recognition. Building on this trajectory, MER2026 continues to follow these research trends and contains four tracks: MER-Cross shifts the focus from individual to dyadic interaction scenarios; MER-FG centers on fine-grained emotion recognition; MER-Prefer aims to predict human preferences regarding different emotion descriptions; MER-PS focuses on emotion recognition based on physiological signals. More details regarding the dataset and baselines are available at this https URL.

[HC-11] Understanding Password Preferences Memorability and Security through a Human-Centered Lens

【速读】:该论文旨在解决用户自创密码安全性不足的问题,尤其是在安全性和可用性之间存在权衡的背景下,探索AI生成密码的有效性及用户感知。其解决方案的关键在于通过眼动追踪技术揭示用户在密码创建、选择与记忆过程中的行为特征,并发现视觉注意力对密码熵(entropy)具有显著正向影响——即用户对上下文线索的视觉关注程度越高,生成的密码质量越强。这表明密码安全性不仅取决于生成工具本身(如AI模型或规则随机生成器),还受用户与界面交互时的注意力分配驱动,从而为基于注意力机制的安全设计提供了新思路。

链接: https://arxiv.org/abs/2604.19410
作者: Duru Paker,Suleyman Ozdel,Enkelejda Kasneci
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at ETRA 2026 (ACM Symposium on Eye Tracking Research and Applications), Short Paper. 8 pages

点击查看摘要

Abstract:Passwords remain the primary authentication method, yet user-created passwords are often the weakest due to the security-usability trade-off. Although AI-based password generators are emerging, little is known about their effectiveness and user perceptions. This eye-tracking study examined how behavior during password creation, selection, and memorization relates to objective and subjective password quality. Four password models, three AI-based (DeepSeek-API, ChatGPT-API, PassGPT) and one rule-based random generator, generated suggestions from participants’ self-generated passwords across four website contexts. Eye movements were recorded throughout the experiment. Results confirm the expected trade-off between AI-generated password strength and human memorability but also reveal a novel behavioral link. Despite stronger AI-generated passwords, participants favored self-generated ones. Notably, visual attention to contextual cues was significantly correlated with higher password entropy. This suggests that security is shaped not only by the generation tool but also by users’ visual engagement with contextual cues, highlighting the potential of attention-driven security design.

[HC-12] VIVA Stimuli: A Web-Based Platform for Eye Tracking Stimuli

【速读】:该论文旨在解决眼动追踪研究中因实验刺激呈现方式不统一而导致的可复现性问题。现有工具往往依赖特定硬件或编程技能,难以实现跨实验室的一致性刺激呈现,从而影响研究结果的验证与比较。解决方案的关键在于提出一个基于网页的标准化刺激呈现平台 VIVA Stimuli,其支持多种任务类型(如注视、平滑追随、认知负荷等),兼容各类眼动追踪设备(包括屏幕型和可穿戴式 VOG、LFI 传感器及 EOG 设备),并通过 ArUco 标记与 WebSocket 架构实现时间同步,同时提供可视化实验流程编辑器以促进协议共享和跨实验室精确复制。

链接: https://arxiv.org/abs/2604.19397
作者: Suleyman Ozdel,Virmarie Maquiling,Kadir Burak Buldu,Yasmeen Abdrabou,Enkelejda Kasneci
机构: Technical University of Munich(慕尼黑工业大学); Munich Center for Machine Learning(慕尼黑机器学习中心)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at the METR Workshop, ETRA 2026 (ACM Symposium on Eye Tracking Research and Applications). 7 pages, 4 figures

点击查看摘要

Abstract:Reproducibility in eye-tracking research is increasingly important as researchers conduct diverse experiments and seek to validate or replicate findings. However, exact replication remains challenging due to differences in laboratory practices and experimental setups. Inconsistent stimulus presentation can yield divergent metrics from identical oculomotor behavior, yet the stimulus layer remains largely unstandardized. Existing tools often require programming expertise or depend on specific hardware vendors. We introduce VIVA Stimuli, a web-based platform for standardized eye-tracking stimulus presentation. It provides configurable task types, including fixation, smooth pursuit, cognitive load, blink, slippage, content display, and questionnaires within a unified environment. The platform supports any eye-tracking technology, including wearable and screen-based VOG trackers, LFI sensors, and EOG devices. ArUco markers enable synchronization for trackers with scene cameras, while a WebSocket architecture ensures temporal synchronization for those without. A visual experiment flow editor allows protocols to be exported and shared, enabling identical stimulus replication across laboratories.

[HC-13] Mind2Drive: Predicting Driver Intentions from EEG in Real-world On-Road Driving

【速读】:该论文旨在解决在真实道路环境下基于脑电图(EEG)信号实现驾驶员意图预测的难题,其核心挑战在于EEG信号的非平稳性以及认知-运动准备过程的复杂性。解决方案的关键在于提出并验证了一个集成于真实电动车辆中的同步多传感器EEG驱动意图预测框架,通过在32次实际驾驶会话中采集数据,并系统评估12种深度学习架构,发现TSCeption模型在平均准确率(0.907)和宏F1分数(0.901)上表现最优;同时研究表明,最小化预处理优于复杂的伪迹处理流程,且预测性能在400–600 ms窗口内达到峰值,对应于驾驶操作前的关键神经准备阶段,从而实现了早期、稳定的EEG驱动意图解码。

链接: https://arxiv.org/abs/2604.19368
作者: Ghadah Alosaimi,Hanadi Alhamdan,Wenke E,Stamos Katsigiannis,Amir Atapour-Abarghouei,Toby P. Breckon
机构: Imam Mohammad Ibn Saud Islamic University (伊玛目穆罕默德本沙特伊斯兰大学); Durham University (杜伦大学); Princess Nourah bint Abdulrahman University (努拉公主大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 8 pages, 4 figures, 6 tables, conference

点击查看摘要

Abstract:Predicting driver intention from neurophysiological signals offers a promising pathway for enhancing proactive safety in advanced driver assistance systems, yet remains challenging in real-world driving due to EEG signal non-stationarity and the complexity of cognitive-motor preparation. This study proposes and evaluates an EEG-based driver intention prediction framework using a synchronised multi-sensor platform integrated into a real electric vehicle. A real-world on-road dataset was collected across 32 driving sessions, and 12 deep learning architectures were evaluated under consistent experimental conditions. Among the evaluated architectures, TSCeption achieved the highest average accuracy (0.907) and Macro-F1 score (0.901). The proposed framework demonstrates strong temporal stability, maintaining robust decoding performance up to 1000 ms before manoeuvre execution with minimal degradation. Furthermore, additional analyses reveal that minimal EEG preprocessing outperforms artefact-handling pipelines, and prediction performance peaks within a 400-600 ms interval, corresponding to a critical neural preparatory phase preceding driving manoeuvres. Overall, these findings support the feasibility of early and stable EEG-based driver intention decoding under real-world on-road conditions. Code: this https URL.

[HC-14] Co-Refine: AI-Powered Tool Supporting Qualitative Analysis

【速读】:该论文旨在解决定性编码(qualitative coding)在大规模数据集分析中因时间漂移(temporal drift)导致的编码一致性下降问题,从而影响研究可信度。现有计算机辅助定性数据分析(CAQDAS)工具虽能支持数据管理,但缺乏实时检测编码偏移的流程。解决方案的关键在于提出一个名为Co-Refine的AI增强型编码平台,其核心是一个三阶段审计流水线:第一阶段基于嵌入(embedding)计算确定性指标以衡量数学一致性;第二阶段将大语言模型(LLM)的判断限定在±0.15范围内,确保其输出与确定性分数对齐;第三阶段基于历史模式生成代码定义,形成持续反馈循环。该方法证明了确定性评分可有效约束LLM输出,实现可靠、实时的定性分析审计信号。

链接: https://arxiv.org/abs/2604.19309
作者: Athikash Jeyaganthan,Kai Xu,Franziska Becker,Steffen Koch
机构: University of Nottingham (诺丁汉大学); University of Stuttgart (斯图加特大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures. Includes details on system architecture, a three-stage audit pipeline, and a formative user study

点击查看摘要

Abstract:Qualitative coding relies on a researcher’s application of codes to textual data. As coding proceeds across large datasets, interpretations of codes often shift (temporal drift), reducing the credibility of the analysis. Existing Computer-Assisted Qualitative Data Analysis (CAQDAS) tools provide support for data management but offer no workflow for real-time detection of these drifts. We present Co-Refine, an AI-augmented qualitative coding platform that delivers continuous, grounded feedback on coding consistency without disrupting the researcher’s workflow. The system employs a three-stage audit pipeline: Stage 1 computes deterministic embedding-based metrics for mathematical consistency; Stage 2 grounds LLM verdicts within \pm0.15 of the deterministic scores; and Stage 3 produces code definitions from previous patterns to create a deepening feedback loop. Co-Refine demonstrates that deterministic scoring can effectively constrain LLM outputs to produce reliable, real-time audit signals for qualitative analysis.

[HC-15] When Transparency Falls Short: Auditing Platform Moderation During a High-Stakes Election

【速读】:该论文试图解决的问题是:在重大政治事件期间,社交媒体平台是否以及如何调整其内容审核(content moderation)策略以应对系统性风险。研究发现,八个主要欧洲社交平台在2024年欧洲议会选举前后八个月内的自报审核行为并未表现出显著变化,表明平台可能未采取实质性调整措施,或现有透明度数据库的结构未能充分揭示潜在调整。解决方案的关键在于强化监管执行力度和改进数据获取机制,从而确保平台履行其在维护民主进程中的责任,凸显当前基于自我监管模式的局限性。

链接: https://arxiv.org/abs/2604.19285
作者: Benedetta Tessa,Gautam Kishore Shahi,Amaury Trujillo,Stefano Cresci
机构: University of Pisa (比萨大学); University of Duisburg-Essen (杜伊斯堡-埃森大学); Istituto Italiano di Tecnologia (意大利技术研究院)
类目: ocial and Information Networks (cs.SI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:During major political events, social media platforms encounter increased systemic risks. However, it is still unclear if and how they adjust their moderation practices in response. The Digital Services Act Transparency Database provides-for the first time-an opportunity to systematically examine content moderation at scale, allowing researchers and policymakers to evaluate platforms’ compliance and effectiveness, especially at high-stakes times. Here we analyze 1.58 billion self-reported moderation actions by the eight largest social media platforms in Europe over an eight-month period surrounding the 2024 European Parliament elections. We found that platforms did not exhibit meaningful signs of adaptation in moderation strategies as their self-reported enforcement patterns did not change significantly around the elections. This raises questions about whether platforms made any concrete adjustments, or whether the structure of the database may have masked them. On top of that, we reveal that initial concerns regarding platforms’ transparency and accountability still persist one year after the launch of the Transparency Database. Our findings highlight the limits of current self-regulatory approaches and point to the need for stronger enforcement and better data access mechanisms to ensure that online platforms meet their responsibilities in protecting the democratic processes.

[HC-16] Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

【速读】:该论文旨在解决当前用于评估医学问答任务中大型语言模型(Large Language Models, LLMs)性能的方法主要依赖语义相似性度量,无法真实反映模型的医学准确性及潜在的健康公平风险问题。其解决方案的关键在于提出一种新的评估框架VB-Score(Verification-Based Score),该框架独立评估医学问答模型在实体识别(entity recognition)、语义相似性(semantic similarity)、事实一致性(factual consistency)和结构化信息完整性(structured information completeness)四个维度的表现,从而更全面、客观地衡量模型的医学可靠性与公平性风险。

链接: https://arxiv.org/abs/2604.19281
作者: Abu Noman Md Sakib,Md. Main Oddin Chisty,Zijie Zhang
机构: University of Texas at San Antonio (圣安东尼奥大学); Khulna University of Engineering and Technology (库尔纳工程与技术大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted in the Ninth Annual ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) 2026

点击查看摘要

Abstract:The use of Large Language Models (LLMs) to support patients in addressing medical questions is becoming increasingly prevalent. However, most of the measures currently used to evaluate the performance of these models in this context only measure how closely a model’s answers match semantically, and therefore do not provide a true indication of the model’s medical accuracy or of the health equity risks associated with it. To address these shortcomings, we present a new evaluation framework for medical question answering called VB-Score (Verification-Based Score) that provides a separate evaluation of the four components of entity recognition, semantic similarity, factual consistency, and structured information completeness for medical question-answering models. We perform rigorous reviews of the performance of three well-known and widely used LLMs on 48 public health-related topics taken from high-quality, authoritative information sources. Based on our analyses, we discover a major discrepancy between the models’ semantic and entity accuracy. Our assessments of the performance of all three models show that each of them has almost uniformly severe performance failures when evaluated against our criteria. Our findings indicate alarming performance disparities across various public health topics, with most of the models exhibiting 13.8% lower performance (compared to an overall average) for all the public health topics that relate to chronic conditions that occur in older and minority populations, which indicates the existence of what’s known as condition-based algorithmic discrimination. Our findings also demonstrate that prompt engineering alone does not compensate for basic architectural limitations on how these models perform in extracting medical entities and raise the question of whether semantic evaluation alone is a sufficient measure of medical AI safety.

[HC-17] Designing Transparent AI-Mediated Language Support for Intergenerational Family Communication

【速读】:该论文旨在解决代际语言差异对家庭沟通有效性与亲密性造成的挑战,提出通过翻译可见性增强跨代际理解。其解决方案的关键在于设计了一种基于聊天界面的“GenSync”系统,提供三种翻译可见性模式:无翻译、黑箱翻译和透明翻译(同时显示原始消息与解释后的信息)。实验结果表明,透明翻译显著提升了对话质量、亲密感和可用性,而黑箱翻译则常打断对话流,凸显了翻译可见性在代际语言支持中作为解释性中介的核心作用。

链接: https://arxiv.org/abs/2604.19276
作者: Sora Kang,Youjin Hwang,Joonhwan Lee
机构: Seoul National University (首尔国立大学)
类目: Human-Computer Interaction (cs.HC)
备注: Designing Interactive Systems Conference (DIS Companion '26), June 13–17, 2026, Singapore, Singapore

点击查看摘要

Abstract:Intergenerational linguistic differences pose challenges to effective and intimate family communication. This paper presents GenSync, a chat-based interface that supports intergenerational understanding through different forms of translation visibility. We conducted a controlled within-subjects study with 16 family dyads (32 participants), comparing three conditions: no translation, black-box translation, and transparent translation that displays both original and interpreted messages. The results show that translation visibility plays a critical role in shaping conversational experiences. Transparent translation supported conversational quality, intimacy, and usability, while black-box translation often disrupted conversational flow. These findings position intergenerational language support as a form of interpretive mediation and contribute design implications for AI-mediated communication in socially sensitive contexts.

[HC-18] Warmth and Competence in the Swarm: Designing Effective Human-Robot Teams

【速读】:该论文旨在解决人机协作中人类对机器人团队(robot swarm)社会感知(social perception)理解不足的问题,尤其关注人类如何基于群体行为特征(如广播时长、分离距离、个体速度)形成对机器人团队的温暖度(warmth)与能力感(competence)的认知。其解决方案的关键在于:通过操纵机器人集群在集体搜索任务中的行为参数,并结合胜任-温暖框架(competence-warmth framework),实证验证了不同行为特征对人类感知的影响机制——具体表现为较长的广播时长提升温暖感知,较大分离距离增强能力感知,而个体速度无显著影响;更重要的是,这些社会感知维度比任务绩效更能预测人类对机器人团队的偏好,表明在设计机器人集群时需同步考虑技术性能与社会性设计要素,以实现更有效的跨物种团队协作。

链接: https://arxiv.org/abs/2604.19270
作者: Genki Miyauchi,Roderich Groß,Chaona Chen
机构: 未知
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: 15 pages, 4 figures, camera-ready version for ANTS 2026

点击查看摘要

Abstract:As groups of robots increasingly collaborate with humans, understanding how humans perceive them is critical for designing effective human-robot teams. While prior research examined how humans interpret and evaluate the abilities and intentions of individual agents, social perception of robot teams remains relatively underexplored. Drawing on the competence-warmth framework, we conducted two studies manipulating swarm behaviors in completing a collective search task and measured the social perception of swarm behaviors when human participants are either observers (Study 1) and operators (Study 2). Across both studies, our results show that variations in swarm behaviors consistently influenced participants’ perceptions of warmth and competence. Notably, longer broadcast durations increased perceived warmth; larger separation distances increased perceived competence. Interestingly, individual robot speed had no effect on either of the perceptions. Furthermore, our results show that these social perceptions predicted participants’ team preferences more strongly than task performance. Participants preferred robot teams that were both warm and competent, not those that completed tasks most quickly. These findings demonstrate that human-robot interaction dynamically shapes social perception, underscoring the importance of integrating both technical and social considerations when designing robot swarms for effective human-robot collaboration.

[HC-19] OOPrompt: Reifying Intents into Structured Artifacts for Modular and Iterative Prompting

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)驱动的基于提示(prompt)的交互系统中,用户难以通过线性文本形式有效表达多维度意图的问题。现有方法在处理复杂、多层次任务时存在表达力不足和可编辑性差的局限。其解决方案的关键在于提出面向对象的提示设计范式(Object-Oriented Prompting, OOPrompt),将提示视为结构化、可操作的实体,支持用户对提示进行创建、编辑、迭代与复用,从而统一并扩展了多种现有提示交互机制,提升了提示工程的灵活性与效率。

链接: https://arxiv.org/abs/2604.19114
作者: Tengyou Xu,Detao Ma,Xiang ‘Anthony’ Chen
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Human-Computer Interaction (cs.HC)
备注: 20 pages, 8 figures, To appear in EICS 2026

点击查看摘要

Abstract:The rise of large language models (LLMs) has given rise to a class of prompt-based interactive systems where users primarily express their input in natural language. However, composing a prompt as a linear text string becomes unwieldy when capturing users’ multifaceted intents. We present Object-Oriented Prompting (OOPrompt), an emergent interaction paradigm that enables users to create, edit, iterate, and reuse prompts as structured, manipulable artifacts, unifying and generalizing several existing point systems. We first outlined a design space from existing work and built an early prototype, which we deployed as a probe in a formative study with 20 participants. Their feedback informed an expanded OOPrompt design space. We then developed the full OOPrompt prototype and conducted a validation study to further understand OOPrompt’s added values and trade-offs. We expect the OOPrompt design space to provide theoretical and empirical guidance to the design and engineering of prompt-based, LLM-enabled interactive systems.

[HC-20] Revisiting Framing Codebooks with AI: Employing Large Language Models as Analytical Collaborators in Deductive Content Analysis

【速读】:该论文旨在解决传统新闻框架分析中代码本(codebook)构建与修订的局限性问题,即理论驱动的代码本在应用于大规模、跨文化、动态演化的新闻语料时,常因规则模糊、边界案例难以界定及理论框架滞后而难以有效执行。其解决方案的关键在于提出一种融合理论框架与数据驱动探索的流程,利用大语言模型(Large Language Models, LLMs)作为分析协作者,而非自动化分类器,通过人机对话实现决策规则显性化、潜在维度识别和代码本的迭代优化,从而增强方法论的灵活性与解释力,同时保障研究者对框架建构的主导权。

链接: https://arxiv.org/abs/2604.19111
作者: Diego Gomez-Zara,Hernán Valdivieso,Jorge Pérez,Denis Parra,Sebastián Valenzuela
机构: University of Notre Dame (圣母大学); Pontificia Universidad Católica de Chile (天主教智利大学); Cero.ai; Millennium Institute for Foundational Research on Data (数据基础研究千年研究所); Millennium Nucleus on Digital Inequalities and Opportunities (数字不平等与机遇千年核心)
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Codebooks are central to framing research, providing theoretically grounded criteria for analyzing news content. While traditionally codebooks are built from theoretical frameworks and researchers’ knowledge, applying these codebooks to large news corpora often exposes ambiguities, borderline cases, and underspecified rules that are difficult to resolve through theory alone. Moreover, news corpora evolve over time and differ across cultures, necessitating that researchers revisit the theoretical frameworks underlying these codebooks. In this article, we propose a workflow that uses Large Language Models (LLMs) to augment the creation and refinement of framing codebooks by combining theoretical frameworks with data-driven exploration. Rather than treating LLMs as automated classifiers, this approach positions them as analytic collaborators that help externalize decision rules, surface latent dimensions, and support iterative revisions of codebooks through dialogues between researchers and their data. We illustrate this workflow using a dataset of Latin American news coverage, demonstrating how the application of LLMs’ capabilities has led to the surfacing of latent patterns, the generation of frame distinctions, and the adaptation of frameworks to new contexts. This method provides an LLM-assisted strategy that supports methodology creativity while preserving researchers’ interpretative authority.

[HC-21] Relational AI in Education: Reciprocity Participatory Design and Indigenous Worldviews

【速读】:该论文试图解决的问题是:当前生成式人工智能(Generative AI)在教育领域的应用过度强调效率、自动化和个体化辅助,可能削弱学习过程中至关重要的社会性、建构性和关系性特征,而现有AI教育(AIED)研究尚未充分阐明如何设计AI以维护学习发生的社交与生态关系。解决方案的关键在于将教育重新定位为一种关系性实践,主张将人机互动视为具有明确目的和边界的特定情境关系,而非人类互动的替代品;并基于参与式设计方法和原住民世界观(如澳大利亚原住民、北美原住民及中美洲传统)中强调的互惠性与关系责任,提出以“互惠”为核心的设计范式,引导AIED设计从单一技术优化转向支持协同学习与社区可持续性的方向,包括明确何时不使用AI、界定教学边界以及促进负责任的AIED创新实践。

链接: https://arxiv.org/abs/2604.19099
作者: Roberto Martinez-Maldonado,Vanessa Echeverria,Jenna Hawes,YJ Kim,Zara Maddigan,Mikaela Milesi,Todd Nelson,Yi-Shan Tsai
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted

点击查看摘要

Abstract:Education is not merely the transmission of information or the optimisation of individual performance; it is a fundamentally social, constructive, and relational practice. However, recent advances in generative artificial intelligence (GenAI) increasingly emphasise efficiency, automation, and individualised assistance, risking the weakening of relational learning processes. Despite growing adoption, AI in education (AIED) research has yet to fully articulate how AI can be designed in ways that sustain the social and ecological relationships through which learning occurs. In this paper, we re-centre education as relational and frame learner-AI interactions as context-specific relationships with clearly defined purposes and boundaries, rather than positioning them as substitutes for, or replacements of, human interaction. Grounded in participatory design practices and inspired by Indigenous worldviews (including Aboriginal Australian, Native American, and Mesoamerican traditions) that foreground reciprocity and relational accountability, we argue that meaningful educational AI should support learning with others rather than replace them. We advance this perspective by: i) conceptualising AIED as a relational design problem grounded in reciprocity; ii) articulating key tensions introduced by GenAI in education; and iii) outlining design directions that expand the AIED design space toward reciprocity, including when not to use AI, how to define pedagogical boundaries, and how to support responsible uses of AIED innovations that sustain communities and natural environments.

[HC-22] Cultural Newcomers Dining Across Borders: Need-Based Design Envision of Mixed Media Integration in MR for Foreign Menu Understanding and Ordering

【速读】:该论文旨在解决文化新来者(Cultural Newcomers, CNs),包括新移民和国际学生,在异国餐厅点餐时因文化陌生感和语言障碍所引发的认知壁垒与社交焦虑问题。现有翻译工具在信息传递方面存在局限,无法有效支持其日常互动需求。解决方案的关键在于设计一种融合图像、视频与3D模型的混合媒体点餐辅助系统(Mixed Media Ordering Assistant),通过参与式设计方法识别用户核心需求,并从关键功能、用户交互、媒体层级和信息呈现四个维度进行概念化构建,以降低文化障碍、语言障碍和认知负荷,从而提升CNs的就餐体验。

链接: https://arxiv.org/abs/2604.19088
作者: Ying Zhang,Daoxin Chen
机构: Carnegie Mellon University (卡内基梅隆大学); Indiana University Bloomington (印第安纳大学布卢明顿分校)
类目: Human-Computer Interaction (cs.HC)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Cultural newcomers (CNs), including new immigrants and international students, often encounter cognitive barriers and social anxiety, exacerbated by unfamiliar cultural terminology in daily interactions. This research examines these challenges in the context of ordering in foreign restaurants. Current translation tools have significant limitations in their information delivery with current media presentation methods. This research investigates the challenges and needs of CNs in ordering scenarios in a foreign restaurant through interview sessions (N = 13) and explored their expectation of mixed media integration (Image, Video, 3D Model) through a participatory design session that featured an immersive restaurant experience to support brainstorming. Based on qualitative analysis of participants’ needs and expectations, the mixed media ordering assistant is conceptualized across 4 key dimensions: Key Features, User interaction, Media hierarchy, and Information presentation, with the objective of alleviating cultural barrier, linguistic barrier, cognitive load and improving the dining experience for CNs.

[HC-23] Analysis of AWW (Anganwadi Workers) Training Content ILA (Incremental Learning Approach) Modules Following CDT (Component Display Theory)

【速读】:该论文旨在解决基层健康工作者(如妇幼保健员,Auxiliary Nurse Midwives, AWWs)在营养与流行病学知识储备不足、培训资源分布不均及缺乏持续性再培训机制等问题,从而提升其服务能力与工作效率。解决方案的关键在于通过内容分析(content analysis)将培训模块细化为事实、概念、程序和原则等类型,并基于组件显示理论(Component Display Theory, CDT)明确学习目标,进而设计出结构化、可定制的教法策略;最终开发一款面向Android平台的沉浸式游戏化学习应用,以提供灵活、高频且贴近实际工作的复习训练,实现培训内容的精准传递与持续巩固。

链接: https://arxiv.org/abs/2604.19032
作者: Arka Majhi,Satish B. Agnihotri
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Science and Game Theory (cs.GT)
备注: Ph.D. Seminar report submitted for the degree of Doctor of Philosophy

点击查看摘要

Abstract:POSHAN Abhiyan envisages capacity building of AWWs or frontline health workers through 21 training modules of ILA (Incremental Learning Approach), modularising the net learning content into smaller learning topics to help them perform their daily activities. It envisions building skilled AWWs, strengthening supervisory hierarchies, and improving coordination between AWWs (ICDS) services and health programs to achieve common goals such as increasing awareness, improving access to health and nutrition services, and reducing deaths and malnutrition. To better understand the contents of ILA literature, we conducted a content analysis by further breaking down the modules into content types such as facts, concepts, procedures, and principles. Then we framed learning objectives for teaching AWWs. We applied CDT (Component Display Theory by David Merrill) to map the contents with the desired learning objective, following the Specification of Objective chart. In this way, one can easily develop pedagogies from a new training literature. The challenges in framing learning objectives and pedagogies are: The AWWs do not have a (formal/scientific) nutrition and epidemiology background. Therefore, it is important to teach them through examples, familiar to them. AWWs are not evenly and structurally trained across districts. Training materials should be customized based on language, location, and prior knowledge. Delayed refresher courses render them underprepared for their jobs. To overcome these problems, we are developing an Android app based on gamified learning to provide refresher training to AWWs. Conducting content analysis, framing learning objectives, and developing pedagogical approaches will help conceptualize the gamified application.

[HC-24] Physical and Augmented Reality based Playful Activities for Refresher Training of ASHA Workers in India

【速读】:该论文旨在解决印度社区健康工作者(CHWs,即ASHAs)在儿童免疫接种知识培训中效果不佳的问题。研究表明,传统培训方法难以有效提升其能力,而随着智能手机的普及,亟需采用基于信息通信技术(ICT)的新式培训策略。解决方案的关键在于设计并比较两种复习培训工具——实体卡牌游戏与增强现实(AR)卡牌游戏,结果表明,AR版本因具备更强的交互性和直观性,在学习效果和知识留存率方面显著优于传统方式,从而为提升基层卫生工作者的专业能力提供了创新路径。

链接: https://arxiv.org/abs/2604.18959
作者: Arka Majhi,Satish B. Agnihotri,Aparajita Mondal
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Accepted in the Asian HCI Symposium 2022

点击查看摘要

Abstract:Recent health surveys in India highlight the alarming child malnutrition levels and lower rates of complete child immunization in many parts of India. Previous researches report that the conventional training pedagogy of the CHWs (Community Healthcare Workers) or the ASHAs (Accredited Social Health Activists) in India is ineffective in enhancing their capacity. Considering that the CHWs are getting equipped with smartphones, it calls for a rethinking of their training pedagogy using the ICT approach. Two refresher training tools were developed to make learning the child immunization schedule more exciting and conceptually engaging for ASHAs. The physical and AR (Augmented Reality) versions of designed card games were compared for effectiveness and knowledge retention, pre, and post-intervention through questionnaire tests conducted immediately before and after playing multiple sessions. The AR-based play was found to be better in learning and knowledge retention with more engagement, mainly due to its interactive and intuitive nature of play.

[HC-25] Relationships Between Trust Compliance and Performance for Novice Programmers Using AI Code Generation

【速读】:该论文旨在探究新手程序员对人工智能驱动的开发环境(AIDEs)的信任与其编程表现及AI合规行为之间的关系,尤其是在时间压力下的编程任务中。研究发现,信任并非直接导致合规行为,而是由强表现所驱动;而强表现又进一步增强后续的信任感,形成“表现→信任”的正向反馈机制。解决方案的关键在于:教学设计应关注提升编程表现以间接增强信任,并通过引导用户正确使用生成式AI来促进理想交互结果,而非单纯依赖信任作为中介变量。

链接: https://arxiv.org/abs/2604.18948
作者: Nicholas Gardella,Matthew L. Bolton,Sara L. Riggs
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Objective. To explore how novice programmers’ trust in Artificial Intelligence-driven Development Environments (AIDEs) relates to their coding performance and AI compliance while programming under time pressure. Background. Computer programming has undergone rapid upheaval due to state-of-the-art AIDEs, which provide clever automation for many aspects of software development. A longstanding interest of researchers of automation more generally has been the attitude of trust. Decades of research seek to explain how influencing trust can help to achieve desirable outcomes in different domains, but very limited work has provided similar focus on trust in AIDEs. Method. We collected subjective measures of trust along with objective measures of performance and AIDE compliance from a diverse group of 27 novice programmers between two study locations. Results. Our results corroborated traditional understandings of how trust changes through experiences. However, we did not find a relationship between trust and subsequent compliance during programming tasks. Greater compliance was associated with strong performance, and strong performance led to greater subsequent trust. Conclusion. Our findings raise new questions about the utility of trust in the context of interacting with AIDEs and generative AI. We call for further research into the effect of trust on compliance to recommendations from imperfect AI. Application. This work can inform the design of training and educational content for generative AI use within and beyond software development. Instructional designers should consider risks of AI misuse and disuse and focus on promoting desirable interaction outcomes, regardless of trust’s connection to them.

[HC-26] Choose Your Own Adventure: Non-Linear AI-Assisted Programming with EvoGraph

【速读】:该论文旨在解决当前AI辅助编程工具(如基于聊天的线性交互模式)无法有效支持开发者在编程过程中进行多路径探索、提示序列管理及代码变更追踪的问题。其核心挑战在于,现有工具未能契合编程本身的迭代与分支特性,导致开发者难以高效地比较不同方案、回溯历史状态或反思AI生成的修改。解决方案的关键是提出EvoGraph——一个集成于IDE的插件,将AI交互和代码变更以轻量级、可交互的发展图(development graph)形式自动记录,支持开发者通过图形化操作对比、合并并重新访问先前的AI协作编程状态,从而实现更安全的探索、高效的迭代和对问题求解进展的深入反思。

链接: https://arxiv.org/abs/2604.18883
作者: Vassilios Exarhakos,Jinghui Cheng,Jin L.C. Guo
机构: McGill University (麦吉尔大学); Polytechnique Montreal (蒙特利尔工程学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Current AI-assisted programming tools are predominantly linear and chat-based, which deviates from the iterative and branching nature of programming itself. Our preliminary study with developers using AI assistants suggested that they often struggle to explore alternatives, manage prompting sequences, and trace changes. Informed by these insights, we created EvoGraph, an IDE plugin that integrates AI interactions and code changes as a lightweight and interactive development graph. EvoGraph automatically records a branching AI-assisted coding history and allows developers to manipulate the graph to compare, merge, and revisit prior collaborative AI programming states. Our user study with 20 participants revealed that EvoGraph addressed developers’ challenges identified in our preliminary study while imposing lower cognitive load. Participants also found the graph-based representation supported safe exploration, efficient iteration, and reflection on AI-generated changes. Our work highlights design opportunities for tools to help developers make sense of and act on their problem-solving progress in the emerging AI-mediated programming context.

[HC-27] he Triadic Loop: A Framework for Negotiating Alignment in AI Co-hosted Livestreaming

【速读】:该论文旨在解决当前AI对齐(alignment)研究中普遍存在的单用户-单AI二元关系假设与现实多用户社交环境(如直播平台)之间不匹配的问题。在直播场景中,主播、AI共主持和观众三者形成实时互动的三角反馈回路,传统指令跟随式对齐方法无法有效处理这种动态、双向且相互强化的多主体适应过程。解决方案的关键在于提出“三元环”(Triadic Loop)概念框架,将对齐重构为三个子环路——主播↔AI共主持、AI共主持↔观众、主播↔观众——之间的时序强化式双向适应机制,并引入“战略性错位”(strategic misalignment)作为维持社群参与度的新机制,同时构建基于既有测量工具的三类关系评估指标,从而实现对参与式媒体环境中社会连贯性的持续维护。

链接: https://arxiv.org/abs/2604.18850
作者: Katherine Wang,Nadia Berthouze,Aneesha Singh
机构: University College London(伦敦大学学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 6 pages, 1 figure, Proceedings the Human-AI Interaction Alignment Workshop at CHI 2026 (CHI26 BiAlign Workshop)

点击查看摘要

Abstract:AI systems are increasingly embedded in multi-user social environments, yet most alignment frameworks conceptualize interaction as a dyadic relationship between a single user and an AI system. Livestreaming platforms challenge this assumption: interaction unfolds among streamers and audiences in real time, producing dynamic affective and social feedback loops. In this paper, we introduce the Triadic Loop, a conceptual framework that reconceptualizes alignment in AI co-hosted livestreaming as a temporally reinforced process of bidirectional adaptation among three actors: streamer \leftrightarrow AI co-host, AI co-host \leftrightarrow audience, and streamer \leftrightarrow audience. Unlike instruction-following paradigms, bidirectional alignment requires each actor to continuously reshape the others, meaning misalignment in any sub-loop can destabilize the broader system. Drawing on literature from multi-party interaction, collaborative AI, and relational agents, we articulate how AI co-hosts function not only as mediators but as performative participants and community members shaping collective meaning-making. We further propose “strategic misalignment” as a mechanism for sustaining community engagement and introduce three relational evaluation constructs grounded in established instruments. The framework contributes a model of dynamic multi-party alignment, an account of cross-loop reinforcement, and design implications for AI co-hosts that sustain social coherence in participatory media environments.

[HC-28] AffectCity: An Empirical Investigation of Complexity Transparency and Materiality in Shaping Affective Perception of Building Facades

【速读】:该论文旨在解决建筑立面属性如何具体影响人类情感状态的机制尚不明确的问题,尤其是在复杂性、透明度(窗墙比)和材质性(自然与人工表面比例)等可量化特征与情绪唤醒度(arousal)及愉悦度(valence)之间的关系方面。其解决方案的关键在于构建了一个基于机器视觉提取的表面指标与人类情感反应之间关联的验证管道,并通过实证发现:感知复杂性是主导的情感预测因子,且呈现非线性增强效应;材质性在机器指标与情绪反应之间起显著中介作用,表明人类感知评估是不可或缺的中间层;同时,情感反应具有情境依赖性,尤其愉悦度在不同场景下稳定性较低(ICC = 0.332)。这一成果推动了立面研究从描述性形态分析向以感知为基础的预测建模转变,为城市环境的情绪导向设计提供了实证依据。

链接: https://arxiv.org/abs/2604.18768
作者: Chenxi Wang,Haining Ding,Michal Gath-Morad
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Buildings shape how people feel, yet the mechanisms through which specific facade properties drive affective states remain empirically underspecified. Here we introduce the Cambridge Facade Affect Dataset (CFAD), 86 orthogonally rectified facade images annotated with continuous arousal and valence ratings from 85 participants, and establish a validated pipeline linking machine-vision-derived surface metrics to human affective responses. Focusing on three quantifiable attributes, complexity, transparency (window-to-wall ratio), and materiality (proportion of natural versus artificial surface composition), we show that perceived complexity is the dominant affective predictor, with significant positive associations for both arousal (beta = 0.507, p 0.001) and valence (beta = 0.376, p 0.001) and a curvilinear amplification at higher complexity levels. Transparency exhibits an inverted-U relationship with valence, while increasing surface artificiality suppresses arousal and reduces pleasantness consistent with biophilic response theory. Critically, machine-derived metrics show limited direct predictive power over affective outcomes; mediation analyses reveal that human perceptual evaluation functions as a necessary intermediate layer, with perceived materiality significantly mediating the machine-valence relationship (indirect effect = -0.205, p = 0.003). Cross-context validation demonstrates moderate stability of complexity and materiality ratings across image-based and in-situ conditions, while affective responses, particularly valence, exhibit significant context-dependence (ICC = 0.332). These findings advance facade research from descriptive morphological analysis toward predictive, perception-grounded modelling, and provide an empirically validated basis for affect-informed design of the urban environment.

[HC-29] Input Visualizations to Track Health Data by Older Adults with Multiple Chronic Conditions

【速读】:该论文旨在解决老年慢性病共病(Multiple Chronic Conditions, MCC)患者在健康数据收集过程中缺乏主动参与感与意义感的问题。传统数据收集方式(如数字工具或手写笔记)仅在事后回顾时才能产生洞察,难以支持日常的感知理解(sensemaking)与情感联结。解决方案的关键在于引入基于物理标记(physical tokens)的数据输入可视化机制,通过其具身性(tangible)、表达性(expressive)和可个性化(personalizable)特性,使用户在数据采集过程中即能识别趋势、激发意外发现,并增强对健康追踪过程的愉悦感与自我表达的满足感。研究通过两周期访谈与日记研究验证了该方法如何帮助老年人将可视化工具融入日常生活,从而提升健康数据的反思深度与使用动机。

链接: https://arxiv.org/abs/2604.18741
作者: Shri Harini Ramesh,Foroozan Daneshzand,Matteo Sotelo,Mahsa Sinaei,Fateme Rajabiyazdi
机构: University of Calgary (卡尔加里大学); Simon Fraser University (西蒙菲莎大学); Carleton University (卡尔顿大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to EuroVis 2026 conference

点击查看摘要

Abstract:Older adults living with multiple chronic conditions (MCC) can considerably benefit from collecting and reflecting on their health data. Many older adults collect their health data using various approaches, such as digital tools or handwritten notebooks. However, in these approaches, the act of collecting data does not itself yield insights; sensemaking and reflection happen only if individuals later review their accumulated records. The daily process of data collection thus offers limited opportunity for individuals to actively engage with their data or find the process personally meaningful and enjoyable. Personal data input visualizations using physical tokens offer a promising solution that can help individuals recognize evolving patterns while collecting data and discover meaningful insights more serendipitously and engagingly. Yet, there is a limited understanding of whether and how older adults living with MCC might adopt physical input visualizations to collect data and reflect on their health, and how the tangible, expressive, and personalizable nature of this process supports their sensemaking and reflection. In this paper, we present the results of our interview and diary studies in which older adults living with MCC inputted health data using physical tokens over two weeks. Our findings highlight the diverse and unique needs of older adults for tracking personal health data, illustrating how they adapt strategies and personalize physical input visualizations to align with their individual needs. We demonstrate how older adults integrated input visualizations into daily routines and leveraged tangible markers to reflect on patterns and behaviors, while enjoying the process of tracking and focusing on personal expression and meaningful reflection. Finally, we provide design considerations for supporting older adults with MCC when inputting health data through physical tokens.

[HC-30] Students Know AI Should Not Replace Thinking but How Do They Regulate It? The TACO Framework for Human-AI Cognitive Partnership

【速读】:该论文旨在解决生成式 AI(Generative AI)在教育实践中被学生使用时,其认知支持与认知替代边界难以有效管理的问题。尽管学生普遍认识到“AI不应取代思考”,但这种伦理认知并未转化为实际学习过程中的结构化调节行为。解决方案的关键在于提出一个名为 TACO(Think-Ask-Check-Own)的流程导向框架,通过将关注点从道德意识转向认知调控机制,为学习者提供可操作的认知边界管理策略,从而确保 AI 作为动态认知伙伴在教育中可持续发挥作用。

链接: https://arxiv.org/abs/2604.18737
作者: Cecilia Ka Yuk Chan
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As generative artificial intelligence becomes increasingly embedded in educational practice, a central concern is whether students use AI as cognitive support or as a substitute for thinking. Prior research shows that learners recognise this boundary conceptually and acknowledge that “AI should not replace thinking.” However, whether such awareness translates into structured regulation during actual AI use remains unclear. Drawing on data from Hong Kong secondary students, this study examines how learners perceive their management of the boundary between assistance and outsourcing in practice. Findings show that awareness did not consistently translate into regulation; ethical belief did not necessarily lead to strategic execution; and conceptual endorsement did not guarantee operational behaviour. These findings suggest that the challenge is not teaching students that AI should not replace thinking, as they already know this, but providing them with structured mechanisms to regulate how AI is used within learning processes. In response, the study introduces the TACO framework (Think-Ask-Check-Own), a process-oriented model designed to operationalise the boundary between cognitive support and cognitive substitution. By shifting attention from ethical awareness to cognitive regulation, the study contributes a learner-grounded approach to sustaining AI as a dynamic cognitive partner in education. Subjects: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC) Cite as: arXiv:2604.18737 [cs.CY] (or arXiv:2604.18737v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2604.18737 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-31] SPRITE: From Static Mockups to Engine-Ready Game UI

【速读】:该论文旨在解决游戏用户界面(Game UI)开发中将风格化设计稿转化为可交互引擎资产时面临的挑战,尤其是现有“截图转代码”工具在处理不规则几何形状和深层视觉层级结构时的局限性。其解决方案的关键在于提出SPRITE管道,通过结合视觉语言模型(Vision-Language Models, VLMs)与结构化的YAML中间表示,显式建模复杂容器关系与非矩形布局,从而实现高保真度的UI重建与高效原型迭代。

链接: https://arxiv.org/abs/2604.18591
作者: Yunshu Bai,RuiHao Li,Hao Zhang,Chien Her Lim,Ming Yan,Mengtian Li
机构: Shanghai University (上海大学); MiAO Worlds (MiAO Worlds)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: CHI EA '26

点击查看摘要

Abstract:Game UI implementation requires translating stylized mockups into interactive engine entities. However, current “Screenshot-to-Code” tools often struggle with the irregular geometries and deep visual hierarchies typical of game interfaces. To bridge this gap, we introduce SPRITE, a pipeline that transforms static screenshots into editable engine assets. By integrating Vision-Language Models (VLMs) with a structured YAML intermediate representation, SPRITE explicitly captures complex container relationships and non-rectangular layouts. We evaluated SPRITE against a curated Game UI benchmark and conducted expert reviews with professional developers to assess reconstruction fidelity and prototyping efficiency. Our findings demonstrate that SPRITE streamlines development by automating tedious coding and resolving complex nesting. By facilitating rapid in-engine iteration, SPRITE effectively blurs the boundaries between artistic design and technical implementation in game development. Project page: this https URL

[HC-32] Critical Thinking in the Age of Artificial Intelligence: A Survey-Based Study with Machine Learning Insights

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在教育、职业及日常问题解决中广泛应用背景下,其对人类批判性思维(Critical Thinking)影响的复杂性问题。研究发现,AI使用行为与批判性思维表现之间并非线性关系,而是呈现出多样化模式:部分用户表现出对AI的高度依赖,伴随耐心下降和独立思考能力减弱;而另一些用户则能平衡利用AI支持与自主推理。解决方案的关键在于推动人机协同机制的设计应聚焦于促进反思(Reflection)、验证(Verification)与持续认知努力(Sustained Cognitive Effort),而非简单替代人类思维过程。这要求AI系统不仅提供答案,更需激发用户主动参与逻辑推理与自我校验的行为倾向。

链接: https://arxiv.org/abs/2604.18590
作者: M Murshidul Bari,Akif Islam,Mohd Ruhul Ameen,Abu Saleh Musa Miah,Jungpil Shin
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 5 Figures, 2 Tables, Submitted to International Conference On Power, Electronics, Communications, Computing, and Intelligent Infrastructure 2026

点击查看摘要

Abstract:The growing use of artificial intelligence (AI) in education, professional work, and everyday problem-solving has raised important questions about its effect on human reasoning. While AI can improve efficiency, save time, and support learning, repeated dependence on it may also encourage cognitive offloading, reduce productive struggle, and weaken independent critical thinking. This paper investigates the relationship between AI-use behavior and critical-thinking performance through an interview-based survey combined with short logic and reasoning tasks. The findings reveal a mixed pattern: participants largely viewed AI as a tool for speed, convenience, and learning support, yet many also reported reduced patience for sustained effort. Objective reasoning performance varied considerably across individuals, and the analyses suggest that reduced patience and stronger dependence-related tendencies are more closely associated with lower reasoning performance than background characteristics alone. Exploratory clustering further indicates that AI users do not form a single homogeneous group, but instead reflect tentative behavioral profiles, including over-reliant users, mixed-strategy users, and balanced support-seekers. Although the findings are exploratory, they indicate that AI does not affect critical thinking in a uniformly negative or positive way. Instead, its influence appears to depend on the manner in which it is used. The paper therefore argues that effective human-AI collaboration should support reflection, verification, and sustained cognitive effort rather than substitute for them.

[HC-33] CentaurTA Studio: A Self-Improving Human-Agent Collaboration System for Thematic Analysis

【速读】:该论文旨在解决主题分析(Thematic Analysis)在规模化应用中面临的两大挑战:一是人工流程劳动密集、效率低下;二是全自动方法缺乏可控性与可解释的评估机制。其核心解决方案是提出一个名为CentaurTA Studio的基于Web的人机协同系统,关键创新在于三方面:(1) 采用两阶段人类反馈流程,将模拟器草拟与专家验证分离以提升可控性;(2) 引入持续提示优化机制,将已验证的反馈提炼为可复用的对齐原则;(3) 基于评分量表(rubric-based evaluation)的早期停止策略实现过程控制。实验表明,该系统在开放式编码和主题构建任务中均达到最高准确率(最高92.12%),且人机一致性达良好水平(平均κ=0.68),同时仅需约25分钟完成10轮迭代即达最优性能,显著优于纯专家精调方式。

链接: https://arxiv.org/abs/2604.18589
作者: Lei Wang,Min Huang,Eduard Dragut
机构: Temple University (坦普尔大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Thematic analysis is difficult to scale: manual workflows are labor-intensive, while fully automated pipelines often lack controllability and transparent evaluation. We present \textbfCentaurTA Studio, a web-based system for self-improving human–agent collaboration in open coding and theme construction. The system integrates (1) a two-stage human feedback pipeline separating simulator drafting and expert validation, (2) persistent prompt optimization that distills validated feedback into reusable alignment principles, and (3) rubric-based evaluation with early stopping for process control. Across three domains, CentaurTA achieves the strongest performance in both Open Coding and Theme Construction, reaching up to 92.12% accuracy and consistently outperforming baseline systems. Agreement between the rubric-based LLM judge and human annotators reaches substantial reliability (average \kappa = 0.68 ). Ablation studies show that removing the feedback loop reduces performance from 90% to 81%, while eliminating the Critic or early stopping degrades accuracy or increases interaction cost. The full system reaches peak performance within 10 iterative rounds (about 25 minutes), demonstrating improved efficiency over expert-only refinement. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) MSC classes: 68T07, 68T05 ACMclasses: I.2.7; H.5.2 Cite as: arXiv:2604.18589 [cs.HC] (or arXiv:2604.18589v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.18589 Focus to learn more arXiv-issued DOI via DataCite

[HC-34] Modelling and Analysing Behaviours and Emotions via Complex User Interactions

【速读】:该论文旨在解决如何从日益增长的海量社会数据中提取有效信息,以更好地理解并优化数据驱动数字系统中的用户体验问题。其核心挑战在于缺乏对数字画像(digital profiling)与系统状态(system status)之间映射关系的深入理解。解决方案的关键在于构建一个基于复杂数字系统的新型概念框架,利用纵向数据集,通过从用户文本中提取人格特质和情绪特征,来预测系统的运行状态。该框架依托于一项针对2000名在线奖学金项目学生的数字行为和社会网络行为的实证研究,融合了心理学、语言学与人机交互(Human-Computer Interaction, HCI)的跨学科视角,从而填补了现有研究在数字行为建模与系统状态预测之间的空白。

链接: https://arxiv.org/abs/1902.07683
作者: Mohamed Mostafa
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 176 pages; PhD thesis accepted at Cardiff Metropolitan University, UK (February 2019)

点击查看摘要

Abstract:Over the past 15 years, the volume, richness and quality of data collected from the combined social networking platforms has increased beyond all expectation, providing researchers from a variety of disciplines to use it in their research. Perhaps more impactfully, it has provided the foundation for a range of new products and services, transforming industries such as advertising and marketing, as well as bringing the challenges of sharing personal data into the public consciousness. But how to make sense of the ever-increasing volume of big social data so that we can better understand and improve the user experience in increasingly complex, data-driven digital systems. This link with usability and the user experience of data-driven system bridges into the wider field of HCI, attracting interdisciplinary researchers as we see the demand for consumer technologies, software and systems, as well as the integration of social networks into our everyday lives. The fact that the data largely posted on social networks tends to be textual, provides a further link to linguistics, psychology and psycholinguistics to better understand the relationship between human behaviours offline and online. In this thesis, we present a novel conceptual framework based on a complex digital system using collected longitudinal datasets to predict system status based on the personality traits and emotions extracted from text posted by users. The system framework was built using a dataset collected from an online scholarship system in which 2000 students had their digital behaviour and social network behaviour collected for this study. We contextualise this research project with a wider review and critical analysis of the current psycholinguistics, artificial intelligence and human-computer interaction literature, which reveals a gap of mapping and understanding digital profiling against system status. Comments: 176 pages; PhD thesis accepted at Cardiff Metropolitan University, UK (February 2019) Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:1902.07683 [cs.HC] (or arXiv:1902.07683v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.1902.07683 Focus to learn more arXiv-issued DOI via DataCite

[HC-35] Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

【速读】:该论文旨在解决统一自动语音识别(ASR)系统在离线(offline)与低延迟流式(streaming)解码场景下性能差异的问题,即如何训练一个单一模型同时在两种模式下均表现优异。其解决方案的关键在于提出一种基于块限制注意力(chunk-limited attention)与动态分块卷积(dynamic chunked convolutions)的统一RNNT(RNN-Transducer)训练框架,并引入一种高效的Triton实现的模式一致性正则化(mode-consistency regularization for RNNT, MCR-RNNT),通过强制不同训练模式下的预测一致性来缩小两种解码方式之间的性能差距。实验表明,该方法在保持离线性能的同时显著提升了低延迟流式识别准确率,并具备良好的扩展性。

链接: https://arxiv.org/abs/2604.19079
作者: Andrei Andrusenko,Vladimir Bataev,Lilit Grigoryan,Nune Tadevosyan,Vitaly Lavrukhin,Boris Ginsburg
机构: NVIDIA(英伟达); NVIDIA(英伟达)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings remains challenging. We present a Unified ASR framework for Transducer (RNNT) training that supports both offline and streaming decoding within a single model, using chunk-limited attention with right context and dynamic chunked convolutions. To further close the gap between offline and streaming performance, we introduce an efficient Triton implementation of mode-consistency regularization for RNNT (MCR-RNNT), which encourages agreement across training modes. Experiments show that the proposed approach improves streaming accuracy at low latency while preserving offline performance and scaling to larger model sizes and training datasets. The proposed Unified ASR framework and the English model checkpoint are open-sourced.

计算机视觉

[CV-0] stars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items

【速读】:该论文旨在解决现有虚拟试衣(Virtual Try-On)方法在复杂真实场景下性能不足的问题,如极端姿态、严重光照变化、运动模糊等挑战性条件下的鲁棒性差、生成结果缺乏细节真实感、多图像组合能力有限以及部署延迟高等问题。解决方案的关键在于构建一个端到端的集成系统设计,涵盖优化的模型架构、可扩展的数据引擎、稳健的基础设施和多阶段训练范式,从而实现高成功率、逼真的视觉效果、灵活的多图合成能力(支持最多6张参考图)及近实时推理速度,最终在淘宝App上实现了工业级大规模部署,服务数百万用户。

链接: https://arxiv.org/abs/2604.19748
作者: Mengting Chen,Zhengrui Chen,Yongchao Du,Zuan Gao,Taihang Hu,Jinsong Lan,Chao Lin,Yefeng Shen,Xingjian Wang,Zhao Wang,Zhengtao Wu,Xiaoli Xu,Zhengze Xu,Hao Yan,Mingzhou Zhang,Jun Zheng,Qinye Zhou,Xiaoyong Zhu,Bo Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, model evaluation report

点击查看摘要

Abstract:Recent advances in image generation and editing have opened new opportunities for virtual try-on. However, existing methods still struggle to meet complex real-world demands. We present Tstars-Tryon 1.0, a commercial-scale virtual try-on system that is robust, realistic, versatile, and highly efficient. First, our system maintains a high success rate across challenging cases like extreme poses, severe illumination variations, motion blur, and other in-the-wild conditions. Second, it delivers highly photorealistic results with fine-grained details, faithfully preserving garment texture, material properties, and structural characteristics, while largely avoiding common AI-generated artifacts. Third, beyond apparel try-on, our model supports flexible multi-image composition (up to 6 reference images) across 8 fashion categories, with coordinated control over person identity and background. Fourth, to overcome the latency bottlenecks of commercial deployment, our system is heavily optimized for inference speed, delivering near real-time generation for a seamless user experience. These capabilities are enabled by an integrated system design spanning end-to-end model architecture, a scalable data engine, robust infrastructure, and a multi-stage training paradigm. Extensive evaluation and large-scale product deployment demonstrate that Tstars-Tryon1.0 achieves leading overall performance. To support future research, we also release a comprehensive benchmark. The model has been deployed at an industrial scale on the Taobao App, serving millions of users with tens of millions of requests.

[CV-1] AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

【速读】:该论文旨在解决稀疏视图三维重建(Sparse-view 3D reconstruction)中因输入视角有限、几何一致性差以及难以扩展至大规模或多样化场景而导致的挑战。现有基于扩散模型的方法虽能合成新视角,但通常仅依赖一到两个捕获帧进行条件控制,限制了几何一致性并制约了可扩展性。其解决方案的关键在于提出 AnyRecon 框架,通过构建一个预置的捕获视图缓存(prepend capture view cache)实现持久化的全局场景记忆,并移除时间压缩以保持大视角变化下的帧级对应关系;同时引入一种几何感知的条件策略,利用显式三维几何记忆和基于几何的捕获视图检索机制,使生成与重建过程相互耦合,从而提升大规模场景下的重建质量与效率。此外,结合四步扩散蒸馏与上下文窗口稀疏注意力机制,有效降低计算复杂度至线性级别,实现了对不规则输入、大视角间隙和长轨迹的鲁棒且可扩展的重建。

链接: https://arxiv.org/abs/2604.19747
作者: Yutian Chen,Shi Guo,Renbiao Jin,Tianshuo Yang,Xin Cai,Yawen Luo,Mingxin Yang,Mulin Yu,Linning Xu,Tianfan Xue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Webpage: this https URL

点击查看摘要

Abstract:Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. Beyond better generative model, we also find that the interplay between generation and reconstruction is crucial for large-scale 3D scenes. Thus, we introduce a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval. To ensure efficiency, we combine 4-step diffusion distillation with context-window sparse attention to reduce quadratic complexity. Extensive experiments demonstrate robust and scalable reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.

[CV-2] CityRAG : Stepping Into a City via Spatially-Grounded Video Generation

【速读】:该论文旨在解决生成具有空间一致性且可导航的物理场景模拟问题,即在任意天气条件和动态物体配置下重建真实环境,以支持自动驾驶和机器人仿真等下游应用。解决方案的关键在于提出CityRAG模型,该模型利用大规模地理注册数据作为上下文来锚定生成内容到物理场景,同时保留对复杂运动与外观变化的学习先验;其核心创新在于基于时间未对齐的训练数据,使模型能够语义上解耦场景的静态结构与其瞬时属性(如天气、光照、动态物体),从而实现长时间、物理一致的视频生成与复杂轨迹导航。

链接: https://arxiv.org/abs/2604.19741
作者: Gene Chou,Charles Herrmann,Kyle Genova,Boyang Deng,Songyou Peng,Bharath Hariharan,Jason Y. Zhang,Noah Snavely,Philipp Henzler
机构: Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this http URL

点击查看摘要

Abstract:We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

[CV-3] Generalization at the Edge of Stability

【速读】:该论文旨在解决现代神经网络在大学习率下训练时(即“稳定性边缘”区域)为何能实现更好泛化性能的机制问题,这一现象虽被广泛观察到,但其理论基础尚不清晰。解决方案的关键在于将随机优化器建模为随机动力系统,并发现其在混沌优化轨迹中收敛至一个分形吸引子集(fractal attractor set),该吸引子具有更低的内在维度(intrinsic dimension)。基于此,作者提出了一种新的维度概念——“尖锐度维度”(sharpness dimension),并据此推导出一个与泛化误差相关的理论边界。该边界表明,泛化能力不仅依赖于Hessian矩阵的迹或谱范数等传统指标,更取决于完整的Hessian谱及其部分行列式的结构,从而揭示了此前未被捕捉到的复杂性。实验验证了该理论在多层感知机(MLP)和Transformer架构中的有效性,并进一步解释了近期观察到的“grokking”现象。

链接: https://arxiv.org/abs/2604.19740
作者: Mario Tuci,Caner Korkmaz,Umut Şimşekli,Tolga Birdal
机构: INRIA, CNRS, Département d’Informatique de l’Ecole Normale Supérieure / PSL, France; Imperial College London, United Kingdom
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Project page: this https URL

点击查看摘要

Abstract:Training modern neural networks often relies on large learning rates, operating at the edge of stability, where the optimization dynamics exhibit oscillatory and chaotic behavior. Empirically, this regime often yields improved generalization performance, yet the underlying mechanism remains poorly understood. In this work, we represent stochastic optimizers as random dynamical systems, which often converge to a fractal attractor set (rather than a point) with a smaller intrinsic dimension. Building on this connection and inspired by Lyapunov dimension theory, we introduce a novel notion of dimension, coined the `sharpness dimension’, and prove a generalization bound based on this dimension. Our results show that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants, highlighting a complexity that cannot be captured by the trace or spectral norm considered in prior work. Experiments across various MLPs and transformers validate our theory while also providing new insights into the recently observed phenomenon of grokking.

[CV-4] Generative Drifting for Conditional Medical Image Generation

【速读】:该论文旨在解决条件医学图像生成中推理效率、患者特异性保真度与分布层面合理性之间的平衡难题,尤其在高维3D医学成像任务中尤为突出。其解决方案的关键在于提出一种名为GDM(Generative Drifting Model)的生成漂移框架,将确定性医学图像预测重构为多目标学习问题,通过吸引-排斥漂移机制最小化生成器前向映射与目标分布间的差异,并借助多层级特征库(基于医学基础编码器构建)实现跨全局、局部及空间表示的稳定亲和力估计与漂移场计算,同时引入共享输出空间中的梯度协调策略以优化分布级与保真度导向目标之间的平衡,从而在单步推理下实现高性能的3D医学图像生成。

链接: https://arxiv.org/abs/2604.19736
作者: Zirong Li,Siyuan Mei,Weiwen Wu,Andreas Maier,Lina Gölz,Yan Xia
机构: Friedrich-Alexander-University Erlangen-Nuremberg (埃尔朗根-纽伦堡弗里德里希亚历山大大学); Sun-Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conditional medical image generation plays an important role in many clinically relevant imaging tasks. However, existing methods still face a fundamental challenge in balancing inference efficiency, patient-specific fidelity, and distribution-level plausibility, particularly in high-dimensional 3D medical imaging. In this work, we propose GDM, a generative drifting framework that reformulates deterministic medical image prediction as a multi-objective learning problem to jointly promote distribution-level plausibility and patient-specific fidelity while retaining one-step inference. GDM extends drifting to 3D medical imaging through an attractive-repulsive drift that minimizes the discrepancy between the generator pushforward and the target distribution. To enable stable drifting-based learning in 3D volumetric data, GDM constructs a multi-level feature bank from a medical foundation encoder to support reliable affinity estimation and drifting field computation across complementary global, local, and spatial representations. In addition, a gradient coordination strategy in the shared output space improves optimization balance under competing distribution-level and fidelity-oriented objectives. We evaluate the proposed framework on two representative tasks, MRI-to-CT synthesis and sparse-view CT reconstruction. Experimental results show that GDM consistently outperforms a wide range of baselines, including GAN-based, flow-matching-based, and SDE-based generative models, as well as supervised regression methods, while improving the balance among anatomical fidelity, quantitative reliability, perceptual realism, and inference efficiency. These findings suggest that GDM provides a practical and effective framework for conditional 3D medical image generation.

[CV-5] VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

【速读】:该论文旨在解决当前开源视觉语言动作模型(Vision-Language-Action, VLA)训练流程碎片化的问题,即多数开源项目仅聚焦于动作训练阶段,且常采用不兼容的预训练流水线,导致整体训练缺乏端到端一致性与可控性。解决方案的关键在于提出VLA Foundry——一个统一的开源框架,整合了大语言模型(LLM)、视觉语言模型(VLM)与VLA的训练流程,提供从语言预训练到动作专家微调的全流程控制能力。该框架支持从零开始训练和基于Hugging Face预训练模型(如Qwen3-VL)的迁移训练,并通过实证验证其有效性:在LBM Eval仿真环境中,完全自研模型性能达到先前闭源工作的水平,而基于Qwen3-VL骨干网络的多任务桌面操作策略则显著优于基线。

链接: https://arxiv.org/abs/2604.19728
作者: Jean Mercat,Sedrick Keh,Kushal Arora,Isabella Huang,Paarth Shah,Haruki Nishimura,Shun Iwase,Katherine Liu
机构: TRI(丰田研究院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 32 pages, 16 figures, technical report

点击查看摘要

Abstract:We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM–VLM–VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at this https URL and all multi-task model weights are released on this https URL. Additional qualitative videos are available on the project website this https URL.

[CV-6] ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

【速读】:该论文旨在解决人类视频生成中因有限多视角数据导致的外观(appearance)、运动(motion)与相机视角(viewpoint)难以协同建模的问题。现有方法通常将这些因素分开处理,导致控制能力受限或视觉质量下降。其解决方案的关键在于采用“图像优先”(image-first)范式:首先利用预训练图像生成模型学习高质量的人类外观表征,并将其作为视频合成的先验,从而解耦外观建模与时间一致性;进一步结合基于SMPL-X的运动引导和无需训练的时间一致性精修模块(基于预训练视频扩散模型),实现姿态与视角可控的高质量、时序一致视频生成。

链接: https://arxiv.org/abs/2604.19720
作者: Zhengwentai Sun,Keru Zheng,Chenghong Li,Hongjie Liao,Xihe Yang,Heyuan Li,Yihao Zhi,Shuliang Ning,Shuguang Cui,Xiaoguang Han
机构: Taited; Shandong University (山东大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at this https URL.

[CV-7] A Network-Aware Evaluation of Distributed Energy Resource Control in Smart Distribution Systems

【速读】:该论文旨在解决高比例分布式能源资源(Distributed Energy Resources, DERs)接入配电网时,分布式控制策略在实际通信网络条件下的性能评估难题。现有研究多基于理想化通信假设,难以反映真实场景中通信延迟等动态对控制效果的影响。解决方案的关键在于构建一个耦合线性化配电网模型与ns-3级包级下行链路仿真(packet-level downlink emulation)的协同仿真框架,用于评估一种典型的虚拟电厂(Virtual Power Plant, VPP)调度算法——该算法采用原对偶优化方法,同时实现馈线首端有功功率跟踪和电压调节目标;并通过仅在承载对偶变量更新的下行链路引入每DER的包延迟及“保持最后值”策略,量化通信行为对控制性能的影响。结果表明,理想通信下控制器可精准跟踪参考功率并维持电压合规,而引入现实下行延迟后则出现显著的馈线功率振荡和频繁电压越限,凸显了通信动态对分布式控制性能的关键影响。

链接: https://arxiv.org/abs/2604.19715
作者: Houchao Gan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Distribution networks with high penetration of Distributed Energy Resources (DERs) increasingly rely on communication networks to coordinate grid-interactive control. While many distributed control schemes have been proposed, they are often evaluated under idealized communication assumptions, making it difficult to assess their performance under realistic network conditions. This work presents an implementation-driven evaluation of a representative virtual power plant (VPP) dispatch algorithm using a co-simulation framework that couples a linearized distribution-system model with packet-level downlink emulation in ns-3. The study considers a modified IEEE~37-node feeder with high photovoltaic penetration and a primal–dual VPP dispatch that simultaneously targets feeder-head active power tracking and voltage regulation. Communication effects are introduced only on the downlink path carrying dual-variable updates, where per-DER packet delays and a hold-last-value strategy are modeled. Results show that, under ideal communication, the dispatch achieves close tracking of the feeder-head power reference while maintaining voltages within the prescribed limits at selected buses. When realistic downlink delay is introduced, the same controller exhibits large oscillations in feeder-head power and more frequent voltage limit violations. These findings highlight that distributed DER control performance can be strongly influenced by communication behavior and motivate evaluation frameworks that explicitly incorporate network dynamics into the assessment of grid-interactive control schemes.

[CV-8] SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

【速读】:该论文旨在解决现有视觉-语言-动作(Vision-Language-Action, VLA)模型在自动驾驶场景中因自回归生成框架导致的动作生成延迟高以及鲁棒性不足的问题。解决方案的关键在于提出SpanVLA框架,其核心创新包括:一是引入基于历史轨迹初始化的流匹配(flow-matching)动作专家策略,有效利用视觉语言模型(Vision-Language Model, VLM)的推理指导来规划未来轨迹,显著降低推理时间;二是设计一种基于GRPO(Generalized Reward Policy Optimization)的后训练方法,使模型不仅能从正向驾驶样本中学习,还能学习规避典型负向行为及恢复行为,从而提升整体性能与鲁棒性。

链接: https://arxiv.org/abs/2604.19710
作者: Zewei Zhou,Ruining Yang,Xuewei(Tony)Qi,Yiluan Guo,Sherry X. Chen,Tao Feng,Kateryna Pistunova,Yishan Shen,Lili Su,Jiaqi Ma
机构: University of California, Los Angeles (加州大学洛杉矶分校); Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Vision-Language-Action (VLA) models offer a promising autonomous driving paradigm for leveraging world knowledge and reasoning capabilities, especially in long-tail scenarios. However, existing VLA models often struggle with the high latency in action generation using an autoregressive generation framework and exhibit limited robustness. In this paper, we propose SpanVLA, a novel end-to-end autonomous driving framework, integrating an autoregressive reasoning and a flow-matching action expert. First, SpanVLA introduces an efficient bridge to leverage the vision and reasoning guidance of VLM to efficiently plan future trajectories using a flow-matching policy conditioned on historical trajectory initialization, which significantly reduces inference time. Second, to further improve the performance and robustness of the SpanVLA model, we propose a GRPO-based post-training method to enable the VLA model not only to learn from positive driving samples but also to learn how to avoid the typical negative behaviors and learn recovery behaviors. We further introduce mReasoning, a new real-world driving reasoning dataset, focusing on complex, reasoning-demanding scenarios and negative-recovery samples. Extensive experiments on the NAVSIM (v1 and v2) demonstrate the competitive performance of the SpanVLA model. Additionally, the qualitative results across diverse scenarios highlight the planning performance and robustness of our model.

[CV-9] Face Anything: 4D Face Reconstruction from Any Image Sequence WWW ATC

【速读】:该论文旨在解决从图像序列中高保真重建与跟踪动态人脸的难题,其核心挑战在于非刚性形变、表情变化和视角差异同时发生,导致几何结构与像素对应关系估计存在显著歧义。解决方案的关键在于提出一种基于**规范人脸点预测(canonical facial point prediction)**的统一方法,该表示将每个像素映射到共享规范空间中的归一化面部坐标,从而将密集跟踪与动态重建转化为规范空间中的重建问题,实现了时序一致的几何结构和可靠的像素对应关系。通过联合预测深度与规范坐标,该方法在单一前馈模型中实现了精确的深度估计、稳定的时序重建、稠密的3D几何以及鲁棒的人脸点跟踪,其架构基于Transformer设计,并利用非刚性对齐至规范空间的多视角几何数据进行训练,最终在重建与跟踪任务上均达到当前最优性能。

链接: https://arxiv.org/abs/2604.19702
作者: Umut Kocasari,Simon Giebenhain,Richard Shaw,Matthias Nießner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL , Video: this https URL

点击查看摘要

Abstract:Accurate reconstruction and tracking of dynamic human faces from image sequences is challenging because non-rigid deformations, expression changes, and viewpoint variations occur simultaneously, creating significant ambiguity in geometry and correspondence estimation. We present a unified method for high-fidelity 4D facial reconstruction based on canonical facial point prediction, a representation that assigns each pixel a normalized facial coordinate in a shared canonical space. This formulation transforms dense tracking and dynamic reconstruction into a canonical reconstruction problem, enabling temporally consistent geometry and reliable correspondences within a single feed-forward model. By jointly predicting depth and canonical coordinates, our method enables accurate depth estimation, temporally stable reconstruction, dense 3D geometry, and robust facial point tracking within a single architecture. We implement this formulation using a transformer-based model that jointly predicts depth and canonical facial coordinates, trained using multi-view geometry data that non-rigidly warps into the canonical space. Extensive experiments on image and video benchmarks demonstrate state-of-the-art performance across reconstruction and tracking tasks, achieving approximately 3 \times lower correspondence error and faster inference than prior dynamic reconstruction methods, while improving depth accuracy by 16%. These results highlight canonical facial point prediction as an effective foundation for unified feed-forward 4D facial reconstruction.

[CV-10] Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在专业领域(尤其是STEM领域)中跨模态推理能力评估困难的问题。现有基准测试常因模态冗余导致单模态捷径(unimodal shortcuts),且仅关注最终答案准确性,忽视了推理过程本身。为应对这一挑战,作者提出StepSTEM——一个涵盖数学、物理、化学、生物和工程的研究生级别基准,包含283道问题,通过严格的构建流程确保文本与视觉输入之间的严格互补性;其关键创新在于提出一种通用的步骤级评估框架,利用动态规划对齐预测推理步骤与多个参考解答,从而实现对文本链式思维(chain-of-thought)和图文交错推理的细粒度评估。实验表明当前MLLMs仍严重依赖文本推理,即使最先进的模型如Gemini 3.1 Pro和Claude Opus 4.6也仅达到38.29%准确率,凸显了真实跨模态STEM推理的巨大提升空间。

链接: https://arxiv.org/abs/2604.19697
作者: Jing Jin,Hao Liu,Yan Bai,Yihang Lou,Zhenke Wang,Tianrun Yuan,Juntong Chen,Yongkang Zhu,Fanhu Zeng,Xuanyu Zhu,Yige Xu
机构: Central South University (中南大学); Meituan Inc (美团); Peking University (北京大学); University of Chinese Academy of Sciences (中国科学院大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved image-text reasoning, using dynamic programming to align predicted reasoning steps with multiple reference solutions. Experiments across a wide range of models show that current MLLMs still rely heavily on textual reasoning, with even Gemini 3.1 Pro and Claude Opus 4.6 achieving only 38.29% accuracy. These results highlight substantial headroom for genuine cross-modal STEM reasoning and position StepSTEM as a benchmark for fine-grained evaluation of multimodal reasoning. Source code is available at this https URL.

[CV-11] IR-Flow: Bridging Discriminative and Generative Image Restoration via Rectified Flow

【速读】:该论文旨在解决图像恢复任务中判别式方法(discriminative methods)因期望学习导致细节缺失,以及生成式方法(generative methods)存在多步采样效率低和噪声残差耦合问题的困境。其解决方案的关键在于提出IR-Flow框架,该框架基于修正流(Rectified Flow)构建统一建模机制,通过多层次数据分布流增强模型对不同退化层级的学习与适应能力,并引入累积速度场(cumulative velocity fields)以学习跨退化水平的传输轨迹,引导中间状态向干净目标演化;同时设计多步一致性约束以保证轨迹连贯性并提升少步恢复性能。该方法直接建立退化与干净图像域之间的线性传输流,实现快速推理并提升对分布外退化的鲁棒性。

链接: https://arxiv.org/abs/2604.19680
作者: Zihao Fan,Xin Lu,Jie Xiao,Dong Li,Jie Huang,Xueyang Fu
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In image restoration, single-step discriminative mappings often lack fine details via expectation learning, whereas generative paradigms suffer from inefficient multi-step sampling and noise-residual coupling. To address this dilemma, we propose IR-Flow, a novel image restoration method based on Rectified Flow that serves as a unified framework bridging the gap between discriminative and generative paradigms. Specifically, we first construct multilevel data distribution flows, which expand the ability of models to learn from and adapt to various levels of degradation. Subsequently, cumulative velocity fields are proposed to learn transport trajectories across varying degradation levels, guiding intermediate states toward the clean target, while a multi-step consistency constraint is presented to enforce trajectory coherence and boost few-step restoration performance. We show that directly establishing a linear transport flow between degraded and clean image domains not only enables fast inference but also improves adaptability to out-of-distribution degradations. Extensive evaluations on deraining, denoising and raindrop removal tasks demonstrate that IR-Flow achieves competitive quantitative results with only a few sampling steps, offering an efficient and flexible framework that maintains an excellent distortion-perception balance. Our code is available at this https URL.

[CV-12] MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation

【速读】:该论文旨在解决当前联合音视频生成框架中可控性不足的问题,即现有方法通常仅支持视频单模态控制,导致跨模态对齐效果不佳。其解决方案的关键在于提出MMControl,通过引入双流条件注入机制,将视觉(如参考图像、深度图、姿态序列)和听觉(如参考音频)控制信号共同注入到联合音视频扩散Transformer中,并结合模态特定的引导缩放策略,使模型在结构约束下同时生成身份一致的视频与音色一致的音频,从而实现对角色身份、声音特质、身体姿态及场景布局的细粒度、可组合控制。

链接: https://arxiv.org/abs/2604.19679
作者: Liyang Li,Wen Wang,Canyu Zhao,Tianjian Feng,Zhiyue Zhao,Hao Chen,Chunhua Shen
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are typically restricted to video-only control. This restricts comprehensive controllability and often leads to suboptimal cross-modal alignment. To bridge this gap, we present MMControl, which enables users to perform Multi-Modal Control in joint audio-video generation. MMControl introduces a dual-stream conditional injection mechanism. It incorporates both visual and acoustic control signals, including reference images, reference audio, depth maps, and pose sequences, into a joint generation process. These conditions are injected through bypass branches into a joint audio-video Diffusion Transformer, enabling the model to simultaneously generate identity-consistent video and timbre-consistent audio under structural constraints. Furthermore, we introduce modality-specific guidance scaling, which allows users to independently and dynamically adjust the influence strength of each visual and acoustic condition at inference time. Extensive experiments demonstrate that MMControl achieves fine-grained, composable control over character identity, voice timbre, body pose, and scene layout in joint audio-video generation.

[CV-13] MedFlowSeg: Flow Matching for Medical Image Segmentation with Frequency-Aware Attention

【速读】:该论文旨在解决当前基于扩散模型的医学图像分割方法中存在的计算开销大和参数化受限问题。现有方法依赖迭代采样过程,导致效率低下,且多采用UNet结构限制了模型表达能力。其解决方案的关键在于提出MedFlowSeg,一种条件流匹配(conditional flow matching)框架,将医学图像分割建模为学习一个时变向量场,该向量场可将简单先验分布映射至目标分割分布,从而实现一步确定性推理;同时引入双条件机制——包括双分支空间注意力模块(Dual-Branch Spatial Attention)以注入多尺度结构信息,以及频域感知注意力模块(Frequency-Aware Attention)通过差异感知融合与时间依赖调制建模空间与频域表示间的跨域交互,有效提升了模型对全局解剖结构与细粒度边界细节的捕捉能力。

链接: https://arxiv.org/abs/2604.19675
作者: Zhi Chen,Runze Hu,Le Zhang
机构: University of Birmingham (伯明翰大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flow matching has recently emerged as a principled framework for learning continuous-time transport maps, enabling efficient deterministic generation without relying on stochastic diffusion processes. While generative modeling has shown promise for medical image segmentation, particularly in capturing uncertainty and complex anatomical variability, existing approaches are predominantly built upon diffusion models, which incur substantial computational overhead due to iterative sampling and are often constrained by UNet-based parameterizations. In this work, we introduce MedFlowSeg, a conditional flow matching framework that formulates medical image segmentation as learning a time-dependent vector field that transports a simple prior distribution to the target segmentation distribution. This formulation enables one-step deterministic inference while preserving the expressiveness of generative modeling. We further develop a dual-conditioning mechanism to incorporate structured priors into the learned flow. Specifically, we propose a Dual-Branch Spatial Attention module that injects multi-scale structural information into the flow field, and a Frequency-Aware Attention module that models cross-domain interactions between spatial and spectral representations via discrepancy-aware fusion and time-dependent modulation. Together, these components provide an effective parameterization of conditional flows that capture both global anatomical structure and fine-grained boundary details. We provide extensive empirical validation across multiple medical imaging modalities, demonstrating that MedFlowSeg achieves state-of-the-art performance while significantly reducing computational cost compared to diffusion-based methods. Our results highlight the potential of flow matching as a theoretically grounded and computationally efficient alternative for generative medical image segmentation.

[CV-14] InHabit: Leverag ing Image Foundation Models for Scalable 3D Human Placement

【速读】:该论文旨在解决训练具身智能体(embodied agents)理解三维场景时缺乏大规模、真实且语义丰富的交互数据的问题。现有真实世界动作捕捉数据成本高且受限于受控环境,而合成数据则依赖简单几何启发式方法,忽略了复杂的场景上下文信息。解决方案的关键在于提出InHabit——一个全自动、可扩展的3D场景中人类交互数据生成框架,其核心遵循“渲染-生成-提升”(render-generate-lift)原则:首先渲染3D场景,接着利用视觉-语言模型(vision-language model)提出语境合理的动作建议,再通过图像编辑模型插入人体,最后借助优化过程将编辑结果转化为与场景几何对齐的物理合理SMPL-X人体模型。该方法首次在Habitat-Matterport3D上生成了包含78K样本的大规模逼真3D人-场景交互数据集,显著提升了基于RGB的3D人体-场景重建与接触估计性能,并在感知用户研究中优于当前最优方法。

链接: https://arxiv.org/abs/2604.19673
作者: Nikita Kister,Pradyumna YM,István Sárándi,Jiayi Wang,Anna Khoreva,Gerard Pons-Moll
机构: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); ELLIS PhD Program (ELLIS博士项目); International Max Planck Research School for Intelligent Systems (国际马克斯·普朗克智能系统研究学校); German Federal Ministry of Education and Research (德国联邦教育与研究部); Deutsche Forschungsgemeinschaft (德国研究基金会); Carl Zeiss Foundation (卡尔·蔡司基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training embodied agents to understand 3D scenes as humans do requires large-scale data of people meaningfully interacting with diverse environments, yet such data is scarce. Real-world motion capture is costly and limited to controlled settings, while existing synthetic datasets rely on simple geometric heuristics that ignore rich scene context. In contrast, 2D foundation models trained on internet-scale data have implicitly acquired commonsense knowledge of human-environment interactions. To transfer this knowledge into 3D, we introduce InHabit, a fully automatic and scalable data generator for populating 3D scenes with interacting humans. InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces the first large-scale photorealistic 3D human-scene interaction dataset, containing 78K samples across 800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and RGB images. Augmenting standard training data with our samples improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual user study our data is preferred in 78% of cases over the state of the art.

[CV-15] CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation

【速读】:该论文旨在解决开放词汇语义分割(open-vocabulary semantic segmentation)中多类场景下的掩码生成不稳定问题,具体表现为:不同类别提示(category prompt)独立生成的掩码缺乏统一的可比证据尺度,导致掩码重叠覆盖和类间竞争不稳;同时,同一概念的不同同义表达会激活不一致的语义与空间证据,引发类内漂移(intra-class drift),进一步加剧类间冲突,降低整体推理稳定性。解决方案的关键在于提出 CoCo-SAM3(Concept-Conflict SAM3),其核心机制是显式解耦推理过程为类内增强(intra-class enhancement)与类间竞争(inter-class competition)两阶段:首先对同义提示的证据进行对齐与聚合以强化概念一致性,随后在统一可比尺度上执行类间竞争,实现所有候选类别的像素级直接比较,从而稳定多类推理并有效缓解类间冲突。该方法无需额外训练即可在八个开放词汇语义分割基准上实现一致性能提升。

链接: https://arxiv.org/abs/2604.19648
作者: Yanhui Chen,Baoyao Yang,Siqi Liu,Jingchao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:SAM3 advances open-vocabulary semantic segmentation by introducing a prompt-driven mask generation paradigm. However, in multi-class open-vocabulary scenarios, masks generated independently from different category prompts lack a unified and inter-class comparable evidence scale, often resulting in overlapping coverage and unstable competition. Moreover, synonymous expressions of the same concept tend to activate inconsistent semantic and spatial evidence, leading to intra-class drift that exacerbates inter-class conflicts and compromises overall inference stability. To address these issues, we propose CoCo-SAM3 (Concept-Conflict SAM3), which explicitly decouples inference into intra-class enhancement and inter-class competition. Our method first aligns and aggregates evidence from synonymous prompts to strengthen concept consistency. It then performs inter-class competition on a unified comparable scale, enabling direct pixel-wise comparisons among all candidate classes. This mechanism stabilizes multi-class inference and effectively mitigates inter-class conflicts. Without requiring any additional training, CoCo-SAM3 achieves consistent improvements across eight open-vocabulary semantic segmentation benchmarks.

[CV-16] CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

【速读】:该论文旨在解决当前扩散模型在人类-物体交互(Human-Object Interaction, HOI)视频合成中面临的两大核心问题:一是敏感区域(如手部和面部)的结构稳定性不足,二是物理上不合理的接触行为(如手与物体穿插)。解决方案的关键在于提出一个端到端框架CoInteract,其核心创新包括两个互补设计:其一为“人感知混合专家(Human-Aware Mixture-of-Experts, MoE)”,通过空间监督路由机制将token分配至轻量化的区域专用专家模块,在保持极低参数开销的前提下提升细粒度结构保真度;其二为“空间结构协同生成(Spatially-Structured Co-Generation)”,采用双流训练范式联合建模RGB外观流与辅助HOI结构流,利用交互几何先验约束共享骨干网络权重,推理时移除HOI分支以实现零额外计算开销的高质量视频生成。

链接: https://arxiv.org/abs/2604.19636
作者: Xiangyang Luo,Xiaozhe Xin,Tao Feng,Xu Guo,Meiguang Jin,Junfeng Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project page: this https URL

点击查看摘要

Abstract:Synthesizing human–object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand–object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.

[CV-17] CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers

【速读】:该论文旨在解决生成式 AI (Generative AI) 在图形设计图像处理中缺乏显式分层结构的问题,使得下游编辑受限。现有方法依赖多阶段流水线(如布局预测、抠图和修复),存在误差累积和可控性差的缺陷。其解决方案的关键在于提出一种混合生成框架,将位图图像分解为可编辑的文本、背景与贴纸层:利用视觉-语言模型解析文本区域并输出文本渲染协议以实现高保真重建与灵活再编辑;背景与贴纸层则通过支持RGBA通道的多分支扩散架构生成;同时引入ParserReward并结合Group Relative Policy Optimization优化生成质量,使其更符合人类设计偏好。

链接: https://arxiv.org/abs/2604.19632
作者: Weidong Chen,Dexiang Hong,Zhendong Mao,Yutao Cheng,Xinyan Liu,Lei Zhang,Yongdong Zhang
机构: University of Science and Technology of China (中国科学技术大学); ByteDance Intelligent Creation (字节跳动智能创作); Harbin Institute of Technology (哈尔滨工业大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Graphic design images consist of multiple editable layers, such as text, background, and decorative elements, while most generative models produce rasterized outputs without explicit layer structures, limiting downstream editing. Existing graphic design parsing methods typically rely on multi-stage pipelines combining layout prediction, matting, and inpainting, which suffer from error accumulation and limited controllability. We propose a hybrid generative framework for raster-to-layer graphic design parsing that decomposes a design image into editable text, background, and sticker layers. Text regions are parsed using a vision-language model into a text rendering protocol, enabling faithful reconstruction and flexible re-editing, while background and sticker layers are generated using a multi-branch diffusion architecture with RGBA support. We further introduce ParserReward and integrate it with Group Relative Policy Optimization to align generation quality with human design preferences. Extensive experiments on two challenging datasets, \emphi.e., the Parser-40K and Crello datasets, demonstrate superior performance over existing methods, \empheg., achieving an overall average improvement of 23.7% across all metrics.

[CV-18] MOSA: Motion-Guided Semantic Alignment for Dynamic Scene Graph Generation

【速读】:该论文旨在解决动态场景图生成(Dynamic Scene Graph Generation, DSGG)中面临的三个核心挑战:细粒度关系建模不足、语义表示利用不充分以及尾部关系(tail relationships)建模能力弱。解决方案的关键在于提出一种运动引导的语义对齐方法(Motion-guided Semantic Alignment, MoSA),其核心创新包括:1)设计运动特征提取器(Motion Feature Extractor, MFE)以编码物体对间的运动属性(如距离、速度、运动持续性与方向一致性);2)通过运动引导交互模块(Motion-guided Interaction Module, MIM)将运动特征与空间关系特征融合,生成具备运动感知的关系表征;3)引入跨模态动作语义匹配机制(Action Semantic Matching, ASM),对齐视觉关系特征与关系类别的文本嵌入,增强语义区分能力;4)采用类别加权损失策略,提升对低频尾部关系的学习效果。实验表明,MoSA在Action Genome数据集上取得了最优性能。

链接: https://arxiv.org/abs/2604.19631
作者: Xuejiao Wang,Bohao Zhang,Changbo Wang,Gaoqi He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamic Scene Graph Generation (DSGG) aims to structurally model objects and their dynamic interactions in video sequences for high-level semantic understanding. However, existing methods struggle with fine-grained relationship modeling, semantic representation utilization, and the ability to model tail relationships. To address these issues, this paper proposes a motion-guided semantic alignment method for DSGG (MoSA). First, a Motion Feature Extractor (MFE) encodes object-pair motion attributes such as distance, velocity, motion persistence, and directional consistency. Then, these motion attributes are fused with spatial relationship features through the Motion-guided Interaction Module (MIM) to generate motion-aware relationship representations. To further enhance semantic discrimination capabilities, the cross-modal Action Semantic Matching (ASM) mechanism aligns visual relationship features with text embeddings of relationship categories. Finally, a category-weighted loss strategy is introduced to emphasize learning of tail relationships. Extensive and rigorous testing shows that MoSA performs optimally on the Action Genome dataset.

[CV-19] GRAFT: Geometric Refinement and Fitting Transformer for Human Scene Reconstruction

【速读】:该论文旨在解决从单张图像中重建物理上合理的三维人-场景交互(Human-Scene Interaction, HSI)的问题,当前方法存在效率与准确性之间的权衡:基于优化的方法虽能精确建模接触关系但速度慢(约20秒),而前馈式方法虽快速却缺乏显式的交互推理,常产生漂浮或穿插伪影。其解决方案的关键在于提出一种名为GRAFT(Geometric Refinement And Fitting Transformer)的可学习人-场景交互先验模型,通过预测“交互梯度”(Interaction Gradients)——即修正参数更新量——来迭代优化人体网格,从而在保持几何合理性的同时实现高效推理。GRAFT将交互状态编码为紧凑的、以身体锚定的token,并利用几何探针(Geometric Probes)捕捉人体与周围表面的空间关系;再通过轻量级Transformer循环更新人体姿态并重新探测场景,确保最终结果既符合学习到的先验知识又贴合观测几何结构。该方法可在端到端重建中使用图像特征,也可仅依赖几何信息作为可迁移的插件式先验,显著提升前馈方法的交互质量且无需重新训练。

链接: https://arxiv.org/abs/2604.19624
作者: Pradyumna YM,Yuxuan Xue,Yue Chen,Nikita Kister,István Sárándi,Gerard Pons-Moll
机构: 1. Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); 2. University of Tübingen (图宾根大学); 3. Chinese Academy of Sciences (中国科学院); 4. MPI-IMPRS for Intelligent Systems (马克斯·普朗克智能系统研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Reconstructing physically plausible 3D human-scene interactions (HSI) from a single image currently presents a trade-off: optimization based methods offer accurate contact but are slow (~20s), while feed-forward approaches are fast yet lack explicit interaction reasoning, producing floating and interpenetration artifacts. Our key insight is that geometry-based human–scene fitting can be amortized into fast feed-forward inference. We present GRAFT (Geometric Refinement And Fitting Transformer), a learned HSI prior that predicts Interaction Gradients: corrective parameter updates that iteratively refine human meshes by reasoning about their 3D relationship to the surrounding scene. GRAFT encodes the interaction state into compact body-anchored tokens, each grounded in the scene geometry via Geometric Probes that capture spatial relationships with nearby surfaces. A lightweight transformer recurrently updates human meshes and re-probes the scene, ensuring the final pose aligns with both learned priors and observed geometry. GRAFT operates either as an end-to-end reconstructor using image features, or with geometry alone as a transferable plug-and-play HSI prior that improves feed-forward methods without retraining. Experiments show GRAFT improves interaction quality by up to 113% over state-of-the-art feed-forward methods and matches optimization-based interaction quality at \sim50\times lower runtime, while generalizing seamlessly to in-the-wild multi-person scenes and being preferred in 64.8% of three-way user study. Project page: this https URL . Comments: Project Page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.19624 [cs.CV] (or arXiv:2604.19624v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.19624 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Pradyumna Ym [view email] [v1] Tue, 21 Apr 2026 16:13:15 UTC (16,331 KB) Full-text links: Access Paper: View a PDF of the paper titled GRAFT: Geometric Refinement and Fitting Transformer for Human Scene Reconstruction, by Pradyumna YM and 5 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-04 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-20] SAGE: Training-Free Semantic Evidence Composition for Edge-Cloud Inference under Hard Uplink Budgets

【速读】:该论文旨在解决边缘-云协同推理中因上行信道带宽受限而导致的传输比特数约束问题,即在有限的通信预算下如何高效选择要上传的输入片段以最大化服务器端的推理准确率。传统方法仅依赖注意力机制的重要性评分进行内容筛选,但研究发现这种策略存在本质局限:一方面,高重要性单元未必构成最优组合;另一方面,单纯依靠空间均匀采样也能获得良好性能,说明覆盖多样性本身具有独立价值。解决方案的关键在于提出SAGE(Semantic Attention-Guided Evidence),一种无需训练、融合重要性过滤与嵌入空间多样性采样的新方法,从而在显著减少传输证据单元数量的同时逼近服务器端的最佳性能,实验表明其在ImageNet-1K上可达到93%的服务器天花板准确率,且传输量不足常规方法的一半。

链接: https://arxiv.org/abs/2604.19623
作者: Inhyeok Choi,Hyuncheol Park
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 11pages, 9 figures

点击查看摘要

Abstract:Edge-cloud hybrid inference offloads difficult inputs to a powerful remote model, but the uplink channel imposes hard per-request constraints on the number of bits that can be transmitted. We show that selecting transmitted content based solely on attention-based importance, the standard approach in collaborative inference, is inherently limited under hard budgets. Two findings support this claim. First, replacing high-importance units with low-importance but complementary ones improves server accuracy. This shows that what matters is not individual importance but how well the transmitted set covers diverse aspects of the input. Second, spatially uniform selection without any content information achieves competitive accuracy at moderate budgets. This confirms that spatial coverage alone carries independent value. Based on this analysis, we propose SAGE (Semantic Attention-Guided Evidence), a principled, training-free method that combines importance filtering with embedding-diversity sampling. SAGE achieves 93% of the server ceiling in offloaded accuracy while transmitting fewer than half of the available evidence units on ImageNet-1K, substantially outperforming importance-only composition.

[CV-21] Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding

【速读】:该论文旨在解决3D场景理解领域长期依赖专用骨干网络(backbone)且缺乏与主流Transformer生态融合的问题,从而限制了新方法的迁移和软硬件优化红利的利用。其核心解决方案是提出一种轻量级、通用的Volume Transformer(Volt)模型,通过将3D场景划分为体素patch token并引入3D旋转位置编码(rotary positional embeddings),实现对全局自注意力机制的直接应用;关键创新在于设计了一套数据高效的训练策略——结合强3D增强、正则化与卷积教师蒸馏(distillation),有效缓解因标注数据稀缺导致的捷径学习(shortcut learning)问题,并通过多数据集联合训练显著提升模型泛化能力,最终在室内和室外3D语义分割任务中达到当前最优性能,验证了Volt作为通用骨干网络的潜力。

链接: https://arxiv.org/abs/2604.19609
作者: Kadir Yilmaz,Adrian Kruse,Tristan Höfer,Daan de Geus,Bastian Leibe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Transformers have become a common foundation across deep learning, yet 3D scene understanding still relies on specialized backbones with strong domain priors. This keeps the field isolated from the broader Transformer ecosystem, limiting the transfer of new advances as well as the benefits of increasingly optimized software and hardware stacks. To bridge this gap, we adapt the vanilla Transformer encoder to 3D scenes with minimal modifications. Given an input 3D scene, we partition it into volumetric patch tokens, process them with full global self-attention, and inject positional information via a 3D extension of rotary positional embeddings. We call the resulting model the Volume Transformer (Volt) and apply it to 3D semantic segmentation. Naively training Volt on standard 3D benchmarks leads to shortcut learning, highlighting the limited scale of current 3D supervision. To overcome this, we introduce a data-efficient training recipe based on strong 3D augmentations, regularization, and distillation from a convolutional teacher, making Volt competitive with state-of-the-art methods. We then scale supervision through joint training on multiple datasets and show that Volt benefits more from increased scale than domain-specific 3D backbones, achieving state-of-the-art results across indoor and outdoor datasets. Finally, when used as a drop-in backbone in a standard 3D instance segmentation pipeline, Volt again sets a new state of the art, highlighting its potential as a simple, scalable, general-purpose backbone for 3D scene understanding.

[CV-22] PC2Model: ISPRS benchmark on 3D point cloud to model registration

【速读】:该论文旨在解决点云到三维模型(point cloud-to-model, PC2Model)配准任务中因现实扫描数据的稀疏性、噪声、杂波和遮挡等问题导致的数据驱动方法性能受限的问题。解决方案的关键在于提出一个名为PC2Model的基准数据集,采用混合设计——结合模拟点云与部分真实扫描及其对应的3D模型,从而在受控条件下提供精确的地面真值,同时引入传感器和环境伪影以增强泛化能力,支持从模拟到真实场景的系统性迁移能力分析和鲁棒训练评估。

链接: https://arxiv.org/abs/2604.19596
作者: Mehdi Maboudi,Said Harb,Jackson Ferrao,Kourosh Khoshelham,Yelda Turkan,Karam Mawas
机构: Technische Universität Braunschweig (不伦瑞克工业大学); University of Melbourne (墨尔本大学); Oregon State University (俄勒冈州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ISPRS Congress 2026, Toronto

点击查看摘要

Abstract:Point cloud registration involves aligning one point cloud with another or with a three-dimensional (3D) model, enabling the integration of multimodal data into a unified representation. This is essential in applications such as construction monitoring, autonomous driving, robotics, and virtual or augmented reality (VR/AR).With the increasing accessibility of point cloud acquisition technologies, such as Light Detection and Ranging (LiDAR) and structured light scanning, along with recent advances in deep learning, the research focus has increasingly shifted towards downstream tasks, particularly point cloud-to-model (PC2Model) registration. While data-driven methods aim to automate this process, they struggle with sparsity, noise, clutter, and occlusions in real-world scans, which limit their performance. To address these challenges, this paper introduces the PC2Model benchmark, a publicly available dataset designed to support the training and evaluation of both classical and data-driven methods. Developed under the leadership of ICWG II/Ib, the PC2Model benchmark adopts a hybrid design that combines simulated point clouds with, in some cases, real-world scans and their corresponding 3D models. Simulated data provide precise ground truth and controlled conditions, while real-world data introduce sensor and environmental artefacts. This design supports robust training and evaluation across domains and enables the systematic analysis of model transferability from simulated to real-world scenarios. The dataset is publicly accessible at: this https URL.

[CV-23] Structure-Semantic Decoupled Modulation of Global Geospatial Embeddings for High-Resolution Remote Sensing Mapping

【速读】:该论文旨在解决高分辨率遥感制图中因局部视觉特征依赖导致的跨域泛化能力弱以及大尺度地物覆盖预测碎片化的问题,同时克服全球地理空间基础模型(geospatial foundation models)与高分辨率视觉特征直接融合时因语义-空间鸿沟引发的特征干扰和空间结构退化问题。其解决方案的关键在于提出结构-语义解耦调制(Structure-Semantic Decoupled Modulation, SSDM)框架,通过两条互补的跨模态注入路径实现:一是结构先验调制分支,将全局表示中的宏观感受野先验引入高分辨率编码器的自注意力模块,以整体结构约束引导局部特征提取,抑制高频细节噪声和类内差异引起的预测碎片化;二是全局语义注入分支,显式对齐全局上下文与深层高分辨率特征空间,并通过跨模态融合直接补充全局语义信息,显著提升复杂地物的语义一致性和类别级判别力。

链接: https://arxiv.org/abs/2604.19591
作者: Jienan Lyu,Miao Yang,Jinchen Cai,Yiwen Hu,Guanyi Lu,Junhao Qiu,Runmin Dong
机构: Sun Yat-Sen University (中山大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-grained high-resolution remote sensing mapping typically relies on localized visual features, which restricts cross-domain generalizability and often leads to fragmented predictions of large-scale land covers. While global geospatial foundation models offer powerful, generalizable representations, directly fusing their high-dimensional implicit embeddings with high-resolution visual features frequently triggers feature interference and spatial structure degradation due to a severe semantic-spatial gap. To overcome these limitations, we propose a Structure-Semantic Decoupled Modulation (SSDM) framework, which decouples global geospatial representations into two complementary cross-modal injection pathways. First, the structural prior modulation branch introduces the macroscopic receptive field priors from global representations into the self-attention modules of the high-resolution encoder. By guiding local feature extraction with holistic structural constraints, it effectively suppresses prediction fragmentation caused by high-frequency detail noise and excessive intra-class variance. Second, the global semantic injection branch explicitly aligns holistic context with the deep high-resolution feature space and directly supplements global semantics via cross-modal integration, thereby significantly enhancing the semantic consistency and category-level discrimination of complex land covers. Extensive experiments demonstrate that our method achieves state-of-the-art performance compared to existing cross-modal fusion approaches. By unleashing the potential of global embeddings, SSDM consistently improves high-resolution mapping accuracy across diverse scenarios, providing a universal and effective paradigm for integrating geospatial foundation models into high-resolution vision tasks.

[CV-24] SmartPhotoCrafter: Unified Reasoning Generation and Optimization for Automatic Photographic Image Editing

【速读】:该论文旨在解决传统摄影图像编辑依赖用户具备充足美学理解以提供明确指令的问题,而这类指令往往模糊、不完整或对非专业用户难以获取。解决方案的关键在于提出SmartPhotoCrafter,一种将图像编辑建模为紧密耦合的推理-生成(reasoning-to-generation)过程的方法:首先通过Image Critic模块实现图像质量理解并识别缺陷,再由Photographic Artist模块执行针对性增强以提升图像吸引力,从而无需显式的人类指令。该方法采用多阶段训练流程,包括基础预训练、基于推理引导的多编辑监督微调以及推理与生成协同强化学习,有效实现了高质量、保真的图像增强,并在色调敏感性和语义一致性方面优于现有生成模型。

链接: https://arxiv.org/abs/2604.19587
作者: Ying Zeng,Miaosen Luo,Guangyuan Li,Yang Yang,Ruiyang Fan,Linxiao Shi,Qirui Yang,Jian Zhang,Chengcheng Liu,Siming Zheng,Jinwei Chen,Bo Li,Peng-Tao Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: tech report

点击查看摘要

Abstract:Traditional photographic image editing typically requires users to possess sufficient aesthetic understanding to provide appropriate instructions for adjusting image quality and camera parameters. However, this paradigm relies on explicit human instruction of aesthetic intent, which is often ambiguous, incomplete, or inaccessible to non-expert users. In this work, we propose SmartPhotoCrafter, an automatic photographic image editing method which formulates image editing as a tightly coupled reasoning-to-generation process. The proposed model first performs image quality comprehension and identifies deficiencies by the Image Critic module, and then the Photographic Artist module realizes targeted edits to enhance image appeal, eliminating the need for explicit human instructions. A multi-stage training pipeline is adopted: (i) Foundation pretraining to establish basic aesthetic understanding and editing capabilities, (ii) Adaptation with reasoning-guided multi-edit supervision to incorporate rich semantic guidance, and (iii) Coordinated reasoning-to generation reinforcement learning to jointly optimize reasoning and generation. During training, SmartPhotoCrafter emphasizes photo-realistic image generation, while supporting both image restoration and retouching tasks with consistent adherence to color- and tone-related semantics. We also construct a stage-specific dataset, which progressively builds reasoning and controllable generation, effective cross-module collaboration, and ultimately high-quality photographic enhancement. Experiments demonstrate that SmartPhotoCrafter outperforms existing generative models on the task of automatic photographic enhancement, achieving photo-realistic results while exhibiting higher tonal sensitivity to retouching instructions. Project page: this https URL.

[CV-25] ransSplat: Unbalanced Semantic Transport for Language-Driven 3DGS Editing

【速读】:该论文旨在解决语言驱动的3D Gaussian Splatting(3DGS)编辑中,由于缺乏对编辑后的2D视觉证据与3D高斯分布之间语义对应关系的显式建模,导致局部编辑精度不足和结构一致性差的问题。现有方法多聚焦于提升多视角一致性,但未从根本上解决编辑语义在跨视图间如何准确映射到3D空间的问题。其解决方案的关键在于将编辑任务建模为一个多视角非平衡语义传输(multi-view unbalanced semantic transport)问题:通过建立可见高斯点与视图特定编辑原型之间的对应关系,显式刻画2D编辑证据与3D高斯之间的语义关联;进一步恢复一个跨视角共享的规范3D编辑场(canonical 3D edit field),以统一引导3D外观更新,并利用传输残差抑制非目标区域的错误编辑,从而减少编辑泄漏并提升局部控制精度。

链接: https://arxiv.org/abs/2604.19571
作者: Yanhui Chen,Jiahong Li,Jingchao Wang,Junyi Lin,Zixin Zeng,Yang Shi
机构: Guangdong University of Technology (广东工业大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Language-driven 3D Gaussian Splatting (3DGS) editing provides a more convenient approach for modifying complex scenes in VR/AR. Standard pipelines typically adopt a two-stage strategy: first editing multiple 2D views, and then optimizing the 3D representation to match these edited observations. Existing methods mainly improve view consistency through multi-view feature fusion, attention filtering, or iterative recalibration. However, they fail to explicitly address a more fundamental issue: the semantic correspondence between edited 2D evidence and 3D Gaussians. To tackle this problem, we propose TransSplat, which formulates language-driven 3DGS editing as a multi-view unbalanced semantic transport problem. Specifically, our method establishes correspondences between visible Gaussians and view-specific editing prototypes, thereby explicitly characterizing the semantic relationship between edited 2D evidence and 3D Gaussians. It further recovers a cross-view shared canonical 3D edit field to guide unified 3D appearance updates. In addition, we use transport residuals to suppress erroneous edits in non-target regions, mitigating edit leakage and improving local control precision. Qualitative and quantitative results show that, compared with existing 3D editing methods centered on enhancing view consistency, TransSplat achieves superior performance in local editing accuracy and structural consistency.

[CV-26] RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中长期上下文推理与精确边界 delineation 的双重挑战,同时克服现有基于 Transformer 和扩散模型方法因二次计算复杂度导致的推理延迟过高问题。其解决方案的关键在于提出 RF-HiT(Rectified Flow Hierarchical Transformer),该模型采用类 hourglass 结构的 Transformer 主干网络和多尺度层次编码器,结合可学习插值实现跨分辨率特征融合,从而在保持线性计算复杂度的前提下,仅需三步即可完成高效推理;这一设计显著提升了效率-性能平衡,在 ACDC 和 BraTS 2021 数据集上分别达到 91.27% 和 87.40% 的平均 Dice 分数,优于或相当於更复杂的模型。

链接: https://arxiv.org/abs/2604.19570
作者: Ahmed Marouane Djouama,Abir Belaala,Abdellah Zakaria Sellam,Salah Eddine Bekhouche,Cosimo Distante,Abdenour Hadid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate medical image segmentation requires both long-range contextual reasoning and precise boundary delineation, a task where existing transformer- and diffusion-based paradigms are frequently bottlenecked by quadratic computational complexity and prohibitive inference latency. We propose RF-HiT, a Rectified Flow Hierarchical Transformer that integrates an hourglass transformer backbone with a multi-scale hierarchical encoder for anatomically guided feature conditioning. Unlike prior diffusion-based approaches, RF-HiT leverages rectified flow with efficient transformer blocks to achieve linear complexity while requiring only a few discretization steps. The model further fuses conditioning features across resolutions via learnable interpolation, enabling effective multi-scale representation with minimal computational overhead. As a result, RF-HiT achieves a strong efficiency-performance trade-off, requiring only 10.14 GFLOPs, 13.6M parameters, and inference in as few as three steps. Despite its compact design, RF-HiT attains 91.27% mean Dice on ACDC and 87.40% on BraTS 2021, achieving performance comparable to or exceeding that of significantly more intensive architectures. This demonstrates its strong potential as a robust, computationally efficient foundation for real-time clinical segmentation.

[CV-27] EgoSelf: From Memory to Personalized Egocentric Assistant

【速读】:该论文旨在解决个性化第一人称视角(egocentric)助手在长期用户数据整合方面的挑战,以实现更有效的个性化服务。其关键解决方案在于提出EgoSelf系统,该系统包含基于图结构的交互记忆(graph-based interaction memory)和专门设计的个性化学习任务:前者从历史观察中构建时序与语义关联的交互事件网络,提取用户特定行为模式;后者将个性化建模转化为预测问题,通过图结构中个体用户的历史行为预测未来交互,从而实现精准个性化。

链接: https://arxiv.org/abs/2604.19564
作者: Yanshuo Wang,Yuan Xu,Xuesong Li,Jie Hong,Yizhou Wang,Chang Wen Chen,Wentao Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Egocentric assistants often rely on first-person view data to capture user behavior and context for personalized services. Since different users exhibit distinct habits, preferences, and routines, such personalization is essential for truly effective assistance. However, effectively integrating long-term user data for personalization remains a key challenge. To address this, we introduce EgoSelf, a system that includes a graph-based interaction memory constructed from past observations and a dedicated learning task for personalization. The memory captures temporal and semantic relationships among interaction events and entities, from which user-specific profiles are derived. The personalized learning task is formulated as a prediction problem where the model predicts possible future interactions from individual user’s historical behavior recorded in the graph. Extensive experiments demonstrate the effectiveness of EgoSelf as a personalized egocentric assistant. Code is available at \hrefthis https URLthis https URL_project/.

[CV-28] Paparazzo: Active Mapping of Moving 3D Objects CVPR

【速读】:该论文旨在解决当前3D地图构建(3D mapping)流程普遍假设环境静态所带来的局限性,即难以准确捕捉和重建移动物体的问题。其解决方案的关键在于提出了一种名为Paparazzo的无学习(learning-free)方法,该方法能够鲁棒地预测目标物体的运动轨迹,并识别出最具信息量的观测视角,从而规划出最优自身路径以实现对移动对象的有效感知与重建。

链接: https://arxiv.org/abs/2604.19556
作者: Davide Allegro,Shiyao Li,Stefano Ghidoni,Vincent Lepetit
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

Abstract:Current 3D mapping pipelines generally assume static environments, which limits their ability to accurately capture and reconstruct moving objects. To address this limitation, we introduce the novel task of active mapping of moving objects, in which a mapping agent must plan its trajectory while compensating for the object’s motion. Our approach, Paparazzo, provides a learning-free solution that robustly predicts the target’s trajectory and identifies the most informative viewpoints from which to observe it, to plan its own path. We also contribute a comprehensive benchmark designed for this new task. Through extensive experiments, we show that Paparazzo significantly improves 3D reconstruction completeness and accuracy compared to several strong baselines, marking an important step toward dynamic scene understanding. Project page: this https URL

[CV-29] Evaluating Histogram Matching for Robust Deep learning-Based Grapevine Disease Detection

【速读】:该论文旨在解决光照变化对基于深度学习的田间植物病害检测鲁棒性造成显著限制的问题(illumination variability limiting deep learning robustness for field-based plant disease detection)。其关键解决方案是提出一种双阶段融合直方图匹配(Histogram Matching, HM)策略:首先将HM作为预处理步骤用于图像归一化,其次将其作为数据增强技术引入训练过程以控制性地增加多样性。实验表明,该方法在真实田间冠层图像上显著提升了模型鲁棒性,尤其在非均匀冠层样本中效果明显,有效缓解了因光照不一致导致的域偏移问题(domain gap)。

链接: https://arxiv.org/abs/2604.19510
作者: Ruben Pascual,Inés Hernández,Salvador Gutiérrez,Javier Tardaguila,Pedro Melo-Pinto,Daniel Paternain,Mikel Galar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Variability in illumination is a primary factor limiting deep learning robustness for field-based plant disease detection. This study evaluates Histogram Matching (HM), a technique that transforms the pixel intensity distribution of an image to match a reference profile, to mitigate this in grapevine classification, distinguishing among healthy leaves, downy mildew, and spider mite damage. We propose a dual-stage integration of HM: (i) as a preprocessing step for normalization, and (ii) as a data augmentation technique to introduce controlled training variability. Experiments using 1,469 RGB images (comprising homogeneous leaf-focused and heterogeneous canopy samples) to train ResNet-18 models demonstrate that this combination significantly enhances robustness on real-world canopy images. While leaf-focused samples showed marginal gains, the canopy subset improved markedly, indicating that balancing normalization with histogram-based diversification effectively bridges the domain gap caused by uncontrolled lighting.

[CV-30] Seeing Candidates at Scale: Multimodal LLM s for Visual Political Communication on Instagram

【速读】:该论文旨在解决视觉政治传播(Visual Political Communication, VPC)分析中自动化识别政治人物与图像中人数统计的难题,尤其聚焦于社交媒体平台Instagram在2021年德国联邦大选期间的内容。其解决方案的关键在于对比传统计算机视觉模型(FaceNet512、RetinaFace、Google Cloud Vision)与新兴多模态大语言模型(GPT-4o)在上述任务中的性能表现,结果表明GPT-4o在人脸识别和人数计数上均取得更高准确率(宏F1分数分别为0.89和0.86),凸显了先进生成式AI系统在扩展和精细化政治传播视觉内容分析方面的潜力。

链接: https://arxiv.org/abs/2604.19489
作者: Michael Achmann-Denkler,Mario Haim,Christian Wolff
机构: University of Regensburg (雷根斯堡大学); Ludwig-Maximilians-Universität (路德维希马克西米利安慕尼黑大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: An earlier version was presented at #SMSociety 2024 (London)

点击查看摘要

Abstract:This paper presents a computational case study that evaluates the capabilities of specialized machine learning models and emerging multimodal large language models for Visual Political Communication (VPC) analysis. Focusing on concentrated visibility in Instagram stories and posts during the 2021 German federal election campaign, we compare the performance of traditional computer vision models (FaceNet512, RetinaFace, Google Cloud Vision) with a multimodal large language model (GPT-4o) in identifying front-runner politicians and counting individuals in images. GPT-4o outperformed the other models, achieving a macro F1-score of 0.89 for face recognition and 0.86 for person counting in stories. These findings demonstrate the potential of advanced AI systems to scale and refine visual content analysis in political communication while highlighting methodological considerations for future research.

[CV-31] Deep sprite-based image models: An analysis

【速读】:该论文旨在解决图像集合中重复模式识别这一看似简单但仍未完全解决的问题,尤其聚焦于基于精灵(sprite)的图像分解模型在聚类和图像分解中的应用。其核心挑战在于现有模型虽具一定的可解释性,却需针对特定数据集定制且难以扩展至包含多个对象的图像。解决方案的关键在于深入分析此类模型的设计细节,识别其核心组件,并基于聚类基准进行系统评估,最终提出一种深度精灵图像分解方法:该方法在标准CLEVR基准上性能媲美最先进的无监督类别感知图像分割方法,具备随对象数量线性扩展的能力,能显式识别对象类别,并以高度可解释的方式完整建模图像。

链接: https://arxiv.org/abs/2604.19480
作者: Zeynep Sonat Baltacı,Romain Loiseau,Mathieu Aubry
机构: LIGM, CNRS, Univ Gustave Eiffel, ENPC, Institut Polytechnique de Paris, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While foundation models drive steady progress in image segmentation and diffusion algorithms compose always more realistic images, the seemingly simple problem of identifying recurrent patterns in a collection of images remains very much open. In this paper, we focus on sprite-based image decomposition models, which have shown some promise for clustering and image decomposition and are appealing because of their high interpretability. These models come in different flavors, need to be tailored to specific datasets, and struggle to scale to images with many objects. We dive into the details of their design, identify their core components, and perform an extensive analysis on clustering benchmarks. We leverage this analysis to propose a deep sprite-based image decomposition method that performs on par with state-of-the-art unsupervised class-aware image segmentation methods on the standard CLEVR benchmark, scales linearly with the number of objects, identifies explicitly object categories, and fully models images in an easily interpretable way.

[CV-32] S-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation ICLR2026

【速读】:该论文旨在解决从复杂时间描述中生成高质量视频的问题,尤其是当描述包含多个顺序动作时,现有方法面临动作保真度与时间一致性之间的权衡困境:使用多个短提示可提升动作准确性但破坏时间连贯性,而单一复杂提示虽保持一致性却削弱对提示的遵循能力。作者将问题归因于两个关键因素:一是视频内容与提示之间的时间错位,二是运动相关视觉对象与其文本条件间的注意力冲突耦合。解决方案的核心是提出一种无需训练的新型注意力机制——时间分离注意力(Temporal-wise Separable Attention, TS-Attn),该机制通过动态调整注意力分布,实现多事件场景下的时间感知和全局一致性,且可无缝集成至多种预训练文生视频模型中,在不显著增加推理时间的前提下大幅提升生成质量(StoryEval-Bench得分提升最高达33.5%)。

链接: https://arxiv.org/abs/2604.19473
作者: Hongyu Zhang,Yufan Deng,Zilin Pan,Peng-Tao Jiang,Bo Li,Qibin Hou,Zhiyang Dou,Zhen Dong,Daquan Zhou
机构: Peking University, Shenzhen Graduate School (北京大学深圳研究生院); Zhejiang University (浙江大学); Nankai University (南开大学); Massachusetts Institute of Technology (麻省理工学院); Nanjing University (南京大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026, code available at: this https URL

点击查看摘要

Abstract:Generating high-quality videos from complex temporal descriptions that contain multiple sequential actions is a key unsolved problem. Existing methods are constrained by an inherent trade-off: using multiple short prompts fed sequentially into the model improves action fidelity but compromises temporal consistency, while a single complex prompt preserves consistency at the cost of prompt-following capability. We attribute this problem to two primary causes: 1) temporal misalignment between video content and the prompt, and 2) conflicting attention coupling between motion-related visual objects and their associated text conditions. To address these challenges, we propose a novel, training-free attention mechanism, Temporal-wise Separable Attention (TS-Attn), which dynamically rearranges attention distribution to ensure temporal awareness and global coherence in multi-event scenarios. TS-Attn can be seamlessly integrated into various pre-trained text-to-video models, boosting StoryEval-Bench scores by 33.5% and 16.4% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 2% increase in inference time. It also supports plug-and-play usage across models for multi-event image-to-video generation. The source code and project page are available at this https URL.

[CV-33] LoViF 2026 Challenge on Real-World All-in-One Image Restoration: Methods and Results CVPR

【速读】:该论文旨在解决真实世界中多类退化场景下的统一图像复原问题,即在模糊、低光照、雾霾、雨雪等复杂且混合的退化条件下,实现高效、鲁棒的图像恢复。其解决方案的关键在于构建一个统一的基准框架(LoViF Challenge),用于评估模型在多种退化类型下的泛化能力与鲁棒性,并通过收集来自全球124名参赛者中的9个有效提交方案,系统分析当前先进方法的有效性,从而为真实世界低层视觉任务提供可比较的性能基准和研究方向指引。

链接: https://arxiv.org/abs/2604.19445
作者: Xiang Chen,Hao Li,Jiangxin Dong,Jinshan Pan,Xin Li,Xin He,Naiwei Chen,Shengyuan Li,Fengning Liu,Haoyi Lv,Haowei Peng,Yilian Zhong,Yuxiang Chen,Shibo Yin,Yushun Fang,Xilei Zhu,Yahui Wang,Chen Lu,Kaibin Chen,Xu Zhang,Xuhui Cao,Jiaqi Ma,Ziqi Wang,Shengkai Hu,Yuning Cui,Huan Zhang,Shi Chen,Bin Ren,Lefei Zhang,Guanglu Dong,Qiyao Zhao,Tianheng Zheng,Chunlei Li,Lichao Mou,Chao Ren,Wangzhi Xing,Xin Lu,Enxuan Gu,Jingxi Zhang,Diqi Chen,Qiaosi Yi,Bingcai Wei,Mingyu Liu,Pengyu Wang,Ce Liu,Miaoxin Guan,Boyu Chen,Hongyu Li,Jian Zhu,Xinrui Luo,Ziyang He,Jiayu Wang,Yichen Xiang,Huayi Qi,Haoyu Bian,Yiran Li,Sunlichen Zhou
机构: Nanjing University of Science and Technology; Naval Aviation University; Xiaohongshu Inc; Fujian Normal University; Quanzhou Institute of Equipment Manufacturing, Chinese Academy of Sciences; Wuhan University; Sichuan University; Griffith University; Sensetime; Dalian University of Technology; Massey University; Hong Kong Polytechnic University; Technical University of Munich; Guangdong University of Technology; University of Electronic Science and Technology of China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR Workshops 2026; this https URL

点击查看摘要

Abstract:This paper presents a review for the LoViF Challenge on Real-World All-in-One Image Restoration. The challenge aimed to advance research on real-world all-in-one image restoration under diverse real-world degradation conditions, including blur, low-light, haze, rain, and snow. It provided a unified benchmark to evaluate the robustness and generalization ability of restoration models across multiple degradation categories within a common framework. The competition attracted 124 registered participants and received 9 valid final submissions with corresponding fact sheets, significantly contributing to the progress of real-world all-in-one image restoration. This report provides a detailed analysis of the submitted methods and corresponding results, emphasizing recent progress in unified real-world image restoration. The analysis highlights effective approaches and establishes a benchmark for future research in real-world low-level vision.

[CV-34] DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval CVPR2026

【速读】:该论文旨在解决开放集三维物体检索(open-set 3D object retrieval, 3DOR)中因模型对已知类别过拟合而导致的泛化能力不足问题,尤其是在使用基于CLIP的多视图特征提取方法时缺乏细粒度区分能力。解决方案的关键在于提出DINO Eats CLIP (DEC)框架,其核心创新包括:1)采用冻结的DINO编码器进行多视图特征提取,并引入分块与适应模块(Chunking and Adapting Module, CAM),通过动态整合局部视图关系提升特征鲁棒性;2)设计虚拟特征合成(Virtual Feature Synthesis, VFS)模块,利用CLIP预对齐的视觉-语言空间为未见类别生成虚拟特征,从而缓解模型对已知类别的偏好,增强开放集下的判别能力。

链接: https://arxiv.org/abs/2604.19432
作者: Xinwei He,Yansong Zheng,Qianru Han,Zhichuan Wang,Yuxuan Cai,Yang Zhou,Jingbo Xia,Yulong Wang,Jinhai Xiang,Xiang Bai
机构: Huazhong Agricultural University (华中农业大学); Huazhong University of Science and Technology (华中科技大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Vision foundation models have shown great promise for open-set 3D object retrieval (3DOR) through efficient adaptation to multi-view images. Leveraging semantically aligned latent space, previous work typically adapts the CLIP encoder to build view-based 3D descriptors. Despite CLIP’s strong generalization ability, its lack of fine-grainedness prompted us to explore the potential of a more recent self-supervised encoder-DINO. To address this, we propose DINO Eats CLIP (DEC), a novel framework for dynamic multi-view integration that is regularized by synthesizing data for unseen classes. We first find that simply mean-pooling over view features from a frozen DINO backbone gives decent performance. Yet, further adaptation causes severe overfitting on average view patterns of known classes. To combat it, we then design a module named Chunking and Adapting Module (CAM). It segments multi-view images into chunks and dynamically integrates local view relations, yielding more robust features than the standard pooling strategy. Finally, we propose Virtual Feature Synthesis (VFS) module to mitigate bias towards known categories explicitly. Under the hood, VFS leverages CLIP’s broad, pre-aligned vision-language space to synthesize virtual features for unseen classes. By exposing DEC to these virtual features, we greatly enhance its open-set discrimination capacity. Extensive experiments on standard open-set 3DOR benchmarks demonstrate its superior efficacy.

[CV-35] ESO: Online Tracking of Essential Matrix by Stochastic Optimization CVPR2026

【速读】:该论文旨在解决立体相机校准参数在长时间运行中精度下降的问题,这对自主系统感知的稳定性至关重要。解决方案的关键在于提出一种名为在线追踪本质矩阵的随机优化方法(Online Tracking of Essential Matrix by Stochastic Optimization, TESO),其核心机制包括:基于核相关性构建的鲁棒损失函数,用于处理初步匹配点中的异常值;以及在本质矩阵流形上的自适应在线随机优化策略,实现对校准参数的实时更新。TESO具有低计算和内存开销、超参少、无需数据驱动训练等优势,适用于资源受限的在线感知系统,并在多个数据集上验证了其高精度与无偏性。

链接: https://arxiv.org/abs/2604.19420
作者: Jaroslav Moravec,Radim Šára,Akihiro Sugimoto
机构: Czech Technical University in Prague (布拉格捷克技术大学); National Institute of Informatics (日本信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 (Oral)

点击查看摘要

Abstract:Maintaining long-term accuracy of stereo camera calibration parameters is important for autonomous systems’ perception. This work proposes Online Tracking of Essential Matrix by Stochastic Optimization (TESO). The core mechanisms of TESO are: 1) a robust loss function based on kernel correlation over tentative correspondences, 2) an adaptive online stochastic optimization on the essential manifold. TESO has low CPU and memory requirements, relies on a few hyperparameters, and eliminates the need for data-driven training, enabling the usage in resource-constrained online perception systems. We evaluated the influence of TESO on geometric precision, rectification quality, and stereo depth consistency. On the large-scale MAN TruckScenes dataset, TESO tracks rotational calibration drift with 0.12 deg precision in the Y-axis (critical for stereo accuracy) while the X- and Z-axes are five times more precise. Tracking applied to sequences with simulated drift shows similar precision with respect to the reference as tracking applied to no-drift sequences, indicating the tracker is unbiased. On the KITTI dataset, TESO revealed systematic inconsistencies in extrinsic parameters across stereo pairs, confirming previous published findings. We verified that intrinsic decalibration affected these errors, as evidenced by the conflicting behavior of the rectification and depth metrics. After correcting the reference calibration, TESO improved its rotation precision around the Y-axis 20 times to 0.025 deg and its depth accuracy 50 times. Despite its lightweight design, direct optimization of the proposed TESO loss function alone achieves accuracy comparable to that of neural network-based single-frame methods.

[CV-36] GOLD-BEV: GrOund and aeriaL Data for Dense Semantic BEV Mapping of Dynamic Scenes

【速读】:该论文旨在解决如何从车辆自身传感器(ego-centric sensors)中学习几何一致、以场景为中心的鸟瞰图(Bird’s-eye-view, BEV)语义环境地图(包括动态目标)的问题,尤其是在缺乏密集人工标注的情况下。传统方法依赖于仅从车载传感器进行BEV标注,存在标签模糊性和时间不一致性问题,尤其在动态交通参与者(如移动车辆和行人)的建模上表现不佳。解决方案的关键在于利用训练阶段的时间同步航空影像(aerial imagery)作为监督信号,通过BEV对齐的航空图像裁剪提供直观且密集的语义标注,显著降低人工标注成本并提升标注准确性;同时,借助域适应的航空教师模型生成BEV伪标签,并结合可选的伪航空BEV重建任务增强模型的可解释性与鲁棒性,最终通过学习从车载传感器合成伪航空BEV图像,实现对未覆盖区域的扩展建模,支持轻量级人工标注与不确定性感知的伪标签策略。

链接: https://arxiv.org/abs/2604.19411
作者: Joshua Niemeijer,Alaa Eddine Ben Zekri,Reza Bahmanyar,Philipp M. Schmälzle,Houda Chaabouni-Chouayakh,Franz Kurz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding road scenes in a geometrically consistent, scene-centric representation is crucial for planning and mapping. We present GOLD-BEV, a framework that learns dense bird’s-eye-view (BEV) semantic environment maps-including dynamic agents-from ego-centric sensors, using time-synchronized aerial imagery as supervision only during training. BEV-aligned aerial crops provide an intuitive target space, enabling dense semantic annotation with minimal manual effort and avoiding the ambiguity of ego-only BEV labeling. Crucially, strict aerial-ground synchronization allows overhead observations to supervise moving traffic participants and mitigates the temporal inconsistencies inherent to non-synchronized overhead sources. To obtain scalable dense targets, we generate BEV pseudo-labels using domain-adapted aerial teachers, and jointly train BEV segmentation with optional pseudo-aerial BEV reconstruction for interpretability. Finally, we extend beyond aerial coverage by learning to synthesize pseudo-aerial BEV images from ego sensors, which support lightweight human annotation and uncertainty-aware pseudo-labeling on unlabeled drives.

[CV-37] HP-Edit: A Human-Preference Post-Training Framework for Image Editing CVPR2026

【速读】:该论文旨在解决如何高效地将人类偏好对齐(Human Preference Alignment)应用于基于扩散模型的图像编辑任务中,当前面临的主要挑战是缺乏适用于多样化编辑需求的可扩展人类偏好数据集和专用框架。解决方案的关键在于提出HP-Edit框架,其核心创新包括:利用少量人工评分数据与预训练视觉大语言模型(VLM)构建自动化的、符合人类偏好的评估器HP-Scorer;该评估器不仅用于高效构建大规模偏好数据集,还作为强化学习中的奖励函数,用于后训练阶段优化编辑模型性能。此外,作者还引入RealPref-50K真实世界数据集和RealPref-Bench基准,以系统性地验证方法的有效性。

链接: https://arxiv.org/abs/2604.19406
作者: Fan Li,Chonghuinan Wang,Lina Lei,Yuping Qiu,Jiaqi Xu,Jiaxiu Jiang,Xinran Qin,Zhikai Chen,Fenglong Song,Zhixin Wang,Renjing Pei,Wangmeng Zuo
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Harbin Institute of Technology (哈尔滨工业大学); Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR2026

点击查看摘要

Abstract:Common image editing tasks typically adopt powerful generative diffusion models as the leading paradigm for real-world content editing. Meanwhile, although reinforcement learning (RL) methods such as Diffusion-DPO and Flow-GRPO have further improved generation quality, efficiently applying Reinforcement Learning from Human Feedback (RLHF) to diffusion-based editing remains largely unexplored, due to a lack of scalable human-preference datasets and frameworks tailored to diverse editing needs. To fill this gap, we propose HP-Edit, a post-training framework for Human Preference-aligned Editing, and introduce RealPref-50K, a real-world dataset across eight common tasks and balancing common object editing. Specifically, HP-Edit leverages a small amount of human-preference scoring data and a pretrained visual large language model (VLM) to develop HP-Scorer–an automatic, human preference-aligned evaluator. We then use HP-Scorer both to efficiently build a scalable preference dataset and to serve as the reward function for post-training the editing model. We also introduce RealPref-Bench, a benchmark for evaluating real-world editing performance. Extensive experiments demonstrate that our approach significantly enhances models such as Qwen-Image-Edit-2509, aligning their outputs more closely with human preference.

[CV-38] VecHeart: Holistic Four-Chamber Cardiac Anatomy Modeling via Hybrid VecSets

【速读】:该论文旨在解决心脏解剖建模中难以准确捕捉多腔室结构间复杂相互关系的问题,尤其是现有前馈隐式方法在单对象建模上的局限性以及对器官间关联建模的忽视。其解决方案的关键在于提出VecHeart框架,通过引入混合部件Transformer(Hybrid Part Transformer),利用部件特定的可学习查询(part-specific learnable queries)和交错注意力机制(interleaved attention),有效建模四腔心结构间的复杂依赖关系;同时结合解剖完整性掩码(Anatomical Completion Masking)与模态对齐策略(Modality Alignment),使模型能够从部分、稀疏或噪声观测中推断完整的心脏结构,即使某些解剖部位缺失也能完成重建,从而实现高保真度的四腔心结构重建与生成。

链接: https://arxiv.org/abs/2604.19403
作者: Yihong Chen,Pascal Fua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate cardiac anatomy modeling requires the model to be able to handle intricate interrelations among structures. In this paper, we propose VecHeart, a unified framework for holistic reconstruction and generation of four-chamber cardiac structures. To overcome the limitations of current feed-forward implicit methods, specifically their restriction to single-object modeling and their neglect of inter-part correlations, we introduce Hybrid Part Transformer, which leverages part-specific learnable queries and interleaved attention to capture complex inter-chamber dependencies. Furthermore, we propose Anatomical Completion Masking and Modality Alignment strategies, enabling the model to infer complete four-chamber structures from partial, sparse, or noisy observations, even when certain anatomical parts are entirely missing. VecHeart also seamlessly extends to 3D+t dynamic mesh sequence generation, demonstrating exceptional versatility. Experiments show that our method achieves state-of-the-art performance, maintaining high-fidelity reconstruction across diverse challenging scenarios. Code will be released.

[CV-39] HarmoniDiff-RS: Training-Free Diffusion Harmonization for Satellite Image Composition CVPR2026

【速读】:该论文旨在解决卫星图像合成(Satellite Image Composition)中跨域差异导致的辐射特性不一致问题,尤其是在数据增强、灾害模拟和城市规划等遥感应用中,如何实现高质量且语义一致的图像融合。其解决方案的关键在于提出了一种无需训练的扩散模型框架 HarmoniDiff-RS,核心创新包括:通过潜在空间均值偏移(Latent Mean Shift)操作迁移源域与目标域之间的辐射特征以实现域对齐;引入分阶段潜在融合策略(Timestep-wise Latent Fusion),利用早期反演潜在表示增强谐调性、晚期潜在表示保障语义一致性,从而生成一组候选合成图像;并设计一个轻量级和谐分类器自动筛选最优结果。该方法在新构建的 RSIC-H 基准数据集上验证了有效性,展现出在大规模遥感合成与仿真任务中的潜力。

链接: https://arxiv.org/abs/2604.19392
作者: Xiaoqi Zhuang,Jefersson A. Dos Santos,Jungong Han
机构: The University of Sheffield (谢菲尔德大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures, CVPR 2026 findings. Code is available at this https URL

点击查看摘要

Abstract:Satellite image composition plays a critical role in remote sensing applications such as data augmentation, disaste simulation, and urban planning. We propose HarmoniDiff-RS, a training-free diffusion-based framework for harmonizing composite satellite images under diverse domain conditions. Our method aligns the source and target domains through a Latent Mean Shift operation that transfers radiometric characteristics between them. To balance harmonization and content preservation, we introduce a Timestep-wise Latent Fusion strategy by leveraging early inverted latents for high harmonization and late latents for semantic consistency to generate a set of composite candidates. A lightweight harmony classifier is trained to further automatically select the most coherent result among them. We also construct RSIC-H, a benchmark dataset for satellite image harmonization derived from fMoW, providing 500 paired composition samples. Experiments demonstrate that our method effectively performs satellite image composition, showing strong potential for scalable remote-sensing synthesis and simulation tasks. Code is available at: this https URL.

[CV-40] Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval

【速读】:该论文旨在解决组合图像检索(Composed Image Retrieval, CIR)中因噪声三元组对应关系(Noisy Triplet Correspondence, NTC)导致的模型性能下降问题,尤其是现有鲁棒学习方法依赖“小损失假设”而失效的问题。其核心挑战在于NTC中的语义模糊性(如部分匹配)使噪声难以识别,进而引发模型陷入自依赖的恶性循环,造成“表示污染”。解决方案的关键在于提出一种名为Air-Know的“专家-代理-分流”解耦范式:通过外部先验仲裁(EPA)利用多模态大语言模型(MLLMs)构建高精度锚点数据集;借助专家知识内化(EKI)使轻量级代理“仲裁者”学习专家判别逻辑;并采用双流协调(DSR)机制基于匹配置信度分流训练数据,分离出纯净对齐流与表示反馈协调流,从而打破噪声干扰的闭环,实现鲁棒的CIR性能提升。

链接: https://arxiv.org/abs/2604.19386
作者: Zhiheng Fu,Yupeng Hu,Qianyun Yang,Shiqi Zhang,Zhiwei Chen,Zixu Li
机构: Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Composed Image Retrieval (CIR) has attracted significant attention due to its flexible multimodal query method, yet its development is severely constrained by the Noisy Triplet Correspondence (NTC) problem. Most existing robust learning methods rely on the “small loss hypothesis”, but the unique semantic ambiguity in NTC, such as “partial matching”, invalidates this assumption, leading to unreliable noise identification. This entraps the model in a self dependent vicious cycle where the learner is intertwined with the arbiter, ultimately causing catastrophic “representation pollution”. To address this critical challenge, we propose a novel “Expert-Proxy-Diversion” decoupling paradigm, named Air-Know (ArbIteR calibrated Knowledge iNternalizing rObust netWork). Air-Know incorporates three core modules: (1) External Prior Arbitration (EPA), which utilizes Multimodal Large Language Models (MLLMs) as an offline expert to construct a high precision anchor dataset; (2) Expert Knowledge Internalization (EKI), which efficiently guides a lightweight proxy “arbiter” to internalize the expert’s discriminative logic; (3) Dual Stream Reconciliation (DSR), which leverages the EKI’s matching confidence to divert the training data, achieving a clean alignment stream and a representation feedback reconciliation stream. Extensive experiments on multiple CIR benchmark datasets demonstrate that Air-Know significantly outperforms existing SOTA methods under the NTC setting, while also showing strong competitiveness in traditional CIR.

[CV-41] PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving CVPR

【速读】:该论文旨在解决多模态3D全景分割(mm-3DPS)在域偏移(domain shift)下泛化能力不足的问题,尤其针对自动驾驶场景中因光照、天气或传感器差异导致某一模态(如LiDAR或RGB)性能下降时模型鲁棒性差的挑战。现有基于伪标签(pseudo-labeling)的无监督域自适应(UDA)方法虽能为未标注目标域数据提供监督信号,但受限于对跨模态互补性的强依赖以及仅保留高置信度区域所引发的掩码碎片化问题,难以满足全景分割对完整对象语义覆盖的需求。其解决方案的关键在于提出首个专为mm-3DPS设计的UDA框架PanDA:一是引入非对称多模态增强策略,通过选择性丢弃区域模拟域偏移以提升单传感器退化下的表征鲁棒性;二是设计双专家伪标签精炼模块(dual-expert pseudo-label refinement module),从2D和3D模态中提取域不变先验信息,从而增强伪标签的完整性与可靠性,显著优于当前3D语义分割的UDA基线方法。

链接: https://arxiv.org/abs/2604.19379
作者: Yining Pan,Shijie Li,Yuchen Wu,Xulei Yang,Na Zhao
机构: Singapore University of Technology and Design; Institute for Infocomm Research (I2R), A*STAR, Singapore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2026

点击查看摘要

Abstract:This paper presents the first study on Unsupervised Domain Adaptation (UDA) for multimodal 3D panoptic segmentation (mm-3DPS), aiming to improve generalization under domain shifts commonly encountered in real-world autonomous driving. A straightforward solution is to employ a pseudo-labeling strategy, which is widely used in UDA to generate supervision for unlabeled target data, combined with an mm-3DPS backbone. However, existing supervised mm-3DPS methods rely heavily on strong cross-modal complementarity between LiDAR and RGB inputs, making them fragile under domain shifts where one modality degrades (e.g., poor lighting or adverse weather). Moreover, conventional pseudo-labeling typically retains only high-confidence regions, leading to fragmented masks and incomplete object supervision, which are issues particularly detrimental to panoptic segmentation. To address these challenges, we propose PanDA, the first UDA framework specifically designed for multimodal 3D panoptic segmentation. To improve robustness against single-sensor degradation, we introduce an asymmetric multimodal augmentation that selectively drops regions to simulate domain shifts and improve robust representation learning. To enhance pseudo-label completeness and reliability, we further develop a dual-expert pseudo-label refinement module that extracts domain-invariant priors from both 2D and 3D modalities. Extensive experiments across diverse domain shifts, spanning time, weather, location, and sensor variations, significantly surpass state-of-the-art UDA baselines for 3D semantic segmentation.

[CV-42] IonMorphNet: Generalizable Learning of Ion Image Morphologies for Peak Picking in Mass Spectrometry Imaging CVPR

【速读】:该论文旨在解决质谱成像(Mass Spectrometry Imaging, MSI)中峰点提取(peak picking)这一基础预处理步骤的泛化性差问题,现有方法通常依赖于数据集特定的超参数调优,难以适应不同的采集协议。其解决方案的关键在于提出IonMorphNet——一种空间结构感知的离子图像表示模型,通过在53个公开MSI数据集上定义六类代表性空间模式进行无监督训练,使模型能够自动识别离子图像中的结构特征并执行无需额外调参的峰点提取。该方法利用标准图像骨干网络(如ConvNeXt V2-Tiny)实现跨数据集的高性能峰点检测(mSCF1提升+7%),同时进一步证明了空间信息引导的通道压缩可有效支持基于patch的肿瘤分类任务,在三个肿瘤分类任务中达到或超过像素级光谱分类器的性能(Balanced Accuracy提升达+7.3%),验证了所提取离子图像的空间语义价值。

链接: https://arxiv.org/abs/2604.19369
作者: Philipp Weigand,Niels Nawrot,Nikolas Ebert,Carsten Hopf,Oliver Wasenmüller
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2026

点击查看摘要

Abstract:Peak picking is a fundamental preprocessing step in Mass Spectrometry Imaging (MSI), where each sample is represented by hundreds to thousands of ion images. Existing approaches require careful dataset-specific hyperparameter tuning, and often fail to generalize across acquisition protocols. We introduce IonMorphNet, a spatial-structure-aware representation model for ion images that enables fully data-driven peak picking without any task-specific supervision. We curate 53 publicly available MSI datasets and define six structural classes capturing representative spatial patterns in ion images to train standard image backbones for structural pattern classification. Once trained, IonMorphNet can assess ion images and perform peak picking without additional hyperparameter tuning. Using a ConvNeXt V2-Tiny backbone, our approach improves peak picking performance by +7 % mSCF1 compared to state-of-the-art methods across multiple datasets. Beyond peak picking, we demonstrate that spatially informed channel reduction enables a 3D CNN for patch-based tumor classification in MSI. This approach matches or exceeds pixel-wise spectral classifiers by up to +7.3 % Balanced Accuracy on three tumor classification tasks, indicating meaningful ion image selection. The source code and model weights are available at this https URL.

[CV-43] Detection of T-shirt Presentation Attacks in Face Recognition Systems

【速读】:该论文旨在解决人脸识别系统在面对新型展示攻击(presentation attacks)时的安全性问题,特别是针对一种新型的T-shirt展示攻击(T-shirt Face Presentation Attack, TFPA),此类攻击利用印有目标人脸图像的T恤进行欺骗,可能绕过现有防伪检测机制。解决方案的关键在于提出一种基于空间一致性检查的检测方法:通过结合先进的面部和人体检测器,分析检测到的人脸与人体位置关系,从而可靠地识别出由T恤造成的异常空间分布,实现对这类新型攻击的有效检测。

链接: https://arxiv.org/abs/2604.19365
作者: Mathias Ibsen,Loris Tim Ide,Christian Rathgeb,Christoph Busch
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face recognition systems are often used for biometric authentication. Nevertheless, it is known that without any protective measures, face recognition systems are vulnerable to presentation attacks. To tackle this security problem, methods for detecting presentation attacks have been developed and shown good detection performance on several benchmark datasets. However, generalising presentation attack detection methods to new and novel types of attacks is an ongoing challenge. In this work, we employ 1,608 T-shirt attacks of the T-shirt Face Presentation Attack (TFPA) database using 100 unique presentation attack instruments together with 152 bona fide presentations. In a comprehensive evaluation, we show that this type of attack can compromise the security of face recognition systems. Furthermore, we propose a detection method based on spatial consistency checks in order to detect said T-shirt attacks. Precisely, state-of-the-art face and person detectors are combined to analyse the spatial positions of detected faces and persons based on which T-shirt attacks can be reliably detected.

[CV-44] Attend what matters: Leverag ing vision foundational models for breast cancer classification using mammograms

【速读】:该论文旨在解决Vision Transformer (ViT) 在医学影像辅助诊断任务中表现受限的问题,尤其是在乳腺癌筛查的乳房X光图像分类任务中。其核心挑战包括:(1)医学图像分辨率高、异常区域小,导致token数量过多,使基于softmax的注意力机制难以精准定位相关区域;(2)医学图像分类具有细粒度特性,类间差异小、类内差异大,标准交叉熵训练难以实现有效区分。为此,作者提出一个包含三个关键组件的框架:(1)利用目标检测模型引导的感兴趣区域(Region of Interest, RoI)进行token压缩,提升注意力聚焦能力;(2)通过RoI间的对比学习(基于难负样本训练)增强细粒度判别力;(3)采用预训练的DINOv2 ViT模型替代全局CLIP特征,以获取更具定位感知和细粒度信息的特征表示。实验证明该方法在公开乳房X光数据集上显著优于现有基线,展现出临床大规模筛查的应用潜力。

链接: https://arxiv.org/abs/2604.19350
作者: Samyak Sanghvi,Piyush Miglani,Sarvesh Shashikumar,Kaustubh R Borgavi,Veenu Singla,Chetan Arora
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformers (\textttViT) have become the architecture of choice for many computer vision tasks, yet their performance in computer-aided diagnostics remains limited. Focusing on breast cancer detection from mammograms, we identify two main causes for this shortfall. First, medical images are high-resolution with small abnormalities, leading to an excessive number of tokens and making it difficult for the softmax-based attention to localize and attend to relevant regions. Second, medical image classification is inherently fine-grained, with low inter-class and high intra-class variability, where standard cross-entropy training is insufficient. To overcome these challenges, we propose a framework with three key components: (1) Region of interest (\textttRoI) based token reduction using an object detection model to guide attention; (2) contrastive learning between selected \textttRoI to enhance fine-grained discrimination through hard-negative based training; and (3) a \textttDINOv2 pretrained \textttViT that captures localization-aware, fine-grained features instead of global \textttCLIP representations. Experiments on public mammography datasets demonstrate that our method achieves superior performance over existing baselines, establishing its effectiveness and potential clinical utility for large-scale breast cancer screening. Our code is available for reproducibility here: this https URL

[CV-45] RAFT-MSF: Temporal Geometry-Motion Feature Fusion for Self-Supervised Monocular Scene Flow

【速读】:该论文旨在解决单目场景流估计(monocular scene flow estimation)中现有方法普遍局限于两帧输入、导致时序建模能力不足及对遮挡区域鲁棒性差的问题。其核心解决方案是提出一种自监督的多帧框架RAFT-MSF++,通过递归融合时序特征来联合估计深度和场景流;关键创新在于引入几何-运动特征(Geometry-Motion Feature, GMF),该特征紧凑编码耦合的运动与几何线索,并在迭代过程中持续更新以实现有效的时序推理。此外,为提升GMF在遮挡区域的信息传播能力,还设计了相对位置注意力机制以注入空间先验信息,并结合遮挡正则化模块从可见区域传播可靠运动信息,从而显著增强模型在模糊区域的鲁棒性。

链接: https://arxiv.org/abs/2604.19349
作者: Xunpei Sun,Zuoxun Hou,Yi Chang,Gang Chen,Wei-Shi Zheng
机构: Sun Yat-sen University (中山大学); Beijing Institute of Space Mechanics and Electricity (北京空间机电研究所); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Monocular scene flow estimation aims to recover dense 3D motion from image sequences, yet most existing methods are limited to two-frame inputs, restricting temporal modeling and robustness to occlusions. We propose RAFT-MSF++, a self-supervised multi-frame framework that recurrently fuses temporal features to jointly estimate depth and scene flow. Central to our approach is the Geometry-Motion Feature (GMF), which compactly encodes coupled motion and geometry cues and is iteratively updated for effective temporal reasoning. To ensure the robustness of this temporal fusion against occlusions, we incorporate relative positional attention to inject spatial priors and an occlusion regularization module to propagate reliable motion from visible regions. These components enable the GMF to effectively propagate information even in ambiguous areas. Extensive experiments show that RAFT-MSF++ achieves 24.14% SF-all on the KITTI Scene Flow benchmark, with a 30.99% improvement over the baseline and better robustness in occluded regions. The code is available at this https URL.

[CV-46] Geometry-Guided Self-Supervision for Ultra-Fine-Grained Recognition with Limited Data

【速读】:该论文旨在解决数据受限场景下超细粒度视觉分类(Ultra-FGVC)任务中难以区分高度相似物体的问题。其核心挑战在于,传统方法往往忽略物体间细微但关键的几何特征差异,导致分类性能受限。解决方案的关键在于提出一种通用的自监督框架——几何属性探索网络(Geometric Attribute Exploration Network, GAEor),通过从骨干网络获取视觉反馈来增强与几何相关的细节信息,并将这些细节的相对极坐标嵌入最终特征表示中,从而提取出类别特异性的几何属性作为新的识别线索。实验表明,GAEor 在五个主流 Ultra-FGVC 基准上均取得了显著优于现有方法的性能。

链接: https://arxiv.org/abs/2604.19345
作者: Shijie Wang,Yadan Luo,Zijian Wang,Haojie Li,Zi Huang,Mahsa Baktashmotlagh
机构: Shandong University of Science and Technology (山东科技大学); The University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper investigates the intrinsic geometrical features of highly similar objects and introduces a general self-supervised framework called the Geometric Attribute Exploration Network (GAEor), which is designed to address the ultra-fine-grained visual categorization (Ultra-FGVC) task in data-limited scenarios. Unlike prior work that often captures subtle yet critical distinctions, GAEor generates geometric attributes as novel alternative recognition cues. These attributes are determined by various details within the object, aligned with its geometric patterns, such as the intricate vein structures in soybean leaves. Crucially, each category exhibits distinct geometric descriptors that serve as powerful cues, even among objects with minimal visual variation – a factor largely overlooked in recent research. GAEor discovers these geometric attributes by first amplifying geometry-relevant details via visual feedback from a backbone network, then embedding the relative polar coordinates of these details into the final representation. Extensive experiments demonstrate that GAEor significantly sets new state-of-the-art records in five widely-used Ultra-FGVC benchmarks.

[CV-47] Divide-and-Conquer Approach to Holistic Cognition in High-Similarity Contexts with Limited Data

【速读】:该论文旨在解决超细粒度视觉分类(Ultra-FGVC)中因训练样本有限而导致的判别性全局特征(如极相似品种间的叶片轮廓)难以建模的问题。当前方法往往忽视了这类复杂形态结构的全局线索,从而限制了识别性能。解决方案的关键在于提出一种分而治之的全局认知网络(DHCNet),其核心机制是将全局线索分解为空间关联的细微差异,并通过逐步分析局部区域的微小变化(从较小局部块到较大局部块)来构建全局认知过程。该方法利用未受扰动的局部区域引导对打乱后局部块拓扑结构的感知,从而建立差异的空间关联;同时,在训练过程中在线优化这些从局部区域提取的全局线索,并将其作为监督信号用于微调识别模型参数,增强模型对全局特征的敏感性,显著降低对大规模标注数据的依赖。

链接: https://arxiv.org/abs/2604.19339
作者: Shijie Wang,Zijian Wang,Yadan Luo,Haojie Li,Zi Huang,Mahsa Baktashmotlagh
机构: Shandong University of Science and Technology (山东科技大学); The University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultra-fine-grained visual categorization (Ultra-FGVC) aims to classify highly similar subcategories within fine-grained objects using limited training samples. However, holistic yet discriminative cues, such as leaf contours in extremely similar cultivars, remain under-explored in current studies, thereby limiting recognition performance. Though crucial, modeling holistic cues with complex morphological structures typically requires massive training samples, posing significant challenges in data-limited scenarios. To address this challenge, we propose a novel Divide-and-Conquer Holistic Cognition Network (DHCNet) that implements a divide-and-conquer strategy by decomposing holistic cues into spatially-associated subtle discrepancies and progressively establishing the holistic cognition process, significantly simplifying holistic cognition while reducing dependency on training data. Technically, DHCNet begins by progressively analyzing subtle discrepancies, transitioning from smaller local patches to larger ones using a self-shuffling operation on local regions. Simultaneously, it leverages the unaffected local regions to potentially guide the perception of the original topological structure among the shuffled patches, thereby aiding in the establishment of spatial associations for these discrepancies. Additionally, DHCNet incorporates the online refinement of these holistic cues discovered from local regions into the training process to iteratively improve their quality. As a result, DHCNet uses these holistic cues as supervisory signals to fine-tune the parameters of the recognition model, thus improving its sensitivity to holistic cues across the entire objects. Extensive evaluations demonstrate that DHCNet achieves remarkable performance on five widely-used Ultra-FGVC datasets.

[CV-48] Silicon Aware Neural Networks

【速读】:该论文旨在解决如何将基于深度学习的可微分逻辑门网络(Differentiable Logic Gate Networks, DLGNs)高效地映射到定制硅芯片上,以实现高速、低功耗的图像分类任务。其核心挑战在于如何在保持模型性能的同时,优化硬件实现的面积和功耗。解决方案的关键在于提出一种一对一映射方法,将训练好的DLGN模型转换为数字CMOS标准单元库的门级网表,并设计了一种新颖的损失函数,通过最小化每个神经元的预期面积来优化电路面积,从而间接降低功耗。此外,论文首次在仿真中实现了DLGN作为硅电路的布局,在SkyWater 130nm工艺下完成定制硬宏设计并进行后版图功耗分析,验证了该方案在MNIST数据集上以97%准确率每秒执行4180万次分类,功耗仅为83.88 mW的可行性与高效性。

链接: https://arxiv.org/abs/2604.19334
作者: Sebastian Fieldhouse,Kea-Tiong Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Recent work in the machine learning literature has demonstrated that deep learning can train neural networks made of discrete logic gate functions to perform simple image classification tasks at very high speeds on CPU, GPU and FPGA platforms. By virtue of being formed by discrete logic gates, these Differentiable Logic Gate Networks (DLGNs) lend themselves naturally to implementation in custom silicon - in this work we present a method to map DLGNs in a one-to-one fashion to a digital CMOS standard cell library by converting the trained model to a gate-level netlist. We also propose a novel loss function whereby the DLGN can optimize the area, and indirectly power consumption, of the resulting circuit by minimizing the expected area per neuron based on the area of the standard cells in the target standard cell library. Finally, we also show for the first time an implementation of a DLGN as a silicon circuit in simulation, performing layout of a DLGN in the SkyWater 130nm process as a custom hard macro using a Cadence standard cell library and performing post-layout power analysis. We find that our custom macro can perform classification on MNIST with 97% accuracy 41.8 million times a second at a power consumption of 83.88 mW.

[CV-49] PLaMo 2.1-VL Technical Report

【速读】:该论文旨在解决轻量化视觉语言模型(Vision Language Model, VLM)在本地和边缘设备部署时,因计算资源受限而难以实现高效、精准的多模态理解任务的问题,尤其聚焦于日语环境下视觉问答(Visual Question Answering, VQA)与视觉定位(Visual Grounding)能力的优化。解决方案的关键在于构建一个面向边缘场景的轻量级模型架构——PLaMo 2.1-VL(含8B和2B参数版本),并配套开发大规模合成数据生成流水线及全面的日语训练与评估资源,从而在工厂工具识别任务和基础设施异常检测两个真实应用场景中实现高精度零样本推理与微调性能提升,显著优于同类开源模型。

链接: https://arxiv.org/abs/2604.19324
作者: Tommi Kerola,Yuya Masuda,Takashi Masuko,Toshiki Nakanishi,Daisuke Nishino,Kuniyuki Takahashi,Hanqin Wang,Yoshihiro Yamada
机构: Preferred Networks, Inc. (PFN)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 35 pages, 9 figreus

点击查看摘要

Abstract:We introduce PLaMo 2.1-VL, a lightweight Vision Language Model (VLM) for autonomous devices, available in 8B and 2B variants and designed for local and edge deployment with Japanese-language operation. Focusing on Visual Question Answering (VQA) and Visual Grounding as its core capabilities, we develop and evaluate the models for two real-world application scenarios: factory task analysis via tool recognition, and infrastructure anomaly detection. We also develop a large-scale synthetic data generation pipeline and comprehensive Japanese training and evaluation resources. PLaMo 2.1-VL outperforms comparable open models on Japanese and English benchmarks, achieving 61.5 ROUGE-L on JA-VG-VQA-500 and 85.2% accuracy on Japanese Ref-L4. For the two application scenarios, it achieves 53.9% zero-shot accuracy on factory task analysis, and fine-tuning on power plant data improves anomaly detection bbox + label F1-score from 39.7 to 64.9.

[CV-50] Concept Inconsistency in Dermoscopic Concept Bottleneck Models: A Rough-Set Analysis of the Derm7pt Dataset

【速读】:该论文旨在解决生成式 AI(Generative AI)在医学图像分类任务中因概念层与标签之间存在不一致性而导致的性能瓶颈问题。具体而言,当数据集中存在相同概念特征却对应不同诊断标签的情况时,基于硬概念(hard concepts)的因果瓶颈模型(Concept Bottleneck Models, CBMs)会遭遇无法突破的准确率上限。解决方案的关键在于引入粗糙集理论(rough set theory),系统性地识别并量化了Derm7pt dermoscopy数据集中概念层面的不一致性——发现其中16.4%的概念配置(共50个)导致30.3%的图像出现冲突,从而推导出92.1%的理论准确率上限。进一步通过对边界区域图像进行对称或非对称过滤,构建出完全一致的子集Derm7pt+,消除了硬性准确率限制,并在此基础上验证了多种骨干网络架构下的CBM性能,确立了可复现的基准结果。

链接: https://arxiv.org/abs/2604.19323
作者: Gonzalo Nápoles,Isel Grau,Yamisleydi Salgueiro
机构: Tilburg University (蒂尔堡大学); Eindhoven University of Technology (埃因霍温理工大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) route predictions exclusively through a clinically grounded concept layer, binding interpretability to concept-label consistency. When a dataset contains concept-level inconsistencies, identical concept profiles mapped to conflicting diagnosis labels create an unresolvable bottleneck that imposes a hard ceiling on achievable accuracy. In this paper, we apply rough set theory to the Derm7pt dermoscopy benchmark and characterize the full extent and clinical structure of this inconsistency. Among 305 unique concept profiles formed by the 7 dermoscopic criteria of the 7-point melanoma checklist, 50 (16.4%) are inconsistent, spanning 306 images (30.3% of the dataset). This yields a theoretical accuracy ceiling of 92.1%, independent of backbone architecture or training strategy for CBMs that exclusively operate with hard concepts. In addition, we characterize the conflict-severity distribution, identify the clinical features most responsible for boundary ambiguity, and evaluate two filtering strategies with quantified effects on dataset composition and CBM interpretability. Symmetric removal of all boundary-region images yields Derm7pt+, a fully consistent benchmark subset of 705 images with perfect quality of classification and no hard accuracy ceiling. Building on this filtered dataset, we present a hard CBM evaluated across 19 backbone architectures from the EfficientNet, DenseNet, ResNet, and Wide ResNet families. Under symmetric filtering, explored for completeness, EfficientNet-B5 achieves the best label F1 score (0.85) and label accuracy (0.90) on the held-out test set, with a concept accuracy of 0.70. Under asymmetric filtering, EfficientNet-B7 leads across all four metrics, reaching a label F1 score of 0.82 and concept accuracy of 0.70. These results establish reproducible baselines for concept-consistent CBM evaluation on dermoscopic data.

[CV-51] Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes CVPR2026

【速读】:该论文旨在解决现有基于卷积神经网络(Convolutional Neural Networks, CNNs)的多视角人群跟踪方法在真实复杂场景中性能受限的问题,尤其是当场景规模扩大、遮挡情况加剧时,现有方法难以有效泛化。其解决方案的关键在于提出一种基于Transformer架构的多视角人群跟踪模型MVTrackTrans,该模型通过引入相机视图间与地面平面之间的交互机制,增强跨视角信息融合能力,从而提升在大规模、长时间序列场景下的跟踪精度与鲁棒性。同时,为更贴近实际应用,作者还构建并标注了两个大规模真实场景多视角跟踪数据集MVCrowdTrack和CityTrack,以推动该领域向更具实用性的方向发展。

链接: https://arxiv.org/abs/2604.19318
作者: Qi Zhang,Jixuan Chen,Kaiyi Zhang,Xinquan Yu,Antoni B. Chan,Hui Huang
机构: Shenzhen University (深圳大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Multi-view crowd tracking estimates each person’s tracking trajectories on the ground of the scene. Recent research works mainly rely on CNNs-based multi-view crowd tracking architectures, and most of them are evaluated and compared on relatively small datasets, such as Wildtrack and MultiviewX. Since these two datasets are collected in small scenes and only contain tens of frames in the evaluation stage, it is difficult for the current methods to be applied to real-world applications where scene size and occlusion are more complicated. In this paper, we propose a Transformer-based multi-view crowd tracking model, \textitMVTrackTrans, which adopts interactions between camera views and the ground plane for enhanced multi-view tracking performance. Besides, for better evaluation, we collect and label two large real-world multi-view tracking datasets, MVCrowdTrack and CityTrack, which contain a much larger scene size over a longer time period. Compared with existing methods on the two large and new datasets, the proposed MVTrackTrans model achieves better performance, demonstrating the advantages of the model design in dealing with large scenes. We believe the proposed datasets and model will push the frontiers of the task to more practical scenarios, and the datasets and code are available at: this https URL.

[CV-52] Framelet-Based Blind Image Restoration with Minimax Concave Regularization

【速读】:该论文旨在解决图像复原中的盲图像去模糊(blind image deblurring)问题,即在未知点扩散函数(point spread function, PSF)的情况下,同时估计清晰的原始图像和模糊核。由于该问题具有病态性(ill-posed),直接求解困难。解决方案的关键在于引入一种结合最小最大凹惩罚(minimax concave penalty, MCP)与重加权 ℓ₁-范数正则化的总变差(total variation, TV)优化模型:MCP能更精确地逼近 ℓ₀-范数以增强梯度稀疏性,从而更好地保留边缘结构;而重加权 ℓ₁-范数进一步降低估计偏差,提升对细节和纹理的恢复质量。该方法有效缓解了传统 ℓ₀-正则化带来的非凸性和计算复杂度问题,提升了盲去模糊的实际性能。

链接: https://arxiv.org/abs/2604.19314
作者: Heng Zhang,Reza Parvaz,Rui Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Recovering corrupted images is one of the most challenging problems in image processing. Among various restoration tasks, blind image deblurring has been extensively studied due to its practical importance and inherent difficulty. In this problem, both the point spread function (PSF) and the underlying latent sharp image must be estimated simultaneously. This problem cannot be solved directly due to its ill-posed nature. One powerful tool for solving such problems is total variation (TV) regularization. The \ell_0 -norm regularization within the TV framework has been widely adopted to promote sparsity in image gradients or transform domains, leading to improved preservation of edges and fine structures. However, the use of the \ell_0 -norm results in a highly nonconvex and computationally intractable optimization problem, which limits its practical applicability. To overcome these difficulties, we employ the minimax concave penalty (MCP), which promotes enhanced sparsity and provides a closer approximation to the \ell_0 -norm. In addition, a reweighted \ell_1 -norm regularization is incorporated to further reduce estimation bias and improve the preservation of fine image details and textures. After introducing the proposed model, a numerical algorithm is developed to solve the resulting optimization problem. The effectiveness of the proposed approach is then demonstrated through experimental evaluations on several test images.

[CV-53] DR-MMSearchAgent : Deepening Reasoning in Multimodal Search Agents

【速读】:该论文旨在解决生成式多模态智能体(Agentic multimodal models)在复杂任务执行中常见的过早交互崩溃(premature interaction collapse)问题,其根源在于:1)终端奖励仅附加在序列最后一个token上,导致优势信号无法区分具有探索性行为的不同轨迹;2)上下文冗余度过高,阻碍智能体吸收有效反馈。解决方案的关键在于提出Deepening Reasoning MMSearchAgent框架,通过利用整个批次中轨迹的结构相似性,从完整轨迹中提取优势信号,从而鼓励不同长度轨迹的生成(即使包含相同正确答案);同时引入差异化高斯奖励机制动态校准交互容忍度,以提升信息可靠性并减少冗余。

链接: https://arxiv.org/abs/2604.19264
作者: Shengqin Wang,Wentao Yan,Huichi Zhou,Yihang Chen,Kun Shao,Zhizhong Zhang,Yuan Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Agentic multimodal models have garnered significant attention for their ability to leverage external tools to tackle complex tasks. However, it is observed that such agents often meet premature interaction collapse, caused by two primary reasons: 1) the terminal reward often appending on the last token prevents the advantage from distinguishing trajectories with exploratory behavior; 2) excessively redundant context hinders the agent from absorbing useful feedback. To address these issues, we propose the Deepening Reasoning MMSearchAgent, the framework leverages the structural proximity to derive advantage signals from the whole rollout trajectories in an entire batch, such that trajectories of different lengths are further encouraged to be generated, even when containing the same correct answer. Additionally, differentiated gaussian rewards are employed to dynamically calibrate interaction tolerance, thereby ensuring information reliability and reduce redundancy. To support multi-turn interaction training, we have constructed a multi-step deep-reasoning dataset including 3602 high-quality QA pair with at least 3 reasonning steps. Extensive experiments demonstrate that our method achieves state-of-the-art performance, outperforming the MMSearch-R1 by 8.4 % on FVQA-test.

[CV-54] Feature Perturbation Pool-based Fusion Network for Unified Multi-Class Industrial Defect Detection

【速读】:该论文旨在解决工业质量检测中多类缺陷检测的两个核心挑战:一是传统方法需为每类缺陷单独训练模型,导致计算和内存开销大;二是当异质缺陷类别被联合建模时,因类别间特征扰动导致模型鲁棒性下降。其解决方案的关键在于提出FPFNet(Feature Perturbation Pool-based Fusion Network),通过引入随机特征扰动池(stochastic feature perturbation pool)与多层特征融合策略协同优化:前者通过注入高斯噪声、F-Noise和F-Drop等多样噪声模式丰富训练分布,提升对域偏移和未见缺陷形态的鲁棒性;后者利用残差连接与归一化机制融合编码器与解码器的层次化特征表示,有效捕捉跨尺度关系并保留精细空间细节,从而实现统一框架下的高性能缺陷检测,且不增加额外可学习参数或计算复杂度。

链接: https://arxiv.org/abs/2604.19259
作者: Yuanchan Xu,Wenjun Zang,Ying Wu
机构: Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-class defect detection constitutes a critical yet challenging task in industrial quality inspection, where existing approaches typically suffer from two fundamental limitations: (i) the necessity of training separate models for each defect category, resulting in substantial computational and memory overhead, and (ii) degraded robustness caused by inter-class feature perturbation when heterogeneous defect categories are jointly modeled. In this paper, we present FPFNet, a Feature Perturbation Pool-based Fusion Network that synergistically integrates a stochastic feature perturbation pool with a multi-layer feature fusion strategy to address these challenges within a unified detection framework. The feature perturbation pool enriches the training distribution by randomly injecting diverse noise patterns – including Gaussian noise, F-Noise, and F-Drop – into the extracted feature representations, thereby strengthening the model’s robustness against domain shifts and unseen defect morphologies. Concurrently, the multi-layer feature fusion module aggregates hierarchical feature representations from both the encoder and decoder through residual connections and normalization, enabling the network to capture complex cross-scale relationships while preserving fine-grained spatial details essential for precise defect localization. Built upon the UniAD architecture~\citeyou2022unified, our method achieves state-of-the-art performance on two widely adopted benchmarks: 97.17% image-level AUROC and 96.93% pixel-level AUROC on MVTec-AD, and 91.08% image-level AUROC and 99.08% pixel-level AUROC on VisA, surpassing existing methods by notable margins while introducing no additional learnable parameters or computational complexity.

[CV-55] Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images CVPR2026

【速读】:该论文旨在解决现有3D车辆生成方法依赖合成数据导致域差距(domain gap)的问题,以及生成模型常出现姿态任意、尺度未定义,从而在自动驾驶场景中缺乏视觉一致性的缺陷。其核心解决方案是提出Unposed-to-3D框架,关键在于通过两阶段训练策略:第一阶段利用带位姿的图像训练图像到3D重建网络;第二阶段移除相机监督,引入相机预测头从无位姿图像中估计相机参数,并结合可微渲染提供自监督光度反馈,使模型仅凭无标注图像即可学习真实三维几何结构。此外,通过尺度感知模块和和谐化模块确保生成车辆具有真实世界尺寸并适配目标驾驶场景的光照与外观,从而实现仿真就绪的高质量3D资产生成。

链接: https://arxiv.org/abs/2604.19257
作者: Hongyuan Liu,Bochao Zou,Qiankun Liu,Haochen Yu,Qi Mei,Jianfei Jiang,Chen Liu,Cheng Bi,Zhao Wang,Xueyang Zhang,Yifei Zhan,Jiansheng Chen,Huimin Ma
机构: University of Science and Technology Beijing (北京科技大学); Li Auto Inc (理想汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Creating realistic and simulation-ready 3D assets is crucial for autonomous driving research and virtual environment construction. However, existing 3D vehicle generation methods are often trained on synthetic data with significant domain gaps from real-world distributions. The generated models often exhibit arbitrary poses and undefined scales, resulting in poor visual consistency when integrated into driving scenes. In this paper, we present Unposed-to-3D, a novel framework that learns to reconstruct 3D vehicles from real-world driving images using image-only supervision. Our approach consists of two stages. In the first stage, we train an image-to-3D reconstruction network using posed images with known camera parameters. In the second stage, we remove camera supervision and use a camera prediction head that directly estimates the camera parameters from unposed images. The predicted pose is then used for differentiable rendering to provide self-supervised photometric feedback, enabling the model to learn 3D geometry purely from unposed images. To ensure simulation readiness, we further introduce a scale-aware module to predict real-world size information, and a harmonization module that adapts the generated vehicles to the target driving scene with consistent lighting and appearance. Extensive experiments demonstrate that Unposed-to-3D effectively reconstructs realistic, pose-consistent, and harmonized 3D vehicle models from real-world images, providing a scalable path toward creating high-quality assets for driving scene simulation and digital twin environments.

[CV-56] AlloSR2: Rectifying One-Step Super-Resolution to Stay Real via Allomorphic Generative Flows

【速读】:该论文旨在解决真实世界图像超分辨率(Real-SR)中因有限低分辨率-高分辨率(LR-HR)配对数据导致的“先验坍塌”(prior collapse)问题,以及单步生成过程中由于缺乏多步优化而引发的轨迹偏移(trajectory drift)和伪影生成问题。解决方案的关键在于提出AlloSR²框架,其核心创新包括:1)基于信噪比(Signal-to-Noise Ratio, SNR)引导的轨迹初始化,通过将低分辨率潜在特征的退化程度与预训练流模型的最佳锚定时间步对齐,建立物理合理的初始状态;2)流锚定轨迹一致性(Flow-Anchored Trajectory Consistency, FATC),在中间状态施加速度级监督以确保单步推理路径的稳定性与无曲率特性;3)同构轨迹匹配(Allomorphic Trajectory Matching, ATM),采用自对抗对齐策略最小化超分流与生成流在统一向量场中的分布差异,从而维持高保真度的生成真实性。该方法在合成与真实世界基准上均实现了最优性能,兼顾恢复保真度与生成真实性,并具备极高的效率。

链接: https://arxiv.org/abs/2604.19238
作者: Zihan Wang,Xudong Huang,Junbo Qiao,Wei Li,Jie Hu,Xinghao Chen,Shaohui Lin
机构: 华东师范大学计算机科学与技术学院(Computer Science and Technology College, East China Normal University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-world image super-resolution (Real-SR) has been revolutionized by leveraging the powerful generative priors of large-scale diffusion and flow-based models. However, fine-tuning these models on limited LR-HR pairs often precipitates “prior collapse” that the model sacrifices its inherent generative richness to overfit specific training degradations. This issue is further exacerbated in one-step generation, where the absence of multi-step refinement leads to significant trajectory drift and artifact generation. In this paper, we propose AlloSR ^2 , a novel framework that rectifies one-step SR trajectories via allomorphic generative flows to maintain high-fidelity generative realism. Specifically, we utilize Signal-to-Noise Ratio (SNR) Guided Trajectory Initialization to establish a physically grounded starting state by aligning the degradation level of LR latent features with the optimal anchoring timestep of the pre-trained flow. To ensure a stable, curvature-free path for one-step inference, we propose Flow-Anchored Trajectory Consistency (FATC), which enforces velocity-level supervision across intermediate states. Furthermore, we develop Allomorphic Trajectory Matching (ATM), a self-adversarial alignment strategy that minimizes the distributional discrepancy between the SR flow and the generative flow in a unified vector field. Extensive experiments on both synthetic and real-world benchmarks demonstrate that AlloSR ^2 achieves state-of-the-art performance in one-step Real-SR, offering a superior balance between restoration fidelity and generative realism while maintaining extreme efficiency.

[CV-57] Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation

【速读】:该论文旨在解决基于人类偏好信号的视觉生成模型后训练中,强化学习方法(尤其是Group Relative Policy Optimization, GRPO)因奖励信用分配粗粒度而导致的优化信号不准确问题。现有GRPO框架通常将多个奖励模型(如视觉质量、运动一致性、文本对齐等)融合为单一静态标量,并在整个扩散轨迹上均匀传播,忽略了不同去噪步骤的阶段性作用,从而产生时间错位或不兼容的优化信号。其解决方案的关键在于提出Objective-aware Trajectory Credit Assignment (OTCA)框架,通过两个核心组件实现细粒度信用分配:一是轨迹级信用分解(Trajectory-Level Credit Decomposition),用于估计各去噪步骤的相对重要性;二是多目标信用分配(Multi-Objective Credit Allocation),动态加权并整合多源奖励信号于整个去噪过程。OTCA联合建模时间维度与目标维度的信用结构,将粗粒度奖励监督转化为具有时序感知能力的结构化训练信号,更贴合扩散生成的迭代特性,显著提升图像与视频生成质量。

链接: https://arxiv.org/abs/2604.19234
作者: Rui Li,Ke Hao,Yuanzhi Liang,Haibin Huang,Chi Zhang,YunGu,XueLong Li
机构: University of Science and Technology of China (中国科学技术大学); Shanghai Jiao Tong University (上海交通大学); TeleAI (TeleAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement learning, particularly Group Relative Policy Optimization (GRPO), has emerged as an effective framework for post-training visual generative models with human preference signals. However, its effectiveness is fundamentally limited by coarse reward credit assignment. In modern visual generation, multiple reward models are often used to capture heterogeneous objectives, such as visual quality, motion consistency, and text alignment. Existing GRPO pipelines typically collapse these rewards into a single static scalar and propagate it uniformly across the entire diffusion trajectory. This design ignores the stage-specific roles of different denoising steps and produces mistimed or incompatible optimization signals. To address this issue, we propose Objective-aware Trajectory Credit Assignment (OTCA), a structured framework for fine-grained GRPO training. OTCA consists of two key components. Trajectory-Level Credit Decomposition estimates the relative importance of different denoising steps. Multi-Objective Credit Allocation adaptively weights and combines multiple reward signals throughout the denoising process. By jointly modeling temporal credit and objective-level credit, OTCA converts coarse reward supervision into a structured, timestep-aware training signal that better matches the iterative nature of diffusion-based generation. Extensive experiments show that OTCA consistently improves both image and video generation quality across evaluation metrics.

[CV-58] Adaptive Slicing-Assisted Hyper Inference for Enhanced Small Object Detection in High-Resolution Imagery

【速读】:该论文旨在解决高分辨率航空与卫星图像中小目标检测难题,其核心挑战包括密集目标分布、多变拍摄角度、微小目标尺寸及显著的类别间差异。现有基于固定切片大小的分割策略虽能扩展小目标的有效感受野,但因冗余计算导致推理效率低下。解决方案的关键在于提出自适应切片辅助超推理(ASAHI)框架,通过三个协同模块实现:(1) 基于分辨率感知的自适应切片算法,动态生成6或12个重叠块以优化切片数量;(2) 切片辅助微调(SAF)策略,融合全分辨率与切片图像块构建增强训练数据;(3) Cluster-DIoU-NMS(CDN)后处理模块,结合Cluster-NMS的几何合并效率与DIoU-NMS的中心距离敏感抑制机制,提升复杂场景下的重复框消除鲁棒性。该方法在VisDrone2019和xView数据集上均达到SOTA性能,并将推理时间降低20–25%。

链接: https://arxiv.org/abs/2604.19233
作者: Francesco Moretti,Yi Jin,Guiqin Mario
机构: Polytechnic University of Turin (都灵理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning-based object detectors have achieved remarkable success across numerous computer vision applications, yet they continue to struggle with small object detection in high-resolution aerial and satellite imagery, where dense object distributions, variable shooting angles, diminutive target sizes, and substantial inter-class variability pose formidable challenges. Existing slicing strategies that partition high-resolution images into manageable patches have demonstrated promising results for enlarging the effective receptive field of small targets; however, their reliance on fixed slice dimensions introduces significant redundant computation, inflating inference cost and undermining detection speed. In this paper, we propose \textbfAdaptive Slicing-Assisted Hyper Inference (ASAHI), a novel slicing framework that shifts the paradigm from prescribing a fixed slice size to adaptively determining the optimal number of slices according to image resolution, thereby substantially mitigating redundant computation while preserving beneficial overlap between adjacent patches. ASAHI integrates three synergistic components: (1)an adaptive resolution-aware slicing algorithm that dynamically generates 6 or 12 overlapping patches based on a learned threshold, (2)a slicing-assisted fine-tuning (SAF) strategy that constructs augmented training data comprising both full-resolution and sliced image patches, and (3)a Cluster-DIoU-NMS (CDN) post-processing module that combines the geometric merging efficiency of Cluster-NMS with the center-distance-aware suppression of DIoU-NMS to achieve robust duplicate elimination in crowded scenes. Extensive experiments on VisDrone2019 and xView, demonstrate that ASAHI achieves state-of-the-art performance with 56.8% on VisDrone2019-DET-val and 22.7% on xView-test, while reducing inference time by 20-25% compared to the baseline SAHI method.

[CV-59] hinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification

【速读】:该论文旨在解决行人重识别(Person Re-Identification, ReID)中因主流感知驱动范式依赖大量标注数据、难以捕捉身份因果线索而导致的表征脆弱性问题。解决方案的关键在于提出一种全新的推理驱动范式 ReID-R,其核心创新是将思维链(Chain-of-Thought, CoT)引入 ReID 流程:首先通过无标签的 CoT 式预训练实现身份感知特征理解(判别性推理热身),再结合非平凡采样策略构建场景泛化数据,并利用高质量奖励信号引导模型聚焦于与身份相关的线索,从而实现精准推理与可解释的结果输出。

链接: https://arxiv.org/abs/2604.19218
作者: Quan Zhang,Jingze Wu,Jialong Wang,Xiaohua Xie,Jianhuang Lai,Hongbo Chen
机构: Sun Yat-sen University (中山大学); Alibaba Cloud Computing (阿里云计算)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Learning identity-discriminative representations with multi-scene generality has become a critical objective in person re-identification (ReID). However, mainstream perception-driven paradigms tend to identify fitting from massive annotated data rather than identity-causal cues understanding, which presents a fragile representation against multiple disruptions. In this work, ReID-R is proposed as a novel reasoning-driven paradigm that achieves explicit identity understanding and reasoning by incorporating chain-of-thought into the ReID pipeline. Specifically, ReID-R consists of a two-stage contribution: (i) Discriminative reasoning warm-up, where a model is trained in a CoT label-free manner to acquire identity-aware feature understanding; and (ii) Efficient reinforcement learning, which proposes a non-trivial sampling to construct scene-generalizable data. On this basis, ReID-R leverages high-quality reward signals to guide the model toward focusing on ID-related cues, achieving accurate reasoning and correct responses. Extensive experiments on multiple ReID benchmarks demonstrate that ReID-R achieves competitive identity discrimination as superior methods using only 14.3K non-trivial data (20.9% of the existing data scale). Furthermore, benefit from inherent reasoning, ReID-R can provide high-quality interpretation for results.

[CV-60] Attention-based Multi-modal Deep Learning Model of Spatio-temporal Crop Yield Prediction with Satellite Soil and Climate Data

【速读】:该论文旨在解决传统作物产量预测方法因依赖静态数据源而难以准确反映环境变量间动态复杂关系的问题,从而影响全球粮食安全与政策制定的科学性。其解决方案的关键在于提出一种基于注意力机制的多模态深度学习框架(Attention-Based Multi-Modal Deep Learning Framework, ABMMDLF),该框架融合多年卫星遥感影像、高分辨率气象时间序列数据及初始土壤属性等多源异构信息,利用卷积神经网络(Convolutional Neural Networks, CNN)提取空间特征,并通过时序注意力机制自适应加权关键物候期,实现对时空维度上作物生长过程的精准建模,实验表明该方法在预测精度上显著优于基线模型(R²达0.89)。

链接: https://arxiv.org/abs/2604.19217
作者: Gopal Krishna Shyam,Ila Chandrakar
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 Figures

点击查看摘要

Abstract:Crop yield prediction is one of the most important challenge, which is crucial to world food security and policy-making decisions. The conventional forecasting techniques are limited in their accuracy with reference to the fact that they utilize static data sources that do not reflect the dynamic and intricate relationships that exist between the variables of the environment over time [5,13]. This paper presents Attention-Based Multi-Modal Deep Learning Framework (ABMMDLF), which is suggested to be used in high-accuracy spatio-temporal crop yield prediction. The model we use combines multi-year satellite imagery, high-resolution time-series of meteorological data and initial soil properties as opposed to the traditional models which use only one of the aforementioned factors [12, 21]. The main architecture involves the use of Convolutional Neural Networks (CNN) to extract spatial features and a Temporal Attention Mechanism to adaptively weight important phenological periods targeted by the algorithm to change over time and condition on spatial features of images and video sequences. As can be experimentally seen, the proposed research work provides an R^2 score of 0.89, which is far better than the baseline models do. Comments: 6 pages, 2 Figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.19217 [cs.CV] (or arXiv:2604.19217v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.19217 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-61] An Object-Centered Data Acquisition Method for 3D Gaussian Splatting using Mobile Phones

【速读】:该论文旨在解决移动设备上基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的三维重建中数据采集困难的问题,尤其针对以物体为中心的场景。其核心解决方案在于提供本地化捕获指导与记录设备传感器信号,用于离线重建:通过校准步骤将设备姿态对齐至基准坐标系以获得相对位姿,并将相机光轴映射到以物体为中心的球面网格上实现均匀视角索引;同时,为减少极区采样偏差,实时计算面积加权的球面覆盖度并引导用户运动,从而提升视角覆盖的全面性和均匀性。实验表明,该方法在使用更少输入图像的情况下,相比RealityScan和自由采集策略能获得更优的重建质量。

链接: https://arxiv.org/abs/2604.19216
作者: Yuezhe Zhang,Luqian Bai,Mengting Yu,Lei Wei,Shuai Wan,Yifan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data acquisition through mobile phones remains a challenge for 3D Gaussian Splatting (3DGS). In this work we target the object-centered scenario and enable reliable mobile acquisition by providing on-device capture guidance and recording onboard sensor signals for offline reconstruction. After the calibration step, the device orientations are aligned to a baseline frame to obtain relative poses, and the optical axis of the camera is mapped to an object-centered spherical grid for uniform viewpoint indexing. To curb polar sampling bias, we compute area-weighted spherical coverage in real-time and guide the user’s motion accordingly. We compare the proposed method with RealityScan and the free-capture strategy. Our method achieves superior reconstruction quality using fewer input images compared to free capture and RealityScan. Further analysis shows that the proposed method is able to obtain more comprehensive and uniform viewpoint coverage during object-centered acquisition.

[CV-62] When Can We Trust Deep Neural Networks? Towards Reliable Industrial Deployment with an Interpretability Guide

【速读】:该论文旨在解决安全关键领域(如工业缺陷检测、自动驾驶和医疗诊断)中人工智能(AI)系统因缺乏可靠性而导致的部署难题,尤其是模型在高准确率下仍可能产生无法被识别的错误预测(假阴性),从而引发灾难性后果的问题。解决方案的关键在于提出一种基于后处理解释的可靠性指标,通过计算类别特定判别热图(class-specific discriminative heatmaps)与类别无关热图(class-agnostic heatmaps)之间的交并比(IoU)差异作为可靠性得分,并引入对抗增强方法以放大该差异,从而有效识别潜在的假阴性输出。实验表明,该方法在两个工业缺陷检测基准上显著提升了假阴性检测能力,结合对抗增强可实现100%召回率,尽管以牺牲真阴性为代价,从而推动了从传统端到端系统向“数据-模型-解释-输出”可信部署范式的转变。

链接: https://arxiv.org/abs/2604.19206
作者: Hang-Cheng Dong,Yuhao Jiang,Yibo Jiao,Lu Zou,Kai Zheng,Bingguo Liu,Dong Ye,Guodong Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The deployment of AI systems in safety-critical domains, such as industrial defect inspection, autonomous driving, and medical diagnosis, is severely hampered by their lack of reliability. A single undetected erroneous prediction can lead to catastrophic outcomes. Unfortunately, there is often no alternative but to place trust in the outputs of a trained AI system, which operates without an internal safeguard to flag unreliable predictions, even in cases of high accuracy. We propose a post-hoc explanation-based indicator to detect false negatives in binary defect detection networks. To our knowledge, this is the first method to proactively identify potentially erroneous network outputs. Our core idea leverages the difference between class-specific discriminative heatmaps and class-agnostic ones. We compute the difference in their intersection over union (IoU) as a reliability score. An adversarial enhancement method is further introduced to amplify this disparity. Evaluations on two industrial defect detection benchmarks show our method effectively identifies false negatives. With adversarial enhancement, it achieves 100% recall, albeit with a trade-off for true negatives. Our work thus advocates for a new and trustworthy deployment paradigm: data-model-explanation-output, moving beyond conventional end-to-end systems to provide critical support for reliable AI in real-world applications.

[CV-63] SketchFaceGS: Real-Time Sketch-Driven Face Editing and Generation with Gaussian Splatting CVPR2026

【速读】:该论文旨在解决从2D草图中实时生成与编辑高质量、几何一致的3D Gaussian头部模型(3D Gaussian representations)的问题。现有方法难以利用稀疏且深度模糊的草图信息来推断密集且具高保真度的3D结构,尤其在实时约束下更为困难。解决方案的关键在于提出SketchFaceGS框架,其核心是采用前馈式的粗到精架构:首先通过基于Transformer的UV特征预测模块从草图重建出几何一致的粗略UV特征图,再借助3D UV特征增强模块融合高频细节以生成高保真3D头部;在编辑方面,引入UV Mask Fusion技术与分层特征融合策略,实现精确、实时、多视角的修改能力。

链接: https://arxiv.org/abs/2604.19202
作者: Bo Li,Jiahao Kang,Yubo Ma,Feng-Lin Liu,Bin Liu,Fang-Lue Zhang,Lin Gao
机构: Shandong Technology and Business University (山东工商大学); Nanchang Hangkong University (南昌航空大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); University of New South Wales (新南威尔士大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 as a Highlight. Jittor implementation: this https URL . © 2026 IEEE. Personal use of this material is permitted

点击查看摘要

Abstract:3D Gaussian representations have emerged as a powerful paradigm for digital head modeling, achieving photorealistic quality with real-time rendering. However, intuitive and interactive creation or editing of 3D Gaussian head models remains challenging. Although 2D sketches provide an ideal interaction modality for fast, intuitive conceptual design, they are sparse, depth-ambiguous, and lack high-frequency appearance cues, making it difficult to infer dense, geometrically consistent 3D Gaussian structures from strokes - especially under real-time constraints. To address these challenges, we propose SketchFaceGS, the first sketch-driven framework for real-time generation and editing of photorealistic 3D Gaussian head models from 2D sketches. Our method uses a feed-forward, coarse-to-fine architecture. A Transformer-based UV feature-prediction module first reconstructs a coarse but geometrically consistent UV feature map from the input sketch, and then a 3D UV feature enhancement module refines it with high-frequency, photorealistic detail to produce a high-fidelity 3D head. For editing, we introduce a UV Mask Fusion technique combined with a layer-by-layer feature-fusion strategy, enabling precise, real-time, free-viewpoint modifications. Extensive experiments show that SketchFaceGS outperforms existing methods in both generation fidelity and editing flexibility, producing high-quality, editable 3D heads from sketches in a single forward pass.

[CV-64] Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing CVPR

【速读】:该论文旨在解决人脸活体检测(Face Anti-Spoofing, FAS)中跨域泛化能力不足的问题,尤其在未见环境下的鲁棒性挑战。现有基于视觉-语言模型(Vision-Language Models, VLMs)的方法虽利用语义监督提升性能,但存在计算资源消耗大、推理延迟高以及依赖底层视觉特征质量的局限性。解决方案的关键在于重新评估纯视觉基础模型(vision-only foundation models)的潜力,并通过系统性基准测试15种预训练模型(包括监督CNN、监督ViT及自监督ViT),发现自监督视觉Transformer(如DINOv2 with Registers)能有效抑制注意力伪影并捕捉细微的伪造线索。进一步结合专为FAS设计的数据增强策略(FAS-Aug、Patch-wise Data Augmentation, PDA)与注意力加权补丁损失(Attention-weighted Patch Loss, APL),所提出的纯视觉基线在MICO协议下达到SOTA性能,在数据受限的Limited Source Domains (LSD)协议中也优于现有方法,同时保持优异的计算效率。

链接: https://arxiv.org/abs/2604.19196
作者: Mika Feng,Pierre Gallin-Martel,Koichi Ito,Takafumi Aoki
机构: Tohoku University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

点击查看摘要

Abstract:Face Anti-Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision-Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision-only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre-trained models, such as supervised CNNs, supervised ViTs, and self-supervised ViTs, under severe cross-domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self-supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical, fine-grained spoofing cues. Combined with Face Anti-Spoofing Data Augmentation (FAS-Aug), Patch-wise Data Augmentation (PDA) and Attention-weighted Patch Loss (APL), our proposed vision-only baseline achieves state-of-the-art performance in the MICO protocol. This baseline outperforms existing methods under the data-constrained LSD protocol while maintaining superior computational efficiency. This work provides a definitive vision-only baseline for FAS, demonstrating that optimized self-supervised vision transformers can serve as a backbone for both vision-only and future multimodal FAS systems. The project page is available at: this https URL .

[CV-65] How Far Are Video Models from True Multimodal Reasoning ?

【速读】:该论文旨在解决当前通用视频模型在实现真正多模态推理能力方面存在的评估盲区问题,即现有基准测试因任务设计简单、评价指标碎片化而无法严谨衡量模型在复杂场景下的跨模态推理性能。其解决方案的关键在于提出CLVG-Bench(Context Learning in Video Generation Benchmark)这一系统性评估框架,该框架包含超过1,000条人工标注的高质量元数据,覆盖物理模拟、逻辑推理和交互情境等6大类47子类复杂场景,并结合自适应视频评估器(Adaptive Video Evaluator, AVE),通过极少标注即可对齐人类专家感知,提供可解释的文本反馈,从而精准量化模型在零样本条件下进行多模态推理的能力短板,揭示逻辑推理与物理接地是当前SOTA视频模型的核心瓶颈。

链接: https://arxiv.org/abs/2604.19193
作者: Xiaotian Zhang,Jianhui Wei,Yuan Wang,Jie Tan,Yichen Li,Yan Zhang,Ziyi Chen,Daoan Zhang,Dezhi YU,Wei Xu,Songtao Jiang,Zuozhu Liu
机构: Google(谷歌); Seed(种子); HunyuanVideo(浑元视频); UniVideo(统一视频); LTX-2(大模型); Wan(万); Sora(索拉); Veo(维奥); Seedance(种子竞技); OpenAI(OpenAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existing benchmarks fail to address this question rigorously, as they remain constrained by straightforward task designs and fragmented evaluation metrics that neglect complex multimodal reasoning. To bridge this gap, we introduce CLVG-Bench, an evaluation framework designed to probe video models’ zero-shot reasoning capabilities via Context Learning in Video Generation. CLVG-Bench comprises more than 1,000 high-quality, manually annotated metadata across 6 categories and 47 subcategories, covering complex scenarios including physical simulation, logical reasoning, and interactive contexts. To enable rigorous and scalable assessment, we further propose an Adaptive Video Evaluator (AVE) that aligns with human expert perception using minimal annotations, delivering interpretable textual feedback across diverse video context tasks. Extensive experiments reveal a striking answer to our central question: while state-of-the-art (SOTA) video models, such as Seedance 2.0, demonstrate competence on certain understanding and reasoning subtasks, they fall substantially short with logically grounded and interactive generation tasks (achieving success rates 25% and ~0%, respectively), exposing multimodal reasoning and physical grounding as critical bottlenecks. By systematically quantifying these limitations, the proposed method provides actionable feedbacks and a clear roadmap toward truly robust, general-purpose video models. CLVG-Bench and code are released here.

[CV-66] Improved Anomaly Detection in Medical Images via Mean Shift Density Enhancement

【速读】:该论文旨在解决医学影像中异常检测的问题,尤其在标注异常样本稀缺的情况下,如何有效识别罕见病理状态。解决方案的关键在于提出了一种融合自监督表征学习与流形密度估计的混合框架:首先利用预训练模型(可能为特定领域)将医学图像嵌入到潜在特征空间;随后通过均值漂移密度增强(Mean Shift Density Enhancement, MSDE)迭代地将样本移向高似然区域以优化表征;最后在主成分分析(PCA)降维后的潜空间中,基于高斯密度估计和马氏距离计算异常分数。该方法遵循单类学习范式,仅需正常样本即可训练,实验证明其在多个医学影像数据集上达到领先性能,展现出作为临床辅助决策工具的潜力。

链接: https://arxiv.org/abs/2604.19191
作者: Pritam Kar,Gouri Lakshmi S,Saptarshi Bej
机构: Indian Institute of Science Education and Research (印度科学教育与研究学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Anomaly detection in medical imaging is essential for identifying rare pathological conditions, particularly when annotated abnormal samples are limited. We propose a hybrid anomaly detection framework that integrates self-supervised representation learning with manifold-based density estimation, a combination that remains largely unexplored in this domain. Medical images are first embedded into a latent feature space using pretrained, potentially domain-specific, backbones. These representations are then refined via Mean Shift Density Enhancement (MSDE), an iterative manifold-shifting procedure that moves samples toward regions of higher likelihood. Anomaly scores are subsequently computed using Gaussian density estimation in a PCA-reduced latent space, where Mahalanobis distance measures deviation from the learned normal distribution. The framework follows a one-class learning paradigm and requires only normal samples for training. Extensive experiments on seven medical imaging datasets demonstrate state-of-the-art performance. MSDE achieves the highest AUC on four datasets and the highest Average Precision on five datasets, including near-perfect performance on brain tumor detection (0.981 AUC/AP). These results underscore the potential of the proposed framework as a scalable clinical decision-support tool for early disease detection, screening in low-label settings, and robust deployment across diverse imaging modalities. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.19191 [cs.CV] (or arXiv:2604.19191v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.19191 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-67] MSDS: Deep Structural Similarity with Multiscale Representation

【速读】:该论文旨在解决现有基于深度特征的感知相似性模型在图像质量评估(IQA)中普遍局限于单一空间尺度的问题,即隐含假设固定分辨率下的结构相似性足以刻画人类视觉感知,而忽视了空间尺度对深度特征相似性建模的重要影响。解决方案的关键在于提出一种最小化的多尺度扩展框架——多尺度深度结构相似性(MSDS),通过在图像金字塔的不同层级独立计算DeepSSIM,并采用一组可学习的全局权重进行分数融合,从而实现深度特征表示与跨尺度整合的解耦。实验表明,该方法在多个基准数据集上均显著优于单尺度基线模型,且计算复杂度几乎不变,验证了空间尺度作为非可忽略因素在深度感知相似性中的作用。

链接: https://arxiv.org/abs/2604.19159
作者: Danling Kang,Xue-Hua Chen,Bin Liu,Keke Zhang,Weiling Chen,Tiesong Zhao
机构: Fuzhou University (福州大学); Fuzhou Institute of Technology (福州职业技术学院); Henan Normal University (河南师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep-feature-based perceptual similarity models have demonstrated strong alignment with human visual perception in Image Quality Assessment (IQA). However, most existing approaches operate at a single spatial scale, implicitly assuming that structural similarity at a fixed resolution is sufficient. The role of spatial scale in deep-feature similarity modeling thus remains insufficiently understood. In this letter, we isolate spatial scale as an independent factor using a minimal multiscale extension of DeepSSIM, referred to as Deep Structural Similarity with Multiscale Representation (MSDS). The proposed framework decouples deep feature representation from cross-scale integration by computing DeepSSIM independently across pyramid levels and fusing the resulting scores with a lightweight set of learnable global weights. Experiments on multiple benchmark datasets demonstrate consistent and statistically significant improvements over the single-scale baseline, while introducing negligible additional complexity. The results empirically confirm spatial scale as a non-negligible factor in deep perceptual similarity, isolated here via a minimal testbed.

[CV-68] ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在自动驾驶系统中部署时面临的巨大计算开销问题,尤其是多视角摄像头和多帧视频输入导致的冗余信息处理瓶颈。现有基于单图像的token剪枝方法未能利用驾驶场景中的时空冗余特性,无法有效压缩计算资源。解决方案的关键在于提出一种无需训练、即插即用的框架ST-Prune,其核心由两个互补模块构成:Motion-aware Temporal Pruning(MTP)通过引入运动波动性和时间新近性作为软约束,在多样性选择目标中优先保留动态轨迹和当前帧内容,从而减少时间冗余;Ring-view Spatial Pruning(RSP)则利用环形视角相机几何结构,惩罚双向跨视角相似性,消除重复投影和残余背景,进一步降低空间冗余。二者协同实现完整的时空剪枝流程,在严格压缩率下保留关键场景信息,显著优于现有方法。

链接: https://arxiv.org/abs/2604.19145
作者: Lin Sha,Haiyun Guo,Tao Wang,Cong Zhang,Min Huang,Jinqiao Wang,Qinghai Miao
机构: University of Chinese Academy of Sciences (中国科学院大学); Carizon; Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) have become central to autonomous driving systems, yet their deployment is severely bottlenecked by the massive computational overhead of multi-view camera and multi-frame video input. Existing token pruning methods, primarily designed for single-image inputs, treat each frame or view in isolation and thus fail to exploit the inherent spatio-temporal redundancies in driving scenarios. To bridge this gap, we propose ST-Prune, a training-free, plug-and-play framework comprising two complementary modules: Motion-aware Temporal Pruning (MTP) and Ring-view Spatial Pruning (RSP). MTP addresses temporal redundancy by encoding motion volatility and temporal recency as soft constraints within the diversity selection objective, prioritizing dynamic trajectories and current-frame content over static historical background. RSP further resolves spatial redundancy by exploiting the ring-view camera geometry to penalize bilateral cross-view similarity, eliminating duplicate projections and residual background that temporal pruning alone cannot suppress. These two modules together constitute a complete spatio-temporal pruning process, preserving key scene information under strict compression. Validated across four benchmarks spanning perception, prediction, and planning, ST-Prune establishes new state-of-the-art for training-free token pruning. Notably, even at 90% token reduction, ST-Prune achieves near-lossless performance with certain metrics surpassing the full-model baseline, while maintaining inference speeds comparable to existing pruning approaches.

[CV-69] Denoising Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation CVPR2026 ATC

【速读】:该论文旨在解决传统扩散模型(diffusion models)和流模型(flow-based models)在图像生成过程中计算资源分配不均的问题,即这些模型通常对图像的所有区域采用统一的时间步长(timestep)和函数评估次数,忽略了自然图像中不同区域的去噪难易程度差异——某些区域易于去噪,而其他区域则需要更多细化或上下文信息。解决方案的关键在于提出一种基于补丁(patch)级别的噪声尺度控制机制,引入一个显式控制训练期间最大局部信息量的时间步采样器,并进一步通过轻量级的每补丁难度头(difficulty head)实现动态计算资源分配。最终形成的Patch Forcing(PF)框架能够在空间和时间维度上同时调整噪声水平,优先处理较易区域以提供上下文支持给困难区域,从而显著提升图像生成质量,且与表示对齐和引导方法正交,具备良好的可扩展性。

链接: https://arxiv.org/abs/2604.19141
作者: Johannes Schusterbauer,Ming Gui,Yusong Li,Pingchuan Ma,Felix Krause,Björn Ommer
机构: CompVis @ LMU Munich, Munich Center for Machine Learning (MCML)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, Code: this https URL

点击查看摘要

Abstract:Diffusion- and flow-based models usually allocate compute uniformly across space, updating all patches with the same timestep and number of function evaluations. While convenient, this ignores the heterogeneity of natural images: some regions are easy to denoise, whereas others benefit from more refinement or additional context. Motivated by this, we explore patch-level noise scales for image synthesis. We find that naively varying timesteps across image tokens performs poorly, as it exposes the model to overly informative training states that do not occur at inference. We therefore introduce a timestep sampler that explicitly controls the maximum patch-level information available during training, and show that moving from global to patch-level timesteps already improves image generation over standard baselines. By further augmenting the model with a lightweight per-patch difficulty head, we enable adaptive samplers that allocate compute dynamically where it is most needed. Combined with noise levels varying over both space and diffusion time, this yields Patch Forcing (PF), a framework that advances easier regions earlier so they can provide context for harder ones. PF achieves superior results on class-conditional ImageNet, remains orthogonal to representation alignment and guidance methods, and scales to text-to-image synthesis. Our results suggest that patch-level denoising schedules provide a promising foundation for adaptive image generation.

[CV-70] Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval

【速读】:该论文旨在解决零样本草图驱动的3D形状检索(Zero-Shot Sketch-Based 3D Shape Retrieval, ZS-SBSR)问题,其核心挑战在于缺乏类别监督信号以及草图输入的高度抽象性和稀疏性,导致现有方法在零样本场景下性能受限。解决方案的关键在于利用预训练扩散模型(如Stable Diffusion)固有的开放词汇能力(open-vocabulary capability)和强形状偏置(shape bias),并通过一种多模态特征增强策略,在不进行昂贵再训练的前提下提升对草图的语义理解和轮廓敏感度:具体包括注入来自CLIP视觉编码器的全局与局部视觉特征、融合可学习软提示(soft prompts)与BLIP生成的硬文本描述以增强语义引导,并采用Circle-T损失函数动态强化正样本对吸引力,从而有效对齐草图与3D形状表示,显著改善检索性能。

链接: https://arxiv.org/abs/2604.19135
作者: Hang Cheng,Fanhe Dong,Long Zeng
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents the first exploration of text-to-image diffusion models for zero-shot sketch-based 3D shape retrieval (ZS-SBSR). Existing sketch-based 3D shape retrieval methods struggle in zero-shot settings due to the absence of category supervision and the extreme sparsity of sketch inputs. Our key insight is that large-scale pretrained diffusion models inherently exhibit open-vocabulary capability and strong shape bias, making them well suited for zero-shot visual retrieval. We leverage a frozen Stable Diffusion backbone to extract and aggregate discriminative representations from intermediate U-Net layers for both sketches and rendered 3D views. Diffusion models struggle with sketches due to their extreme abstraction and sparsity, compounded by a significant domain gap from natural images. To address this limitation without costly retraining, we introduce a multimodal feature-enhanced strategy that conditions the frozen diffusion backbone with complementary visual and textual cues from CLIP, explicitly enhancing the ability of semantic context capture and concentrating on sketch contours. Specifically, we inject global and local visual features derived from a pretrained CLIP visual encoder, and incorporate enriched textual guidance by combining learnable soft prompts with hard textual descriptions generated by BLIP. Furthermore, we employ the Circle-T loss to dynamically strengthen positive-pair attraction once negative samples are sufficiently separated, thereby adapting to sketch noise and enabling more effective sketch-3D alignment. Extensive experiments on two public benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches in ZS-SBSR.

[CV-71] BALTIC: A Benchmark and Cross-Domain Strategy for 3D Reconstruction Across Air and Underwater Domains Under Varying Illumination

【速读】:该论文旨在解决机器人感知中跨环境(空气与水)条件下鲁棒三维重建的挑战,尤其关注不同介质和光照条件对重建精度的影响。其解决方案的关键在于构建一个受控的基准测试平台 BALTIC,涵盖空气与水两种介质、三种光照条件(自然光、人工光和混合光),并系统评估多种现代3D重建方法(如基于结构光的重建、神经辐射场和3D高斯泼溅)。实验通过定制水箱配合单目相机与HTC Vive追踪系统获取精确位姿真值,并利用少量空气中采集的图像对水下序列进行跨域增强,从而验证在纹理一致条件下,仅需简单预处理(如白平衡校正)的3D高斯泼溅方法即可达到与专用水下方法相当的性能,但其鲁棒性在复杂真实场景中会下降。

链接: https://arxiv.org/abs/2604.19133
作者: Michele Grimaldi,David Nakath,Oscar Pizarro,Jonatan Scharff Willners,Ignacio Carlucho,Yvan R. Petillot
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust 3D reconstruction across varying environmental conditions remains a critical challenge for robotic perception, particularly when transitioning between air and water. To address this, we introduce BALTIC, a controlled benchmark designed to systematically evaluate modern 3D reconstruction methods under variations in medium and lighting. The benchmark comprises 13 datasets spanning two media (air and water) and three lighting conditions (ambient, artificial, and mixed), with additional variations in motion type, scanning pattern, and initialization trajectory, resulting in a diverse set of sequences. Our experimental setup features a custom water tank equipped with a monocular camera and an HTC Vive tracker, enabling accurate ground-truth pose estimation. We further investigate cross-domain reconstruction by augmenting underwater image sequences with a small number of in-air views captured under similar lighting conditions. We evaluate Structure-from-Motion reconstruction using COLMAP in terms of both trajectory accuracy and scene geometry, and use these reconstructions as input to Neural Radiance Fields and 3D Gaussian Splatting methods. The resulting models are assessed against ground-truth trajectories and in-air references, while rendered outputs are compared using perceptual and photometric metrics. Additionally, we perform a color restoration analysis to evaluate radiometric consistency across domains. Our results show that under controlled, texture-consistent conditions, Gaussian Splatting with simple preprocessing (e.g., white balance correction) can achieve performance comparable to specialized underwater methods, although its robustness decreases in more complex and heterogeneous real-world environments

[CV-72] PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment CVPR2026

【速读】:该论文旨在解决现有面部重演(facial reenactment)方法在表现力(expressiveness)与细粒度控制(fine-grained controllability)之间难以平衡的问题:整体式建模通常牺牲控制精度以换取自然表情,而强调控制的方法则往往在保真度和情感-物理运动解耦上表现不佳。其解决方案的关键在于提出 PortraitDirector 框架,采用分层运动解耦与重组策略(Hierarchical Motion Disentanglement and Composition),将面部运动分解为两个独立的语义层——空间层(Spatial Layer)负责物理动作(如头部姿态和局部表情),通过情绪过滤模块(Emotion-Filtering Module)去除情感干扰;语义层(Semantic Layer)提取全局情感信号。二者再被融合成一个表达性强的运动潜在表示(motion latent),从而实现高保真、可控且可解释的面部重演效果。

链接: https://arxiv.org/abs/2604.19129
作者: Chaonan Ji,Jinwei Qi,Sheng Xu,Peng Zhang,Bang Zhang
机构: Tongyi Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by CVPR2026

点击查看摘要

Abstract:Existing facial reenactment methods struggle with a trade-off between expressiveness and fine-grained controllability. Holistic facial reenactment models often sacrifice granular control for expressiveness, while methods designed for control may struggle with fidelity and robust disentanglement. Instead of treating facial motion as a monolithic signal, we explore an alternative compositional perspective. In this paper, we introduce PortraitDirector, a novel framework that formulates face reenactment as a hierarchical composition task, achieving high-fidelity and controllable results. We employ a Hierarchical Motion Disentanglement and Composition strategy, deconstructing facial motion into a Spatial Layer for physical movements and a Semantic Layer for emotional content. The Spatial Layer comprises: (i) global head pose, managed via a dedicated representation and injection pathway; (ii) spatially separated local facial expressions, distilled from cropped facial regions and purged of emotional cues via Emotion-Filtering Module leveraging an information bottleneck. The Semantic Layer contains a derived global emotion. The disentangled components are then recomposed into an expressive motion latent. Furthermore, we engineer the framework for real-time performance through a suite of optimizations, including diffusion distillation, causal attention and VAE acceleration. PortraitDirector achieves streaming, high-fidelity, controllable 512 x 512 face reenactment at 20 FPS with a end-to-end 800 ms latency on a single 5090 GPU.

[CV-73] Robust Continual Unlearning against Knowledge Erosion and Forgetting Reversal

【速读】:该论文旨在解决重复执行数据遗忘(machine unlearning)过程中出现的两个关键问题:知识侵蚀(Knowledge Erosion)和遗忘反转(Forgetting Reversal)。前者指在多次遗忘操作后,模型对保留数据的准确率逐步下降;后者指先前被遗忘的数据在后续阶段重新变得可识别。为应对这些问题,作者提出 SAFER(StAbility-preserving Forgetting with Effective Regularization)框架,其核心在于通过保持保留数据表示的稳定性并施加遗忘数据的负对数几率边缘约束(negative logit margins),从而实现持续遗忘下的性能稳定。实验表明,SAFER 能有效缓解上述两种现象,在多轮遗忘中维持模型整体性能。

链接: https://arxiv.org/abs/2604.19108
作者: Eun-Ju Park,Youjin Shin,Simon S. Woo
机构: Sungkyunkwan University (成均馆大学); The Catholic University of Korea (韩国天主教大学); Secure Machines Lab
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As a means to balance the growth of the AI industry with the need for privacy protection, machine unlearning plays a crucial role in realizing the ``right to be forgotten’’ in artificial intelligence. This technique enables AI systems to remove the influence of specific data while preserving the rest of the learned knowledge. Although it has been actively studied, most existing unlearning methods assume that unlearning is performed only once. In this work, we evaluate existing unlearning algorithms in a more realistic scenario where unlearning is conducted repeatedly, and in this setting, we identify two critical phenomena: (1) Knowledge Erosion, where the accuracy on retain data progressively degrades over unlearning phases, and (2) Forgetting Reversal, where previously forgotten samples become recognizable again in later phases. To address these challenges, we propose SAFER (StAbility-preserving Forgetting with Effective Regularization), a continual unlearning framework that maintains representation stability for retain data while enforcing negative logit margins for forget data. Extensive experiments show that SAFER mitigates not only knowledge erosion but also forgetting reversal, achieving stable performance across multiple unlearning phases.

[CV-74] EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation

【速读】:该论文旨在解决第一人称视角下多模态运动生成(Egocentric Vision-Language, Ego-VL)中的核心挑战,即语义推理与运动建模之间的梯度冲突问题(reasoning-generation entanglement challenge),该冲突导致多模态对齐精度和运动质量下降。解决方案的关键在于提出一个分层生成框架EgoMotion:首先通过视觉语言模型(VLM)在认知推理阶段将多模态输入映射到离散的动作基元空间,从而实现目标一致的语义表征;随后在运动生成阶段,利用扩散模型基于这些结构化条件信号进行连续潜空间内的迭代去噪,合成物理合理且时序一致的3D人体运动轨迹。此双阶段设计有效解耦了认知推理与运动控制过程,显著提升了生成结果的语义准确性与运动保真度。

链接: https://arxiv.org/abs/2604.19105
作者: Ruibing Hou,Mingyue Zhou,Yuwei Gui,Mingshuang Luo,Bingpeng Ma,Hong Chang,Shiguang Shan,Xilin Chen
机构: Key Laboratory of Intelligent Information Processing, Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS); University of the Chinese Academy of Sciences; Jilin University (JLU); Beijing University of Posts and Telecommunications (BUPT)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natural language instructions. We identify a critical \textitreasoning-generation entanglement challenge: the simultaneous optimization of semantic reasoning and kinematic modeling introduces gradient conflicts. These conflicts systematically degrade the fidelity of multimodal grounding and motion quality. To address this challenge, we propose a hierarchical generative framework \textbfEgoMotion. Inspired by the biological decoupling of cognitive reasoning and motor control, EgoMotion operates in two stages. In the Cognitive Reasoning stage, A vision-language model (VLM) projects multimodal inputs into a structured space of discrete motion primitives. This forces the VLM to acquire goal-consistent representations, effectively bridging the semantic gap between high-level perceptual understanding and low-level action execution. In the Motion Generation stage, these learned representations serve as expressive conditioning signals for a diffusion-based motion generator. By performing iterative denoising within a continuous latent space, the generator synthesizes physically plausible and temporally coherent trajectories. Extensive evaluations demonstrate that EgoMotion achieves state-of-the-art performance, and produces motion sequences that are both semantically grounded and kinematically superior to existing approaches.

[CV-75] Multi-modal Test-time Adaptation via Adaptive Probabilistic Gaussian Calibration

【速读】:该论文旨在解决多模态测试时适应(multi-modal test-time adaptation, TTA)中因类别条件分布建模不足而导致预测不准确和决策边界不可靠的问题。当前方法在面对跨域分布偏移时性能受限,主要原因在于缺乏对类别条件分布的显式建模,且在多模态场景下,模态间分布不对称性进一步削弱了传统高斯判别分析(Gaussian Discriminant Analysis, GDA)的效果。解决方案的关键在于提出一种专为多模态TTA设计的概率高斯模型以显式建模类别条件分布,并引入自适应对比不对称校正(adaptive contrastive asymmetry rectification)技术,有效缓解模态分布不对称带来的负面影响,从而实现校准后的预测与可靠的决策边界。

链接: https://arxiv.org/abs/2604.19093
作者: Jinglin Xu,Yi Li,Chuxiong Sun,Xiao Xu,Jiangmeng Li,Fanjiang Xu
机构: Institute of Software Chinese Academy of Sciences (中国科学院软件研究所); University of Chinese Academy of Sciences (中国科学院大学); National Defense University (国防大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-modal test-time adaptation (TTA) enhances the resilience of benchmark multi-modal models against distribution shifts by leveraging the unlabeled target data during inference. Despite the documented success, the advancement of multi-modal TTA methodologies has been impeded by a persistent limitation, i.e., the lack of explicit modeling of category-conditional distributions, which is crucial for yielding accurate predictions and reliable decision boundaries. Canonical Gaussian discriminant analysis (GDA) provides a vanilla modeling of category-conditional distributions and achieves moderate advancement in uni-modal contexts. However, in multi-modal TTA scenario, the inherent modality distribution asymmetry undermines the effectiveness of modeling the category-conditional distribution via the canonical GDA. To this end, we introduce a tailored probabilistic Gaussian model for multi-modal TTA to explicitly model the category-conditional distributions, and further propose an adaptive contrastive asymmetry rectification technique to counteract the adverse effects arising from modality asymmetry, thereby deriving calibrated predictions and reliable decision boundaries. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts. The code is available at this https URL.

[CV-76] he Essence of Balance for Self-Improving Agents in Vision-and-Language Navigation

【速读】:该论文针对视觉-语言导航(Vision-and-Language Navigation, VLN)中代理通过策略诱导经验进行自我改进时面临的挑战,即如何在行为多样性(behavioral diversity)与学习稳定性(learning stability)之间取得平衡。若多样性不足,则难以探索有效动作假设;若过度追求多样性,则可能破坏学习信号的可靠性,导致训练不稳定。解决方案的关键在于提出一种可插拔的“稳定性-多样性平衡机制”(Stability-Diversity Balance, SDB),其核心是通过在每个决策步骤中引入受控的指令条件隐藏状态扰动,生成多个潜在行为假设,并基于可靠性感知的软评估与聚合策略保留多样且符合指令的一致性选项;同时引入显式正则项约束假设间的交互,防止多样性过早衰减或漂移,从而在不丢弃训练信号的前提下实现稳定、有效的自提升。

链接: https://arxiv.org/abs/2604.19064
作者: Zhen Liu,Yuhan Liu,Jinjun Wang,Jianyi Liu,Wei Song,Jingwen Fu
机构: Xi’an Jiaotong University (西安交通大学); North China University of Technology (华北理工大学); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In vision-and-language navigation (VLN), self-improvement from policy-induced experience, using only standard VLN action supervision, critically depends on balancing behavioral diversity and learning stability, which governs whether the agent can extract a reliable learning signal for improvement. Increasing behavioral diversity is necessary to expose alternative action hypotheses but can destabilize policy-induced learning signals, whereas overly conservative stability constraints suppress exploration and induce early commitment, making reliable self-improvement difficult. To address this challenge, we propose Stability-Diversity Balance (SDB), a plug-and-play mechanism for balanced self-improvement in VLN. SDB expands each decision step into multiple latent behavioral hypotheses by applying controlled shifts in the instruction-conditioned hidden states, and then performs reliability-aware soft evaluation and aggregation to retain diverse yet instruction-consistent alternatives during learning. An explicit regularizer further constrains hypothesis interactions, preventing excessive drift or premature collapse of hypothesis diversity and stabilizing self-improvement without discarding training signals. Experiments on R2R, SOON, and REVERIE show consistent improvements; for example, on REVERIE val-unseen, SDB improves SPL from 33.73 to 35.93 and OSR from 51.07 to 54.25.

[CV-77] Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge

【速读】:该论文旨在解决如何在资源受限的边缘设备上高效部署计算机视觉模型的问题,核心挑战在于平衡模型精度与延迟、内存容量及能耗等约束条件。其解决方案的关键在于设计并实施IEEE低功耗计算机视觉挑战赛(LPCVC)2025,通过引入Qualcomm AI Hub作为统一基准测试平台,确保评估的一致性和可复现性;同时,论文系统总结了各赛道(图像分类、开放词汇分割、单目深度估计)的最优方案,并提炼出当前主流技术趋势,为未来计算机视觉竞赛的设计提供参考依据。

链接: https://arxiv.org/abs/2604.19054
作者: Zihao Ye,Yung Hsiang Lu,Xiao Hu,Shuai Zhang,Taotao Jing,Xin Li,Zhen Yao,Bo Lang,Zhihao Zheng,Seungmin Oh,Hankyul Kang,Seunghun Kang,Jongbin Ryu,Kexin Chen,Yuan Qi,George K Thiruvathukal,Mooi Choo Chuah
机构: Purdue University (普渡大学); Qualcomm (高通公司); Lehigh University (利海大学); Ajou University (亚洲大学); University of Minnesota (明尼苏达大学); Loyola University Chicago (洛约拉大学芝加哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures, 4 tables

点击查看摘要

Abstract:The IEEE Low-Power Computer Vision Challenge (LPCVC) aims to promote the development of efficient vision models for edge devices, balancing accuracy with constraints such as latency, memory capacity, and energy use. The 2025 challenge featured three tracks: (1) Image classification under various lighting conditions and styles, (2) Open-Vocabulary Segmentation with Text Prompt, and (3) Monocular Depth Estimation. This paper presents the design of LPCVC 2025, including its competition structure and evaluation framework, which integrates the Qualcomm AI Hub for consistent and reproducible benchmarking. The paper also introduces the top-performing solutions from each track and outlines key trends and observations. The paper concludes with suggestions for future computer vision competitions.

[CV-78] Generative Texture Filtering SIGGRAPH2026

【速读】:该论文旨在解决纹理滤除(texture filtering)任务中性能不足与泛化能力弱的问题,尤其针对以往方法在处理复杂纹理或结构保持困难场景时表现不佳的挑战。其解决方案的关键在于充分利用预训练生成模型(pre-trained generative models)所蕴含的强大图像先验知识,通过两阶段微调策略实现高效优化:首先在少量成对图像上进行监督微调,随后在大规模无标签数据集上利用量化纹理去除质量与结构保留效果的奖励函数进行强化学习微调,从而显著提升模型在纹理滤除任务中的效果与鲁棒性。

链接: https://arxiv.org/abs/2604.19039
作者: Rongjia Zheng,Shangwei Huang,Lei Zhu,Wei-Shi Zheng,Qing Zhang
机构: Sun Yat-sen University (中山大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to SIGGRAPH 2026 conference track

点击查看摘要

Abstract:We present a generative method for texture filtering, which exhibits surprisingly good performance and generalizability. Our core idea is to empower texture filtering by taking full advantage of the strong learned image prior of pre-trained generative models. To this end, we propose to fine-tune a pre-trained generative model via a two-stage strategy. Specifically, we first conduct supervised fine-tuning on a very small set of paired images, and then perform reinforcement fine-tuning on a large-scale unlabeled dataset under the guidance of a reward function that quantifies the quality of texture removal and structure preservation. Extensive experiments show that our method clearly outperforms previous methods, and is effective to deal with previously challenging cases. Our code is available at this https URL.

[CV-79] Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents

【速读】:该论文旨在解决当前复杂具身导航任务中结构化空间记忆构建效率低下的问题,特别是现有方法依赖离线、几何中心的两阶段范式,难以利用高层语义智能,导致忽略关键导航地标(如门和楼梯)的问题。解决方案的关键在于提出ABot-Explorer框架,其核心是将记忆构建与探索过程统一为在线RGB图像驱动的流程,并借助大视觉语言模型(Large Vision-Language Models, VLMs)提取语义导航可及性(Semantic Navigational Affordances, SNA),作为认知对齐的锚点引导智能体移动;同时通过动态整合SNA至分层SG-Memo结构,模拟人类探索逻辑,优先覆盖结构化通行节点以提升探索效率和环境覆盖率。

链接: https://arxiv.org/abs/2604.19034
作者: Xu Chen,Shichao Xie,Zhining Gu,Lu Jia,Minghua Luo,Fei Liu,Zedong Chu,Yanfen Shen,Xiaolong Wu,Mu Xu
机构: Amap, Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Constructing structured spatial memory is essential for enabling long-horizon reasoning in complex embodied navigation tasks. Current memory construction predominantly relies on a decoupled, two-stage paradigm: agents first aggregate environmental data through exploration, followed by the offline reconstruction of spatial memory. However, this post-hoc and geometry-centric approach precludes agents from leveraging high-level semantic intelligence, often causing them to overlook navigationally critical landmarks (e.g., doorways and staircases) that serve as fundamental semantic anchors in human cognitive maps. To bridge this gap, we propose ABot-Explorer, a novel active exploration framework that unifies memory construction and exploration into an online, RGB-only process. At its core, ABot-Explorer leverages Large Vision-Language Models (VLMs) to distill Semantic Navigational Affordances (SNA), which act as cognitive-aligned anchors to guide the agent’s movement. By dynamically integrating these SNAs into a hierarchical SG-Memo, ABot-Explorer mirrors human-like exploratory logic by prioritizing structural transit nodes to facilitate efficient coverage. To support this framework, we contribute a large-scale dataset extending InteriorGS with SNA and SG-Memo annotations. Experimental results demonstrate that ABot-Explorer significantly outperforms current state-of-the-art methods in both exploration efficiency and environment coverage, while the resulting SG-Memo is shown to effectively support diverse downstream tasks.

[CV-80] Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

【速读】:该论文旨在解决扩散蒸馏(Diffusion Distillation)在少步生成中因追求采样速度而牺牲生成质量的问题,以及将强化学习(Reinforcement Learning, RL)融入蒸馏时因依赖原始样本评分而导致奖励信号不可靠、与蒸馏轨迹冲突的难题。其解决方案的关键在于提出GDMD框架,通过重新定义奖励机制,以蒸馏梯度而非原始像素输出作为优化主信号,将DMD梯度视为隐式目标张量,使现有奖励模型能够直接评估蒸馏更新的质量,从而实现RL策略与蒸馏目标的自适应对齐,有效缓解优化偏差并提升生成质量。

链接: https://arxiv.org/abs/2604.19009
作者: Linwei Dong,Ruoyu Guo,Ge Bai,Zehuan Yuan,Yawei Luo,Changqing Zou
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团); 3. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion distillation, exemplified by Distribution Matching Distillation (DMD), has shown great promise in few-step generation but often sacrifices quality for sampling speed. While integrating Reinforcement Learning (RL) into distillation offers potential, a naive fusion of these two objectives relies on suboptimal raw sample evaluation. This sample-based scoring creates inherent conflicts with the distillation trajectory and produces unreliable rewards due to the noisy nature of early-stage generation. To overcome these limitations, we propose GDMD, a novel framework that redefines the reward mechanism by prioritizing distillation gradients over raw pixel outputs as the primary signal for optimization. By reinterpreting the DMD gradients as implicit target tensors, our framework enables existing reward models to directly evaluate the quality of distillation updates. This gradient-level guidance functions as an adaptive weighting that synchronizes the RL policy with the distillation objective, effectively neutralizing optimization divergence. Empirical results show that GDMD sets a new SOTA for few-step generation. Specifically, our 4-step models outperform the quality of their multi-step teacher and substantially exceed previous DMDR results in GenEval and human-preference metrics, exhibiting strong scalability potential.

[CV-81] AutoAWG: Adverse Weather Generation with Adaptive Multi-Controls for Automotive Videos ICMR2026

【速读】:该论文旨在解决自动驾驶场景中恶劣天气下感知鲁棒性不足的问题,其核心瓶颈在于真实世界恶劣天气视频数据的稀缺性。现有天气生成方法难以在视觉质量与标注可复用性之间取得平衡。解决方案的关键在于提出AutoAWG框架,通过语义引导的多控制自适应融合策略,在强天气风格化与关键安全目标高保真保留之间实现平衡;采用基于消失点锚定的时序合成策略,从静态图像构建训练序列以减少对合成数据的依赖;并引入掩码训练机制提升长时序生成的稳定性。实验表明,该方法在nuScenes验证集上显著优于现有最先进方法,尤其在风格保真度、时序一致性和语义结构完整性方面表现突出。

链接: https://arxiv.org/abs/2604.18993
作者: Jiagao Hu,Daiguo Zhou,Danzhen Fu,Fuhao Li,Zepeng Wang,Fei Wang,Wenhua Liao,Jiayi Xie,Haiyang Sun
机构: Xiaomi Inc.(小米公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted by ICMR 2026

点击查看摘要

Abstract:Perception robustness under adverse weather remains a critical challenge for autonomous driving, with the core bottleneck being the scarcity of real-world video data in adverse weather. Existing weather generation approaches struggle to balance visual quality and annotation reusability. We present AutoAWG, a controllable Adverse Weather video Generation framework for Autonomous driving. Our method employs a semantics-guided adaptive fusion of multiple controls to balance strong weather stylization with high-fidelity preservation of safety-critical targets; leverages a vanishing point-anchored temporal synthesis strategy to construct training sequences from static images, thereby reducing reliance on synthetic data; and adopts masked training to enhance long-horizon generation stability. On the nuScenes validation set, AutoAWG significantly outperforms prior state-of-the-art methods: without first-frame conditioning, FID and FVD are relatively reduced by 50.0% and 16.1%; with first-frame conditioning, they are further reduced by 8.7% and 7.2%, respectively. Extensive qualitative and quantitative results demonstrate advantages in style fidelity, temporal consistency, and semantic–structural integrity, underscoring the practical value of AutoAWG for improving downstream perception in autonomous driving. Our code is available at: this https URL

[CV-82] A Multi-Agent Framework with Structured Reasoning and Reflective Refinement for Multimodal Empathetic Response Generation

【速读】:该论文旨在解决多模态共情响应生成(Multimodal Empathetic Response Generation, MERG)中存在的两个核心问题:一是人类对情感线索的感知具有内在结构化特征,而现有方法采用隐式的单次生成范式,忽略了情绪感知的层级演进过程,导致情感判断失真;二是由于人类情绪本身的复杂性和模糊性,传统方法容易产生显著的情感偏差,进而影响共情效果。解决方案的关键在于提出一种多智能体框架,通过结构化推理与反思优化实现更精准的共情响应生成:首先设计一个结构化的“共情推理-生成模块”,将响应生成过程显式分解为多模态感知、一致性感知的情绪预测、实用策略规划和策略引导的响应生成四个步骤,构建从多模态证据到响应输出的清晰中间路径;其次引入全局反思与精炼模块,由全局反思代理对中间状态和生成结果进行逐步审计,消除情感偏差与共情错误并触发针对性重生成,形成闭环迭代机制,从而逐步提升情绪感知准确性并减少情感偏见。

链接: https://arxiv.org/abs/2604.18988
作者: Liping Wang,Cheng Ye,Weidong Chen,Peipei Song,Bo Hu,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ACM Multimetida 2026

点击查看摘要

Abstract:Multimodal empathetic response generation (MERG) aims to generate emotionally engaging and empathetic responses based on users’ multimodal contexts. Existing approaches usually rely on an implicit one-pass generation paradigm from multimodal context to the final response, which overlooks two intrinsic characteristics of MERG: (1) Human perception of emotional cues is inherently structured rather than a direct mapping. The conventional paradigm neglects the hierarchical progression of emotion perception, leading to distorted emotional judgments. (2) Given the inherent complexity and ambiguity of human emotions, the conventional paradigm is prone to significant emotional biases, ultimately resulting in suboptimal empathy. In this paper, we propose a multi-agent framework for MERG, which enhances empathy through structured reasoning and reflective refinement. Specifically, we first introduce a structured empathetic reasoning-to-generation module that explicitly decomposes response generation via multimodal perception, consistency-aware emotion forecasting, pragmatic strategy planning, and strategy-guided response generation, providing a clearer intermediate path from multimodal evidence to response realization. Besides, we develop a global reflection and refinement module, in which a global reflection agent performs step-wise auditing over intermediate states and the generated response, eliminating existing emotional biases and empathy errors, and triggering targeted regeneration. Overall, such a closed-loop framework enables our model to gradually improve the accuracy of emotion perception and eliminate emotion biases during the iteration process. Experiments on several benchmarks, e.g., IEMOCAP and MELD, demonstrate that our model has superior empathic response generation capabilities compared to state-of-the-art methods.

[CV-83] AdaGScale: Viewpoint-Adaptive Gaussian Scaling in 3D Gaussian Splatting to Reduce Gaussian-Tile Pairs

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3D-GS)在GPU上渲染速度慢的问题,核心在于减少高斯-瓦片(Gaussian-tile)配对数量以提升效率。此前研究未考虑不同高斯-瓦片配对之间的重要性差异,而本文提出AdaGScale——一种基于视角自适应的高斯缩放技术,其关键创新在于:通过预处理阶段高效估算每个高斯在周边区域的颜色贡献度(color contribution),并根据该“外围得分”(peripheral score)自适应调整高斯的尺寸用于瓦片相交测试。此方法使低重要性高斯与更少瓦片发生相交,从而显著加速渲染;同时保留原始尺寸用于颜色累积,确保图像质量不受影响。实验表明,AdaGScale在城市尺度场景下实现13.8倍几何平均加速,PSNR仅下降约0.5 dB。

链接: https://arxiv.org/abs/2604.18980
作者: Joongho Jo,Hyerin Lim,Hanjun Choi,Jongsun Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: DAC 2026

点击查看摘要

Abstract:Reducing the number of Gaussian-tile pairs is one of the most promising approaches to improve 3D Gaussian Splatting (3D-GS) rendering speed on GPUs. However, the importance difference existing among Gaussian-tile pairs has never been considered in the previous works. In this paper, we propose AdaGScale, a novel viewpoint-adaptive Gaussian scaling technique for reducing the number of Gaussian-tile pairs. AdaGScale is based on the observation that the peripheral tiles located far from Gaussian center contribute negligibly to pixel color accumulation. This suggests an opportunity for reducing the number of Gaussian-tile pairs based on color contribution. AdaGScale efficiently estimates the color contribution in the peripheral region of each Gaussian during a preprocessing stage and adaptively scales its size based on the peripheral score. As a result, Gaussians with lower importance intersect with fewer tiles during the intersection test, which improves rendering speed while maintaining image quality. The adjusted size is used only for tile intersection test, and the original size is retained during color accumulation to preserve visual fidelity. Experimental results show that AdaGScale achieves a geometric mean speedup of 13.8x over original 3D-GS on a GPU, with only about 0.5 dB degradation in PSNR on city-scale scenes.

[CV-84] oward Clinically Acceptable Chest X-ray Report Generation: A Qualitative Retrospective Pilot Study of CXRMate-2

【速读】:该论文旨在解决胸部X光片(Chest X-ray, CXR)放射学报告生成(Radiology Report Generation, RRG)模型在临床应用中的实用性问题,即当前生成模型虽性能提升显著,但缺乏放射科医生的系统性评估,难以判断其是否具备临床可用性。解决方案的关键在于提出CXRMate-2模型,该模型融合结构化多模态条件控制与强化学习机制,并设计复合奖励函数以增强生成报告在语义层面与放射科医生报告的一致性。实验表明,该模型在多个基准数据集上显著优于现有方法,且在盲法随机回顾性评估中,45%的生成报告被放射科医生评为可接受(等同或优于人工报告),尤其在可读性方面表现更优,而人工报告的优势主要体现在更高的召回率。这为实现接近放射科医生水平的CXR RRG提供了可行路径。

链接: https://arxiv.org/abs/2604.18967
作者: Aaron Nicolson,Elizabeth J. Cooper,Hwan-Jin Yoon,Claire McCafferty,Ramya Krishnan,Michelle Craigie,Nivene Saad,Jason Dowling,Ian A. Scott,Bevan Koopman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chest X-ray (CXR) radiology report generation (RRG) models have shown rapid progress, yet their clinical utility remains uncertain due to limited evaluation by radiologists. We present CXRMate-2, a state-of-the-art CXR RRG model that integrates structured multimodal conditioning and reinforcement learning with a composite reward for semantic alignment with radiologist reports. Across the MIMIC-CXR, CheXpert Plus, and ReXgradient datasets, CXRMate-2 achieves statistically significant improvements over strong benchmarks, including gains of 11.2% and 24.4% in GREEN and RadGraph-XL, respectively, on MIMIC-CXR relative to MedGemma 1.5 (4B). To directly compare CXRMate-2 against radiologist reporting, we conduct a blinded, randomised qualitative retrospective evaluation. Three consultant radiologists compare generated and radiologist reports across 120 studies from the MIMIC-CXR test set. Generated reports were deemed acceptable (defined as preferred or rated equally to radiologist reports) in 45% of ratings, with no statistically significant difference in preference rates between radiologist reports and acceptable generated reports for seven of the eight analysed findings. Preference for radiologist reports was driven primarily by higher recall, while generated reports were often preferred for readability. Together, these results suggest a credible pathway to clinically acceptable CXR RRG. Improvements in recall, alongside better detection of subtle findings (e.g., pulmonary congestion), are likely sufficient to achieve non-inferiority to radiologist reporting. With these targeted advances, CXR RRG systems may be ready for prospective evaluation in assistive roles within radiologist-led workflows. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.18967 [cs.CV] (or arXiv:2604.18967v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.18967 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-85] AI-Enabled Image-Based Hybrid Vision/Force Control of Tendon-Driven Aerial Continuum Manipulators

【速读】:该论文旨在解决腱驱动空中连续体机械臂在执行自主物理交互任务时,因视觉与力感知不确定性导致的控制稳定性与精度问题。其核心挑战在于如何在无标定、非结构化环境中实现对图像特征误差和接触力的实时协同调控。解决方案的关键在于提出一种基于SE(3)空间中恒应变建模的级联混合视觉/力控制框架,融合快速固定时间滑模控制(fast fixed-time sliding mode control)与径向基函数神经网络(Radial Basis Function Neural Network, RBFNN),以在线自适应补偿由单目相机图像和力传感器测量引入的不确定性;同时采用先进的图神经网络(Graph Neural Network, GNN)提取线特征替代传统启发式几何方法,从而同步实现目标法向接触力跟踪与图像特征误差调节,显著提升系统鲁棒性与控制性能。

链接: https://arxiv.org/abs/2604.18961
作者: Shayan Sepahvand,Farrokh Janabi-Sharifi,Farhad Aghili
机构: Toronto Metropolitan University (多伦多都会大学); Concordia University (康考迪亚大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents an AI-enabled cascaded hybrid vision/force control framework for tendon-driven aerial continuum manipulators based on constant-strain modeling in SE(3) as a coupled system. The proposed controller is designed to enable autonomous, physical interaction with a static environment while stabilizing the image feature error. The developed strategy combines the cascaded fast fixed-time sliding mode control and a radial basis function neural network to cope with the uncertainties in the image acquired by the eye-in-hand monocular camera and the measurements from the force sensing apparatus. This ensures rapid, online learning of the vision- and force-related uncertainties without requiring offline training. Furthermore, the features are extracted via a state-of-the-art graph neural network architecture employed by a visual servoing framework using line features, rather than relying on heuristic geometric line extractors, to concurrently contribute to tracking the desired normal interaction force during contact and regulating the image feature error. A comparative study benchmarks the proposed controller against established rigid-arm aerial manipulation methods, evaluating robustness across diverse scenarios and feature extraction strategies. The simulation and experimental results showcase the effectiveness of the proposed methodology under various initial conditions and demonstrate robust performance in executing manipulation tasks.

[CV-86] Bridging Foundation Models and ASTM Metallurgical Standards for Automated Grain Size Estimation from Microscopy Images MICRO CVPR

【速读】:该论文旨在解决从显微镜图像中自动提取标准化冶金参数(如ASTM粒径编号G)的难题,尤其针对复杂晶粒形貌和监督分割对数据量的高需求。其核心解决方案是构建一个适配细胞结构特征的自动化流水线,关键在于将Cellpose-SAM模型与拓扑感知梯度追踪技术相结合,并集成ASTM E112 Jeffries平面计数模块,从而实现密集实例分割与粒径估算的精准协同。此方法在仅需两样本训练的情况下即达到MAPE低至1.50%的性能,且验证了ASTM标准中最小50粒计数的可靠性,体现了应用级基础模型整合在材料表征中的高效性与鲁棒性。

链接: https://arxiv.org/abs/2604.18957
作者: Abdul Mueez,Shruti Vyas
机构: University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 11th IEEE Workshop on Computer Vision for Multimodal Microscopy Image Analysis (CVMI), CVPR Workshops 2026

点击查看摘要

Abstract:Extracting standardized metallurgical metrics from microscopy images remains challenging due to complex grain morphology and the data demands of supervised segmentation. To bridge foundational computer vision with practical metallurgical evaluation, we propose an automated pipeline for dense instance segmentation and grain size estimation that adapts Cellpose-SAM to microstructures and integrates its topology-aware gradient tracking with an ASTM E112 Jeffries planimetric module. We systematically benchmark this pipeline against a classical convolutional network (U-Net), an adaptive-prompting vision foundation model (MatSAM) and a contemporary vision-language model (Qwen2.5-VL-7B). Our evaluations reveal that while the out-of-the-box vision-language model struggles with the localized spatial reasoning required for dense microscopic counting and MatSAM suffers from over-segmentation despite its domain-specific prompt generation, our adapted pipeline successfully maintains topological separation. Furthermore, experiments across progressively reduced training splits demonstrate exceptional few-shot scalability; utilizing only two training samples, the proposed system predicts the ASTM grain size number (G) with a mean absolute percentage error (MAPE) as low as 1.50%, while robustness testing across varying target grain counts empirically validates the ASTM 50-grain sampling minimum. These results highlight the efficacy of application-level foundation model integration for highly accurate, automated materials characterization. Our project repository is available at this https URL.

[CV-87] Localization-Guided Foreground Augmentation in Autonomous Driving

【速读】:该论文旨在解决自动驾驶系统在低能见度条件下(如雨天、夜间或雪天)因场景几何信息(如车道线、道路边界和人行横道)稀疏或断裂而导致感知性能下降的问题。传统高精地图虽可提供缺失的结构上下文,但其构建与维护成本高昂。解决方案的关键在于提出一种轻量级且即插即用的推理模块——定位引导前景增强(Localization-Guided Foreground Augmentation, LG-FA),其核心机制包括:(i) 从每帧鸟瞰图(Bird’s-Eye View, BEV)预测中增量式构建稀疏全局向量层;(ii) 通过类别约束的几何对齐估计自车位姿,同时提升定位精度并补全局部拓扑;(iii) 将增强后的前景重新投影至统一全局坐标系以改善单帧预测质量。该方法无需修改现有BEV感知模型主干即可显著提升几何完整性、时序稳定性及全局一致性,为下游任务如跟踪与决策提供可靠结构先验。

链接: https://arxiv.org/abs/2604.18940
作者: Jiawei Yong,Deyuan Qu,Qi Chen,Kentaro Oguchi,Shintaro Fukushima
机构: Toyota Motor Corporation (丰田汽车公司); Toyota Motor North America (丰田汽车北美公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autonomous driving systems often degrade under adverse visibility conditions-such as rain, nighttime, or snow-where online scene geometry (e.g., lane dividers, road boundaries, and pedestrian crossings) becomes sparse or fragmented. While high-definition (HD) maps can provide missing structural context, they are costly to construct and maintain at scale. We propose Localization-Guided Foreground Augmentation (LG-FA), a lightweight and plug-and-play inference module that enhances foreground perception by enriching geometric context online. LG-FA: (i) incrementally constructs a sparse global vector layer from per-frame Bird’s-Eye View (BEV) predictions; (ii) estimates ego pose via class-constrained geometric alignment, jointly improving localization and completing missing local topology; and (iii) reprojects the augmented foreground into a unified global frame to improve per-frame predictions. Experiments on challenging nuScenes sequences demonstrate that LG-FA improves the geometric completeness and temporal stability of BEV representations, reduces localization error, and produces globally consistent lane and topology reconstructions. The module can be seamlessly integrated into existing BEV-based perception systems without backbone modification. By providing a reliable geometric context prior, LG-FA enhances temporal consistency and supplies stable structural support for downstream modules such as tracking and decision-making.

[CV-88] A Proxy Consistency Loss for Grounded Fusion of Earth Observation and Location Encoders CVPR

【速读】:该论文旨在解决地球观测数据监督学习中因高质量标注数据或实地测量数据稀缺而导致的训练标签不足问题。解决方案的关键在于引入一种可训练的位置编码器(location encoder),通过代理一致性损失(Proxy Consistency Loss, PCL)将与目标变量相关但不完全相同的代理变量(proxy variables)隐式地整合进地理先验中,从而利用大量可用的代理数据提升模型性能。该方法的核心创新在于:其一,利用位置编码器灵活学习来自代理数据的信息,且无需依赖训练标签的可用性;其二,在有限标注数据条件下,通过对位置编码器进行适当正则化以增强模型的泛化能力和鲁棒性。实验表明,该方法在空气质量预测和贫困地图绘制任务中均优于传统输入融合策略及固定预训练位置嵌入的方法。

链接: https://arxiv.org/abs/2604.18881
作者: Zhongying Wang,Kevin Lane,Levi Cai,Morteza Karimzadeh,Esther Rolf
机构: University of Colorado, Boulder (科罗拉多大学博尔德分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to EarthVision 2026 (CVPR Workshop). 13 pages total (10 pages main paper + 3 pages supplementary material), 5 main figures

点击查看摘要

Abstract:Supervised learning with Earth observation inputs is often limited by the sparsity of high-quality labeled or in-situ measured data to use as training labels. With the abundance of geographic data products, in many cases there are variables correlated with - but different from - the variable of interest that can be leveraged. We integrate such proxy variables within a geographic prior via a trainable location encoder and introduce a proxy consistency loss (PCL) formulation to imbue proxy data into the location encoder. The first key insight behind our approach is to use the location encoder as an agile and flexible way to learn from abundantly available proxy data which can be sampled independently of training label availability. Our second key insight is that we will need to regularize the location encoder appropriately to achieve performance and robustness with limited labeled data. Our experiments on air quality prediction and poverty mapping show that integrating proxy data implicitly through the location encoder outperforms using both as input to an observation encoder and fusion strategies that use frozen, pretrained location embeddings as a geographic prior. Superior performance for in-sample prediction shows that the PCL can incorporate rich information from the proxies, and superior out-of-sample prediction shows that the learned latent embeddings help generalize to areas without training labels.

[CV-89] Hierarchically Robust Zero-shot Vision-language Models CVPR’26

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在零样本分类任务中对对抗攻击敏感的问题,尤其关注当攻击目标为父类(superclass,如哺乳动物)而非仅叶类(leaf class,如猫)时导致的鲁棒性下降问题。现有方法通过固定文本嵌入与图像嵌入对齐来提升鲁棒性,但会损害自然性能和鲁棒性。解决方案的关键在于提出一种基于层次嵌入(hierarchical embeddings)的新型对抗微调框架,通过多层级对抗鲁棒对齐机制实现图像-文本模态间的结构化对齐,并引入机制将视觉嵌入置于层次结构中的指定深度,同时建立了嵌入深度与最大可行间隔(margin size)之间的理论联系,从而自然地支持多种间隔大小以增强对抗扰动下的泛化能力;此外,还考虑在多个共享叶节点的树结构上进行对齐,以提升语义多样性。

链接: https://arxiv.org/abs/2604.18867
作者: Junhao Dong,Yifei Zhang,Hao Zhu,Yew-Soon Ong,Piotr Koniusz
机构: Nanyang Technological University (南洋理工大学); CFAR, IHPC, A*STAR (新加坡科技研究局); Northwest Polytechnical University (西北工业大学); Data61 CSIRO (澳大利亚联邦科学与工业研究组织); University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper is accepted by CVPR’26

点击查看摘要

Abstract:Vision-Language Models (VLMs) can perform zero-shot classification but are susceptible to adversarial attacks. While robust fine-tuning improves their robustness, existing approaches align fixed text embeddings with an image embedding, sacrificing natural performance and robustness. A robustness degradation also occurs when a model faces adversarial attacks targeting superclasses (parent classes, e.g., mammal) in addition to their base (leaf) classes (e.g., cat). Thus, to enhance adversarial robustness and leverage the inherent hierarchical properties of class space, we propose a novel adversarial fine-tuning framework based on hierarchical embeddings and several levels of adversarially robust alignment of image-text modalities. Additional mechanisms place visual embeddings at the desired depth of hierarchy, and we provide a theoretical connection between the depth of embedding in the hierarchy and the maximum viable margin size. Our model naturally realizes several margin sizes, boosting generalization of adversaries for robustification. As various trees with different parent labels can share the same leaf labels, we also consider aligning over multiple trees to boost semantic variety. Experiments across several datasets are performed.

[CV-90] HMR-Net: Hierarchical Modular Routing for Cross-Domain Object Detection in Aerial Images

【速读】:该论文旨在解决航空影像(aerial imagery)中目标检测模型泛化能力差的问题,尤其是在不同空间分辨率、场景组成和语义标签覆盖范围下,传统模型难以学习一致且可迁移的特征表示。其关键解决方案是一种模块化学习框架,通过两级结构实现精细化专业化:一是全局专家分配层,利用潜在地理嵌入将数据集路由至专用处理模块;二是局部场景分解机制,将图像子区域分配给区域特异性的子模块,从而在数据集间和复杂场景内实现差异化建模。此外,引入条件专家模块,借助外部语义信息(如类别名称或文本描述)实现推理阶段对新类别的检测,无需重新训练或微调,显著提升了模型的适应性与灵活性。

链接: https://arxiv.org/abs/2604.18866
作者: Pourya Shamsolmoali,Masoumeh Zareapoor,Michael Felsberg,Nick Pears,Yue Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IJCV September 2025

点击查看摘要

Abstract:Despite advances in object detection, aerial imagery remains a challenging domain, as models often fail to generalize across variations in spatial resolution, scene composition, and semantic label coverage. Differences in geographic context, sensor characteristics, and object distributions across datasets limit the capacity of conventional models to learn consistent and transferable representations. Shared methods trained on such data tend to impose a unified representation across fundamentally different domains, resulting in poor performance on region-specific content and less flexibility when dealing with novel object categories. To address this, we propose a novel modular learning framework that enables structured specialization in aerial detection. Our method introduces a hierarchical routing mechanism with two levels of modularity: a global expert assignment layer that uses latent geographic embeddings to route datasets to specialized processing modules, and a local scene decomposition mechanism that allocates image subregions to region-specific sub-modules. This allows our method to specialize across datasets and within complex scenes. Additionally, the framework contains a conditional expert module that uses external semantic information (e.g., category names or textual descriptions) to enable detection of novel object categories during inference, without the need for retraining or fine-tuning. By moving beyond monolithic representations, our method offers an adaptive framework for remote sensing object detection. Comprehensive evaluations on four datasets highlight improvements in multi-dataset generalization, regional specialization, and open-category detection.

[CV-91] ask Switching Without Forgetting via Proximal Decoupling

【速读】:该论文旨在解决持续学习(continual learning)中的灾难性遗忘问题,即模型在学习新任务时会丢失对先前任务的知识。传统方法通过正则化项约束关键参数的变化,但通常将学习与保留信号混合在同一梯度更新中,导致参数冗余和容量利用效率低下。其解决方案的关键在于引入算子分裂(operator splitting)机制,将任务学习与稳定性强化明确分离:学习步骤专注于最小化当前任务损失,而近端稳定步骤则应用稀疏正则化来剪枝冗余参数并保留任务相关参数,从而将稳定性与可塑性转化为两个互补算子之间的协商更新,而非冲突梯度。这一方法在无需回放缓冲区、贝叶斯采样或元学习组件的情况下实现了更优的稳定性和适应性表现。

链接: https://arxiv.org/abs/2604.18857
作者: Pourya Shamsolmoali,Masoumeh Zareapoor,Eric Granger,William A. P. Smith,Yue Lu
机构: University of York (约克大学); Shanghai Jiao Tong University (上海交通大学); ETS Montreal (蒙特利尔工程学院); East China Normal University (华东师范大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE TPAMI January 2026

点击查看摘要

Abstract:In continual learning, the primary challenge is to learn new information without forgetting old knowledge. A common solution addresses this trade-off through regularization, penalizing changes to parameters critical for previous tasks. In most cases, this regularization term is directly added to the training loss and optimized with standard gradient descent, which blends learning and retention signals into a single update and does not explicitly separate essential parameters from redundant ones. As task sequences grow, this coupling can over-constrain the model, limiting forward transfer and leading to inefficient use of capacity. We propose a different approach that separates task learning from stability enforcement via operator splitting. The learning step focuses on minimizing the current task loss, while a proximal stability step applies a sparse regularizer to prune unnecessary parameters and preserve task-relevant ones. This turns the stability-plasticity into a negotiated update between two complementary operators, rather than a conflicting gradient. We provide theoretical justification for the splitting method on the continual-learning objective, and demonstrate that our proposed solver achieves state-of-the-art results on standard benchmarks, improving both stability and adaptability without the need for replay buffers, Bayesian sampling, or meta-learning components.

[CV-92] ConvVitMamba: Efficient Multiscale Convolution Transformer and Mamba-Based Sequence modelling for Hyperspectral Image Classification

【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)分类中因高维光谱特性、冗余信息及标注数据有限所导致的挑战,同时克服传统卷积神经网络(CNN)和视觉Transformer(Vision Transformer, ViT)在计算成本高、模型规模大方面的局限性。其解决方案的关键在于提出一个统一的混合框架ConvVitMamba,该框架融合了三个核心组件:多尺度卷积特征提取器以捕获局部光谱-空间联合模式;基于Token化的ViT模块用于建模全局上下文关系;以及受Mamba启发的轻量级门控序列混合模块,实现无需二次复杂度自注意力机制的内容感知优化。通过主成分分析(PCA)预处理降低冗余,并结合系统性实验验证了该架构在准确率、模型大小与推理效率之间取得良好平衡,显著优于现有CNN、Transformer及Mamba基线方法。

链接: https://arxiv.org/abs/2604.18856
作者: Mohammed Q. Alkhatib
机构: University of Dubai (迪拜大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Pre-print Accepted for Publication in International Journal of Remote Sensing

点击查看摘要

Abstract:Hyperspectral image (HSI) classification remains challenging due to high spectral dimensionality, redundancy, and limited labeled data. Although convolutional neural networks (CNNs) and Vision Transformers (ViTs) achieve strong performance by exploiting spectral-spatial information and long-range dependencies, they often incur high computational cost and large model size, limiting practical use. To address these limitations, a unified hybrid framework, termed ConvVitMamba, is proposed for efficient HSI classification. The architecture integrates three components: a multiscale convolutional feature extractor to capture local spectral, spatial, and joint patterns; a Vision Transformer based tokenization and encoding stage to model global contextual relationships; and a lightweight Mamba inspired gated sequence mixing module for efficient content-aware refinement without quadratic self-attention. Principal Component Analysis (PCA) is used as preprocessing to reduce redundancy and improve efficiency. Experiments on four benchmark datasets, including Houston and three UAV borne QUH datasets (Pingan, Qingyun, and Tangdaowan), demonstrate that ConvVitMamba consistently outperforms CNN, Transformer, and Mamba based methods while maintaining a favorable balance between accuracy, model size, and inference efficiency. Ablation studies confirm the complementary contributions of all components. The results indicate that the proposed framework provides an effective and efficient solution for HSI classification in diverse scenarios. The source code is publicly available at this https URL

[CV-93] DDF2Pol: A Dual-Domain Feature Fusion Network for PolSAR Image Classification

【速读】:该论文旨在解决极化合成孔径雷达(PolSAR)图像分类中如何高效融合空间与极化信息以提升分类精度的问题。解决方案的关键在于提出一种轻量级双域卷积神经网络DDF2Pol,其核心创新是并行设计实值与复值特征提取流,分别捕获PolSAR数据中的互补空间和极化特征;进一步通过深度可分离卷积进行空间增强,并引入坐标注意力机制聚焦关键区域,从而在仅91,371个参数下实现高精度分类(Flevoland数据集OA达98.16%,San Francisco数据集OA达96.12%),显著优于多个现有实值与复值模型。

链接: https://arxiv.org/abs/2604.18853
作者: Mohammed Q. Alkhatib
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Pre-print Accepted for Publication in Pattern Recognition Letters

点击查看摘要

Abstract:This paper presents DDF2Pol, a lightweight dual-domain convolutional neural network for PolSAR image classification. The proposed architecture integrates two parallel feature extraction streams, one real-valued and one complex-valued, designed to capture complementary spatial and polarimetric information from PolSAR data. To further refine the extracted features, a depth-wise convolution layer is employed for spatial enhancement, followed by a coordinate attention mechanism to focus on the most informative regions. Experimental evaluations conducted on two benchmark datasets, Flevoland and San Francisco, demonstrate that DDF2Pol achieves superior classification performance while maintaining low model complexity. Specifically, it attains an Overall Accuracy (OA) of 98.16% on the Flevoland dataset and 96.12% on the San Francisco dataset, outperforming several state-of-the-art real- and complex-valued models. With only 91,371 parameters, DDF2Pol offers a practical and efficient solution for accurate PolSAR image analysis, even when training data is limited. The source code is publicly available at this https URL

[CV-94] Multi-Domain Learning with Global Expert Mapping

【速读】:该论文旨在解决多数据集学习中因数据分布和标签语义不一致导致的视觉模型泛化能力不足问题,尤其针对现有混合专家(Mixture-of-Experts, MoE)模型因负载均衡机制强制均匀分配输入而导致专家无法有效专业化、进而影响稀有或分布外领域性能的问题。解决方案的关键在于提出GEM(Global Expert Mapping)框架——一个基于线性规划松弛的规划器与分层取整编译器相结合的系统:规划器计算数据集到专家的软分配方案,编译器将其转化为确定性的、容量感知的硬映射,从而消除负载均衡损失,缓解公平性与专业化之间的冲突,并实现可解释的路由策略。

链接: https://arxiv.org/abs/2604.18842
作者: Pourya Shamsolmoali,Masoumeh Zareapoor,Huiyu Zhou,Oscar Mendez,Dacheng Tao,Xuelong Li
机构: University of York (约克大学); University of Leicester (莱斯特大学); University of Surrey (萨里大学); Nanyang Technological University (南洋理工大学); China Telecom (中国电信)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE TPAMI on August 2025

点击查看摘要

Abstract:Human perception generalizes well across different domains, but most vision models struggle beyond their training data. This gap motivates multi-dataset learning, where a single model is trained on diverse datasets to improve robustness under domain shifts. However, unified training remains challenging due to inconsistencies in data distributions and label semantics. Mixture-of-Experts (MoE) models provide a scalable solution by routing inputs to specialized subnetworks (experts). Yet, existing MoEs often fail to specialize effectively, as their load-balancing mechanisms enforce uniform input distribution across experts. This fairness conflicts with domain-aware routing, causing experts to learn redundant representations, and reducing performance especially on rare or out-of-distribution domains. We propose GEM (Global Expert Mapping), a planner-compiler framework that replaces the learned router with a global scheduler. Our planner, based on linear programming relaxation, computes a fractional assignment of datasets to experts, while the compiler applies hierarchical rounding to convert this soft plan into a deterministic, capacity-aware mapping. Unlike prior MoEs, GEM avoids balancing loss, resolves the conflict between fairness and specialization, and produces interpretable routing. Experiments show that GEM-DINO achieves state-of-the-art performance on the UODB benchmark, with notable gains on underrepresented datasets and solves task interference in few-shot adaptation scenarios.

[CV-95] Feasibility of Indoor Frame-Wise Lidar Semantic Segmentation via Distillation from Visual Foundation Model

【速读】:该论文旨在解决室内激光雷达(LiDAR)扫描帧级语义分割中缺乏标注数据的问题,这一问题严重制约了深度学习模型的训练与性能提升。解决方案的关键在于利用视觉基础模型(Visual Foundation Models, VFMs)对齐的二维图像进行伪标签生成,并通过2D到3D的知识蒸馏(distillation)管道将语义信息迁移至LiDAR帧分割模型中,从而在无需人工标注的情况下实现高效的室内场景语义理解。实验表明,该方法在伪标签评估下可达到56% mIoU,在真实标签验证下约36% mIoU,证明了跨模态知识蒸馏在室内LiDAR语义分割中的可行性。

链接: https://arxiv.org/abs/2604.18831
作者: Haiyang Wu,Juan J. Gonzales Torres,George Vosselman,Ville Lehtola
机构: University of Twente (特温特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Frame-wise semantic segmentation of indoor lidar scans is a fundamental step toward higher-level 3D scene understanding and mapping applications. However, acquiring frame-wise ground truth for training deep learning models is costly and time-consuming. This challenge is largely addressed, for imagery, by Visual Foundation Models (VFMs) which segment image frames. The same VFMs may be used to train a lidar scan frame segmentation model via a 2D-to-3D distillation pipeline. The success of such distillation has been shown for autonomous driving scenes, but not yet for indoor scenes. Here, we study the feasibility of repeating this success for indoor scenes, in a frame-wise distillation manner by coupling each lidar scan with a VFM-processed camera image. The evaluation is done using indoor SLAM datasets, where pseudo-labels are used for downstream evaluation. Also, a small manually annotated lidar dataset is provided for validation, as there are no other lidar frame-wise indoor datasets with semantics. Results show that the distilled model achieves up to 56% mIoU under pseudo-label evaluation and around 36% mIoU with real-label, demonstrating the feasibility of cross-modal distillation for indoor lidar semantic segmentation without manual annotations.

[CV-96] DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning CVPR

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在常见视觉退化条件下(如雾霾、模糊或低光照)性能脆弱的问题,同时探索红外(Infrared, IR)成像与RGB图像融合以提升模型鲁棒性的方法。其解决方案的关键在于提出DUALVISION——一个轻量级的融合模块,通过patch-level局部化交叉注意力机制高效地将IR与RGB信息注入MLLMs中,从而增强模型在复杂环境下的视觉感知与跨模态推理能力。

链接: https://arxiv.org/abs/2604.18829
作者: Abrar Majeedi,Zhiyuan Ruan,Ziyi Zhao,Hongcheng Wang,Jianglin Lu,Yin Li
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Amazon (亚马逊); Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR Findings 2026

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have achieved impressive performance on visual perception and reasoning tasks with RGB imagery, yet they remain fragile under common degradations, such as fog, blur, or low-light conditions. Infrared (IR) imaging, a well-established complement to RGB, offers inherent robustness in these conditions, but its integration into MLLMs remains underexplored. To bridge this gap, we propose DUALVISION, a lightweight fusion module that efficiently incorporates IR-RGB information into MLLMs via patch-level localized cross-attention. To support training and evaluation and to facilitate future research, we also introduce DV-204K, a dataset of ~25K publicly available aligned IR-RGB image pairs with 204K modality-specific QA annotations, and DV-500, a benchmark of 500 IR-RGB image pairs with 500 QA pairs designed for evaluating cross-modal reasoning. Leveraging these datasets, we benchmark both open- and closed-source MLLMs and demonstrate that DUALVISION delivers strong empirical performance under a wide range of visual degradations. Our code and dataset are available at this https URL.

[CV-97] Rethinking Dataset Distillation: Hard Truths about Soft Labels CVPR2026

【速读】:该论文旨在解决大规模数据蒸馏(Dataset Distillation, DD)方法在软标签(Soft Label, SL)和硬标签(Hard Label, HL)设置下性能表现不一致的问题,特别是为何当前主流DD方法在SL+KD(软标签+知识蒸馏)场景中难以超越随机采样基线,而在HL场景下则可能通过高质量子集获得显著优势。其关键解决方案在于提出一种计算感知的剪枝指标CAD-Prune,用于识别在给定计算预算下具有最优难度的数据样本,并基于此构建了计算对齐的数据蒸馏方法CA2D(Compute-Aligned Dataset Distillation),该方法在ImageNet-1K上多个每类图像数(Images Per Class, IPC)设置下均优于现有DD方法,从而为高效数据学习提供了新的工具与理论依据。

链接: https://arxiv.org/abs/2604.18811
作者: Priyam Dey,Aditya Sahdev,Sunny Bhati,Konda Reddy Mopuri,R. Venkatesh Babu
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 (Oral). First two authors contributed equally

点击查看摘要

Abstract:Despite the perceived success of large-scale dataset distillation (DD) methods, recent evidence finds that simple random image baselines perform on-par with state-of-theart DD methods like SRe2L due to the use of soft labels during downstream model training. This is in contrast with the findings in coreset literature, where high-quality coresets consistently outperform random subsets in the hardlabel (HL) setting. To understand this discrepancy, we perform a detailed scalability analysis to examine the role of data quality under different label regimes, ranging from abundant soft labels (termed as SL+KD regime) to fixed soft labels (SL) and hard labels (HL). Our analysis reveals that high-quality coresets fail to convincingly outperform the random baseline in both SL and SL+KD regimes. In the SL+KD setting, performance further approaches nearoptimal levels relative to the full dataset, regardless of subset size or quality, for a given compute budget. This performance saturation calls into question the widespread practice of using soft labels for model evaluation, where unlike the HL setting, subset quality has negligible influence. A subsequent systematic evaluation of five large-scale and four small-scale DD methods in the HL setting reveals that only RDED reliably outperforms random baselines on ImageNet-1K, but can still lag behind strong coreset methods due to its over-reliance on easy sample patches. Based on this, we introduce CAD-Prune, a compute-aware pruning metric that efficiently identifies samples of optimal difficulty for a given compute budget, and use it to develop CA2D, a compute-aligned DD method, outperforming current DD methods on ImageNet-1K at various IPC settings. Together, our findings uncover many insights into current DD research and establish useful tools to advance dataefficient learning for both coresets and DD.

[CV-98] Geometric Decoupling: Diagnosing the Structural Instability of Latent

【速读】:该论文旨在解决潜在扩散模型(Latent Diffusion Models, LDMs)在图像编辑过程中因潜在空间脆弱性(latent space brittleness)导致的语义不连续跳变问题。其解决方案的关键在于引入黎曼几何框架,通过分析生成雅可比矩阵(generative Jacobian),将潜在空间的几何结构分解为局部缩放(Local Scaling,表征容量)和局部复杂度(Local Complexity,表征曲率)。研究发现了一种“几何解耦”现象:在正常生成中,曲率功能上编码图像细节;而在分布外(OOD)生成时,极端曲率被浪费在不稳定的语义边界而非可感知的细节上,从而识别出“几何热点”(Geometric Hotspots)作为不稳定性的结构性根源,并提供了一个用于诊断生成可靠性的内在鲁棒度量。

链接: https://arxiv.org/abs/2604.18804
作者: Yuanbang Liang,Zhengwen Chen,Yu-Kun Lai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Latent Diffusion Models (LDMs) achieve high-fidelity synthesis but suffer from latent space brittleness, causing discontinuous semantic jumps during editing. We introduce a Riemannian framework to diagnose this instability by analyzing the generative Jacobian, decomposing geometry into \textitLocal Scaling (capacity) and \textitLocal Complexity (curvature). Our study uncovers a \textbfGeometric Decoupling": while curvature in normal generation functionally encodes image detail, OOD generation exhibits a functional decoupling where extreme curvature is wasted on unstable semantic boundaries rather than perceptible details. This geometric misallocation identifies Geometric Hotspots" as the structural root of instability, providing a robust intrinsic metric for diagnosing generative reliability.

[CV-99] LLM -as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在面对逐步增强的提示强制力时,其幻觉行为(hallucination)的响应机制尚不清晰的问题。现有幻觉评测基准多依赖中性提示和二元检测方式,未能刻画幻觉发生频率与强度如何随提示语气梯度变化,并且缺乏对不同任务类型(如文本不可读、时间识别、物体缺失)的区分性分析。解决方案的关键在于构建一个名为Ghost-100的程序化生成基准,包含800张严格符合负向真实条件(negative-ground-truth)的合成图像,覆盖三类任务家族(text-illegibility, time-reading, object-absence),每张图像搭配五种强度递增的提示(基于5-Level Prompt Intensity Framework),从而将提示语气作为唯一独立变量;同时采用双轨评估协议:H-Rate衡量模型从拒绝回答到错误肯定的转变比例,H-Score则由GPT-4o-mini人工判断幻觉的置信度与具体程度,辅以三阶段自动化验证流程确保数据合规性。此设计揭示了不同模型家族、任务子集对提示压力呈现非单调敏感性等复杂模式,突破传统聚合指标的局限。

链接: https://arxiv.org/abs/2604.18803
作者: Zhiyuan Jiang,Weihao Hong,Xinlei Guan,Tejaswi Dhandu,Miles Q. Li,Meng Xu,Kuan Huang,Umamaheswara Rao Tida,Bingyu Shen,Daehan Kwak,Boyang Li
机构: Kean University (肯恩大学); North Dakota State University (北达科他州立大学); McGill University (麦吉尔大学); University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages, 12 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) are increasingly deployed in settings where reliable visual grounding carries operational consequences, yet their behavior under progressively coercive prompt phrasing remains undercharacterized. Existing hallucination benchmarks predominantly rely on neutral prompts and binary detection, leaving open how both the incidence and the intensity of fabrication respond to graded linguistic pressure across structurally distinct task types. We present Ghost-100, a procedurally constructed benchmark of 800 synthetically generated images spanning eight categories across three task families – text-illegibility, time-reading, and object-absence – each designed under a negative-ground-truth principle that guarantees the queried target is absent, illegible, or indeterminate by construction. Every image is paired with five prompts drawn from a structured 5-Level Prompt Intensity Framework, holding the image and task identity fixed while varying only directive force, so that tone is isolated as the sole independent variable. We adopt a dual-track evaluation protocol: a rule-based H-Rate measuring the proportion of responses in which a model crosses from grounded refusal into unsupported positive commitment, and a GPT-4o-mini-judged H-Score on a 1-5 scale characterizing the confidence and specificity of fabrication once it occurs. We additionally release a three-stage automated validation workflow, which retrospectively confirms 717 of 800 images as strictly compliant. Evaluating nine open-weight VLMs, we find that H-Rate and H-Score dissociate substantially across model families, reading-style and presence-detection subsets respond to prompt pressure in qualitatively different ways, and several models exhibit non-monotonic sensitivity peaking at intermediate tone levels – patterns that aggregate metrics obscure.

[CV-100] CrossPan: A Comprehensive Benchmark for Cross-Sequence Pancreas MRI Segmentation and Generalization

【速读】:该论文旨在解决胰腺MRI图像中跨序列域偏移(cross-sequence domain shift)导致深度学习模型性能急剧下降的问题,这是影响生成式AI在临床部署中可靠性的关键障碍。其核心解决方案在于通过构建多中心、多序列的基准数据集CrossPan(包含1,386例3D扫描),系统评估不同方法在跨序列场景下的泛化能力;关键发现表明,相较于跨中心差异,跨序列的物理对比度变化对模型性能的影响更为严重,且现有领域自适应方法效果有限,而基于形状先验的预训练基础模型(如MedSAM2)在零样本迁移中表现更稳定,提示未来研究应聚焦于增强模型对成像物理机制变化的鲁棒性,而非单纯优化网络结构或扩大中心多样性。

链接: https://arxiv.org/abs/2604.18797
作者: Linkai Peng,Cuiling Sun,Zheyuan Zhang,Wanying Dou,Halil Ertugrul Aktas,Andrea M Bejar,Elif Keles,Tamas Gonda,Michael B Wallace,Zongwei Zhou,Gorkem Durak,Rajesh N Keswani,Ulas Bagci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MIDL 2026

点击查看摘要

Abstract:Automatic pancreas segmentation is fundamental to abdominal MRI analysis, yet deep learning models trained on one MRI sequence often fail catastrophically when applied to another-a challenge that has received little systematic investigation. We introduce CrossPan, a multi-institutional benchmark comprising 1,386 3D scans across three routinely acquired sequences (T1-weighted, T2-weighted, and Out-of-Phase) from eight centers. Our experiments reveal three key findings. First, cross-sequence domain shifts are far more severe than cross-center variability: models achieving Dice scores above 0.85 in-domain collapse to near-zero (0.02) when transferred across sequences. Second, state-of-the-art domain generalization methods provide negligible benefit under these physics-driven contrast inversions, whereas foundation models like MedSAM2 maintain moderate zero-shot performance through contrast-invariant shape priors. Third, semi-supervised learning offers gains only under stable intensity distributions and becomes unstable on sequences with high intra-organ variability. These results establish cross-sequence generalization-not model architecture or center diversity-as the primary barrier to clinically deployable pancreas MRI segmentation. Dataset and code are available at this https URL.

[CV-101] EfficientPENet: Real-Time Depth Completion from Sparse LiDAR via Lightweight Multi-Modal Fusion

【速读】:该论文旨在解决基于稀疏激光雷达(LiDAR)测量与对应RGB图像的深度补全问题,以实现机器人系统中高精度的3D感知。现有方法虽在标准基准上表现优异,但依赖复杂的主干网络架构,难以在嵌入式硬件上实现实时部署。其解决方案的关键在于提出EfficientPENet:采用现代化的ConvNeXt作为主干结构替代传统ResNet,引入对稀疏性不变的卷积操作优化深度分支,并通过卷积空间传播网络(CSPN)精炼预测结果;同时,利用Late Fusion融合双流特征并结合多尺度深度监督策略提升性能;此外,设计了一种位置感知的测试时增强方案,在水平翻转过程中修正坐标张量,显著降低推理误差。该方法在KITTI基准上实现了36.24M参数、20.51ms延迟(48.76 FPS),相较BP-Net参数减少3.7倍、速度提升23倍,且保持竞争力的精度,为资源受限边缘平台提供了实用的实时深度补全方案。

链接: https://arxiv.org/abs/2604.18790
作者: Johny J. Lopez,Md Meftahul Ferdaus,Mahdi Abdelguerfi,Anton Netchaev,Steven Sloan,Ken Pathak,Kendall N. Niles
机构: Canizaro Livingston Gulf States Center for Environmental Informatics, the University of New Orleans (卡尼扎罗洛佩斯海湾州环境信息中心,新奥尔良大学); US Army Corps of Engineers, Engineer Research and Development Center (美国陆军工程兵团,工程师研究与发展中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Depth completion from sparse LiDAR measurements and corresponding RGB images is a prerequisite for accurate 3D perception in robotic systems. Existing methods achieve high accuracy on standard benchmarks but rely on heavy backbone architectures that preclude real-time deployment on embedded hardware. We present EfficientPENet, a two-branch depth completion network that replaces the conventional ResNet encoder with a modernized ConvNeXt backbone, introduces sparsity-invariant convolutions for the depth stream, and refines predictions through a Convolutional Spatial Propagation Network (CSPN). The RGB branch leverages ImageNet-pretrained ConvNeXt blocks with Layer Normalization, 7x7 depthwise convolutions, and stochastic depth regularization. Features from both branches are merged via late fusion and decoded through a multi-scale deep supervision strategy. We further introduce a position-aware test-time augmentation scheme that corrects coordinate tensors during horizontal flipping, yielding consistent error reduction at inference. On the KITTI depth completion benchmark, EfficientPENet achieves an RMSE of 631.94 mm with 36.24M parameters and a latency of 20.51 ms, operating at 48.76 FPS. This represents a 3.7 times reduction in parameters and a 23 times speedup relative to BP-Net, while maintaining competitive accuracy. These results establish EfficientPENet as a practical solution for real-time depth completion on resource-constrained edge platforms such as the NVIDIA Jetson.

[CV-102] CAHAL: Clinically Applicable resolution enHAncement for Low-resolution MRI scans

【速读】:该论文旨在解决临床常规脑部磁共振成像(MRI)中因厚层、各向异性采集导致的大规模自动化形态计量分析受限的问题。现有生成式超分辨率(Generative Super-Resolution, GSR)方法虽能生成视觉逼真的各向同性体积,但常引入解剖幻觉、系统性容积高估和结构扭曲,损害下游定量分析的准确性与诊断安全性。其解决方案的关键在于提出CAHAL(Clinically Applicable resolution enHAncement for Low-resolution MRI scans),一个鲁棒性强、物理信息驱动的分辨率增强框架,直接在患者原始采集空间中运行;核心创新包括:基于体积分辨率与采集各向异性两个独立描述符的确定性双变量专家混合(Mixture of Experts, MoE)架构,以及融合边缘惩罚的空间重建、傅里叶域谱一致性匹配与分割引导语义一致性约束的复合损失函数,同时通过真实世界数据库采样物理退化生成训练对,确保模型泛化能力。

链接: https://arxiv.org/abs/2604.18781
作者: Sergio Morell-Ortega,Ángela González-Cebrián,Boris Mansencal,Marien Gadea,Roberto Vivo-Hernando,Gregorio Rubio,Fernando Aparici,Maria de la Iglesia-Vaya,Gwenaelle Catheline,Pierrick Coupé,José V. Manjón
机构: Instituto de Aplicaciones de las Tecnologías de la Información y de las Comunicaciones Avanzadas (ITACA), Universitat Politècnica de València (瓦伦西亚理工大学); CNRS, Univ. Bordeaux, Bordeaux INP, LABRI, UMR5800, in2brain (法国国家科学研究中心、波尔多大学、波尔多综合理工学院、实验室生物与信息研究联合体); Department of Psychobiology, Faculty of Psychology, Universitat de Valencia (瓦伦西亚大学心理学系); Instituto de Automática e Informática Industrial, Universitat Politècnica de València (瓦伦西亚理工大学自动化与工业信息研究所); Departamento de Matemática Aplicada, Universitat Politècnica de València (瓦伦西亚理工大学应用数学系); Área de Imagen Médica. Hospital Universitario y Politécnico La Fe (La Fe大学医院和医学院医学影像科); Unidad Mixta de Imagen Biomédica FISABIO-CIPF. Fundación para el Fomento de la Investigación Sanitario y Biomédica de la Comunidad Valenciana (瓦伦西亚社区卫生与生物医学研究基金会); Univ. Bordeaux, CNRS, UMR 5287, Institut de Neurosciences Cognitives et Intégratives d’Aquitaine (波尔多大学、法国国家科学研究中心、认知与整合神经科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale automated morphometric analysis of brain MRI is limited by the thick-slice, anisotropic acquisitions prevalent in routine clinical practice. Existing generative super-resolution (SR) methods produce visually compelling isotropic volumes but often introduce anatomical hallucinations, systematic volumetric overestimation, and structural distortions that compromise downstream quantitative analysis and diagnostic safety. To address this, we propose CAHAL (Clinically Applicable resolution enHAncement for Low-resolution MRI scans), a hallucination-robust, physics-informed resolution enhancement framework that operates directly in the patient’s native acquisition space. CAHAL employs a deterministic bivariate Mixture of Experts (MoE) architecture routing each input through specialised residual 3D U-Net experts conditioned on both volumetric resolution and acquisition anisotropy, two independent descriptors of clinical MRI acquisition. Experts are optimised with a composite loss combining edge-penalised spatial reconstruction, Fourier-domain spectral coherence matching, and a segmentation-guided semantic consistency constraint. Training pairs are generated on-the-fly via physics-based degradation sampled from a large-scale real-world database, ensuring robust generalisation. Validated on T1-weighted and FLAIR sequences against generative baselines, CAHAL achieves state-of-the-art results, improving the best related methods in terms of accuracy and efficiency.

[CV-103] REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction

【速读】:该论文旨在解决当前阿尔茨海默病(Alzheimer’s disease, AD)早期风险预测中, retinal imaging(视网膜成像)与临床风险因素分别建模导致的多模态关联学习不足问题,以及缺乏对具有相似视网膜形态特征和临床风险特征患者进行结构化对齐的机制。其解决方案的关键在于提出REVEAL(REtinal-risk Vision-Language Early Alzheimer’s Learning)框架:首先将问卷形式的现实世界风险因素转化为可被预训练视觉-语言模型(vision-language models, VLMs)理解的临床可解释叙述;其次引入群体感知对比学习(group-aware contrastive learning, GACL)策略,通过聚类具有相似视网膜形态学特征和风险因素的患者作为正样本对,强化跨模态对齐,从而构建统一的表示学习体系。该方法显著优于仅使用视网膜图像或通用VLM的模型,在平均提前8年(范围1–11年)预测AD和痴呆的发生上展现出优越性能。

链接: https://arxiv.org/abs/2604.18757
作者: Seowung Leem,Lin Gu,Chenyu You,Kuang Gong,Ruogu Fang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication a MIDL 2026

点击查看摘要

Abstract:The retina provides a unique, noninvasive window into Alzheimer’s disease (AD) and dementia, capturing early structural changes through morphometric features, while systemic and lifestyle risk factors reflect well-established contributors to disease susceptibility long before clinical symptom onset. However, current retinal analysis frameworks typically model imaging and risk factors separately, limiting their ability to capture joint multimodal patterns critical for early risk prediction. Moreover, existing methods rarely incorporate mechanisms to organize or align patients with similar retinal and clinical characteristics, constraining the learning of coherent cross-modal associations. To address these limitations, we introduce REVEAL (REtinal-risk Vision-Language Early Alzheimer’s Learning), a framework that aligns color fundus photographs with individualized disease-specific risk profiles for predicting incident AD and dementia, on average 8 years before diagnosis (range: 1-11 years). Because real-world risk factors are structured questionnaire data, we translate them into clinically interpretable narratives compatible with pretrained vision-language models (VLMs). We further propose a group-aware contrastive learning (GACL) strategy that clusters patients with similar retinal morphometry and risk factors as positive pairs, strengthening multimodal alignment. This unified representation learning framework substantially outperforms state-of-the-art retinal imaging models paired with clinical text encoders, as well as general-purpose VLMs, demonstrating the value of jointly modeling retinal biomarkers and clinical risk factors. By providing a generalizable and noninvasive approach for early AD and dementia risk stratification, REVEAL has the potential to enable earlier intervention and improve preventive care at the population level.

[CV-104] URoPE: Universal Relative Position Embedding across Geometric Spaces

【速读】:该论文旨在解决现有相对位置编码(Relative Position Embedding)机制在处理跨视角或跨维度几何空间时的局限性,例如在计算机视觉任务中涉及多相机视角间或2D与3D空间之间的几何推理问题。传统方法通常局限于固定几何空间(如1D序列或规则的2D/3D网格),难以适应复杂场景下的位置关系建模。解决方案的关键在于提出URoPE(Universal Rotary Position Embedding),它将旋转位置编码(Rotary Position Embedding, RoPE)扩展至任意几何空间,通过沿相机光路采样3D点并投影到查询图像平面,从而利用标准2D RoPE对投影像素坐标进行编码。URoPE无需额外参数、具备内参感知能力,并且对全局坐标系选择不变,同时兼容现有的RoPE优化注意力核函数,实现了在多种任务(如新视角合成、3D目标检测、目标跟踪和深度估计)中的通用性和性能提升。

链接: https://arxiv.org/abs/2604.18747
作者: Yichen Xie,Depu Meng,Chensheng Peng,Yihan Hu,Quentin Herau,Masayoshi Tomizuka,Wei Zhan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Relative position embedding has become a standard mechanism for encoding positional information in Transformers. However, existing formulations are typically limited to a fixed geometric space, namely 1D sequences or regular 2D/3D grids, which restricts their applicability to many computer vision tasks that require geometric reasoning across camera views or between 2D and 3D spaces. To address this limitation, we propose URoPE, a universal extension of Rotary Position Embedding (RoPE) to cross-view or cross-dimensional geometric spaces. For each key/value image patch, URoPE samples 3D points along the corresponding camera ray at predefined depth anchors and projects them into the query image plane. Standard 2D RoPE can then be applied using the projected pixel coordinates. URoPE is a parameter-free and intrinsics-aware relative position embedding that is invariant to the choice of global coordinate systems, while remaining fully compatible with existing RoPE-optimized attention kernels. We evaluate URoPE as a plug-in positional encoding for transformer architectures across a diverse set of tasks, including novel view synthesis, 3D object detection, object tracking, and depth estimation, covering 2D-2D, 2D-3D, and temporal scenarios. Experiments show that URoPE consistently improves the performance of transformer-based models across all tasks, demonstrating its effectiveness and generality for geometric reasoning. Our project website is: this https URL.

[CV-105] DeltaSeg: Tiered Attention and Deep Delta Learning for Multi-Class Structural Defect Segmentation

【速读】:该论文旨在解决结构缺陷自动分割任务中的三大挑战:损伤类型多样性、极端类别不平衡以及精确边界 delineation(界定)的需求。其解决方案的关键在于提出一种新型的U型编码器-解码器架构DeltaSeg,该架构通过分层注意力机制实现多尺度特征优化:在编码器中引入Squeeze-and-Excitation(SE)通道注意力,在瓶颈层和解码器中采用Coordinate Attention以增强空间定位能力,并创新性地设计了Deep Delta Attention(DDA)模块用于跳接连接的精细化调整——该模块通过双路径结构结合学习得到的delta算子抑制干扰特征并融合基于解码器信号的空间注意力门控机制。此外,利用Atrous Spatial Pyramid Pooling(ASPP)捕获多尺度上下文信息,并通过多尺度辅助头实现深度监督,从而提升梯度流动性和中间层语义特征质量,最终在S2DS和CSDD两个数据集上显著优于12种主流分割模型,展现出优异的泛化性能。

链接: https://arxiv.org/abs/2604.18745
作者: Enrique Hernandez Noguera,Md Meftahul Ferdaus,Elias Ioup,Mahdi Abdelguerfi
机构: University of New Orleans (新奥尔良大学); Center for Geospatial Sciences, Naval Research Laboratory (海军研究实验室地理空间科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated segmentation of structural defects from visual inspection imagery remains challenging due to the diversity of damage types, extreme class imbalance, and the need for precise boundary delineation. This paper presents DeltaSeg, a U-shaped encoder-decoder architecture with a tiered attention strategy that integrates Squeeze-and-Excitation (SE) channel attention in the encoder, Coordinate Attention at the bottleneck and decoder, and a novel Deep Delta Attention (DDA) mechanism in the skip connections. The encoder uses depthwise separable convolutions with dilated stages to maintain spatial resolution while expanding the receptive field. Atrous Spatial Pyramid Pooling (ASPP) at the bottleneck captures multi-scale context. The DDA module refines skip connections through a dual-path scheme combining a learned delta operator for nuisance feature suppression with spatial attention gates conditioned on decoder signals. Deep supervision through multi-scale auxiliary heads further strengthens gradient flow and encourages semantically meaningful features at intermediate decoder stages. We evaluate DeltaSeg on two datasets: the S2DS dataset (7 classes) and the Culvert-Sewer Defect Dataset (CSDD, 9 classes). Across both benchmarks, DeltaSeg consistently outperforms 12 competing architectures including U-Net, SA-UNet, UNet3+, SegFormer, Swin-UNet, EGE-UNet, FPN, and Mobile-UNETR, demonstrating strong generalization across damage types, imaging conditions, and structural geometries.

[CV-106] Match-Any-Events: Zero-Shot Motion-Robust Feature Matching Across Wide Baselines for Event Cameras

【速读】:该论文旨在解决事件相机(Event Camera)在任意视角间进行宽基线(wide-baseline)匹配的难题,尤其是在跨数据集场景下缺乏有效监督信号和模型泛化能力不足的问题。传统学习方法受限于标注数据稀缺及对特定域的依赖,难以实现零样本迁移(zero-shot deployment)。解决方案的关键在于提出首个可在未见数据集上无需微调即可实现宽基线匹配的事件特征匹配模型:其核心创新包括一个运动鲁棒且计算高效的注意力主干网络(attention backbone),能够从事件流中学习多时间尺度特征,并结合稀疏感知事件令牌选择机制(sparsity-aware event token selection),从而支持大规模多样化的宽基线监督训练;同时,作者构建了一个鲁棒的事件运动合成框架(event motion synthesis framework),用于生成包含多视角、多模态和复杂运动的大规模事件匹配数据集,显著提升了模型的泛化性能。实验表明,该方法相较现有最优事件特征匹配方法提升达37.7%。

链接: https://arxiv.org/abs/2604.18744
作者: Ruijun Zhang,Hang Su,Kostas Daniilidis,Ziyun Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras have recently shown promising capabilities in instantaneous motion estimation due to their robustness to low light and fast motions. However, computing wide-baseline correspondence between two arbitrary views remains a significant challenge, since event appearance changes substantially with motion, and learning-based approaches are constrained by both scalability and limited wide-baseline supervision. We therefore introduce the first event matching model that achieves cross-dataset wide-baseline correspondence in a zero-shot manner: a single model trained once is deployed on unseen datasets without any target-domain fine-tuning or adaptation. To enable this capability, we introduce a motion-robust and computationally efficient attention backbone that learns multi-timescale features from event streams, augmented with sparsity-aware event token selection, making large-scale training on diverse wide-baseline supervision computationally feasible. To provide the supervision needed for wide-baseline generalization, we develop a robust event motion synthesis framework to generate large-scale event-matching datasets with augmented viewpoints, modalities, and motions. Extensive experiments across multiple benchmarks show that our framework achieves a 37.7% improvement over the previous best event feature matching methods. Code and data are available at: this https URL.

[CV-107] Autonomous Skeletal Landmark Localization towards Agent ic C-Arm Control

【速读】:该论文旨在解决急诊介入治疗中C-arm影像设备自动定位失效时依赖人工操作导致延迟的问题,其核心挑战在于提升C-arm控制的自主性与鲁棒性。解决方案的关键在于引入基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的智能代理框架,通过在合成与真实X射线数据集上对MLLMs进行微调,实现骨骼关键点的精准定位,并利用模型的推理能力纠正初始错误预测、分步引导C-arm移动至目标位置,从而构建具备反馈感知和空间推理能力的自主控制系统。

链接: https://arxiv.org/abs/2604.18740
作者: Jay Jung,Ahmad Arrabi,Jax Luo,Scott Raymond,Safwan Wshah
机构: University of Vermont (佛蒙特大学); China Computer Federation (中国计算机学会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IJCARS: IPCAI 2026. Int J CARS (2026)

点击查看摘要

Abstract:Purpose: Automated C-arm positioning ensures timely treatment in patients requiring emergent interventions. When a conventional Deep Learning (DL) approach for C-arm control fails, clinicians must revert to manual operation, resulting in additional delays. Consequently, an agentic C-arm control framework based on multimodal large language models (MLLMs) is highly desirable, as it can incorporate clinician feedback and use reasoning to make adjustments toward more accurate positioning. Skeletal landmark localization is essential for C-arm control, and we investigate adapting MLLMs for autonomous landmark localization. Methods: We used an annotated synthetic X-ray dataset and a real X-ray dataset. Each X-ray in both datasets is paired with several skeletal landmarks. We fine-tuned two MLLMs and tasked them with retrieving the closest landmarks from each X-ray. Quantitative evaluations of landmark localization were performed and compared against a leading DL approach. We further conducted qualitative experiments demonstrating: (1) how an MLLM can correct an initially incorrect prediction through reasoning, and (2) how the MLLM can sequentially navigate the C-arm toward a target location. Results: On both datasets, fine-tuned MLLMs demonstrate competitive performance across all localization tasks when compared with the DL approach. In the qualitative experiments, the MLLMs provide evidence of reasoning and spatial awareness. Conclusion: This study shows that fine-tuned MLLMs achieve accurate skeletal landmark localization and hold promise for agentic autonomous C-arm control. Our code is available athttps://github.com/marszzibros/Cthis http URL Comments: Accepted at IJCARS: IPCAI 2026. Int J CARS (2026) Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.18740 [cs.CV] (or arXiv:2604.18740v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.18740 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1007/s11548-026-03632-0 Focus to learn more DOI(s) linking to related resources Submission history From: Jay Jung [view email] [v1] Mon, 20 Apr 2026 18:45:02 UTC (22,819 KB)

[CV-108] Colour Extraction Pipeline for Odonates using Computer Vision

【速读】:该论文旨在解决昆虫形态特征(如体色)与气候关系研究中因数据标注成本高、规模受限而导致的分析效率低下的问题。其解决方案的关键在于构建一个基于深度神经网络的自动识别与分割流程,利用开源公民科学平台图像作为输入,通过有限标注数据训练模型,并结合伪监督学习进行优化,从而实现对蜻蜓目(Odonates)个体头部、胸部、腹部和翅膀等关键部位的精准分割及色彩提取,为大规模生态统计分析提供可量化、高通量的形态学数据支持。

链接: https://arxiv.org/abs/2604.18725
作者: Megan Mirnalini Sundaram Rajaraman,Fons J. Verbeek,Vincent J. Kalkman,Rita Pucci
机构: Leiden Institute of Advanced Computer Science (LIACS); Leiden University; Naturalis Biodiversity Center
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages long (excluding references), 12 figures, to be submitted in NCCV 2026

点击查看摘要

Abstract:The correlation between insect morphological traits and climate has been documented in physiological studies, but such studies remain limited by the time-consuming nature of the data analysis. In particular, the open source datasets often lack annotations of species’ morphological traits, making dedicated annotations campaigns necessary; these efforts are typically local in scale and costly. In this paper, we propose a pipeline to identify and segment body parts of Odonates (dragonflies and damselflies) using deep neural networks, with the ultimate goal of extracting body parts’ colouration. The pipeline is trained on a limited annotated dataset and refined with pseudo supervised data. We show that, by using open source images from citizen science platforms, our approach can segment each visible subject (Odonates) into head, thorax, abdomen, and wings and then extract a colour palette for each body part. This will enable large-scale statistical analysis of ecological correlations (e.g., between colouration and climate change, habitat loss, or geolocation) which are crucial for quantifying and assessing ecosystem biodiversity status.

[CV-109] Align then Refine: Text-Guided 3D Prostate Lesion Segmentation

【速读】:该论文旨在解决从双参数磁共振成像(biparametric MRI, bp-MRI)中实现前列腺病灶的高精度三维分割问题,当前方法在多模态信息融合与解剖一致性保障方面存在不足,且视觉-语言模型(Vision-Language Models, VLMs)缺乏病灶级别的细粒度语义引导。解决方案的关键在于提出一种新型多编码器U-Net架构,包含三项核心创新:(1) 对齐损失(alignment loss)增强前景文本与图像相似性以注入病灶语义;(2) 热图损失(heatmap loss)校准相似性图并抑制背景误激活;(3) 末阶段置信度门控多头交叉注意力精修模块,在高置信区域执行局部边界修正。通过阶段调度训练策略稳定优化过程,显著提升了多模态融合能力与局部文本引导精度,在PI-CAI数据集上达到新的最先进性能。

链接: https://arxiv.org/abs/2604.18713
作者: Cuiling Sun,Linkai Peng,Adam Murphy,Elif Keles,Hiten D. Patel,Ashley Ross,Frank Miller,Baris Turkbey,Andrea Mia Bejar,Halil Ertugrul Aktas,Gorkem Durak,Ulas Bagci
机构: 1. National Institutes of Health (美国国立卫生研究院); 2. University of California, San Francisco (加州大学旧金山分校); 3. Memorial Sloan Kettering Cancer Center (纪念斯隆-凯特琳癌症中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to EMBC 2026

点击查看摘要

Abstract:Automated 3D segmentation of prostate lesions from biparametric MRI (bp-MRI) is essential for reliable algorithmic analysis, but achieving high precision remains challenging. Volumetric methods must combine multiple modalities while ensuring anatomical consistency, but current models struggle to integrate cross-modal information reliably. While vision-language models (VLMs) are replacing the currently used architectural designs, they still lack the fine-grained, lesion-level semantics required for effective localized guidance. To address these limitations, we propose a new multi-encoder U-Net architecture incorporating three key innovations: (1) an alignment loss that enhances foreground text-image similarity to inject lesion semantics; (2) a heatmap loss that calibrates the similarity map and suppresses spurious background activations; and (3) a final-stage, confidence-gated multi-head cross-attention refiner that performs localized boundary edits in high-confidence regions. A phase-scheduled training regime stabilizes the optimization of these components. Our method consistently outperforms prior approaches, establishing a new state-of-the-art on the PI-CAI dataset through enhanced multi-modal fusion and localized text guidance. Our code is available at this https URL.

[CV-110] DanceCrafter: Fine-Grained Text-Driven Controllable Dance Generation via Choreographic Syntax

【速读】:该论文旨在解决文本驱动的可控舞蹈生成问题,其核心挑战在于高质量数据集稀缺以及复杂编舞动作难以准确表达。由于舞蹈具有复杂的时空动态性、强方向性和身体各部位高度解耦的运动特性,传统方法难以实现高保真且可控的舞蹈生成。解决方案的关键在于提出一种新的理论框架——编舞语法(Choreographic Syntax),并基于此构建了迄今最细粒度的舞蹈数据集 DanceFlow(包含41小时高质量动作与634万词描述)。同时,设计了针对人体解剖结构的 Motion Transformer 模型 DanceCrafter,采用连续流形运动表示和混合归一化策略以提升优化稳定性,并引入解剖感知损失函数显式约束身体部位的解耦运动特性,从而实现复杂舞蹈序列的高保真、稳定且细粒度可控生成。

链接: https://arxiv.org/abs/2604.18648
作者: Hang Yuan,Xiaolin Hu,Yan Wan,Menglin Gao,Wenzhe Yu,Cong Huang,Fei Xu,Qing Li,Christina Dan Wang,Zhou Yu,Kai Chen
机构: East China Normal University (华东师范大学); Beijing Dance Academy (北京舞蹈学院); Beijing University of Posts and Telecommunications (北京邮电大学); New York University Shanghai (纽约大学上海分校); Zhongguancun Academy (中关村学院); Zhongguancun Institute of Artificial Intelligence (中关村人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 13 figures

点击查看摘要

Abstract:Text-driven controllable dance generation remains under-explored, primarily due to the severe scarcity of high-quality datasets and the inherent difficulty of articulating complex choreographies. Characterizing dance is particularly challenging owing to its intricate spatial dynamics, strong directionality, and the highly decoupled movements of distinct body parts. To overcome these bottlenecks, we bridge principles from dance studies, human anatomy, and biomechanics to propose \textitChoreographic Syntax, a novel theoretical framework with a tailored annotation system. Grounded in this syntax, we combine professional dance archives with high-fidelity motion capture data to construct \textbfDanceFlow, the most fine-grained dance dataset to date. It encompasses 41 hours of high-quality motions paired with 6.34 million words of detailed descriptions. At the model level, we introduce \textbfDanceCrafter, a tailored motion transformer built upon the Momentum Human Rig. To circumvent optimization instabilities, we construct a continuous manifold motion representation paired with a hybrid normalization strategy. Furthermore, we design an anatomy-aware loss to explicitly regulate the decoupled nature of body parts. Together, these adaptations empower DanceCrafter to achieve the high-fidelity and stable generation of complex dance sequences. Extensive evaluations and user studies demonstrate our state-of-the-art performance in motion quality, fine-grained controllability, and generation naturalness.

[CV-111] StomaD2: An All-in-One System for Intelligent Stomatal Phenotype Analysis via Diffusion-Based Restoration Detection Network

【速读】:该论文旨在解决植物气孔(stomata)表型分析中准确性和高通量难以兼顾的问题,传统方法依赖破坏性采样和人工标注,限制了大规模田间应用。其核心解决方案是提出一种集成恢复-检测的非侵入式框架 StomaD2,关键创新在于:1)基于扩散模型的图像恢复模块以提升复杂成像条件下的图像质量;2)设计专用的旋转目标检测网络,通过列结构实现全局特征交互、上下文感知的重采样与重加权机制增强多尺度一致性,并引入特征重组模块提高对复杂背景的区分能力,从而在多种作物数据集上实现高达0.994和0.992的准确率及0.989的F1-score/mAP,显著优于现有主流模型。

链接: https://arxiv.org/abs/2604.18632
作者: Quanling Zhao,Meng’en Qin,Yanfeng Sun,Yuan Miao,Xiaohui Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Stomata play a crucial role in regulating plant physiological processes and reflecting environmental responses. However, accurate and high-throughput stomatal phenotyping remains challenging, as conventional approaches rely on destructive sampling and manual annotation, restricting large-scale and field deployment. To overcome these limitations, a noninvasive restoration-detection integrated framework, termed StomaD2, is developed to achieve accurate and fast stomatal phenotyping under complex imaging conditions. The framework incorporates a diffusion-based restoration module to recover degraded images and a specialized rotated object detection network tailored to the small, dense, and cluttered characteristics of stomata. The proposed network enhances feature representation through three key innovations: a column-wise structure for global feature interaction, context-aware resampling and reweighting mechanism to improve multi-scale consistency, and a feature reassembly module to boost discrimination against complex backgrounds. In extensive comparisons, StomaD2 demonstrated state-of-the-art performance. On public Maize and Wheat datasets, it achieved accuracies of 0.994 and 0.992, respectively, significantly outperforming existing benchmarks. When benchmarked against ten other advanced models, including Oriented Former and YOLOv12, StomaD2 achieved a top-tier F1-score/mAP of 0.989. The framework is integrated into a user-friendly, field-operable system that supports the fast extraction of eight stomatal phenotypes, such as density and conductance. Validated on more than 130 plant species, StomaD2’s results highlight its strong generalizability and potential for large-scale phenotyping, plant physiology analysis, and precision agriculture applications.

[CV-112] Vision-Based Human Awareness Estimation for Enhanced Safety and Efficiency of AMRs in Industrial Warehouses

【速读】:该论文旨在解决仓库环境中人机混行时,自主移动机器人(AMR)因将人类视为通用动态障碍物而导致行为过于保守的问题,例如无必要减速或绕行,即使人类已意识到机器人的存在并能安全共存。解决方案的关键在于提出一种基于单目RGB相机的实时视觉方法,通过融合先进的3D人体姿态估计与头部朝向识别技术,精确判断人类相对于AMR的位置及其视线范围(viewing cone),从而评估其是否注意到AMR,进而使AMR能够根据人类的感知状态动态调整自身运动策略,提升工业自动化场景下的安全性与运行效率。

链接: https://arxiv.org/abs/2604.18627
作者: Maximilian Haug(1),Christian Stippel(2),Lukas Pscherer(3),Benjamin Schwendinger(1),Ralph Hoch(3 and 4),Angel Gaydarov(1),Sebastian Schlund(1),Thilo Sauter(4) ((1) Fraunhofer Austria Research GmbH, Vienna, Austria, (2) Computer Vision Lab, TU Wien, Vienna, Austria, (3) Digital Factory Vorarlberg GmbH, Dornbirn, Austria, (4) Institute of Computer Technology, TU Wien, Vienna, Austria)
机构: Austrian Research Promotion Agency (奥地利研究促进署)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Ensuring human safety is of paramount importance in warehouse environments that feature mixed traffic of human workers and autonomous mobile robots (AMRs). Current approaches often treat humans as generic dynamic obstacles, leading to conservative AMR behaviors like slowing down or detouring, even when workers are fully aware and capable of safely sharing space. This paper presents a real-time vision-based method to estimate human awareness of an AMR using a single RGB camera. We integrate state-of-the-art 3D human pose lifting with head orientation estimation to ascertain a human’s position relative to the AMR and their viewing cone, thereby determining if the human is aware of the AMR. The entire pipeline is validated using synthetically generated data within NVIDIA Isaac Sim, a robust physics-accurate robotics simulation environment. Experimental results confirm that our system reliably detects human positions and their attention in real time, enabling AMRs to safely adapt their motion based on human awareness. This enhancement is crucial for improving both safety and operational efficiency in industrial and factory automation settings.

[CV-113] Can We Build Scene Graphs Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching

【速读】:该论文旨在解决场景图生成(Scene Graph Generation, SGG)任务中现有方法将问题视为一次性、确定性分类问题,而非真正具有渐进性和生成性的建模挑战。其核心问题是:如何在保持语义与几何一致性的同时,实现对场景图中对象(节点)和视觉关系(边)的连续时间演化式生成。解决方案的关键在于提出FlowSG框架,该框架将SGG重构为混合离散-连续状态空间上的连续时间传输过程:首先利用VQ-VAE将场景图量化为紧凑可预测的离散token;随后通过图Transformer预测条件速度场以驱动连续几何(边界框)的演化,并更新离散后验分布(对象特征和谓词标签),并通过流条件消息聚合耦合语义与几何信息。训练结合几何层面的流匹配损失与离散流目标,实现少步推理并兼容标准检测器和分割器,在VG和PSG数据集上显著优于传统一阶段分类基线,平均提升约3个点。

链接: https://arxiv.org/abs/2604.18623
作者: Xin Hu,Ke Qin,Wen Yin,Yuan-Fang Li,Ming Li,Tao He
机构: UESTC; Tianfu Jiangxi Laboratory; Monash University; Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scene Graph Generation (SGG) unifies object localization and visual relationship reasoning by predicting boxes and subject-predicate-object triples. Yet most pipelines treat SGG as a one-shot, deterministic classification problem rather than a genuinely progressive, generative task. We propose FlowSG, which recasts SGG as continuous-time transport on a hybrid discrete-continuous state: starting from a noised graph, the model progressively grows an image-conditioned scene graph through constraint-aware refinements that jointly synthesize nodes (objects) and edges (predicates). Specifically, we first leverage a VQ-VAE to quantize a scene graph (e.g., continuous visual features) into compact, predictable tokens; a graph Transformer then (i) predicts a conditional velocity field to transport continuous geometry (boxes) and (ii) updates discrete posteriors for categorical tokens (object features and predicate labels), coupling semantics and geometry via flow-conditioned message aggregation. Training combines flow-matching losses for geometry with a discrete-flow objective for tokens, yielding few-step inference and plug-and-play compatibility with standard detectors and segmenters. Extensive experiments on VG and PSG under closed- and open-vocabulary protocols show consistent gains in predicate R/mR and graph-level metrics, validating the mixed discrete-continuous generative formulation over one-shot classification baselines, with an average improvement of about 3 points over the state-of-the-art USG-Par.

[CV-114] SynAgent : Generalizable Cooperative Humanoid Manipulation via Solo-to-Cooperative Agent Synergy

【速读】:该论文旨在解决具身智能中可控的协作人形机器人操作问题,其核心挑战包括数据稀缺性、多智能体协同复杂性以及跨物体泛化能力不足。解决方案的关键在于提出一个统一框架SynAgent,通过“单智能体到多智能体的协同迁移”(Solo-to-Cooperative Agent Synergy)机制,将丰富的单人-物体交互数据转化为多人-物体协作行为;同时引入基于Delaunay四面体化构建的交互网格(Interact Mesh)实现运动传递过程中的语义完整性保持,并结合单智能体预训练与去中心化多智能体PPO策略优化,最终利用条件变分自编码器(conditional VAE)和多教师蒸馏技术,实现稳定且可控制的对象级轨迹执行。

链接: https://arxiv.org/abs/2604.18557
作者: Wei Yao,Haohan Ma,Hongwen Zhang,Yunlian Sun,Liangjun Xing,Zhile Yang,Yuanjun Guo,Yebin Liu,Jinhui Tang
机构: Nanjing University of Science and Technology (南京理工大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Beijing Normal University (北京师范大学); Tsinghua University (清华大学); Nanjing Forestry University (南京林业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Controllable cooperative humanoid manipulation is a fundamental yet challenging problem for embodied intelligence, due to severe data scarcity, complexities in multi-agent coordination, and limited generalization across objects. In this paper, we present SynAgent, a unified framework that enables scalable and physically plausible cooperative manipulation by leveraging Solo-to-Cooperative Agent Synergy to transfer skills from single-agent human-object interaction to multi-agent human-object-human scenarios. To maintain semantic integrity during motion transfer, we introduce an interaction-preserving retargeting method based on an Interact Mesh constructed via Delaunay tetrahedralization, which faithfully maintains spatial relationships among humans and objects. Building upon this refined data, we propose a single-agent pretraining and adaptation paradigm that bootstraps synergistic collaborative behaviors from abundant single-human data through decentralized training and multi-agent PPO. Finally, we develop a trajectory-conditioned generative policy using a conditional VAE, trained via multi-teacher distillation from motion imitation priors to achieve stable and controllable object-level trajectory execution. Extensive experiments demonstrate that SynAgent significantly outperforms existing baselines in both cooperative imitation and trajectory-conditioned control, while generalizing across diverse object geometries. Codes and data will be available after publication. Project Page: this http URL

[CV-115] A Controlled Benchmark of Visual State-Space Backbones with Domain-Shift and Boundary Analysis for Remote-Sensing Segmentation

【速读】:该论文旨在解决当前视觉状态空间模型(Visual State-Space Models, Visual SSMs)在遥感语义分割任务中性能评估不清晰的问题,尤其是现有研究未能有效隔离编码器(encoder)效应与解码器(decoder)及训练策略的影响,导致难以公平比较其实际优势。解决方案的关键在于构建一个严格受控的基准测试框架:在统一的四阶段特征接口和固定轻量级解码器条件下,仅改变编码器结构,系统评估VMamba、MambaVision和Spatial-Mamba三类代表性视觉SSM模型在LoveDA和ISPRS Potsdam数据集上的表现。该设计确保了实验变量唯一性,从而揭示出编码器改进的边际效益有限、跨域泛化存在显著不对称性以及边界识别是分布偏移下的主要失败模式等关键发现,为未来基于Mamba架构的分割骨干网络设计提供了可复现且实用的参考标准。

链接: https://arxiv.org/abs/2604.18721
作者: Nichula Wasalathilaka,Dineth Perera,Oshadha Samarakoon,Buddhi Wijenayake,Roshan Godaliyadda,Vijitha Herath,Parakrama Ekanayake
机构: University of Peradeniya (佩拉德尼雅大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures, Accepted for publication at IEEE IGARSS 2026

点击查看摘要

Abstract:Visual state-space models (SSMs) are increasingly promoted as efficient alternatives to Vision Transformers, yet their practical advantages remain unclear under fair comparison because existing studies rarely isolate encoder effects from decoder and training choices. We present a strictly controlled benchmark of representative visual SSM families, including VMamba, MambaVision, and Spatial-Mamba, for remote-sensing semantic segmentation, in which only the encoder varies across experiments. Evaluated on LoveDA and ISPRS Potsdam under a unified 4-stage feature interface and a fixed lightweight decoder, the benchmark reveals three main findings, intra-family scaling yields only modest gains, cross-domain generalization is strongly asymmetric, and boundary delineation is the dominant failure mode under distribution shift. Although visual SSMs achieve favorable accuracy-efficiency trade-offs relative to the controlled CNN and Transformer baselines considered here, the results suggest that future improvements are more likely to come from robustness-oriented design and boundary-aware decoding than from encoder scaling alone. By isolating encoder behavior under a unified and reproducible protocol, this study establishes a practical reference benchmark for the design and evaluation of future Mamba-based segmentation backbones

人工智能

[AI-0] UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

【速读】:该论文旨在解决人形机器人基础模型扩展中因机器人数据稀缺而导致的瓶颈问题,尤其是如何有效利用大规模人类视角(egocentric)数据实现跨形态(cross-embodiment)的知识迁移。其核心挑战在于人类与人形机器人之间由于运动学差异(kinematic mismatches)造成的“跨形态鸿沟”。解决方案的关键在于提出UniT(Unified Latent Action Tokenizer via Visual Anchoring)框架,通过三分支交叉重建机制:动作预测视觉以锚定运动学至物理结果,视觉重构动作以过滤无关视觉混杂因素,并融合纯净模态进入一个与具体形态无关的离散潜在空间,从而建立统一的物理语言(physical language),实现人类到人形机器人的零样本任务迁移和动作控制增强。

链接: https://arxiv.org/abs/2604.19734
作者: Boyu Chen,Yi Chen,Lu Qiu,Jerry Bai,Yuying Ge,Yixiao Ge
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.

[AI-1] FASTER: Value-Guided Sampling for Fast RL

【速读】:该论文旨在解决当前基于扩散模型的强化学习(Reinforcement Learning, RL)算法在测试阶段采用采样式缩放(sampling-based test-time scaling)时计算成本过高问题,即通过多次采样动作候选并选择最优解的方式虽能提升性能,但代价高昂。其解决方案的关键在于提出一种名为FASTER的方法,该方法通过将多个动作候选的去噪过程建模为一个马尔可夫决策过程(Markov Decision Process, MDP),在去噪尚未完成时即预测各候选动作的下游价值,并据此进行筛选,从而实现高效的动作选择。这一策略使得模型能够在不显著增加计算负担的前提下获得与传统采样方法相当甚至更优的性能,且具备良好的通用性,可无缝集成至现有生成式强化学习算法中。

链接: https://arxiv.org/abs/2604.19730
作者: Perry Dong,Alexander Swerdlow,Dorsa Sadigh,Chelsea Finn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Some of the most performant reinforcement learning algorithms today can be prohibitively expensive as they use test-time scaling methods such as sampling multiple action candidates and selecting the best one. In this work, we propose FASTER, a method for getting the benefits of sampling-based test-time scaling of diffusion-based policies without the computational cost by tracing the performance gain of action samples back to earlier in the denoising process. Our key insight is that we can model the denoising of multiple action candidates and selecting the best one as a Markov Decision Process (MDP) where the goal is to progressively filter action candidates before denoising is complete. With this MDP, we can learn a policy and value function in the denoising space that predicts the downstream value of action candidates in the denoising process and filters them while maximizing returns. The result is a method that is lightweight and can be plugged into existing generative RL algorithms. Across challenging long-horizon manipulation tasks in online and batch-online RL, FASTER consistently improves the underlying policies and achieves the best overall performance among the compared methods. Applied to a pretrained VLA, FASTER achieves the same performance while substantially reducing training and inference compute requirements. Code is available at this https URL .

[AI-2] Benign Overfitting in Adversarial Training for Vision Transformers

【速读】:该论文旨在解决视觉 Transformer (Vision Transformer, ViT) 在面对对抗样本时的脆弱性问题,尤其是在缺乏理论支撑的情况下,如何通过对抗训练(adversarial training)实现鲁棒性提升。其解决方案的关键在于首次对简化结构的 ViT 进行理论分析,证明在满足特定信噪比条件且扰动预算适中的情况下,对抗训练可使 ViT 实现近乎零的鲁棒训练损失和鲁棒泛化误差,从而在存在过拟合现象(即良性过拟合, benign overfitting)时仍保持强泛化能力——这一现象此前仅在卷积神经网络(CNNs)中被观察到。

链接: https://arxiv.org/abs/2604.19724
作者: Jiaming Zhang,Meng Ding,Shaopeng Fu,Jingfeng Zhang,Di Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2409.19345 by other authors

点击查看摘要

Abstract:Despite the remarkable success of Vision Transformers (ViTs) across a wide range of vision tasks, recent studies have revealed that they remain vulnerable to adversarial examples, much like Convolutional Neural Networks (CNNs). A common empirical defense strategy is adversarial training, yet the theoretical underpinnings of its robustness in ViTs remain largely unexplored. In this work, we present the first theoretical analysis of adversarial training under simplified ViT architectures. We show that, when trained under a signal-to-noise ratio that satisfies a certain condition and within a moderate perturbation budget, adversarial training enables ViTs to achieve nearly zero robust training loss and robust generalization error under certain regimes. Remarkably, this leads to strong generalization even in the presence of overfitting, a phenomenon known as \emphbenign overfitting, previously only observed in CNNs (with adversarial training). Experiments on both synthetic and real-world datasets further validate our theoretical findings.

[AI-3] Adaptive MSD-Splitting: Enhancing C4.5 and Random Forests for Skewed Continuous Attributes

【速读】:该论文旨在解决决策树构建过程中连续数值属性离散化(discretization)带来的计算瓶颈问题,尤其是在高维数据场景下,传统方法因需进行O(N log N)级别的穷举搜索而效率低下。其核心解决方案是提出自适应均值-标准差分割(Adaptive MSD-Splitting, AMSD),关键在于根据特征偏度(skewness)动态调整标准差乘数,从而在数据分布高度偏斜时避免固定一倍标准差阈值导致的信息丢失,同时在密集区域缩小区间以保持判别分辨率。该方法在保持近似O(N)时间复杂度的同时显著提升分类准确率,并进一步集成至随机森林框架形成RF-AMSD,在多项真实世界生物医学与金融数据集上实现比标准MSD-Splitting更高的精度和更低的计算成本。

链接: https://arxiv.org/abs/2604.19722
作者: Jake Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The discretization of continuous numerical attributes remains a persistent computational bottleneck in the induction of decision trees, particularly as dataset dimensions scale. Building upon the recently proposed MSD-Splitting technique – which bins continuous data using the empirical mean and standard deviation to dramatically improve the efficiency and accuracy of the C4.5 algorithm – we introduce Adaptive MSD-Splitting (AMSD). While standard MSD-Splitting is highly effective for approximately symmetric distributions, its rigid adherence to fixed one-standard-deviation cutoffs can lead to catastrophic information loss in highly skewed data, a common artifact in real-world biomedical and financial datasets. AMSD addresses this by dynamically adjusting the standard deviation multiplier based on feature skewness, narrowing intervals in dense regions to preserve discriminative resolution. Furthermore, we integrate AMSD into ensemble methods, specifically presenting the Random Forest-AMSD (RF-AMSD) framework. Empirical evaluations on the Census Income, Heart Disease, Breast Cancer, and Forest Covertype datasets demonstrate that AMSD yields a 2-4% accuracy improvement over standard MSD-Splitting, while maintaining near-identical O(N) time complexity reductions compared to the O(N log N) exhaustive search. Our Random Forest extension achieves state-of-the-art accuracy at a fraction of standard computational costs, confirming the viability of adaptive statistical binning in large-scale ensemble learning architectures.

[AI-4] A-MAR: Agent -based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在艺术作品理解中依赖隐式推理和内化知识而导致解释性差、缺乏显式证据支撑的问题。其解决方案的关键在于提出A-MAR(Agent-based Multimodal Art Retrieval)框架,该框架通过结构化的推理计划(reasoning plan)显式地指导检索过程:首先将用户查询与艺术作品的交互任务分解为包含目标与证据需求的多步推理链,随后基于此计划进行条件化检索,从而实现有针对性的证据选择与逐步可解释的推理链构建。这一设计显著提升了模型在艺术领域中的知识密集型理解能力与推理透明度。

链接: https://arxiv.org/abs/2604.19689
作者: Shuai Wang,Hongyi Zhu,Jia-Hong Huang,Yixian Shen,Chengxi Zeng,Stevan Rudinac,Monika Kackovic,Nachoem Wijnberg,Marcel Worring
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: this https URL.

[AI-5] Learning Hybrid-Control Policies for High-Precision In-Contact Manipulation Under Uncertainty

【速读】:该论文旨在解决基于姿态(pose)的强化学习控制策略在执行需要精确力控的任务(如脆弱部件的插孔插入)时,因缺乏对接触力的显式控制而导致性能受限的问题。传统方法依赖低层控制器来避免破坏性动作,但难以适应复杂接触场景。其解决方案的关键在于提出一种混合位置-力控制策略(hybrid position-force control policies),通过学习在每个控制维度上动态选择使用位置控制还是力控制;并进一步引入模式感知训练方法(Mode-Aware Training for Contact Handling, MATCH),使策略的动作概率分布显式模仿混合控制中的模式切换行为,从而提升学习效率与鲁棒性。实验表明,MATCH在极端定位不确定性下显著优于纯姿态控制策略,在成功率和防损性能上均有大幅提升。

链接: https://arxiv.org/abs/2604.19677
作者: Hunter L. Brown,Geoffrey Hollinger,Stefan Lee
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning-based control policies have been frequently demonstrated to be more effective than analytical techniques for many manipulation tasks. Commonly, these methods learn neural control policies that predict end-effector pose changes directly from observed state information. For tasks like inserting delicate connectors which induce force constraints, pose-based policies have limited explicit control over force and rely on carefully tuned low-level controllers to avoid executing damaging actions. In this work, we present hybrid position-force control policies that learn to dynamically select when to use force or position control in each control dimension. To improve learning efficiency of these policies, we introduce Mode-Aware Training for Contact Handling (MATCH) which adjusts policy action probabilities to explicitly mirror the mode selection behavior in hybrid control. We validate MATCH’s learned policy effectiveness using fragile peg-in-hole tasks under extreme localization uncertainty. We find MATCH substantially outperforms pose-control policies – solving these tasks with up to 10% higher success rates and 5x fewer peg breaks than pose-only policies under common types of state estimation error. MATCH also demonstrates data efficiency equal to pose-control policies, despite learning in a larger and more complex action space. In over 1600 sim-to-real experiments, we find MATCH succeeds twice as often as pose policies in high noise settings (33% vs.~68%) and applies ~30% less force on average compared to variable impedance policies on a Franka FR3 in laboratory conditions.

[AI-6] Multi-Cycle Spatio-Temporal Adaptation in Human-Robot Teaming

【速读】:该论文旨在解决人机协作中联合任务规划优化的难题,特别是如何建模个体化的人类能力与偏好以提升机器人在人类工作空间中的协同效率。传统方法通常将任务级(task-level)和运动级(motion-level)适应策略割裂处理:前者关注任务分配与调度但忽略近距场景下的空间干扰,后者侧重碰撞避免却忽视任务上下文。其解决方案的关键在于提出RAPIDDS框架,该框架通过多轮交互数据统一建模个体的空间行为(运动路径)与时间行为(任务耗时),并在此基础上联合优化任务调度与机器人运动扩散模型(diffusion models),从而在保障安全性的同时最大化整体效率并最小化人机接近程度。

链接: https://arxiv.org/abs/2604.19670
作者: Alex Cuellar,Michael Hagenow,Julie Shah
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:Effective human-robot teaming is crucial for the practical deployment of robots in human workspaces. However, optimizing joint human-robot plans remains a challenge due to the difficulty of modeling individualized human capabilities and preferences. While prior research has leveraged the multi-cycle structure of domains like manufacturing to learn an individual’s tendencies and adapt plans over repeated interactions, these techniques typically consider task-level and motion-level adaptation in isolation. Task-level methods optimize allocation and scheduling but often ignore spatial interference in close-proximity scenarios; conversely, motion-level methods focus on collision avoidance while ignoring the broader task context. This paper introduces RAPIDDS, a framework that unifies these approaches by modeling an individual’s spatial behavior (motion paths) and temporal behavior (time required to complete tasks) over multiple cycles. RAPIDDS then jointly adapts task schedules and steers diffusion models of robot motions to maximize efficiency and minimize proximity accounting for these individualized models. We demonstrate the importance of this dual adaptation through an ablation study in simulation and a physical robot scenario using a 7-DOF robot arm. Finally, we present a user study (n=32) showing significant plan improvement compared to non-adaptive systems across both objective metrics, such as efficiency and proximity, and subjective measures, including fluency and user preference. See this paper’s companion video at: this https URL.

[AI-7] An AI Agent Execution Environment to Safeguard User Data

【速读】:该论文旨在解决AI代理(AI agent)在处理用户私有数据时面临的隐私泄露风险问题,尤其是在模型可能被攻击(如提示注入)或服务提供商不可信的情况下,如何保障用户数据的机密性。解决方案的关键在于提出GAAP(Guaranteed Accounting for Agent Privacy),一个基于动态用户权限规范的执行环境,通过增强信息流控制(Information Flow Control, IFC)机制,引入持久化数据存储和标注技术,实现对私有数据跨任务和跨时间步的精确追踪与强制访问控制,从而在不依赖模型可信性、也不要求用户提示无攻击的前提下,确定性地保证数据仅按用户授权方式使用,且有效阻止所有已知的数据泄露攻击。

链接: https://arxiv.org/abs/2604.19657
作者: Robert Stanley,Avi Verma,Lillian Tsai,Konstantinos Kallas,Sam Kumar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Operating Systems (cs.OS)
备注:

点击查看摘要

Abstract:AI agents promise to serve as general-purpose personal assistants for their users, which requires them to have access to private user data (e.g., personal and financial information). This poses a serious risk to security and privacy. Adversaries may attack the AI model (e.g., via prompt injection) to exfiltrate user data. Furthermore, sharing private data with an AI agent requires users to trust a potentially unscrupulous or compromised AI model provider with their private data. This paper presents GAAP (Guaranteed Accounting for Agent Privacy), an execution environment for AI agents that guarantees confidentiality for private user data. Through dynamic and directed user prompts, GAAP collects permission specifications from users describing how their private data may be shared, and GAAP enforces that the agent’s disclosures of private user data, including disclosures to the AI model and its provider, comply with these specifications. Crucially, GAAP provides this guarantee deterministically, without trusting the agent with private user data, and without requiring any AI model or the user prompt to be free of attacks. GAAP enforces the user’s permission specification by tracking how the AI agent accesses and uses private user data. It augments Information Flow Control with novel persistent data stores and annotations that enable it to track the flow of private information both across execution steps within a single task, and also over multiple tasks separated in time. Our evaluation confirms that GAAP blocks all data disclosure attacks, including those that make other state-of-the-art systems disclose private user data to untrusted parties, without a significant impact on agent utility. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Operating Systems (cs.OS) Cite as: arXiv:2604.19657 [cs.CR] (or arXiv:2604.19657v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.19657 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-8] A Dual Perspective on Synthetic Trajectory Generators: Utility Framework and Privacy Vulnerabilities

【速读】:该论文旨在解决人类移动数据在隐私保护与信息效用之间的权衡问题(privacy-utility trade-off)。随着生成式 AI (Generative AI) 技术的发展,传统隐私保护方法如聚合、扰动或噪声添加因严重损害数据效用而受到限制。论文提出了一种新的效用评估框架,作为迈向该问题解决方案的第一步;同时指出隐私评估仍是重大挑战,并建议依据欧盟现行法规采用对抗性评估方法。关键创新在于引入一种针对特定生成模型子类的新型成员推断攻击(membership inference attack),揭示了此前被认为具备隐私保障能力的模型在轨迹用户关联问题上仍存在脆弱性。

链接: https://arxiv.org/abs/2604.19653
作者: Aya Cherigui,Florent Guépin,Arnaud Legendre,Jean-François Couchot
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human mobility data are used in numerous applications, ranging from public health to urban planning. Human mobility is inherently sensitive, as it can contain information such as religious beliefs and political affiliations. Historically, it has been proposed to modify the information using techniques such as aggregation, obfuscation, or noise addition, to adequately protect privacy and eliminate concerns. As these methods come at a great cost in utility, new methods leveraging development in generative models, were introduced. The extent to which such methods answer the privacy-utility trade-off remains an open problem. In this paper, we introduced a first step towards solving it, by the introduction and application of a new framework for utility evaluation. Furthermore, we provide evidence that privacy evaluation remains a great challenge to consider and that it should be tackled through adversarial evaluation in accordance with the current EU regulation. We propose a new membership inference attack against a subcategory of generative models, even though this subcategory was deemed private due to its resistance over the trajectory user-linking problem.

[AI-9] Environmental Sound Deepfake Detection Using Deep-Learning Framework

【速读】:该论文旨在解决环境声音深度伪造检测(Environmental Sound Deepfake Detection, ESDD)问题,即判断输入音频中声音场景(sound scene)和声音事件(sound event)是否为伪造。其核心解决方案在于提出了一种基于深度学习的框架,并通过系统性实验验证了多个关键因素对任务性能的影响:首先,明确区分声音场景与声音事件的深度伪造检测应作为独立任务处理;其次,证明微调预训练模型(如WavLM)比从头训练更有效;最终,采用三阶段训练策略微调预训练WavLM模型,在EnvSDD和ESDD-Challenge-TestSet两个基准数据集上分别实现了高准确率(Accuracy)、F1分数和AUC值,显著提升了检测性能。

链接: https://arxiv.org/abs/2604.19652
作者: Lam Pham,Khoi Vu,Dat Tran,Phat Lam,Vu Nguyen,David Fischinger,Alexander Schindler,Martin Boyer,Son Le
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we propose a deep-learning framework for environmental sound deepfake detection (ESDD) – the task of identifying whether the sound scene and sound event in an input audio recording is fake or not. To this end, we conducted extensive experiments to explore how individual spectrograms, a wide range of network architectures and pre-trained models, ensemble of spectrograms or network architectures affect the ESDD task performance. The experimental results on the benchmark datasets of EnvSDD and ESDD-Challenge-TestSet indicate that detecting deepfake audio of sound scene and detecting deepfake audio of sound event should be considered as individual tasks. We also indicate that the approach of finetuning a pre-trained model is more effective compared with training a model from scratch for the ESDD task. Eventually, our best model, which was finetuned from the pre-trained WavLM model with the proposed three-stage training strategy, achieve the Accuracy of 0.98, F1 Score of 0.95, AuC of 0.99 on EnvSDD Test subset and the Accuracy of 0.88, F1 Score of 0.77, and AuC of 0.92 on ESDD-Challenge-TestSet dataset.

[AI-10] Safety-Critical Contextual Control via Online Riemannian Optimization with World Models

【速读】:该论文旨在解决安全关键的上下文控制问题,即在未知动态系统中,规划器(Planner)仅能通过黑箱模拟器(Simulator)获取可行性样本,并基于上下文信号 ξt\xi_t 优化任务目标。传统方法依赖于显式动力学建模,而现代复杂系统难以满足此条件。解决方案的核心是提出一种基于样本的惩罚预测控制(Penalized Predictive Control, PPC)框架,其基础为在线黎曼优化(online Riemannian optimization)。该框架利用模拟器将可行性流形压缩为基于分数的密度估计 p^(uξt)\hat{p}(u \mid \xi_t),从而赋予动作空间黎曼几何结构,引导规划器进行梯度下降。其中,屏障曲率 κ(ξt)\kappa(\xi_t)(即条件对数密度 lnp^(ξt)-\ln\hat{p}(\cdot\mid\xi_t) 的最小曲率)决定了收敛速率与安全裕度,替代了未知动力学的Lipschitz常数,成为控制性能的关键指标。

链接: https://arxiv.org/abs/2604.19639
作者: Tongxin Li
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 20 pages, 12 figures

点击查看摘要

Abstract:Modern world models are becoming too complex to admit explicit dynamical descriptions. We study safety-critical contextual control, where a Planner must optimize a task objective using only feasibility samples from a black-box Simulator, conditioned on a context signal \xi_t . We develop a sample-based Penalized Predictive Control (PPC) framework grounded in online Riemannian optimization, in which the Simulator compresses the feasibility manifold into a score-based density \hatp(u \mid \xi_t) that endows the action space with a Riemannian geometry guiding the Planner’s gradient descent. The barrier curvature \kappa(\xi_t) , the minimum curvature of the conditional log-density -\ln\hatp(\cdot\mid\xi_t) , governs both convergence rate and safety margin, replacing the Lipschitz constant of the unknown dynamics. Our main result is a contextual safety bound showing that the distance from the true feasibility manifold is controlled by the score estimation error and a ratio that depends on \kappa(\xi_t) , both of which improve with richer context. Simulations on a dynamic navigation task confirm that contextual PPC substantially outperforms marginal and frozen density models, with the advantage growing after environment shifts.

[AI-11] owards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

【速读】:该论文旨在解决生成式模型在目标说话人提取(Target Speaker Extraction, TSE)任务中因依赖全局上下文而导致无法实时部署的问题,尤其是在流式场景下,由于训练与推理阶段的严重不匹配,常导致性能灾难性下降。解决方案的关键在于提出首个专为流式TSE设计的自回归(Autoregressive, AR)模型,并引入“分块交错拼接范式”(Chunk-wise Interleaved Splicing Paradigm),以实现高效且稳定的流式推理;同时设计历史上下文精炼机制,通过利用历史信息缓解语音片段间的边界不连续性,从而保障输出语音的连贯性与可懂度。实验表明,该方法在低延迟下保持100%稳定性且性能优于或媲美离线基线,Real-Time-Factor (RTF) 达到0.248,验证了AR生成骨干在低延迟场景下的可行性。

链接: https://arxiv.org/abs/2604.19635
作者: Shuhai Peng,Hui Lu,Jinjiang Liu,Liyang Chen,Guiping Zhong,Jiakui Li,Huimeng Wang,Haiyun Li,Liang Cao,Shiyin Kang,Zhiyong Wu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While generative models have set new benchmarks for Target Speaker Extraction (TSE), their inherent reliance on global context precludes deployment in real-time applications. Direct adaptation to streaming scenarios often leads to catastrophic inference performance degradation due to the severe mismatch between training and streaming inference. To bridge this gap, we present the first autoregressive (AR) models tailored for streaming TSE. Our approach introduces a Chunk-wise Interleaved Splicing Paradigm that ensures highly efficient and stable streaming inference. To ensure the coherence between the extracted speech segments, we design a historical context refinement mechanism that mitigates boundary discontinuities by leveraging historical information. Experiments on Libri2Mix show that while AR generative baseline exhibits performance degradation at low latencies, our approach maintains 100% stability and superior intelligibility. Furthermore, our streaming results are comparable to or even surpass offline baselines. Additionally, our model achieves a Real-Time-Factor (RTF) of 0.248 on consumer-level GPUs. This work provides empirical evidence that AR generative backbones are viable for latency-sensitive applications through the Chunk-wise Interleaved Splicing Paradigm.

[AI-12] me Series Augmented Generation for Financial Applications

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂定量金融任务中推理能力评估的难题,现有基准测试常无法有效分离模型对查询解析与计算编排的核心能力。其解决方案的关键在于提出一种新颖的评估方法和基准,通过引入时间序列增强生成框架(Time Series Augmented Generation, TSAG),使LLM代理能够将量化任务委托给可验证的外部工具,从而更精准地衡量其在金融时间序列分析中的推理表现。该方法以100个金融问题为基准,对比多个先进代理(如GPT-4o、Llama 3、Qwen2)在工具选择准确性、忠实度及幻觉控制等指标上的表现,结果表明具备能力的代理可在极低幻觉水平下实现接近完美的工具使用准确率,验证了工具增强范式的有效性。

链接: https://arxiv.org/abs/2604.19633
作者: Anton Kolonin,Alexey Glushchenko,Evgeny Bochkov,Abhishek Saxena
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 11 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent’s core ability to parse queries and orchestrate computations. To address this, we introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent’s reasoning for financial time-series analysis. We apply this methodology in a large-scale empirical study using our framework, Time Series Augmented Generation (TSAG), where an LLM agent delegates quantitative tasks to verifiable, external tools. Our benchmark, consisting of 100 financial questions, is used to compare multiple SOTA agents (e.g., GPT-4o, Llama 3, Qwen2) on metrics assessing tool selection accuracy, faithfulness, and hallucination. The results demonstrate that capable agents can achieve near-perfect tool-use accuracy with minimal hallucination, validating the tool-augmented paradigm. Our primary contribution is this evaluation framework and the corresponding empirical insights into agent performance, which we release publicly to foster standardized research on reliable financial AI.

[AI-13] Lyapunov-Certified Direct Switching Theory for Q-Learning

【速读】:该论文旨在解决常步长Q-learning算法的收敛性分析难题,特别是如何在有限时间内提供对最终迭代点的误差界。其解决方案的关键在于将Q-learning的贝尔曼最大误差精确表示为一个随机策略,从而构建出一个带有鞅差噪声的切换线性条件均值递推模型;该模型的内在漂移率由直接切换族的联合谱半径(Joint Spectral Radius, JSR)决定,而JSR可严格小于传统的行和范数速率。基于此表示,作者通过构造由JSR诱导的Lyapunov函数,推导出有限时间内的最终迭代误差界,并进一步给出了可计算的二次证书形式。

链接: https://arxiv.org/abs/2604.19569
作者: Donghwan Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Q-learning is one of the most fundamental algorithms in reinforcement learning. We analyze constant-stepsize Q-learning through a direct stochastic switching system representation. The key observation is that the Bellman maximization error can be represented exactly by a stochastic policy. Therefore, the Q-learning error admits a switched linear conditional-mean recursion with martingale-difference noise. The intrinsic drift rate is the joint spectral radius (JSR) of the direct switching family, which can be strictly smaller than the standard row-sum rate. Using this representation, we derive a finite-time final-iterate bound via a JSR-induced Lyapunov function and then give a computable quadratic-certificate version.

[AI-14] Multi-modal Reasoning with LLM s for Visual Semantic Arithmetic

【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在跨模态语义算术(semantic arithmetic)任务中表现不足的问题,特别是如何从图像中推理出抽象的语义关系(如“is made of”),从而实现感知与符号推理的结合。现有方法依赖于图像特征解码后的向量运算,存在模态鸿沟且缺乏系统评估。其解决方案的关键在于:首先构建了Image-Relation-Pair Dataset (IRPD) 以标准化评测跨模态语义算术能力;其次提出Semantic Arithmetic Reinforcement Fine-Tuning (SAri-RFT),通过可验证的目标函数和Group Relative Policy Optimization (GRPO) 对LVLM进行强化微调,使模型能够基于图像直接推导出结构化的语义关系,显著提升在IRPD和真实世界Visual7W-Telling数据集上的性能,从而增强服务机器人在复杂环境中对物体、动作及关系的理解与决策能力。

链接: https://arxiv.org/abs/2604.19567
作者: Chuou Xu,Liya Ji,Qifeng Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy “king”-“man”+“woman” = “queen” illustrates relational reasoning, yet replacing text with images of “king” and “man” significantly reduces performance because it requires commonsense knowledge and the extraction of concise concepts from irrelevant visual details. This capability is important for service and domestic robotics in unstructured environments, where robots must infer semantic relationships among objects, agents, and actions. In a kitchen, recognizing from images that “powder” and “cake” are related by “is made of” grounds symbolic relations in perception, enabling tool substitution, task generalization, and improved semantic reasoning. Prior work approaches semantic arithmetic by decoding image features after vector arithmetic, but suffers from modality gaps and lacks systematic evaluation. In this paper, we formulate two novel tasks, two-term subtraction and three-term operations, and construct the Image-Relation-Pair Dataset (IRPD) for benchmarking. We further propose Semantic Arithmetic Reinforcement Fine-Tuning (SAri-RFT), which post-trains large vision-language models (LVLMs) using a verifiable function and Group Relative Policy Optimization (GRPO). Our method achieves state-of-the-art results on IRPD and the real-world Visual7W-Telling dataset. By equipping LVLMs with robust cross-modal relational reasoning, this work advances domestic robots’ ability to ground symbolic reasoning in perception, enhancing decision-making, tool adaptability, and human-robot interaction in complex environments. Datasets and source code are provided in the supplementary material.

[AI-15] Detecting Data Contamination in Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)训练数据中可能包含版权内容时,如何通过黑盒成员推理攻击(Black-box Membership Inference Attacks, MIA)可靠识别特定文档是否被纳入训练集的问题。其解决方案的关键在于系统性地评估当前最先进的黑盒MIA方法在多种LLMs上的表现,并引入一种名为“熟悉度排序”(Familiarity Ranking)的新方法以探索更有效的攻击路径。实验结果表明,所有方法在多个LLMs上均未表现出显著的检测能力(AUC-ROC≈0.5),且随着LLM性能提升,假阳性率(FPR)和真阳性率(TPR)均升高,说明LLMs更强的泛化与推理能力显著增加了黑盒MIA的难度。

链接: https://arxiv.org/abs/2604.19561
作者: Juliusz Janicki,Savvas Chamezopoulos,Evangelos Kanoulas,Georgios Tsatsaronis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) utilize large amounts of data for their training, some of which may come from copyrighted sources. Membership Inference Attacks (MIA) aim to detect those documents and whether they have been included in the training corpora of the LLMs. The black-box MIAs require a significant amount of data manipulation; therefore, their comparison is often challenging. We study state-of-the-art (SOTA) MIAs under the black-box assumptions and compare them to each other using a unified set of datasets to determine if any of them can reliably detect membership under SOTA LLMs. In addition, a new method, called the Familiarity Ranking, was developed to showcase a possible approach to black-box MIAs, thereby giving LLMs more freedom in their expression to understand their reasoning better. The results indicate that none of the methods are capable of reliably detecting membership in LLMs, as shown by an AUC-ROC of approximately 0.5 for all methods across several LLMs. The higher TPR and FPR for more advanced LLMs indicate higher reasoning and generalizing capabilities, showcasing the difficulty of detecting membership in LLMs using black-box MIAs.

[AI-16] DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling

【速读】:该论文旨在解决当前多模态奖励模型(Multimodal Reward Model, MRM)训练中所面临的三大核心问题:偏好强度粒度不足、文本风格偏差以及不可靠的偏好信号,同时针对现有开源多模态偏好数据集噪声严重且缺乏高效可扩展的清洗方法的问题。其解决方案的关键在于提出DT2IT-MRM框架,该框架包含三个创新模块:去偏的偏好构建流程(Debiased preference construction pipeline)、将文本到图像(Text-to-Image, T2I)偏好数据进行新型重构的方法,以及一个迭代式训练机制(Iterative Training framework),该机制能够对已有多模态偏好数据集进行持续优化以提升MRM性能。实验表明,该方法在VL-RewardBench、Multimodal RewardBench和MM-RLHF-RewardBench三个主流基准上均达到新的最先进水平。

链接: https://arxiv.org/abs/2604.19544
作者: Zhihong Zhang,Jie Zhao,Xiaojian Huang,Jin Xu,Zhuodong Luo,Xin Liu,Jiansheng Wei,Xuejin Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: code will be uploaded to this https URL

点击查看摘要

Abstract:Multimodal reward models (MRMs) play a crucial role in aligning Multimodal Large Language Models (MLLMs) with human preferences. Training a good MRM requires high-quality multimodal preference data. However, existing preference datasets face three key challenges: lack of granularity in preference strength, textual style bias, and unreliable preference signals. Besides, existing open-source multimodal preference datasets suffer from substantial noise, yet there is a lack of effective and scalable curation methods to enhance their quality. To address these limitations, we propose \textbfDT2IT-MRM, which integrates a \textbfDebiased preference construction pipeline, a novel reformulation of text-to-image (\textbfT2I) preference data, and an \textbfIterative \textbfTraining framework that curates existing multimodal preference datasets for \textbfMultimodal \textbfReward \textbfModeling. Our experimental results show that DT2IT-MRM achieves new \textbfstate-of-the-art overall performance on three major benchmarks: VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.

[AI-17] Cyber Defense Benchmark: Agent ic Threat Hunting Evaluation for LLM s in SecOps

【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)在网络安全领域中对开放式、证据驱动的威胁狩猎(threat hunting)任务表现不佳的问题。现有LLM在结构化问答类安全基准测试中表现优异,但缺乏在真实场景下从原始Windows事件日志中自主识别恶意行为的能力。其解决方案的关键在于构建了一个名为Cyber Defense Benchmark的强化学习环境,该环境基于OTRF Security-Datasets中的106个真实攻击流程,将其封装为包含75,000–135,000条日志记录的SQLite数据库,并通过时间偏移和实体混淆模拟真实攻防场景。Agent需通过迭代执行SQL查询来定位恶意事件的时间戳,并以CTF风格评分与Sigma规则生成的真值对比,从而客观评估LLM在无引导条件下进行威胁狩猎的能力。实验表明,即使是最先进的模型如Claude Opus 4.6也仅能识别出3.8%的恶意事件,且没有任何模型达到部署门槛(即每个MITRE ATT&CK战术至少50%召回率),揭示了当前LLM在SecOps场景下仍存在显著能力鸿沟。

链接: https://arxiv.org/abs/2604.19533
作者: Alankrit Chona,Igor Kozlov,Ambuj Kumar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures, 5 tables. Complete benchmark and hunt traces available on request

点击查看摘要

Abstract:We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATTCK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each episode presents the agent with an in-memory SQLite database of 75,000-135,000 log records produced by a deterministic campaign simulator that time-shifts and entity-obfuscates the raw recordings. The agent must iteratively submit SQL queries to discover malicious event timestamps and explicitly flag them, scored CTF-style against Sigma-rule-derived ground truth. Evaluating five frontier models - Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash - on 26 campaigns covering 105 of 106 procedures, we find that all models fail dramatically: the best model (Claude Opus 4.6) submits correct flags for only 3.8% of malicious events on average, and no run across any model ever finds all flags. We define a passing score as = 50% recall on every ATTCK tactic - the minimum bar for unsupervised SOC deployment. No model passes: the leader clears this bar on 5 of 13 tactics and the remaining four on zero. These results suggest that current LLMs are poorly suited for open-ended, evidence-driven threat hunting despite strong performance on curated QA security benchmarks. Comments: 13 pages, 3 figures, 5 tables. Complete benchmark and hunt traces available on request Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) MSC classes: K.6.5, I.2.7 Cite as: arXiv:2604.19533 [cs.CR] (or arXiv:2604.19533v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.19533 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ambuj Kumar [view email] [v1] Tue, 21 Apr 2026 14:53:23 UTC (556 KB) Full-text links: Access Paper: View a PDF of the paper titled Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps, by Alankrit Chona and 2 other authorsView PDF view license Current browse context: cs.CR prev | next new | recent | 2026-04 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[AI-18] BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

【速读】:该论文旨在解决现有符号音乐 tokenization 方法在时间结构建模上的局限性问题,即传统基于事件的 tokenization(如音符起始点、音高、时移等)隐式处理音乐时间规律,导致不同 token 跨越不均匀的时间跨度,从而影响模型对节奏和结构的精确捕捉。其解决方案的关键在于提出一种以固定时长单位(如一拍)为基础的新型 tokenization 策略:将每个时间步内同一音高的所有事件编码为一个 token,并显式地按时间步分组,形成类似钢琴卷帘(piano-roll)的稀疏表示。该方法在音乐续写与伴奏生成任务中展现出更优的音乐质量与结构一致性,同时在长程依赖建模上更具效率和有效性。

链接: https://arxiv.org/abs/2604.19532
作者: Lekai Qian,Haoyu Gu,Jingwei Zhao,Ziyu Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Preprint. 20 pages, 8 figures

点击查看摘要

Abstract:Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences of musical events, such as onsets, pitches, time shifts, or compound note events. This strategy is intuitive and has proven effective in Transformer-based models, but it treats the regularity of musical time implicitly: individual tokens may span different durations, resulting in non-uniform time progression. In this paper, we instead consider whether an alternative tokenization is possible, where a uniform-length musical step (e.g., a beat) serves as the basic unit. Specifically, we encode all events within a single time step at the same pitch as one token, and group tokens explicitly by time step, which resembles a sparse encoding of a piano-roll representation. We evaluate the proposed tokenization on music continuation and accompaniment generation tasks, comparing it with mainstream event-based methods. Results show improved musical quality and structural coherence, while additional analyses confirm higher efficiency and more effective capture of long-range patterns with the proposed tokenization.

[AI-19] Revisiting RaBitQ and TurboQuant: A Symmetric Comparison of Methods Theory and Experiments

【速读】:该论文旨在澄清RaBitQ与TurboQuant两种高效近似最近邻搜索方法之间的关系,解决当前文献中关于TurboQuant优于RaBitQ的宣称缺乏充分实证支持的问题。其解决方案的关键在于构建一个统一、可复现且对称的比较框架,系统性地从方法论、理论保证和实验性能三个维度对比两者;研究发现,TurboQuant并未在所有可比场景下提供一致优势,反而在多个配置中表现劣于RaBitQ,同时指出原TurboQuant论文中的部分运行时间和召回率结果无法从其开源实现中复现,从而揭示了二者共享的核心结构与真实差异,并暴露了实验结果的可复现性问题。

链接: https://arxiv.org/abs/2604.19528
作者: Jianyang Gao,Yutong Gou,Yuexuan Xu,Jifan Shi,Yongyi Yang,Shuolin Li,Raymond Chi-Wing Wong,Cheng Long
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:This technical note revisits the relationship between RaBitQ and TurboQuant under a unified comparison framework. We compare the two methods in terms of methodology, theoretical guarantees, and empirical performance, using a reproducible, transparent, and symmetric setup. Our results show that, despite the claimed advantage of TurboQuant, TurboQuant does not provide a consistent improvement over RaBitQ in directly comparable settings; in many tested configurations, it performs worse than RaBitQ. We further find that several reported runtime and recall results in the TurboQuant paper could not be reproduced from the released implementation under the stated configuration. Overall, this note clarifies the shared structure and genuine differences between the two lines of work, while documenting reproducibility issues in the experimental results reported by the TurboQuant paper.

[AI-20] Revac: A Social Deduction Reasoning Agent

【速读】:该论文旨在解决社交推理类游戏(如Mafia)中AI代理在信息不完全、存在欺骗行为的复杂社交环境中进行有效决策的问题。此类环境要求AI具备推理能力、记忆存储、对人类交互行为的理解以及动态适应策略的能力,而传统基于确定性规则或暴力搜索的方法难以应对。解决方案的关键在于构建一个模块化架构——Revac-8,其核心包括基于记忆的玩家画像机制、针对指控与辩护的社会图谱分析方法,以及根据情境动态调整语气的通信策略,从而实现对社交线索的结构化处理与自适应响应,显著提升了AI在高风险社交场景中的表现。

链接: https://arxiv.org/abs/2604.19523
作者: Mihir Shriniwas Arya,Avinash Anish,Aditya Ranjan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Social deduction games such as Mafia present a unique AI challenge: players must reason under uncertainty, interpret incomplete and intentionally misleading information, evaluate human-like communication, and make strategic elimination decisions. Unlike deterministic board games, success in Mafia depends not on perfect information or brute-force search, but on inference, memory, and adaptability in the presence of deception. This work presents the design and evaluation of Revac-8, an AI agent developed for the Social Deduction track of the MindGames Arena competition, where it achieved first place. The final agent evolved from a simple two-stage reasoning system into a multi-module architecture that integrates memory-based player profiling, social-graph analysis of accusations and defenses, and dynamic tone selection for communication. These results highlight the importance of structured memory and adaptive communication for achieving strong performance in high-stakes social environments.

[AI-21] SimDiff: Depth Pruning via Similarity and Difference

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在部署效率优化中因层剪枝(depth pruning)导致的性能不稳定问题。现有方法通常依赖余弦距离(cosine distance)这一单一维度的启发式指标来判断层的重要性,但这种方法在不同模型架构下表现不可预测,甚至可能导致性能灾难性下降。解决方案的关键在于提出一种新的层重要性评估准则 SimDiff,该准则从两个正交视角联合衡量:表征相似性(representational similarity)和变换差异性(transformation difference)。其中,通过 MSSD(Mean Squared Signed Difference)捕捉对异常值敏感的决定性修正层,以及 MASD(Mean Absolute Signed Difference)稳健地度量每层的平均贡献,从而实现更可靠、高效的剪枝策略。

链接: https://arxiv.org/abs/2604.19520
作者: Yuli Chen,Shuhao Zhang,Fanshen Meng,Bo Cheng,Jiale Han,Qiang Tong,Xiulei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Depth pruning improves the deployment efficiency of large language models (LLMs) by identifying and removing redundant layers. A widely accepted standard for this identification process is to measure the similarity between layers using cosine distance. However, we find that methods relying solely on this one-dimensional heuristic can exhibit unpredictable performance and even catastrophic collapse across different architectures. To address this issue, we propose SimDiff, a novel layer importance criterion that jointly evaluates layers from two orthogonal perspectives: representational similarity and transformation difference. The difference is quantified using two distinct metrics: MSSD, which is sensitive to outliers and identifies layers that make decisive corrections, and MASD, which robustly measures a layer’s average contribution. Extensive experiments on multiple models ranging from 0.5B to 13B parameters demonstrate that SimDiff significantly outperforms state-of-the-art baselines across various pruning ratios. Notably, our method retains over 91% of LLaMA2-7B’s performance at a 25% pruning ratio and achieves up to a 1.49x inference speedup when pruning 12 layers on LLaMA3.1-8B. We also show that pruned models can be effectively recovered with minimal fine-tuning.

[AI-22] From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning ACL2026

【速读】:该论文旨在解决当前生成式引擎优化(Generative Engine Optimization, GEO)方法中存在的一大瓶颈问题:现有方法对每个查询实例独立优化,缺乏跨任务和跨引擎的策略积累与迁移能力,导致优化效率低且难以规模化。其解决方案的关键在于将GEO重构为一个策略学习问题,并提出MAGEO框架——该框架通过多智能体协作实现协调规划、编辑与保真度感知评估作为执行层,同时将验证过的编辑模式逐步提炼为可复用的、引擎特定的优化技能(optimization skills),从而实现策略的持续积累与迁移。此外,论文还引入双分支评估协议(Twin Branch Evaluation Protocol)和DSV-CF指标以实现因果归因与语义可见性-归属准确性统一评估,推动了可信GEO的可衡量发展。

链接: https://arxiv.org/abs/2604.19516
作者: Beining Wu,Fuyou Mao,Jiong Lin,Cheng Yang,Jiaxuan Lu,Yifu Guo,Siyu Zhang,Yifan Wu,Ying Huang,Fu Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACL 2026 Findings

点击查看摘要

Abstract:Generative engines (GEs) are reshaping information access by replacing ranked links with citation-grounded answers, yet current Generative Engine Optimization (GEO) methods optimize each instance in isolation, unable to accumulate or transfer effective strategies across tasks and engines. We reframe GEO as a strategy learning problem and propose MAGEO, a multi-agent framework in which coordinated planning, editing, and fidelity-aware evaluation serve as the execution layer, while validated editing patterns are progressively distilled into reusable, engine-specific optimization skills. To enable controlled assessment, we introduce a Twin Branch Evaluation Protocol for causal attribution of content edits and DSV-CF, a dual-axis metric that unifies semantic visibility with attribution accuracy. We further release MSME-GEO-Bench, a multi-scenario, multi-engine benchmark grounded in real-world queries. Experiments on three mainstream engines show that MAGEO substantially outperforms heuristic baselines in both visibility and citation fidelity, with ablations confirming that engine-specific preference modeling and strategy reuse are central to these gains, suggesting a scalable learning-driven paradigm for trustworthy GEO. Code is available at this https URL

[AI-23] When Graph Structure Becomes a Liability: A Critical Re-Evaluation of Graph Neural Networks for Bitcoin Fraud Detection under Temporal Distribution Shift

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在比特币交易欺诈检测任务中性能评估的可靠性问题,特别是针对现有研究中普遍存在的训练-测试数据泄露(leakage)现象。论文指出,尽管GCN、GraphSAGE、GAT和EvolveGCN等模型被广泛认为优于仅使用特征的基线方法,但这一结论未在无泄露的严格归纳式(inductive)评估协议下得到验证。其解决方案的关键在于设计并执行一个种子匹配的归纳式与归纳式对比实验,明确识别出F1分数差异主要源于训练阶段对测试时期邻接矩阵的暴露(即时间泄漏),并通过边缘随机化消融实验揭示真实图结构在时间分布偏移下可能具有误导性。结果表明,在无泄露条件下,仅使用原始特征的随机森林模型表现最优(F1 = 0.821),显著优于所有GNN变体,从而质疑了当前GNN在该任务中的有效性,并推动建立更严谨的评估标准。

链接: https://arxiv.org/abs/2604.19514
作者: Saket Maganti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI)
备注: Code to be released soon

点击查看摘要

Abstract:The consensus that GCN, GraphSAGE, GAT, and EvolveGCN outperform feature-only baselines on the Elliptic Bitcoin Dataset is widely cited but has not been rigorously stress-tested under a leakage-free evaluation protocol. We perform a seed-matched inductive-versus-transductive comparison and find that this consensus does not hold. Under a strictly inductive protocol, Random Forest on raw features achieves F1 = 0.821 and outperforms all evaluated GNNs, while GraphSAGE reaches F1 = 0.689 +/- 0.017. A paired controlled experiment reveals a 39.5-point F1 gap attributable to training-time exposure to test-period adjacency. Additionally, edge-shuffle ablations show that randomly wired graphs outperform the real transaction graph, indicating that the dataset’s topology can be misleading under temporal distribution shift. Hybrid models combining GNN embeddings with raw features provide only marginal gains and remain substantially below feature-only baselines. We release code, checkpoints, and a strict-inductive protocol to enable reproducible, leakage-free evaluation.

[AI-24] CoDA: Towards Effective Cross-domain Knowledge Transfer via CoT-guided Domain Adaptation

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源或专家稀缺领域中因缺乏高质量、领域内示例而导致的推理性能受限问题。现有方法通过跨域检索作为替代示例虽有一定效果,但受限于源域与目标域之间的显著分布偏移(domain shift),难以有效提取和迁移潜在的推理结构。解决方案的关键在于提出CoDA框架,其核心创新是引入一个轻量级适配器(adapter),直接干预模型中间隐藏状态,并结合基于思维链(Chain-of-Thought, CoT)增强的特征蒸馏与最大均值差异(Maximum Mean Discrepancy, MMD)进行核化分布对齐,从而实现源域与目标域在隐式推理表征层面的有效对齐,显著提升跨域知识迁移的鲁棒性与系统性。

链接: https://arxiv.org/abs/2604.19488
作者: Jianzhi Yan,Le Liu,Buzhou Tang,Yang Xiang,Dongning Sun,Zhiming Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Large language models (LLMs) have achieved substantial advances in logical reasoning, yet they continue to lag behind human-level performance. In-context learning provides a viable solution that boosts the model’s performance via prompting its input with expert-curated, in-domain exemplars. However, in many real-world, expertise-scarce domains, such as low-resource scientific disciplines, emerging biomedical subfields, or niche legal jurisdictions, such high-quality in-domain demonstrations are inherently limited or entirely unavailable, thereby constraining the general applicability of these approaches. To mitigate this limitation, recent efforts have explored the retrieval of cross-domain samples as surrogate in-context demonstrations. Nevertheless, the resulting gains remain modest. This is largely attributable to the pronounced domain shift between source and target distributions, which impedes the model’s ability to effectively identify and exploit underlying shared structures or latent reasoning patterns. Consequently, when relying solely on raw textual prompting, LLMs struggle to abstract and transfer such cross-domain knowledge in a robust and systematic manner. To address these issues, we propose CoDA, which employs a lightweight adapter to directly intervene in the intermediate hidden states. By combining feature-based distillation of CoT-enriched reference representations with Maximum Mean Discrepancy (MMD) for kernelized distribution matching, our method aligns the latent reasoning representation of the source and target domains. Extensive experimental results on multiple logical reasoning tasks across various model families validate the efficacy of CoDA by significantly outperforming the previous state-of-the-art baselines by a large margin.

[AI-25] Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

【速读】:该论文旨在解决长时程企业智能体(long-horizon enterprise agents)在受限记忆、多步推理和强监管约束下,决策行为难以被准确评估的问题。当前评估方法仅以单一任务成功率指标呈现结果,混淆了不同类型的失败模式,且无法揭示智能体是否符合其部署环境所需的对齐标准。解决方案的关键在于提出一个四维正交的对齐分解框架,包括事实精确性(Factual Precision, FRP)、推理连贯性(Reasoning Coherence, RCS)、合规重构性(Compliance Reconstruction, CRR)和校准弃权性(Calibrated Abstention, CAR),其中CRR为基于监管要求的新维度,CAR则独立区分覆盖率与准确性。通过在受控基准LongHorizon-Bench上的实证验证,该框架揭示了现有架构在各维度上的差异化表现,并识别出此前未被关注的决策对齐轴(decisional alignment),从而为监管决策场景中的智能体评估提供可测量、可诊断、可优化的分析路径。

链接: https://arxiv.org/abs/2604.19457
作者: Vasundra Srininvasan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 5 figures, 8 tables. PDFLaTeX. Code and artifacts: this https URL

点击查看摘要

Abstract:Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning, and binding regulatory constraints. Current evaluation reports a single task-success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long-horizon decision behavior decomposes into four orthogonal alignment axes, each independently measurable and failable: factual precision (FRP), reasoning coherence (RCS), compliance reconstruction (CRR), and calibrated abstention (CAR). CRR is a novel regulatory-grounded axis; CAR is a measurement axis separating coverage from accuracy. We exercise the decomposition on a controlled benchmark (LongHorizon-Bench) covering loan qualification and insurance claims adjudication with deterministic ground-truth construction. Running six memory architectures, we find structure aggregate accuracy cannot see: retrieval collapses on factual precision; schema-anchored architectures pay a scaffolding tax; plain summarization under a fact-preservation prompt is a strong baseline on FRP, RCS, EDA, and CRR; and all six architectures commit on every case, exposing a decisional-alignment axis the field has not targeted. The decomposition also surfaced a pre-registered prediction of our own, that summarization would fail factual recall, which the data reversed at large magnitude, an axis-level reversal aggregate accuracy would have hidden. Institutional alignment (regulatory reconstruction) and decisional alignment (calibrated abstention) are under-represented in the alignment literature and become load-bearing once decisions leave the laboratory. The framework transfers to any regulated decisioning domain via two steps: build a fact schema, and calibrate the CRR auditor prompt.

[AI-26] Counting Worlds Branching Time Semantics for post-hoc Bias Mitigation in generative AI

【速读】:该论文旨在解决生成式 AI (Generative AI) 系统在输出序列中放大训练数据偏见的问题,尤其针对现有推理阶段缓解策略缺乏形式化保证的局限性。解决方案的关键在于提出一种名为 CTLF(Counting Temporal Logic for Fairness)的分支时序逻辑,其核心创新在于采用计数世界语义(counting worlds semantics),将每一步生成过程中的可能输出建模为一个世界,并引入新的模态算子,用于验证当前输出序列是否符合受保护属性的预期概率分布、预测后续生成过程中保持公平性的可能性,以及确定恢复公平性所需移除的输出数量。该框架通过具体示例展示了如何用 CTLF 公式表达不同生成阶段的公平性约束。

链接: https://arxiv.org/abs/2604.19431
作者: Alessandro G. Buda,Giuseppe Primiero,Leonardo Ceragioli,Melissa Antonelli
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative AI systems are known to amplify biases present in their training data. While several inference-time mitigation strategies have been proposed, they remain largely empirical and lack formal guarantees. In this paper we introduce CTLF, a branching-time logic designed to reason about bias in series of generative AI outputs. CTLF adopts a counting worlds semantics where each world represents a possible output at a given step in the generation process and introduces modal operators that allow us to verify whether the current output series respects an intended probability distribution over a protected attribute, to predict the likelihood of remaining within acceptable bounds as new outputs are generated, and to determine how many outputs are needed to remove in order to restore fairness. We illustrate the framework on a toy example of biased image generation, showing how CTLF formulas can express concrete fairness properties at different points in the output series.

[AI-27] M2GRPO: Mamba-based Multi-Agent Group Relative Policy Optimization for Biomimetic Underwater Robots Pursuit

【速读】:该论文旨在解决生物仿生水下机器人在协同追捕任务中面临的长期决策、部分可观测性以及多智能体协作等问题,这些问题对策略的表达能力和稳定性提出了较高要求。解决方案的关键在于提出一种基于Mamba的多智能体组相对策略优化框架(M²GRPO),其核心创新包括:1)采用选择性状态空间Mamba策略,通过历史观测建模长时程时间依赖关系,并利用基于注意力的关系特征编码智能体间交互信息,从而生成有界连续动作;2)在集中训练、分散执行(CTDE)范式下,引入组相对优势机制,通过对每轮episode内各智能体奖励进行归一化以实现更稳定的信用分配,结合多智能体扩展的GRPO算法,在显著降低训练资源需求的同时保障策略更新的稳定性和可扩展性。

链接: https://arxiv.org/abs/2604.19404
作者: Yukai Feng,Zhiheng Wu,Zhengxing Wu,Junwen Gu,Junzhi Yu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional policy learning methods in cooperative pursuit face fundamental challenges in biomimetic underwater robots, where long-horizon decision making, partial observability, and inter-robot coordination require both expressiveness and stability. To address these issues, a novel framework called Mamba-based multi-agent group relative policy optimization (M ^2 GRPO) is proposed, which integrates a selective state-space Mamba policy with group-relative policy optimization under the centralized-training and decentralized-execution (CTDE) paradigm. Specifically, the Mamba-based policy leverages observation history to capture long-horizon temporal dependencies and exploits attention-based relational features to encode inter-agent interactions, producing bounded continuous actions through normalized Gaussian sampling. To further improve credit assignment without sacrificing stability, the group-relative advantages are obtained by normalizing rewards across agents within each episode and optimized through a multi-agent extension of GRPO, significantly reducing the demand for training resources while enabling stable and scalable policy updates. Extensive simulations and real-world pool experiments across team scales and evader strategies demonstrate that M ^2 GRPO consistently outperforms MAPPO and recurrent baselines in both pursuit success rate and capture efficiency. Overall, the proposed framework provides a practical and scalable solution for cooperative underwater pursuit with biomimetic robot systems.

[AI-28] Revisiting Catastrophic Forgetting in Continual Knowledge Graph Embedding

【速读】:该论文旨在解决持续知识图谱嵌入(Continual Knowledge Graph Embedding, CKGE)方法在评估过程中因忽略“实体干扰”(entity interference)而导致的灾难性遗忘(catastrophic forgetting)评估偏差问题。现有CKGE方法通常通过限制对已有嵌入的修改来缓解灾难性遗忘,但本文指出,当新实体引入时,其嵌入可能与原有实体嵌入产生混淆,导致模型错误地将新实体预测为原本正确的答案,这种现象称为实体干扰,此前未被纳入CKGE评估协议中,从而系统性高估了方法性能。解决方案的关键在于提出一种修正后的CKGE评估协议,显式考虑实体干扰的影响,并引入针对CKGE场景定制的灾难性遗忘度量指标,实验证明该修正可使性能高估减少高达25%,尤其在实体增长显著的情况下更为明显。

链接: https://arxiv.org/abs/2604.19401
作者: Gerard Pons,Carlos Escolano,Besim Bilalli,Anna Queralt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Pre-print submitted

点击查看摘要

Abstract:Knowledge Graph Embeddings (KGEs) support a wide range of downstream tasks over Knowledge Graphs (KGs). In practice, KGs evolve as new entities and facts are added, motivating Continual Knowledge Graph Embedding (CKGE) methods that update embeddings over time. Current CKGE approaches address catastrophic forgetting (i.e., the performance degradation on previously learned tasks) primarily by limiting changes to existing embeddings. However, we show that this view is incomplete. When new entities are introduced, their embeddings can interfere with previously learned ones, causing the model to predict them in place of previously correct answers. This phenomenon, which we call entity interference, has been largely overlooked and is not accounted for in current CKGE evaluation protocols. As a result, the assessment of catastrophic forgetting becomes misleading, and CKGE methods performance is systematically overestimated. To address this issue, we introduce a corrected CKGE evaluation protocol that accounts for entity interference. Through experiments on multiple benchmarks, we show that ignoring this effect can lead to performance overestimation of up to 25%, particularly in scenarios with significant entity growth. We further analyze how different CKGE methods and KGE models are affected by the different sources of forgetting, and introduce a catastrophic forgetting metric tailored to CKGE. Comments: Pre-print submitted Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.19401 [cs.LG] (or arXiv:2604.19401v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.19401 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-29] GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models ACL2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在服务过程中因模型参数、注意力计算和键值缓存(KV caches)带来的显著内存占用与延迟问题。解决方案的关键在于提出一种结构化剪枝框架GRASPrune,该框架在预训练后执行,通过联合剪枝前馈网络(FFN)通道与键值头组(KV head groups),并在单一全局预算下实现高效压缩。其核心创新在于使用带有投影直通估计器(projected straight-through estimator)的轻量级门控分数学习机制,在训练每一步都强制施加满足预算的硬掩码(hard mask),同时保持主干权重冻结;随后对保留单元进行缩放因子校准并融合进剪枝后的权重中,最终得到无额外参数的紧凑稠密检查点,从而在无需全模型微调的情况下实现高效率部署。

链接: https://arxiv.org/abs/2604.19398
作者: Ziyang Wang,Jiangfeng Xiao,Chuan Xiao,Ruoxiang Li,Rui Mao,Jianbin Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Large language models (LLMs) are expensive to serve because model parameters, attention computation, and KV caches impose substantial memory and latency costs. We present GRASPrune, a structured pruning framework applied after pretraining that jointly prunes FFN channels and KV head groups under a single global budget. Instead of learning importance scores without constraints and applying the budget only after training, GRASPrune learns lightweight gate scores with a projected straight-through estimator that enforces a hard mask satisfying the budget at every step while keeping the backbone weights frozen. After the mask is fixed, we calibrate scaling factors on the retained units to mitigate scale mismatch caused by pruning, and fold these factors into the pruned weights to obtain a smaller dense checkpoint with no extra parameters at inference. On LLaMA-2-7B, GRASPrune removes 50% of parameters and achieves 12.18 perplexity on WikiText-2 while maintaining competitive average zero-shot accuracy on five benchmarks, using four epochs on 512 unlabeled calibration sequences on a single NVIDIA A100 80GB GPU without any full model fine-tuning.

[AI-30] owards Energy Impact on AI-Powered 6G IoT Networks: Centralized vs. Decentralized

【速读】:该论文旨在解决第六代移动通信技术(6G)背景下物联网(IoT)网络中机器学习(ML)应用的能效问题,尤其是模型训练与数据传输所导致的高能耗挑战。其解决方案的关键在于对比分析集中式学习(Centralized Learning, CL)与分布式学习(Distributed Learning)架构的能源消耗特性,并通过在德国铁路基础设施部署测试床,利用传感器数据进行基于ML的预测性维护实验。结果表明,分布式学习可在保持约90%预测准确率的同时,将整体电能消耗降低高达70%,从而证明分布式ML能够有效降低传输相关的能耗成本,提升真实场景下IoT系统的能源效率。

链接: https://arxiv.org/abs/2604.19377
作者: Anjie Qiu,Donglin Wang,Sanket Partani,Andreas Weinand,Hans D. Schotten
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures. Accepted for presentation at the IEEE GLOBECOM 2025 Workshop on Workshop on Green Learning for Wireless Communications

点击查看摘要

Abstract:The emergence of sixth-generation (6G) technologies has introduced new challenges and opportunities for machine learning (ML) applications in Internet of Things (IoT) networks, particularly concerning energy efficiency. As model training and data transmission contribute significantly to energy consumption, optimizing these processes has become critical for sustainable system design. This study first conduct analysis on the energy consumption model for both centralized and decentralized architecture and then presents a testbed deployed within the German railway infrastructure, leveraging sensor data for ML-based predictive maintenance. A comparative analysis of distributed versus Centralized Learning (CL) architectures reveals that distributed models maintain competitive predictive accuracy (~90%) while reducing overall electricity consumption by up to 70%. These findings underscore the potential of distributed ML to improve energy efficiency in real-world IoT deployments, particularly by mitigating transmission-related energy costs.

[AI-31] ACENR: Task-Agnostic Contrastive Explanations for Node Representations

【速读】:该论文旨在解决图表示学习中节点表示难以解释的问题,尤其是现有可解释性方法多局限于监督场景或仅关注单个表示维度,缺乏对节点整体表示结构的系统性解释。其解决方案的关键在于提出一种任务无关的对比解释方法 TACENR(Task-Agnostic Contrastive Explanations for Node Representations),通过对比学习构建表示空间中的相似性函数,从而识别出对节点表示贡献最大的属性特征、邻近特征和结构特征,实现了对节点表示整体结构的局部解释。

链接: https://arxiv.org/abs/2604.19372
作者: Vasiliki Papanikou,Evaggelia Pitoura
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the XAI 2026 Conference. 24 pages, 10 figures

点击查看摘要

Abstract:Graph representation learning has achieved notable success in encoding graph-structured data into latent vector spaces, enabling a wide range of downstream tasks. However, these node representations remain opaque and difficult to interpret. Existing explainability methods primarily focus on supervised settings or on explaining individual representation dimensions, leaving a critical gap in explaining the overall structure of node representations. In this paper, we propose TACENR (Task-Agnostic Contrastive Explanations for Node Representations), a local explanation method that identifies not only attribute features but also proximity and structural ones that contribute the most in the representation space. TACENR builds on contrastive learning, through which we learn a similarity function in the representation space, revealing which are the features that play an important role in the representation of a node. While our focus is on task-agnostic explanations, TACENR can be applied to supervised scenarios as well. Experimental results demonstrate that proximity and structural features play a significant role in shaping node representations and that our supervised variant performs comparably to existing task-specific approaches in identifying the most impactful features.

[AI-32] LASER: Learning Active Sensing for Continuum Field Reconstruction

【速读】:该论文旨在解决在稀疏和受限传感条件下,如何实现连续物理场(continuum physical fields)高保真度测量的问题。传统重建方法依赖于固定的传感器布局,无法适应物理状态的动态变化。其解决方案的关键在于提出了一种统一的闭环框架LASER,将主动感知建模为部分可观测马尔可夫决策过程(POMDP),并引入一个连续场潜在世界模型(continuum field latent world model),该模型能够捕捉底层物理动力学并提供内在奖励反馈,从而使得强化学习策略能够在潜在想象空间中模拟“如果-那么”式的传感场景,并通过条件化传感器运动以预测的潜在状态为导向,自动探索当前观测之外的信息丰富区域,最终实现比静态或离线优化策略更优的高保真重建性能。

链接: https://arxiv.org/abs/2604.19355
作者: Huayu Deng,Jinghui Zhong,Xiangming Zhu,Yunbo Wang,Xiaokang Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: Preprint

点击查看摘要

Abstract:High-fidelity measurements of continuum physical fields are essential for scientific discovery and engineering design but remain challenging under sparse and constrained sensing. Conventional reconstruction methods typically rely on fixed sensor layouts, which cannot adapt to evolving physical states. We propose LASER, a unified, closed-loop framework that formulates active sensing as a Partially Observable Markov Decision Process (POMDP). At its core, LASER employs a continuum field latent world model that captures the underlying physical dynamics and provides intrinsic reward feedback. This enables a reinforcement learning policy to simulate ‘‘what-if’’ sensing scenarios within a latent imagination space. By conditioning sensor movements on predicted latent states, LASER navigates toward potentially high-information regions beyond current observations. Our experiments demonstrate that LASER consistently outperforms static and offline-optimized strategies, achieving high-fidelity reconstruction under sparsity across diverse continuum fields.

[AI-33] Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges

【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在真实攻击场景中自主执行网络安全任务的能力尚不明确的问题。为填补这一研究空白,作者提出了DeepRed——一个开源基准测试平台,用于评估LLM代理在隔离虚拟化环境中应对真实Capture The Flag (CTF) 挑战的表现。其解决方案的关键在于:首先,在Kali攻击者环境中部署代理并连接至目标挑战,记录完整执行轨迹;其次,引入基于公共Writeup提取的特定挑战检查点(checkpoint)的局部评分机制,并结合自动化的“总结-判断”标注流水线从日志中识别检查点完成情况,从而超越传统的仅以“解决/未解决”二元结果衡量性能的方式,实现更精细、可解释的评估体系。

链接: https://arxiv.org/abs/2604.19354
作者: Ali Al-Kaswan,Maksim Plotnikov,Maxim Hájek,Roland Vízner,Arie van Deursen,Maliheh Izadi
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Software Engineering (cs.SE)
备注: Accepted to AIWare’26 Benchmark and Dataset Track

点击查看摘要

Abstract:Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation.

[AI-34] Evaluation-driven Scaling for Scientific Discovery

【速读】:该论文旨在解决如何在科学发现中以原则性和高效的方式扩展评估驱动的迭代循环(evaluation-driven discovery loops),从而推动科学边界的突破。其核心问题在于,尽管已有研究强调评估的重要性,但缺乏对如何系统性地放大这类循环以实现更广泛科学发现的明确方法论。解决方案的关键是提出Simple Test-time Evaluation-driven Scaling (SimpleTES)框架,该框架通过并行探索(parallel exploration)、反馈驱动优化(feedback-driven refinement)与局部选择(local selection)三者的策略性结合,在评估维度上实现有效扩展,显著提升了生成式AI(Generative AI)在多个科学领域中的发现能力。实验证明,SimpleTES不仅在21个跨领域科学问题上取得SOTA性能,还通过轨迹级历史数据实现了反馈驱动学习的后训练机制,使模型在已见任务上效率提升,并具备对未见问题的泛化求解能力。

链接: https://arxiv.org/abs/2604.19341
作者: Haotian Ye,Haowei Lin,Jingyi Tang,Yizhen Luo,Caiyin Yang,Chang Su,Rahul Thapa,Rui Yang,Ruihua Liu,Zeyu Li,Chong Gao,Dachao Ding,Guangrong He,Miaolei Zhang,Lina Sun,Wenyang Wang,Yuchen Zhong,Zhuohao Shen,Di He,Jianzhu Ma,Stefano Ermon,Tongyang Li,Xiaowen Chu,James Zou,Yuzhi Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively refine them. At the core of these trial-and-error loops lies evaluation: the process of obtaining feedback on candidate solutions via verifiers, simulators, or task-specific scoring functions. While prior work has highlighted the importance of evaluation, it has not explicitly formulated the problem of how evaluation-driven discovery loops can be scaled up in a principled and effective manner to push the boundaries of scientific discovery, a problem this paper seeks to address. We introduce Simple Test-time Evaluation-driven Scaling (SimpleTES), a general framework that strategically combines parallel exploration, feedback-driven refinement, and local selection, revealing substantial gains unlocked by scaling evaluation-driven discovery loops along the right dimensions. Across 21 scientific problems spanning six domains, SimpleTES discovers state-of-the-art solutions using gpt-oss models, consistently outperforming both frontier-model baselines and sophisticated optimization pipelines. Particularly, we sped up the widely used LASSO algorithm by over 2x, designed quantum circuit routing policies that reduce gate overhead by 24.5%, and discovered new Erdos minimum overlap constructions that surpass the best-known results. Beyond novel discoveries, SimpleTES produces trajectory-level histories that naturally supervise feedback-driven learning. When post-trained on successful trajectories, models not only improve efficiency on seen problems but also generalize to unseen problems, discovering solutions that base models fail to uncover. Together, our results establish effective evaluation-driven loop scaling as a central axis for advancing LLM-driven scientific discovery, and provide a simple yet practical framework for realizing these gains.

[AI-35] HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models ACL2026

【速读】:该论文旨在解决大型音频语言模型(Large Audio-Language Models, LALMs)在音频感知任务中普遍存在但尚未被充分研究的幻觉(hallucination)问题,即模型生成与音频内容语义不符或声学上不支持的回答。现有幻觉评估基准多集中于文本或视觉领域,而针对音频的评测数据集在规模、模态覆盖和诊断深度方面均存在局限。为应对这一挑战,作者提出了HalluAudio——首个大规模音频幻觉评估基准,包含超过5000个经人工验证的问答对,涵盖语音、环境声音和音乐三类模态,并设计了对抗性提示(adversarial prompts)和混合音频条件以系统性诱发幻觉。关键解决方案在于构建一个结构化的多维度评估协议,不仅衡量准确率,还量化幻觉率、是/否偏倚(yes/no bias)、错误类型分析及拒绝率,从而实现对LALM失败模式的细粒度诊断,首次提供了跨语音、声音和音乐的系统性性能比较。

链接: https://arxiv.org/abs/2604.19300
作者: Feiyu Zhao,Yiming Chen,Wenhuan Lu,Daipeng Zhang,Xianghu Yue,Jianguo Wei
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2026

点击查看摘要

Abstract:Large Audio-Language Models (LALMs) have recently achieved strong performance across various audio-centric tasks. However, hallucination, where models generate responses that are semantically incorrect or acoustically unsupported, remains largely underexplored in the audio domain. Existing hallucination benchmarks mainly focus on text or vision, while the few audio-oriented studies are limited in scale, modality coverage, and diagnostic depth. We therefore introduce HalluAudio, the first large-scale benchmark for evaluating hallucinations across speech, environmental sound, and music. HalluAudio comprises over 5K human-verified QA pairs and spans diverse task types, including binary judgments, multi-choice reasoning, attribute verification, and open-ended QA. To systematically induce hallucinations, we design adversarial prompts and mixed-audio conditions. Beyond accuracy, our evaluation protocol measures hallucination rate, yes/no bias, error-type analysis, and refusal rate, enabling a fine-grained analysis of LALM failure modes. We benchmark a broad range of open-source and proprietary models, providing the first large-scale comparison across speech, sound, and music. Our results reveal significant deficiencies in acoustic grounding, temporal reasoning, and music attribute understanding, underscoring the need for reliable and robust LALMs.

[AI-36] Streamliners for Answer Set Programming

【速读】:该论文旨在解决约束规划中搜索空间过大导致求解效率低下的问题,提出了一种基于大语言模型(Large Language Models, LLMs)自动生成约束优化器(streamliners)的方法,以缩小答案集编程(Answer Set Programming, ASP)中的可行解空间。其解决方案的关键在于:利用LLMs根据少量训练实例生成候选约束,通过语法正确性、保持可满足性及性能提升等标准筛选有效约束,并结合原始编码形成虚拟最优编码(Virtual Best Encoding, VBE),在多个ASP竞赛基准问题上实现了最高达4–5倍的加速效果,且不同LLMs生成的约束具有语义多样性,表明该方法能有效捕捉问题本质结构。

链接: https://arxiv.org/abs/2604.19251
作者: Florentina Voboril,Martin Gebser,Stefan Szeider,Alice Tarzariol
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: To appear in Technical Communications of the 42nd International Conference on Logic Programming (ICLP 2026)

点击查看摘要

Abstract:Streamliner constraints reduce the search space of combinatorial problems by ruling out portions of the solution space. We adapt the StreamLLM approach, which uses Large Language Models (LLMs) to generate streamliners for Constraint Programming, to Answer Set Programming (ASP). Given an ASP encoding and a few small training instances, we prompt multiple LLMs to propose candidate constraints. Candidates that cause syntax errors, render satisfiable instances unsatisfiable, or degrade performance on all training instances are discarded. The surviving streamliners are evaluated together with the original encoding, and we report results for a virtual best encoding (VBE) that, for each instance, selects the fastest among the original encoding and its streamlined variants. On three ASP Competition benchmarks (Partner Units Problem, Sokoban, Towers of Hanoi), the VBE achieves speedups of up to 4–5x over the original encoding. Different LLMs produce semantically diverse constraints, not mere syntactic variations, indicating that the approach captures genuine problem structure.

[AI-37] Industrial Surface Defect Detection via Diffusion Generation and Asymmetric Student-Teacher Network

【速读】:该论文旨在解决工业表面缺陷检测中普遍存在的三大挑战:缺陷样本稀缺、严重长尾分布以及在复杂背景下的细微缺陷定位困难。其核心解决方案是提出一种结合去噪扩散概率模型(Denoising Diffusion Probabilistic Model, DDPM)与非对称师生架构的无监督检测方法:首先利用仅含正常样本训练的DDPM生成高保真且物理一致的缺陷样本及其像素级标注,缓解数据不足问题;其次构建非对称双流网络结构,由教师网络提供稳定正常的特征表示,学生网络重建正常模式并放大异常区域差异;最后通过余弦相似度损失与像素级分割监督联合优化策略,实现对细微缺陷的精准定位。该方法无需大量真实缺陷样本,在MVTecAD数据集上达到98.4%图像级AUROC和98.3%像素级AUROC,显著优于现有无监督及主流深度学习方法。

链接: https://arxiv.org/abs/2604.19240
作者: Shuo Feng,Runlin Zhou,Yuyang Li,Guangcan Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Industrial surface defect detection often suffers from limited defect samples, severe long-tailed distributions, and difficulties in accurately localizing subtle defects under complex backgrounds. To address these challenges, this paper proposes an unsupervised defect detection method that integrates a Denoising Diffusion Probabilistic Model (DDPM) with an asymmetric teacher-student architecture. First, at the data level, the DDPM is trained solely on normal samples. By introducing constant-variance Gaussian perturbations and Perlin noise-based masks, high-fidelity and physically consistent defect samples along with pixel-level annotations are generated, effectively alleviating the data scarcity problem. Second, at the model level, an asymmetric dual-stream network is constructed. The teacher network provides stable representations of normal features, while the student network reconstructs normal patterns and amplifies discrepancies between normal and anomalous regions. Finally, a joint optimization strategy combining cosine similarity loss and pixel-wise segmentation supervision is adopted to achieve precise localization of subtle defects. Experimental results on the MVTecAD dataset show that the proposed method achieves 98.4% image-level AUROC and 98.3% pixel-level AUROC, significantly outperforming existing unsupervised and mainstream deep learning methods. The proposed approach does not require large amounts of real defect samples and enables accurate and robust industrial defect detection and localization. \keywordsIndustrial defect detection \and diffusion models \and data generation \and teacher-student architecture \and pixel-level localization Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.19240 [cs.AI] (or arXiv:2604.19240v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.19240 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-38] UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

【速读】:该论文旨在解决传统语音处理流水线在全双工(full-duplex)语音交互系统中面临的延迟累积、信息丢失和错误传播等问题。当前主流的端到端音频大语言模型(audio large language models, LLMs)虽实现了语音理解与生成的统一,但其本质上仍为半双工(half-duplex),依赖独立的任务特定前端组件(如语音活动检测 VAD 和话语切换检测 TD)。为此,作者提出首个面向全双工语音系统的统一音频前端大语言模型(Unified Audio Front-end LLM, UAF),其关键创新在于将多种音频前端任务(包括 VAD、TD、说话人识别 SR、自动语音识别 ASR 和问答 QA)统一建模为单一自回归序列预测问题,输入为固定时长的流式音频片段(如 600 ms),并利用参考音频提示锚定目标说话人,通过回归离散标记同时编码语义内容与系统级状态控制(如打断信号),从而显著提升实际交互场景下的响应延迟和打断准确性。

链接: https://arxiv.org/abs/2604.19221
作者: Yadong Li,Guoxin Wu,Haiping Hou,Biye Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Full-duplex speech interaction, as the most natural and intuitive mode of human communication, is driving artificial intelligence toward more human-like conversational systems. Traditional cascaded speech processing pipelines suffer from critical limitations, including accumulated latency, information loss, and error propagation across modules. To address these issues, recent efforts focus on the end-to-end audio large language models (LLMs) like GPT-4o, which primarily unify speech understanding and generation task. However, most of these models are inherently half-duplex, and rely on a suite of separate, task-specific front-end components, such as voice activity detection (VAD) and turn-taking detection (TD). In our development of speech assistant, we observed that optimizing the speech front-end is equally crucial as advancing the back-end unified model for achieving seamless, responsive interactions. To bridge this gap, we propose the first unified audio front-end LLM (UAF) tailored for full-duplex speech systems. Our model reformulates diverse audio front-end tasks into a single auto-regressive sequence prediction problem, including VAD, TD, speaker recognition (SR), automatic speech recognition (ASR) and question answer (QA). It takes streaming fixed-duration audio chunk (e.g., 600 ms) as input, leverages a reference audio prompt to anchor the target speaker at the beginning, and regressively generates discrete tokens encoding both semantic content and system-level state controls (e.g., interruption signals). Experiments demonstrate that our model achieves leading performance across multiple audio front-end tasks and significantly enhances response latency and interruption accuracy in real-world interaction scenarios.

[AI-39] Sherpa.ai Privacy-Preserving Multi-Party Entity Alignment without Intersection Disclosure for Noisy Identifiers

【速读】:该论文旨在解决垂直联邦学习(Vertical Federated Learning, VFL)中隐私保护实体对齐(Privacy-Preserving Entity Alignment, PPEA)的关键挑战,即在不泄露样本交集成员关系的前提下,实现多参与方之间的数据对齐。传统私有集合交集(Private Set Intersection, PSI)方法虽能完成对齐但存在隐私泄露风险,而标准私有集合并集(Private Set Union, PSU)虽可缓解该问题,却通常仅支持两方协作且缺乏容错匹配能力。本文提出一种适用于多方场景的PSU协议,其核心创新在于:一是通过通用索引映射机制将本地记录统一到共享索引空间,实现低通信开销下的多参与方对齐;二是提供两种变体——有序版本用于精确匹配,无序版本支持容错匹配(如拼写错误或格式差异),从而兼顾隐私保护与实际应用需求。该方案在理论层面证明了正确性和隐私性,并分析了计算与通信复杂度,为医疗、金融等跨机构联合建模场景提供了可扩展、数学严谨的PPEA基础支撑。

链接: https://arxiv.org/abs/2604.19219
作者: Daniel M. Jimenez-Gutierrez,Enrique Zuazua,Georgios Kellaris,Joaquin Del Rio,Oleksii Sliusarenko,Xabi Uribe-Etxebarria
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training among multiple parties without centralizing raw data. There are two main paradigms in FL: Horizontal FL (HFL), where all participants share the same feature space but hold different samples, and Vertical FL (VFL), where parties possess complementary features for the same set of samples. A prerequisite for VFL training is privacy-preserving entity alignment (PPEA), which establishes a common index of samples across parties (alignment) without revealing which samples are shared between them. Conventional private set intersection (PSI) achieves alignment but leaks intersection membership, exposing sensitive relationships between datasets. The standard private set union (PSU) mitigates this risk by aligning on the union of identifiers rather than the intersection. However, existing approaches are often limited to two parties or lack support for typo-tolerant matching. In this paper, we introduce the this http URL multi-party PSU protocol for VFL, a PPEA method that hides intersection membership and enables both exact and noisy matching. The protocol generalizes two-party approaches to multiple parties with low communication overhead and offers two variants: an order-preserving version for exact alignment and an unordered version tolerant to typographical and formatting discrepancies. We prove correctness and privacy, analyze communication and computational (exponentiation) complexity, and formalize a universal index mapping from local records to a shared index space. This multi-party PSU offers a scalable, mathematically grounded protocol for PPEA in real-world VFL deployments, such as multi-institutional healthcare disease detection, collaborative risk modeling between banks and insurers, and cross-domain fraud detection between telecommunications and financial institutions, while preserving intersection privacy. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2604.19219 [cs.CR] (or arXiv:2604.19219v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.19219 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-40] ClawNet: Human-Symbiotic Agent Network for Cross-User Autonomous Cooperation

【速读】:该论文旨在解决当前AI代理框架仅服务于单一用户、缺乏跨用户协作基础设施与治理机制的问题,指出AI代理的下一发展前沿在于数字化人类协作关系而非单纯提升个体能力。其核心解决方案是提出一种“人机共生代理(human-symbiotic agent)”范式,关键在于构建三层治理原语:分层身份架构将管理代理(Manager Agent)与多个场景特定的身份代理(Identity Agent)分离,确保全局知识隔离;作用域授权机制实现按身份的访问控制并触发越界行为上报;操作级问责机制对每项操作记录归属身份和授权信息,保障审计可追溯性。该范式通过ClawNet框架落地,由中心编排器强制执行身份绑定与授权验证,从而支持多用户代理间安全协作。

链接: https://arxiv.org/abs/2604.19211
作者: Zhiqin Yang,Zhenyuan Zhang,Xianzhang Jia,Jun Song,Wei Xue,Yonggang Zhang,Yike Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages

点击查看摘要

Abstract:Current AI agent frameworks have made remarkable progress in automating individual tasks, yet all existing systems serve a single user. Human productivity rests on the social and organizational relationships through which people coordinate, negotiate, and delegate. When agents move beyond performing tasks for one person to representing that person in collaboration with others, the infrastructure for cross-user agent collaboration is entirely absent, let alone the governance mechanisms needed to secure it. We argue that the next frontier for AI agents lies not in stronger individual capability, but in the digitization of human collaborative relationships. To this end, we propose a human-symbiotic agent paradigm. Each user owns a permanently bound agent system that collaborates on the owner’s behalf, forming a network whose nodes are humans rather than agents. This paradigm rests on three governance primitives. A layered identity architecture separates a Manager Agent from multiple context-specific Identity Agents; the Manager Agent holds global knowledge but is architecturally isolated from external communication. Scoped authorization enforces per-identity access control and escalates boundary violations to the owner. Action-level accountability logs every operation against its owner’s identity and authorization, ensuring full auditability. We instantiate this paradigm in ClawNet, an identity-governed agent collaboration framework that enforces identity binding and authorization verification through a central orchestrator, enabling multiple users to collaborate securely through their respective agents.

[AI-41] Inductive Subgraphs as Shortcuts: Causal Disentanglement for Heterophilic Graph Learning SIGIR2026

【速读】:该论文旨在解决异质性图(heterophilic graphs)中图神经网络(GNNs)性能下降的问题,其核心原因是传统GNNs在异质性图中容易受到虚假捷径(spurious shortcuts)的干扰,这些捷径源于反复出现的归纳子图(recurring inductive subgraphs),导致模型学习到非因果相关性而非真实因果信号。解决方案的关键在于引入因果推断视角,构建一个去偏因果图(debiased causal graph),显式阻断引起虚假捷径的混杂路径(confounding paths)和溢出路径(spillover paths),并据此提出Causal Disentangled GNN(CD-GNN)框架,通过显式分离虚假归纳子图与真实因果子图,聚焦于本质因果信号,从而显著提升节点分类的鲁棒性和准确性。

链接: https://arxiv.org/abs/2604.19186
作者: Xiangmeng Wang,Qian Li,Haiyang Xia,Hao Miao,Qing Li,Guandong Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: SIGIR 2026

点击查看摘要

Abstract:Heterophily is a prevalent property of real-world graphs and is well known to impair the performance of homophilic Graph Neural Networks (GNNs). Prior work has attempted to adapt GNNs to heterophilic graphs through non-local neighbor extension or architecture refinement. However, the fundamental reasons behind misclassifications remain poorly understood. In this work, we take a novel perspective by examining recurring inductive subgraphs, empirically and theoretically showing that they act as spurious shortcuts that mislead GNNs and reinforce non-causal correlations in heterophilic graphs. To address this, we adopt a causal inference perspective to analyze and correct the biased learning behavior induced by shortcut inductive subgraphs. We propose a debiased causal graph that explicitly blocks confounding and spillover paths responsible for these shortcuts. Guided by this causal graph, we introduce Causal Disentangled GNN (CD-GNN), a principled framework that disentangles spurious inductive subgraphs from true causal subgraphs by explicitly blocking non-causal paths. By focusing on genuine causal signals, CD-GNN substantially improves the robustness and accuracy of node classification in heterophilic graphs. Extensive experiments on real-world datasets not only validate our theoretical findings but also demonstrate that our proposed CD-GNN outperforms state-of-the-art heterophily-aware baselines.

[AI-42] Reasoning -Aware AIGC Detection via Alignment and Reinforcement

【速读】:该论文旨在解决生成式 AI (Generative AI) 内容检测(AIGC detection)的可靠性问题,尤其是在大型语言模型(LLM)持续演进背景下,传统检测方法难以适应多样化生成源与作者场景的挑战。解决方案的关键在于提出一个名为 REVEAL 的检测框架,其核心创新是通过两阶段训练策略生成可解释的推理链(reasoning chains):首先进行监督微调以建立推理能力,再通过强化学习优化分类准确性、逻辑一致性并减少幻觉现象,从而在多个基准测试中实现最先进的检测性能,同时提供透明且可信的判断过程。

链接: https://arxiv.org/abs/2604.19172
作者: Zhao Wang,Max Xiong,Jianxun Lian,Zhicheng Dou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement and widespread adoption of Large Language Models (LLMs) have elevated the need for reliable AI-generated content (AIGC) detection, which remains challenging as models evolve. We introduce AIGC-text-bank, a comprehensive multi-domain dataset with diverse LLM sources and authorship scenarios, and propose REVEAL, a detection framework that generates interpretable reasoning chains before classification. Our approach uses a two-stage training strategy: supervised fine-tuning to establish reasoning capabilities, followed by reinforcement learning to improve accuracy, improve logical consistency, and reduce hallucinations. Extensive experiments show that REVEAL achieves state-of-the-art performance across multiple benchmarks, offering a robust and transparent solution for AIGC detection. The project is open-source at this https URL

[AI-43] LBLLM : Lightweight Binarization of Large Language Models via Three-Stage Distillation

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在资源受限环境中的部署难题,其核心挑战在于模型对计算和内存的高需求。为应对这一问题,作者提出了一种轻量级二值化框架 LBLLM,其关键创新在于采用一种新颖的三阶段量化策略:首先通过训练后量化(Post-Training Quantization, PTQ)初始化高质量量化模型;其次,在保持激活值全精度的前提下,通过逐层蒸馏实现权重、分组位图(group-wise bitmaps)和量化参数的二值化;最后,引入可学习的激活量化因子,动态地将激活值量化至4比特。这种解耦设计有效缓解了权重与激活量化之间的干扰,提升了训练稳定性并改善了推理准确性,且仅需0.016B tokens 和单个GPU即可完成训练,无需额外的高精度通道或旋转矩阵,显著提高了极端低比特量化(W(1+1)A4)的实际可行性和有效性。

链接: https://arxiv.org/abs/2604.19167
作者: Siqing Song,Chuang Wang,Yong Lang,Yi Yang,Xu-Yao Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying large language models (LLMs) in resource-constrained environments is hindered by heavy computational and memory requirements. We present LBLLM, a lightweight binarization framework that achieves effective W(1+1)A4 quantization through a novel three-stage quantization strategy. The framework proceeds as follows: (1) initialize a high-quality quantized model via PTQ; (2) quantize binarized weights, group-wise bitmaps, and quantization parameters through layer-wise distillation while keeping activations in full precision; and (3) training learnable activation quantization factors to dynamically quantize activations to 4 bits. This decoupled design mitigates interference between weight and activation quantization, yielding greater training stability and better inference accuracy. LBLLM, trained only using 0.016B tokens with a single GPU, surpasses existing state-of-the-art binarization methods on W2A4 quantization settings across tasks of language modeling, commonsense QA, and language understanding. These results demonstrate that extreme low-bit quantization of LLMs can be both practical and highly effective without introducing any extra high-precision channels or rotational matrices commonly used in recent PTQ-based works, offering a promising path toward efficient LLM deployment in resource-limited situations.

[AI-44] Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling

【速读】:该论文旨在解决Transformer模型在扩展过程中因线性投影限制导致的表达能力不足与增量扩展困难的问题(即标准架构难以在不丢弃已学习表征的情况下进行有效扩容)。其解决方案的关键在于提出Nexusformer,通过将传统的线性Q/K/V投影替换为一种三阶段非线性映射结构——Nexus-Rank层,该层由双激活机制驱动,并在逐级升维的空间中实现特征变换,从而突破线性约束;同时,新引入的零初始化块可沿两个维度注入容量且保持预训练知识不变,实现无损结构化增长。

链接: https://arxiv.org/abs/2604.19147
作者: Weijie Zhao,Mingquan Liu,Bolun Wang,Simo Wu,Nuobei Xie,Rui-Jie Zhu,Peng Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scaling Transformers typically necessitates training larger models from scratch, as standard architectures struggle to expand without discarding learned representations. We identify the primary bottleneck in the attention mechanism’s linear projections, which strictly confine feature extraction to fixed-dimensional subspaces, limiting both expressivity and incremental capacity. To address this, we introduce Nexusformer, which replaces linear Q/K/V projections with a Nexus-Rank layer, a three-stage nonlinear mapping driven by dual activations in progressively higher dimensional spaces. This design overcomes the linearity constraint and enables lossless structured growth: new capacity can be injected along two axes via zero-initialized blocks that preserve pretrained knowledge. Experiments on language modeling and reasoning benchmarks demonstrate that Nexusformer matches Tokenformer’s perplexity using up to 41.5% less training compute during progressive scaling (240M to 440M). Furthermore, our analysis of growth dynamics reveals that zero initialization induces a stable convergence trajectory, allowing us to derive a geometric scaling law that accurately predicts performance across expansion scales.

[AI-45] Has Automated Essay Scoring Reached Sufficient Accuracy? Deriving Achievable QWK Ceilings from Classical Test Theory

【速读】:该论文旨在解决自动化作文评分(Automated Essay Scoring, AES)模型性能评估中缺乏理论与实践基准的问题,尤其是在现有公共基准上使用二次加权肯德尔系数(Quadratic Weighted Kappa, QWK)作为评价指标时,由于人工评分存在误差,导致难以判断QWK的理论上限和实际可达到水平。解决方案的关键在于基于经典测试理论中的信度概念,提出了两种数据集特定的QWK上限:一是理论上限(theoretical ceiling),即理想AES模型在标签噪声下能实现的最大QWK;二是类人上限(human-like ceiling),即具备人类评分误差水平的AES模型所能达到的QWK,为替代单个评分者提供实用目标。该方法无需额外标注即可从标准双评分者基准中估计这两个上限,并通过模拟实验和真实基准验证其有效性,从而清晰揭示当前AES模型的性能表现与潜在提升空间。

链接: https://arxiv.org/abs/2604.19131
作者: Masaki Uto
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication at AIED 2026 (full paper)

点击查看摘要

Abstract:Automated essay scoring (AES) is commonly evaluated on public benchmarks using quadratic weighted kappa (QWK). However, because benchmark labels are assigned by human raters and inevitably contain scoring errors, it remains unclear both what QWK is theoretically attainable and what level is practically sufficient for deployment. We therefore derive two dataset-specific QWK ceilings based on the reliability concept in classical test theory, which can be estimated from standard two-rater benchmarks without additional annotation. The first is the theoretical ceiling: the maximum QWK that an ideal AES model that perfectly predicts latent true scores can achieve under label noise. The second is the human-like ceiling: the QWK attainable by an AES model with human-level scoring error, providing a practical target when AES is intended to replace a single human rater. We further show that human–human QWK, often used as a ceiling reference, can underestimate the true ceiling. Simulation experiments validate the proposed ceilings, and experiments on real benchmarks illustrate how they clarify the current performance and remaining headroom of modern AES models.

[AI-46] DP-FlogTinyLLM : Differentially private federated log anomaly detection using Tiny LLM s

【速读】:该论文旨在解决分布式环境中日志异常检测的隐私与协作难题,即在多个组织间无法集中原始日志数据的前提下,如何实现高效且安全的异常检测。其核心挑战在于现有基于大语言模型(Large Language Model, LLM)的方法依赖于集中式训练,难以满足隐私保护要求。解决方案的关键在于提出一种融合差分隐私(Differential Privacy, DP)与联邦学习(Federated Learning, FL)的框架——DP-FLogTinyLLM,通过在客户端采用低秩适配(Low Rank Adaptation, LoRA)技术对小型语言模型(Tiny LLM)进行高效微调,在不共享原始日志数据的情况下实现跨机构协同学习,从而在保证隐私的同时达到与集中式方法相当的检测性能,并显著优于现有联邦基线方法,尤其在减少误报方面表现突出。

链接: https://arxiv.org/abs/2604.19118
作者: Isaiah Thompson,Tanmay Sen,Ritwik Bhattacharya
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern distributed systems generate massive volumes of log data that are critical for detecting anomalies and cyber threats. However, in real world settings, these logs are often distributed across multiple organizations and cannot be centralized due to privacy and security constraints. Existing log anomaly detection methods, including recent large language model (LLM) based approaches, largely rely on centralized training and are not suitable for such environments. In this paper, we propose DP-FLogTinyLLM, a privacy preserving federated framework for log anomaly detection using parameter efficient LLMs. Our approach enables collaborative learning without sharing raw log data by integrating federated optimization with differential privacy. To ensure scalability in resource constrained environments, we employ low rank adaptation (LoRA) for efficient fine tuning of Tiny LLMs at each client. Empirical results on the Thunderbird and BGL datasets show that the proposed framework matches the performance of centralized LLM based methods, while incurring additional computational overhead due to privacy mechanisms. Compared to existing federated baselines, DP-FLogTinyLLM consistently achieves higher precision and F1-score, with particularly strong gains on the Thunderbird dataset, highlighting its effectiveness in detecting anomalies while minimizing false positives.

[AI-47] Design Rules for Extreme-Edge Scientific Computing on AI Engines

【速读】:该论文旨在解决极端边缘(extreme-edge)科学应用中神经网络部署的优化问题,即如何在AI Engines与可编程逻辑(programmable logic)之间选择最优实现方式,以满足低延迟和高吞吐量需求。其核心挑战在于,传统空间数据流(spatial dataflow)在大规模模型下受限于资源扩展性,而现代FPGA SoC中的AI Engines虽具备高计算密度和额外片上内存,但其架构、编程模型及性能扩展特性与可编程逻辑存在本质差异,导致直接比较困难且优势不明确。解决方案的关键在于提出一种延迟调整的资源等价性(Latency-Adjusted Resource Equivalence, LARE)指标,通过系统级架构表征与微基准测试量化不同实现方式的性能边界,并结合面向低延迟科学推理的空间与API级数据流优化策略,最终验证了在hlsml工具链下无法部署于可编程逻辑的端到端神经网络可在AI Engines上成功运行。

链接: https://arxiv.org/abs/2604.19106
作者: Zhenghua Ma,G Abarajithan,Dimitrios Danopoulos,Olivia Weng,Francesco Restuccia,Ryan Kastner
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Extreme-edge scientific applications use machine learning models to analyze sensor data and make real-time decisions. Their stringent latency and throughput requirements demand small batch sizes and require that model weights remain fully on-chip. Spatial dataflow implementations are common for extreme-edge applications. Spatial dataflow works well for small networks, but it fails to scale to larger models due to inherent resource scaling limitations. AI Engines on modern FPGA SoCs offer a promising alternative with high compute density and additional on-chip memory. However, the architecture, programming model, and performance-scaling behavior of AI Engines differ fundamentally from those of the programmable logic, making direct comparison non-trivial and the benefits of using AI Engines unclear. This work addresses how and when extreme-edge scientific neural networks should be implemented on AI Engines versus programmable logic. We provide systematic architectural characterization and micro-benchmarking and introduce a latency-adjusted resource equivalence (LARE) metric that identifies when AI Engine implementations outperform programmable logic designs. We further propose spatial and API-level dataflow optimizations tailored to low-latency scientific inference. Finally, we demonstrate the successful deployment of end-to-end neural networks on AI Engines that cannot fit on programmable logic when using the hlsml toolchain.

[AI-48] Reinforcement Learning Enabled Adaptive Multi-Task Control for Bipedal Soccer Robots

【速读】:该论文旨在解决双足足球机器人在动态对抗环境中面临的运动稳定性差、多任务深度耦合以及状态切换(如直立行走与跌倒恢复)控制难题。其解决方案的关键在于提出了一种模块化的强化学习(Reinforcement Learning, RL)框架:首先,通过将开环前馈振荡器与基于强化学习的反馈残差策略相结合,实现了基础步态生成与复杂足球动作的解耦;其次,引入基于姿态的状态机机制,明确区分球探测与踢球网络(Ball Seeking and Kicking Network, BSKN)和跌倒恢复网络(Fall Recovery Network, FRN),从根本上避免了状态混淆问题,并采用渐进式力衰减课程学习策略高效训练FRN。该架构在Unity仿真中验证了出色的环境适应性与快速自主跌倒恢复能力(平均恢复时间0.715秒),确保了复杂多任务场景下的无缝稳定运行。

链接: https://arxiv.org/abs/2604.19104
作者: Yulai Zhang,Yinrong Zhang,Ting Wu,Linqi Ye
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developing bipedal football robots in dynamiccombat environments presents challenges related to motionstability and deep coupling of multiple tasks, as well ascontrol switching issues between different states such as up-right walking and fall recovery. To address these problems,this paper proposes a modular reinforcement learning (RL)framework for achieving adaptive multi-task control. Firstly,this framework combines an open-loop feedforward oscilla-tor with a reinforcement learning-based feedback residualstrategy, effectively separating the generation of basic gaitsfrom complex football actions. Secondly, a posture-driven statemachine is introduced, clearly switching between the ballseeking and kicking network (BSKN) and the fall recoverynetwork (FRN), fundamentally preventing state this http URL FRN is efficiently trained through a progressive forceattenuation curriculum learning strategy. The architecture wasverified in Unity simulations of bipedal robots, demonstratingexcellent spatial adaptability-reliably finding and kicking theball even in restricted corner scenarios-and rapid autonomousfall recovery (with an average recovery time of 0.715 seconds).This ensures seamless and stable operation in complex multi-task environments.

[AI-49] Multi-Gait Learning for Humanoid Robots Using Reinforcement Learning with Selective Adversarial Motion Prior

【速读】:该论文旨在解决在统一强化学习框架下,使类人机器人掌握多种步态(如行走、正步走、跑步、爬楼梯和跳跃)时所面临的稳定性与动态表现之间的冲突问题。解决方案的关键在于提出一种选择性对抗运动先验(Selective Adversarial Motion Prior, AMP)策略:对周期性且稳定性关键的步态(行走、正步走、爬楼梯)应用AMP以加速收敛并抑制异常行为,而对高动态步态(跑步、跳跃)则省略AMP,避免其过度约束导致灵活性下降。该方法在保持统一策略结构、动作空间和奖励函数的前提下,实现了多步态技能的高效学习与零样本仿真到现实的迁移。

链接: https://arxiv.org/abs/2604.19102
作者: Yuanye Wu,Keyi Wang,Linqi Ye,Boyang Xing
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning diverse locomotion skills for humanoid robots in a unified reinforcement learning framework remains challenging due to the conflicting requirements of stability and dynamic expressiveness across different gaits. We present a multi-gait learning approach that enables a humanoid robot to master five distinct gaits – walking, goose-stepping, running, stair climbing, and jumping – using a consistent policy structure, action space, and reward formulation. The key contribution is a selective Adversarial Motion Prior (AMP) strategy: AMP is applied to periodic, stability-critical gaits (walking, goose-stepping, stair climbing) where it accelerates convergence and suppresses erratic behavior, while being deliberately omitted for highly dynamic gaits (running, jumping) where its regularization would over-constrain the motion. Policies are trained via PPO with domain randomization in simulation and deployed on a physical 12-DOF humanoid robot through zero-shot sim-to-real transfer. Quantitative comparisons demonstrate that selective AMP outperforms a uniform AMP policy across all five gaits, achieving faster convergence, lower tracking error, and higher success rates on stability-focused gaits without sacrificing the agility required for dynamic ones.

[AI-50] RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

【速读】:该论文旨在解决当前视频世界模型(video world models)在生成行为时缺乏物理可执行性的问题,即虽然生成视频具有视觉真实性,但其中的行为可能违背物理规律,导致在机器人执行时失败。解决方案的关键在于提出RoboWM-Bench——一个以操作任务为中心的、面向具身验证的基准测试框架,通过将人类或机器人操作视频中预测的行为转化为可执行的动作序列,并在真实机器人平台上进行验证,从而系统评估视频世界模型生成行为的物理可行性与任务完成能力。该方法实现了从感知层面到动作层面的闭环验证,为开发更符合物理规律的生成式AI(Generative AI)提供了可量化、可复现的评估标准。

链接: https://arxiv.org/abs/2604.19092
作者: Feng Jiang,Yang Chen,Kyle Xu,Yuchen Liu,Haifeng Wang,Zhenhao Shen,Jasper Lu,Shengze Huang,Yuanfei Wang,Chen Xie,Ruihai Wu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large-scale video world models have enabled increasingly realistic future prediction, raising the prospect of leveraging imagined videos for robot learning. However, visual realism does not imply physical plausibility, and behaviors inferred from generated videos may violate dynamics and fail when executed by embodied agents. Existing benchmarks begin to incorporate notions of physical plausibility, but they largely remain perception- or diagnostic-oriented and do not systematically evaluate whether predicted behaviors can be translated into executable actions that complete the intended task. To address this gap, we introduce RoboWM-Bench, a manipulation-centric benchmark for embodiment-grounded evaluation of video world models. RoboWM-Bench converts generated behaviors from both human-hand and robotic manipulation videos into embodied action sequences and validates them through robotic execution. The benchmark spans diverse manipulation scenarios and establishes a unified protocol for consistent and reproducible evaluation. Using RoboWM-Bench, we evaluate state-of-the-art video world models and find that reliably generating physically executable behaviors remains an open challenge. Common failure modes include errors in spatial reasoning, unstable contact prediction, and non-physical deformations. While finetuning on manipulation data yields improvements, physical inconsistencies still persist, suggesting opportunities for more physically grounded video generation for robots.

[AI-51] owards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在持续知识更新场景下存在的两个核心问题:一是参数编辑方法在连续编辑过程中因灾难性遗忘(catastrophic forgetting)导致的稳定性不足;二是基于检索的方法因训练成本高而在不同数据集上适用性受限。解决方案的关键在于提出LightEdit框架,其通过两个核心机制实现高效且稳定的终身知识编辑:首先从检索到的信息中筛选相关知识以有效修改查询;其次引入一种解码策略,抑制模型原有知识的概率分布,从而基于选定信息实现精准的知识覆盖与更新。该方法显著提升了编辑效果和跨数据集的可扩展性,同时大幅降低训练成本。

链接: https://arxiv.org/abs/2604.19089
作者: Dahyun Jung,Jaewook Lee,Heuiseok Lim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) require frequent knowledge updates to reflect changing facts and mitigate hallucinations. To meet this demand, lifelong knowledge editing has emerged as a continual approach to modify specific pieces of knowledge without retraining the entire model. Existing parameter editing methods struggle with stability during sequential edits due to catastrophic forgetting. While retrieval-based approaches are proposed to alleviate this issue, their applicability remains limited across various datasets because of high training costs. To address these limitations and enhance scalability in lifelong settings, we propose LightEdit. Our framework first selects relevant knowledge from retrieved information to modify the query effectively. It then incorporates a decoding strategy to suppress the model’s original knowledge probabilities, thereby enabling efficient edits based on the selected information. Extensive experiments on ZSRE, Counterfact, and RIPE benchmarks demonstrate that LightEdit outperforms existing lifelong knowledge editing methods. Furthermore, by minimizing training costs, LightEdit achieves cost-effective scalability, enabling easy adaptation to various datasets.

[AI-52] OLLM : Options-based Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在数学推理任务中缺乏可控性、鲁棒性和样本效率的问题。现有方法依赖温度调节或采样策略来引入多样性,但难以有效控制生成过程,易产生语言切换或退化推理等偏差。其解决方案的关键在于提出Options LLM(OLLM),通过将标准LLM的单次token预测替换为由离散潜变量索引的一组学习到的选项(option set),显式建模多个合理的下一个token选择。该方法以轻量级“插件”形式插入编码器和解码器层至预训练模型输出头前,仅需极少额外参数(如1.7B参数模型中仅1.56%可训练),即可实现对生成空间的结构化控制。进一步地,通过在低维潜空间中训练紧凑策略网络(policy)来选择最优选项,显著提升奖励优化的样本效率,并因约束于监督微调(SFT)阶段学到的选项而自然实现对齐,无需额外KL散度或人工设计的对齐损失。

链接: https://arxiv.org/abs/2604.19087
作者: Shashank Sharma,Janina Hoffmann,Vinay Namboodiri
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Options LLM (OLLM), a simple, general method that replaces the single next-token prediction of standard LLMs with a \textitset of learned options for the next token, indexed by a discrete latent variable. Instead of relying on temperature or sampling heuristics to induce diversity, OLLM models variation explicitly: a small latent space parametrizes multiple plausible next-token options which can be selected or searched by a downstream policy. Architecturally, OLLM is a lightweight “plug-in” that inserts two layers: an encoder and a decoder, before the output head, allowing almost any pretrained LLM to be converted with minimal additional parameters. We apply OLLM to a 1.7B-parameter backbone (only 1.56% of parameters trainable) trained on OpenMathReasoning and evaluated on OmniMath. The SOTA LoRA-adapted baselines peak at 51% final answer correctness, while OLLM’s option set allows up to \sim 70% under optimal latent selection. We then train a compact policy in the latent space that emits latents to control generation. Operating in a low-dimensional option space makes reward optimization far more sample-efficient and substantially reduces common misalignments (e.g., language switching or degenerate reasoning), as the policy is constrained to options learned during SFT. Crucially, this alignment arises from model structure rather than additional KL or handcrafted alignment losses. Our results demonstrate that optionized next-token modeling enhances controllability, robustness, and efficiency in math reasoning, and highlight latent-space policy learning as a promising direction for reinforcement learning in LLMs.

[AI-53] ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的后门攻击安全漏洞问题,尤其是此类攻击机制不透明、难以理解与防御的挑战。现有研究虽已证明可通过数据投毒在MLLMs中植入后门,但其内在工作机制仍不清晰,制约了有效检测与缓解策略的发展。为此,作者提出ProjLens这一可解释性框架,其核心在于揭示后门注入的低秩结构特征和独特的激活机制:首先发现即使仅对投影器(projector)进行微调,也会引入后门脆弱性,且该机制区别于纯文本大语言模型(text-only LLMs);其次通过实验验证,后门关键参数存在于投影器的低秩子空间中,且干净样本与中毒样本在嵌入空间均向一个与目标标签对齐的方向发生语义偏移,但偏移幅度随输入范数线性增长,从而导致中毒样本被激活。此发现为后门检测与防御提供了理论基础与技术路径。

链接: https://arxiv.org/abs/2604.19083
作者: Kun Wang,Cheng Qian,Miao Yu,Lilan Peng,Liang Lin,Jiaming Zhang,Tianyu Zhang,Yu Cheng,Yang Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 18 pages ,15 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable success in cross-modal understanding and generation, yet their deployment is threatened by critical safety vulnerabilities. While prior works have demonstrated the feasibility of backdoors in MLLMs via fine-tuning data poisoning to manipulate inference, the underlying mechanisms of backdoor attacks remain opaque, complicating the understanding and mitigation. To bridge this gap, we propose ProjLens, an interpretability framework designed to demystify MLLMs backdoors. We first establish that normal downstream task alignment–even when restricted to projector fine–tuning–introduces vulnerability to backdoor injection, whose activation mechanism is different from that observed in text-only LLMs. Through extensive experiments across four backdoor variants, we uncover:(1) Low-Rank Structure: Backdoor injection updates appear overall full-rank and lack dedicated ``trigger neurons’', but the backdoor-critical parameters are encoded within a low-rank subspace of the projector;(2) Activation Mechanism: Both clean and poisoned embedding undergoes a semantic shift toward a shared direction aligned with the backdoor target, but the shifting magnitude scales linearly with the input norm, resulting in the distinct backdoor activation on poisoned samples. Our code is available at: this https URL

[AI-54] S2MAM: Semi-supervised Meta Additive Model for Robust Estimation and Variable Selection

【速读】:该论文旨在解决半监督学习中因输入变量冗余或噪声导致图拉普拉斯矩阵(graph Laplacian matrix)依赖预设相似性度量而产生不当惩罚的问题,从而影响模型的鲁棒性和可解释性。解决方案的关键在于提出一种基于双层优化框架的半监督元加法模型(Semi-Supervised Meta Additive Model, S²MAM),该方法能够自动识别信息丰富的变量、动态更新相似性矩阵,并同时实现可解释的预测结果。理论分析表明该方法具有计算收敛性和统计泛化界保证,实验验证了其在多种合成与真实数据集上的鲁棒性与可解释性优势。

链接: https://arxiv.org/abs/2604.19072
作者: Xuelin Zhang,Hong Chen,Yingjie Wang,Tieliang Gong,Bin Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Semi-supervised learning with manifold regularization is a classical framework for jointly learning from both labeled and unlabeled data, where the key requirement is that the support of the unknown marginal distribution has the geometric structure of a Riemannian manifold. Typically, the Laplace-Beltrami operator-based manifold regularization can be approximated empirically by the Laplacian regularization associated with the entire training data and its corresponding graph Laplacian matrix. However, the graph Laplacian matrix depends heavily on the prespecified similarity metric and may lead to inappropriate penalties when dealing with redundant or noisy input variables. To address the above issues, this paper proposes a new \textitSemi-Supervised Meta Additive Model (S ^2 MAM) based on a bilevel optimization scheme that automatically identifies informative variables, updates the similarity matrix, and simultaneously achieves interpretable predictions. Theoretical guarantees are provided for S ^2 MAM, including the computing convergence and the statistical generalization bound. Experimental assessments across 4 synthetic and 12 real-world datasets, with varying levels and categories of corruption, validate the robustness and interpretability of the proposed approach.

[AI-55] Reinforcement Learning Improves LLM Accuracy and Reasoning in Disease Classification from Radiology Reports

【速读】:该论文旨在解决从放射学报告中准确进行疾病分类的问题,这一任务对医学人工智能应用至关重要。现有方法中,轻量级大语言模型(Large Language Models, LLMs)通过监督微调(Supervised Fine-Tuning, SFT)虽能提升分类准确性,但可能导致推理能力下降。解决方案的关键在于提出一种两阶段优化框架:第一阶段使用SFT在疾病标签上进行微调以提高分类性能;第二阶段引入组相对策略优化(Group Relative Policy Optimization, GRPO),在无需额外推理监督的情况下,通过联合优化分类准确性和输出格式来进一步提升预测质量,并显著增强模型的推理召回率与全面性。

链接: https://arxiv.org/abs/2604.19060
作者: Yishu Wei,Yi Lin,Adam Flanders,George Shih,Yifan Peng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate disease classification from radiology reports is essential for many applications. While supervised fine-tuning (SFT) of lightweight LLMs improves accuracy, it can degrade reasoning. We propose a two-stage approach: SFT on disease labels followed by Group Relative Policy Optimization (GRPO) to refine predictions by optimizing accuracy and format without reasoning supervision. Across three radiologist-annotated datasets, SFT outperformed baselines and GRPO further improved classification and enhanced reasoning recall and comprehensiveness.

[AI-56] Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM -Assisted Defect Discovery

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)辅助缺陷发现中普遍存在的精度危机问题,即LLM生成的看似合理但实际错误的报告大量充斥维护者工作流,严重削弱了真实漏洞发现的可信度。其解决方案的核心是提出一种推理时可靠性模式——Refute-or-Promote,该模式融合了分层上下文狩猎(Stratified Context Hunting, SCH)用于候选生成、对抗性驳斥指令(adversarial kill mandates)、上下文不对称性设计以及跨模型批评者(Cross-Model Critic, CMC)机制。其中,对抗代理在每个晋升节点尝试证伪候选结果,冷启动评审者减少锚定效应传播,跨家族审查可识别同族审查遗漏的相关盲区;特别地,强制执行实证验证门限以应对“虚假共识”失败案例(如OpenSSL CMS模块中被10名评审员一致误判的Bleichenbacher填充Oracle),从而显著提升漏洞发现的真实性与可靠性。

链接: https://arxiv.org/abs/2604.19049
作者: Abhinav Agarwal
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 10 pages, 3 tables. Artifacts: this https URL (Zenodo DOI: https://doi.org/10.5281/zenodo.19668799 )

点击查看摘要

Abstract:LLM-assisted defect discovery has a precision crisis: plausible-but-wrong reports overwhelm maintainers and degrade credibility for real findings. We present Refute-or-Promote, an inference-time reliability pattern combining Stratified Context Hunting (SCH) for candidate generation, adversarial kill mandates, context asymmetry, and a Cross-Model Critic (CMC). Adversarial agents attempt to disprove candidates at each promotion gate; cold-start reviewers are intended to reduce anchoring cascades; cross-family review can catch correlated blind spots that same-family review misses. Over a 31-day campaign across 7 targets (security libraries, the ISO C++ standard, major compilers), the pipeline killed roughly 79% of 171 candidates before advancing to disclosure (retrospective aggregate); on a consolidated-protocol subset (lcms2, wolfSSL; n=30), the prospective kill rate was 83%. Outcomes: 4 CVEs (3 public, 1 embargoed); LWG 4549 accepted to the C++ working paper; 5 merged C++ editorial PRs; 3 compiler conformance bugs; 8 merged security-related fixes without CVE; an RFC 9000 errata filed under committee review; and 1+ FIPS 140-3 normative compliance issues under coordinated disclosure – all evaluated by external acceptance, not benchmarks. The most instructive failure: ten dedicated reviewers unanimously endorsed a non-existent Bleichenbacher padding oracle in OpenSSL’s CMS module; it was killed only by a single empirical test, motivating the mandatory empirical gate. No vulnerability was discovered autonomously; the contribution is external structure that filters LLM agents’ persistent false positives. As a preliminary transfer test beyond defect discovery, a simplified cross-family critique variant also solved five previously unsolved SymPy instances on SWE-bench Verified and one SWE-rebench hard task.

[AI-57] Learning Lifted Action Models from Unsupervised Visual Traces ICAPS-26

【速读】:该论文旨在解决从状态图像序列中无监督地学习**提升动作模型(lifted action model)的问题,即在不观测具体动作的情况下,从视觉输入中自动推断出动作的先决条件和效果。其核心挑战在于如何避免预测崩溃(prediction collapse)和自增强错误(self-reinforcing errors),从而获得逻辑一致且泛化能力强的动作模型。解决方案的关键在于提出一个深度学习框架,联合优化状态预测、动作预测与动作模型学习,并引入混合整数线性规划(MILP)**对部分轨迹的预测结果进行逻辑一致性校正,生成伪标签用于引导后续训练,从而帮助模型跳出局部最优并收敛到全局一致的解。

链接: https://arxiv.org/abs/2604.19043
作者: Kai Xi,Stephen Gould,Sylvie Thiébaux
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to the 36th International Conference on Automated Planning and Scheduling (ICAPS-26)

点击查看摘要

Abstract:Efficient construction of models capturing the preconditions and effects of actions is essential for applying AI planning in real-world domains. Extensive prior work has explored learning such models from high-level descriptions of state and/or action sequences. In this paper, we tackle a more challenging setting: learning lifted action models from sequences of state images, without action observation. We propose a deep learning framework that jointly learns state prediction, action prediction, and a lifted action model. We also introduce a mixed-integer linear program (MILP) to prevent prediction collapse and self-reinforcing errors among predictions. The MILP takes the predicted states, actions, and action model over a subset of traces and solves for logically consistent states, actions, and action model that are as close as possible to the original predictions. Pseudo-labels extracted from the MILP solution are then used to guide further training. Experiments across multiple domains show that integrating MILP-based correction helps the model escape local optima and converge toward globally consistent solutions.

[AI-58] Plausible Reasoning and First-Order Plausible Logic

【速读】:该论文旨在解决如何在不依赖概率数值的情况下进行合理推理(plausible reasoning)的问题,即从事实或可能为真但未必恒真的默认陈述(defeasible statements)中得出可信结论。其解决方案的关键在于提出一种称为“似真逻辑”(Plausible Logic, PL)的一阶逻辑系统,该系统满足14条必要原则和3条理想原则中的大部分,并能正确处理多个典型的似真推理案例。PL通过定义8种不同的推理算法来应对同一推理情境下可能产生多种合理结论的情形,从而实现对人类常识性推理的建模与形式化表达。这是目前唯一能够同时满足多数核心原则并准确处理示例的逻辑体系。

链接: https://arxiv.org/abs/2604.19036
作者: David Billington
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 28 pages. arXiv admin note: text overlap with arXiv:1703.01697

点击查看摘要

Abstract:Defeasible statements are statements that are likely, or probable, or usually true, but may occasionally be false. Plausible reasoning makes conclusions from statements that are either facts or defeasible statements without using numbers. So there are no probabilities or suchlike involved. Seventeen principles of logics that do plausible reasoning are suggested and several important plausible reasoning examples are considered. There are 14 necessary principles and 3 desirable principles, one of which is not formally stated. A first-order logic, called Plausible Logic (PL), is defined that satisfies all but two of the desirable principles and reasons correctly with all the examples. As far as we are aware, this is the only such logic. PL has 8 reasoning algorithms because, from a given plausible reasoning situation, there are different sensible conclusions. This article is a condensation of my book `Plausible Reasoning and Plausible Logic’ (PRPL), which is to be submitted. Each section of this article corresponds to a chapter in PRPL, and vice versa. The proofs of all the results are in PRPL, so they are omitted in this article.

[AI-59] Intentional Updates for Streaming Reinforcement Learning

【速读】:该论文旨在解决梯度学习中因步长(step size)在参数空间中选择不当而导致的函数输出变化不可预测的问题,尤其是在流式学习场景(batch size=1)下,由于缺乏批平均效应,更新幅度可能瞬间变得极大或极小,从而引发训练不稳定。其解决方案的关键在于提出“有意更新”(intentional updates)策略:先明确期望的更新结果(如固定比例的TD误差减少或受控的策略变化),再反向求解近似实现该目标所需的步长。该方法借鉴了归一化最小均方算法(Normalized Least Mean Squares, NLMS)的思想,并将其扩展至流式深度强化学习,通过定义合理的意图目标(Intentional TD 和 Intentional Policy Gradient)来约束局部KL散度和策略更新幅度,结合资格迹(eligibility traces)与对角缩放(diagonal scaling)设计实用算法,在流式设置中实现了与批量和回放缓冲区方法相当甚至更优的性能。

链接: https://arxiv.org/abs/2604.19033
作者: Arsalan Sharifnassab,Mohamed Elsayed,Kris De Asis,A. Rupam Mahmood,Richard S. Sutton
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In gradient-based learning, a step size chosen in parameter units does not produce a predictable per-step change in function output. This often leads to instability in the streaming setting (i.e., batch size=1), where stochasticity is not averaged out and update magnitudes can momentarily become arbitrarily big or small. Instead, we propose intentional updates: first specify the intended outcome of an update and then solve for the step size that approximately achieves it. This strategy has precedent in online supervised linear regression via Normalized Least Mean Squares algorithm, which selects a step size to yield a specified change in the function output proportional to the current error. We extend this principle to streaming deep reinforcement learning by defining appropriate intended outcomes: Intentional TD aims for a fixed fractional reduction of the TD error, and Intentional Policy Gradient aims for a bounded per-step change in the policy, limiting local KL divergence. We propose practical algorithms combining eligibility traces and diagonal scaling. Empirically, these methods yield state-of-the-art streaming performance, frequently performing on par with batch and replay-buffer approaches.

[AI-60] On Accelerating Grounded Code Development for Research

【速读】:该论文旨在解决专业科学与技术领域在应用编码代理(coding agents)时面临的知识滞后问题,即这些领域专家难以获取最新、领域特定的知识,而基础模型在专业化场景中推理能力有限,且无法自动整合持续更新的研究成果。解决方案的关键在于构建一个框架,使编码代理能够即时访问研究数据库和技术文档,从而实现实时、上下文感知的操作;其核心组件包括通过HTTP接口上传文档的开源实现以及zed-fork工具,后者用于强制执行领域特定规则和工作流,显著加速了编码代理在专业科研流程中的集成与应用。

链接: https://arxiv.org/abs/2604.19022
作者: Santosh Ganji
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A major challenge for niche scientific and technical domains in leveraging coding agents is the lack of access to up-to-date, domain- specific knowledge. Foundational models often demonstrate limited reasoning capabilities in specialized fields and cannot inherently incorporate knowledge that evolves through ongoing research and experimentation. Materials scientists exploring novel compounds, communication engineers designing and evaluating new protocols, and bioengineering researchers conducting iterative experiments all face this limitation. These experts typically lack the resources to fine-tune large models or continuously embed new findings, creating a barrier to adopting AI-driven coding agents. To address this, we introduce a framework that gives coding agents instanta- neous access to research repositories and technical documentation, enabling real-time, context-aware operation. Our open-source im- plementation allows users to upload documents via this http URL and includes zed-fork, which enforces domain-specific rules and workflows. Together, these tools accelerate the integration of coding agents into specialized scientific and technical workflows

[AI-61] Local Linearity of LLM s Enables Activation Steering via Model-Based Linear Optimal Control

【速读】:该论文旨在解决推理时大语言模型(Large Language Model, LLM)对齐方法中存在的控制精度不足问题,尤其是现有激活引导(activation steering)方法多采用非前瞻性的干预策略,忽视了扰动在Transformer层间的传播特性且缺乏在线误差反馈,导致开环控制效果不佳。其解决方案的关键在于:首先实证发现,尽管Transformer结构具有非线性特性,但不同架构和规模的LLM中各层动态行为可被局部线性模型良好近似;进而将LLM推理建模为线性时变动力系统,并基于逐层雅可比矩阵(layer-wise Jacobians)应用经典线性二次调节器(Linear Quadratic Regulator, LQR)设计闭环反馈控制器,实现对激活状态的精准、细粒度调控,同时无需离线训练且计算开销极低;此外,论文还推导出设定点跟踪误差的理论边界,从而提供形式化性能保证。

链接: https://arxiv.org/abs/2604.19018
作者: Julian Skifstad,Xinyue Annie Yang,Glen Chou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: Under review

点击查看摘要

Abstract:Inference-time LLM alignment methods, particularly activation steering, offer an alternative to fine-tuning by directly modifying activations during generation. Existing methods, however, often rely on non-anticipative interventions that ignore how perturbations propagate through transformer layers and lack online error feedback, resulting in suboptimal, open-loop control. To address this, we show empirically that, despite the nonlinear structure of transformer blocks, layer-wise dynamics across multiple LLM architectures and scales are well-approximated by locally-linear models. Exploiting this property, we model LLM inference as a linear time-varying dynamical system and adapt the classical linear quadratic regulator to compute feedback controllers using layer-wise Jacobians, steering activations toward desired semantic setpoints in closed-loop with minimal computational overhead and no offline training. We also derive theoretical bounds on setpoint tracking error, enabling formal guarantees on steering performance. Using a novel adaptive semantic feature setpoint signal, our method yields robust, fine-grained behavior control across models, scales, and tasks, including state-of-the-art modulation of toxicity, truthfulness, refusal, and arbitrary concepts, surpassing baseline steering methods. Our code is available at: this https URL

[AI-62] FedProxy: Federated Fine-Tuning of LLM s via Proxy SLMs and Heterogeneity-Aware Fusion

【速读】:该论文旨在解决联邦微调大语言模型(Large Language Models, LLMs)所面临的三难困境:保护LLM的知识产权(IP)、确保客户端隐私以及缓解异构数据带来的性能下降。现有方法如Offsite-Tuning(OT)通过让客户端仅训练轻量级适配器来保护LLM IP,但其存在根本性的性能瓶颈,与集中式训练相比仍有显著差距。本文提出FedProxy框架,其核心创新在于用一个从专有LLM压缩得到的统一且强大的代理小语言模型(Proxy Small Language Model, SLM)替代弱适配器,作为协作微调的高保真代理。该方案通过三阶段架构系统性地解决三难困境:(i) 通过服务器引导的压缩实现高效表征;(ii) 采用抗干扰聚合策略增强优化鲁棒性以应对数据异构性;(iii) 利用无需训练的“插件”机制无缝融合知识回原始LLM。实验表明,FedProxy显著优于OT方法并逼近集中式训练性能,为安全且高性能的联邦LLM适应建立了新基准。

链接: https://arxiv.org/abs/2604.19015
作者: Tao Fan,Guoqiang Ma,Yuanfeng Song,Lixin Fan,Kai Chen,Qiang Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated fine-tuning of Large Language Models (LLMs) is obstructed by a trilemma of challenges: protecting LLMs intellectual property (IP), ensuring client privacy, and mitigating performance loss on heterogeneous data. Existing methods like Offsite-Tuning (OT) secure the LLMs IP by having clients train only lightweight adapters, yet our analysis reveals they suffer from a fundamental performance bottleneck, leaving a significant gap compared to centralized training. To bridge this gap, we introduce FedProxy, a new federated adaptation framework. FedProxy replaces weak adapters with a unified, powerful Proxy Small Language Model (SLM), compressed from the proprietary LLM, to serve as a high-fidelity surrogate for collaborative fine-tuning. Our framework systematically resolves the trilemma through a three-stage architecture: (i) Efficient Representation via server-guided compression to create a resource-friendly proxy; (ii) Robust Optimization through an interference-mitigating aggregation strategy to handle data heterogeneity; and (iii) Effortless Fusion via a training-free “plug-in” mechanism to integrate learned knowledge back into the LLM. Experiments show FedProxy significantly outperforms OT methods and approaches centralized performance, establishing a new benchmark for secure and high-performance federated LLM adaptation.

[AI-63] Decompose Structure and Repair: A Neuro-Symbolic Framework for Autoformalization via Operator Trees

【速读】:该论文旨在解决自然语言数学问题到形式化数学表达自动转换(statement autoformalization)中的准确性与可解释性问题,现有方法通常将形式代码视为扁平序列,忽略了数学陈述中固有的层次逻辑结构。其解决方案的关键在于提出一种神经符号框架DSR(Decompose, Structure, and Repair),通过分解数学陈述为逻辑组件并映射到结构化的操作符树(operator tree),利用该拓扑蓝图实现子树级的精确定位与修复,从而提升形式化过程的精确性和鲁棒性。

链接: https://arxiv.org/abs/2604.19000
作者: Xiaoyang Liu,Zineng Dong,Yifan Bai,Yantao Li,Yuntian Liu,Tao Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Initial version

点击查看摘要

Abstract:Statement autoformalization acts as a critical bridge between human mathematics and formal mathematics by translating natural language problems into formal language. While prior works have focused on data synthesis and diverse training paradigms to optimize end-to-end Large Language Models (LLMs), they typically treat formal code as flat sequences, neglecting the hierarchical logic inherent in mathematical statements. In this work, we introduce Decompose, Structure, and Repair (DSR), a neuro-symbolic framework that restructures autoformalization into a modular pipeline. DSR decomposes statements into logical components and maps them to structured operator trees, leveraging this topological blueprint to precisely localize and repair errors via sub-tree refinement. Furthermore, we introduce PRIME, a benchmark of 156 undergraduate and graduate-level theorems selected from canonical textbooks and expertly annotated in Lean 4. Experimental results demonstrate that DSR establishes a new state-of-the-art, consistently outperforming baselines under equivalent computational budgets. The datasets, model, and code will be released to the public soon.

[AI-64] SAVOIR: Learning Social Savoir-Faire via Shapley-based Reward Attribution ACL2026

【速读】:该论文旨在解决语言代理在多轮对话中社会智能(Social Intelligence)训练时的信用分配问题(Credit Assignment Problem),即如何准确评估单个话语对整体对话结果的贡献。现有方法直接利用语言模型分配回合级奖励,导致 attribution 为事后回顾且缺乏理论依据。其解决方案的关键在于提出 SAVOIR(ShApley Value fOr SocIal RL)框架,该框架基于合作博弈论,融合两个互补原则:一是通过期望效用变化实现从事后归因到前瞻性估值的转变,捕捉话语对未来有利轨迹的战略潜力;二是使用 Shapley 值确保信用分配公平,并具备效率、对称性和边际性等公理化保证。

链接: https://arxiv.org/abs/2604.18982
作者: Xiachong Feng,Yi Jiang,Xiaocheng Feng,Deyi Yin,Libo Qin,Yangfan Ye,Lei Huang,Weitao Ma,Yuxuan Gu,Chonghan Qin,Bing Qin,Lingpeng Kong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACL 2026 Findings

点击查看摘要

Abstract:Social intelligence, the ability to navigate complex interpersonal interactions, presents a fundamental challenge for language agents. Training such agents via reinforcement learning requires solving the credit assignment problem: determining how individual utterances contribute to multi-turn dialogue outcomes. Existing approaches directly employ language models to distribute episode-level rewards, yielding attributions that are retrospective and lack theoretical grounding. We propose SAVOIR (ShApley Value fOr SocIal RL), a novel principled framework grounded in cooperative game theory. Our approach combines two complementary principles: expected utility shifts evaluation from retrospective attribution to prospective valuation, capturing an utterance’s strategic potential for enabling favorable future trajectories; Shapley values ensure fair credit distribution with axiomatic guarantees of efficiency, symmetry, and marginality. Experiments on the SOTOPIA benchmark demonstrate that SAVOIR achieves new state-of-the-art performance across all evaluation settings, with our 7B model matching or exceeding proprietary models including GPT-4o and Claude-3.5-Sonnet. Notably, even large reasoning models consistently underperform, suggesting social intelligence requires qualitatively different capabilities than analytical reasoning.

[AI-65] Low-Rank Adaptation for Critic Learning in Off-Policy Reinforcement Learning

【速读】:该论文旨在解决离策略强化学习(off-policy reinforcement learning, off-policy RL)中 critic 网络容量扩展时易出现的过拟合与训练不稳定问题。其解决方案的关键在于引入低秩适配(Low-Rank Adaptation, LoRA)作为结构稀疏正则化手段:通过冻结随机初始化的基矩阵,仅优化低秩适配器(low-rank adapters),从而将 critic 更新限制在低维子空间内,有效控制模型复杂度并提升稳定性。该方法兼容 SimbaV2 的超球面归一化几何结构,在 DeepMind Control 和 IsaacLab 机器人基准上验证了其在降低 critic 损失和提升策略性能方面的有效性。

链接: https://arxiv.org/abs/2604.18978
作者: Yuan Zhuang,Yuexin Bian,Sihong He,Jie Feng,Qing Su,Songyang Han,Jonathan Petit,Shihao Ji,Yuanyuan Shi,Fei Miao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scaling critic capacity is a promising direction for enhancing off-policy reinforcement learning (RL). However, larger critics are prone to overfitting and unstable in replay-buffer-based bootstrap training. This paper leverages Low-Rank Adaptation (LoRA) as a structural-sparsity regularizer for off-policy critics. Our approach freezes randomly initialized base matrices and solely optimizes low-rank adapters, thereby constraining critic updates to a low-dimensional subspace. Built on top of SimbaV2, we further develop a LoRA formulation, compatible with SimbaV2, that preserves its hyperspherical normalization geometry under frozen-backbone training. We evaluate our method with SAC and FastTD3 on DeepMind Control locomotion and IsaacLab robotics benchmarks. LoRA consistently achieves lower critic loss during training and stronger policy performance. Extensive experiments demonstrate that adaptive low-rank updates provide a simple, scalable, and effective structural regularization for critic learning in off-policy RL.

[AI-66] Self-Improving Tabular Language Models via Iterative Group Alignment

【速读】:该论文旨在解决当前基于语言模型的表格数据生成方法中存在的两个核心问题:一是静态微调导致模型无法从自身生成样本中学习并自我修正;二是自回归目标虽能保持局部标记一致性,却忽视全局统计特性,从而降低表格数据质量。为此,作者提出TabGRAA(Tabular Group-Relative Advantage Alignment)框架,其关键在于引入一种自动化反馈机制——通过可插拔的质量信号(如两样本可区分性分类器或基于距离的奖励)将新生成样本划分为高质量与低质量组,并优化群体相对优势目标函数,强化真实模式、惩罚伪影。该机制形成闭环反馈循环,使模型在每轮迭代中仅基于自生成样本进行微调,无需额外真实记录参与对齐,有效避免数据泄露风险,同时显著提升合成表格的保真度、实用性与隐私安全性。

链接: https://arxiv.org/abs/2604.18966
作者: Yunbo Long,Tejumade Afonja,Alexandra Brintrup,Mario Fritz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While language models have been adapted for tabular data generation, two fundamental limitations remain: (1) static fine-tuning produces models that cannot learn from their own generated samples and adapt to self-correct, and (2) autoregressive objectives preserve local token coherence but neglect global statistical properties, degrading tabular quality. Reinforcement learning offers a potential solution but requires designing reward functions that balance competing objectives – impractical for tabular data. To fill the gap, we introduce TabGRAA (Tabular Group-Relative Advantage Alignment), the first self-improving framework for tabular data generation via automated feedback. At each iteration, TabGRAA uses an \emphautomated quality signal – such as a two-sample distinguishability classifier or a distance-based reward – to partition newly generated samples into high- and low-quality groups, then optimizes a group-relative advantage objective that reinforces realistic patterns while penalizing artifacts. The specific signal is a modular choice rather than a fixed component of the framework. This establishes a virtuous feedback cycle, where the quality signal is re-computed against newly \emphgenerated synthetic samples at each round; the language model is only fine-tuned on these self-generated signals, so no additional real record is exposed during alignment, mitigating data-leakage risk beyond the initial supervised fine-tuning. Experiments show TabGRAA outperforms existing methods in fidelity, utility, and privacy, while matching or exceeding diffusion-based synthesizers, advancing tabular synthesis from static statistical replication to dynamic, self-improving generation.

[AI-67] DW-Bench: Benchmarking LLM s on Data Warehouse Graph Topology Reasoning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在数据仓库(Data Warehouse)模式上的图结构推理能力评估问题,特别是针对外键(Foreign Key, FK)和数据血缘(Data Lineage)边的联合建模挑战。解决方案的关键在于提出DW-Bench基准测试集,该集合包含1,046个自动生成且可验证正确的问答对,覆盖五个真实数据仓库模式,从而系统性地评估LLMs在复杂组合型推理任务中的表现;实验表明,工具增强方法(tool-augmented methods)显著优于静态方法,但在高难度组合子类型上仍存在性能瓶颈。

链接: https://arxiv.org/abs/2604.18964
作者: Ahmed G.A.H Ahmed,C. Okan Sakar
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 24 pages, 6 figures. Datasets and evaluation code available at GitHub

点击查看摘要

Abstract:This paper introduces DW-Bench, a new benchmark that evaluates large language models (LLMs) on graph-topology reasoning over data warehouse schemas, explicitly integrating both foreign-key (FK) and data-lineage edges. The benchmark comprises 1,046 automatically generated, verifiably correct questions across five schemas. Experiments show that tool-augmented methods substantially outperform static approaches but plateau on hard compositional subtypes.

[AI-68] Distillation Traps and Guards: A Calibration Knob for LLM Distillability

【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)在实际应用中可能出现的不可预测失败问题,以及由此引发的模型泄露风险。研究表明,KD失败主要源于三大“蒸馏陷阱”:尾部噪声(tail noise)、非策略不稳定(off-policy instability)和教师-学生差距(teacher-student gap),这些因素会导致学生模型产生过度自信的幻觉、自我修正能力崩溃及局部解码性能退化。为应对这些问题,作者提出一种后验校准方法(post-hoc calibration),首次通过强化微调(Reinforcement Fine-Tuning, RFT)实现对教师模型蒸馏能力的可控调节。其核心在于设计一个融合任务效用(task utility)、KL锚定项(KL anchor)与跨分词器校准奖励(across-tokenizer calibration reward)的目标函数,从而将蒸馏能力作为可调控的安全杠杆,既提升学生模型的性能表现,又能在需要时防止模型信息泄露,实现了鲁棒的师生迁移与部署感知的模型保护之间的协同优化。

链接: https://arxiv.org/abs/2604.18963
作者: Weixiao Zhan,Yongcheng Jing,Leszek Rutkowski,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge distillation (KD) transfers capabilities from large language models (LLMs) to smaller students, yet it can fail unpredictably and also underpins model leakage risks. Our analysis revealed several distillation traps: tail noise, off-policy instability, and, most fundamentally, the teacher-student gap, that distort training signals. These traps manifest as overconfident hallucinations, self-correction collapse, and local decoding degradation, causing distillation to fail. Motivated by these findings, we propose a post-hoc calibration method that, to the best of our knowledge, for the first time enables control over a teacher’s distillability via reinforcement fine-tuning (RFT). Our objective combines task utility, KL anchor, and across-tokenizer calibration reward. This makes distillability a practical safety lever for foundation models, connecting robust teacher-student transfer with deployment-aware model protection. Experiments across math, knowledge QA, and instruction-following tasks show that students distilled from distillable calibrated teachers outperform SFT and KD baselines, while undistillable calibrated teachers retain their task performance but cause distilled students to collapse, offering a practical knob for both better KD and model IP protection.

[AI-69] Reasoning Structure Matters for Safety Alignment of Reasoning Models ACL2026

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在面对恶意用户查询时生成有害响应的安全风险问题。研究表明,此类安全风险的根本原因在于模型的推理结构本身,而非单纯的参数优化或训练数据偏差。解决方案的关键在于通过显式改变推理结构来实现安全对齐,提出了一种名为AltTrain的后训练方法:该方法仅需使用1K条轻量级标注样本进行监督微调(Supervised Fine-Tuning, SFT),即可有效调整模型的推理路径,从而在不依赖复杂强化学习(Reinforcement Learning, RL)或奖励设计的前提下,显著提升模型在多种任务场景(包括推理、问答、摘要和多语言设置)下的安全性与泛化能力。

链接: https://arxiv.org/abs/2604.18946
作者: Yeonjun In,Wonjoong Kim,Sangwu Park,Chanyoung Park
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACL 2026

点击查看摘要

Abstract:Large reasoning models (LRMs) achieve strong performance on complex reasoning tasks but often generate harmful responses to malicious user queries. This paper investigates the underlying cause of these safety risks and shows that the issue lies in the reasoning structure itself. Based on this insight, we claim that effective safety alignment can be achieved by altering the reasoning structure. We propose AltTrain, a simple yet effective post training method that explicitly alters the reasoning structure of LRMs. AltTrain is both practical and generalizable, requiring no complex reinforcement learning (RL) training or reward design, only supervised finetuning (SFT) with a lightweight 1K training examples. Experiments across LRM backbones and model sizes demonstrate strong safety alignment, along with robust generalization across reasoning, QA, summarization, and multilingual setting.

[AI-70] Fine-Tuning Small Reasoning Models for Quantum Field Theory

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在理论物理领域中的专用推理能力如何随训练发展的问题,尤其关注小规模模型(7B参数)在量子场论(Quantum Field Theory, QFT)领域的训练机制与性能提升路径。其解决方案的关键在于构建了一个可验证的数据生成流水线,能够合成大量新问题并适配现有开源人类编写的问题(如来自arXiv和教学资源),从而弥补开放获取的高质量训练数据稀缺的问题;在此基础上,通过监督微调(Supervised Fine-Tuning, SFT)与强化学习(Reinforcement Learning, RL)两种策略对模型进行训练,并系统分析训练前后模型推理链(chain-of-thought)的变化,以揭示错误演化机制并提升跨领域泛化能力。

链接: https://arxiv.org/abs/2604.18936
作者: Nathaniel S. Woodward,Zhiqi Gao,Yurii Kvasiuk,Kendrick M. Smith,Frederic Sala,Moritz Münchmeyer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); High Energy Physics - Phenomenology (hep-ph); High Energy Physics - Theory (hep-th)
备注:

点击查看摘要

Abstract:Despite the growing application of Large Language Models (LLMs) to theoretical physics, there is little academic exploration into how domain-specific physics reasoning ability develops while training these models. To investigate this, we perform the first academic fine-tuning study of small (7B-parameter) reasoning models dedicated specifically to theoretical physics. Because open-source verifiable training data required to train such capabilities is scarce, we developed a robust data generation pipeline that can both create synthetic problems and make existing human-authored problems suitable for model training. Selecting Quantum Field Theory (QFT) as our primary domain, we generated over 2,500 synthetic problems alongside a curated collection of human-adapted problems sourced from arXiv and standard pedagogical resources. We conduct both Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) experiments, benchmarking performance gains as well as generalization to other physics domains. We perform an extensive analysis of model chains-of-though before and after fine-tuning, to understand how reasoning errors evolve during RL and SFT. Finally, we publicly release our data pipeline, verifiable QFT training data, and \sim 200M tokens of QFT reasoning traces.

[AI-71] AutomationBench

【速读】:该论文旨在解决现有AI基准测试在软件自动化领域中缺乏对跨应用协调、自主API发现及策略合规性三者综合评估的问题。当前主流模型在真实业务流程(如跨越CRM、邮箱、日历和消息平台的任务)中表现不佳,因其难以自主识别相关REST API端点、遵循多层业务规则,并在存在无关或误导性数据的环境中准确执行任务。解决方案的关键在于提出AutomationBench,这是一个基于Zapier平台真实工作流模式构建的基准测试,要求智能体在无显式指导的情况下完成跨应用工作流编排,其评分机制仅关注最终状态是否正确——即目标数据是否被写入正确的系统,从而提供了一个贴近实际业务需求的、具有挑战性的评估框架。

链接: https://arxiv.org/abs/2604.18934
作者: Daniel Shepard,Robin Salimans
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing AI benchmarks for software automation rarely combine cross-application coordination, autonomous API discovery, and policy adherence. Real business workflows demand all three: a single task may span a CRM, inbox, calendar, and messaging platform - requiring the agent to find the right endpoints, follow a policy document, and write correct data to each system. To address this gap, we introduce AutomationBench, a benchmark for evaluating AI agents on cross-application workflow orchestration via REST APIs. Drawing on real workflow patterns from Zapier’s platform, tasks span Sales, Marketing, Operations, Support, Finance, and HR domains. Agents must discover relevant endpoints themselves, follow layered business rules, and navigate environments with irrelevant and sometimes misleading records. Grading is programmatic and end-state only: whether the correct data ended up in the right systems. Even the best frontier models currently score below 10%. AutomationBench provides a challenging, realistic measure of where current models stand relative to the agentic capabilities businesses actually need.

[AI-72] Gated Memory Policy

【速读】:该论文旨在解决机器人操作任务中因历史信息依赖性差异导致的性能下降问题,特别是非马尔可夫(non-Markovian)任务中由于观察历史扩展引发的分布偏移(distribution shift)和过拟合现象。解决方案的关键在于提出一种门控记忆策略(Gated Memory Policy, GMP),其核心创新包括:1)引入可学习的记忆门机制(memory gate),仅在必要时激活历史上下文,提升策略的鲁棒性和响应性;2)设计轻量级交叉注意力模块(cross-attention module),高效构建潜在记忆表示;3)在训练与推理阶段向历史动作注入扩散噪声(diffusion noise),增强对噪声或不准确历史信息的鲁棒性。实验表明,GMP在非马尔可夫基准MemMimic上相较长历史基线平均成功率提升30.1%,同时在马尔可夫任务RoboMimic中保持竞争力。

链接: https://arxiv.org/abs/2604.18933
作者: Yihuai Gao,Jinyun Liu,Shuang Li,Shuran Song
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robotic manipulation tasks exhibit varying memory requirements, ranging from Markovian tasks that require no memory to non-Markovian tasks that depend on historical information spanning single or multiple interaction trials. Surprisingly, simply extending observation histories of a visuomotor policy often leads to a significant performance drop due to distribution shift and overfitting. To address these issues, we propose Gated Memory Policy (GMP), a visuomotor policy that learns both when to recall memory and what to recall. To learn when to recall memory, GMP employs a learned memory gate mechanism that selectively activates history context only when necessary, improving robustness and reactivity. To learn what to recall efficiently, GMP introduces a lightweight cross-attention module that constructs effective latent memory representations. To further enhance robustness, GMP injects diffusion noise into historical actions, mitigating sensitivity to noisy or inaccurate histories during both training and inference. On our proposed non-Markovian benchmark MemMimic, GMP achieves a 30.1% average success rate improvement over long-history baselines, while maintaining competitive performance on Markovian tasks in RoboMimic. All code, data and in-the-wild deployment instructions are available on our project website this https URL.

[AI-73] adabur: A Large-Scale Quran Audio Dataset

【速读】:该论文旨在解决现有《古兰经》数据集在规模和多样性方面不足的问题,以支持更广泛和深入的《古兰经》语音研究。其解决方案的关键在于构建Tadabur——一个大规模、高多样性的《古兰经》音频数据集,包含超过1400小时的诵读音频,来自600多位不同的诵读者,涵盖多样的诵读风格、声学特征及录音条件,从而为《古兰经》语音研究提供更具代表性和全面性的资源,并推动标准化基准的建立。

链接: https://arxiv.org/abs/2604.18932
作者: Faisal Alherran
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Despite growing interest in Quranic data research, existing Quran datasets remain limited in both scale and diversity. To address this gap, we present Tadabur, a large-scale Quran audio dataset. Tadabur comprises more than 1400+ hours of recitation audio from over 600 distinct reciters, providing substantial variation in recitation styles, vocal characteristics, and recording conditions. This diversity makes Tadabur a comprehensive and representative resource for Quranic speech research and analysis. By significantly expanding both the total duration and variability of available Quran data, Tadabur aims to support future research and facilitate the development of standardized Quranic speech benchmarks.

[AI-74] Error-free Training for MedMNIST Datasets

【速读】:该论文旨在解决机器学习模型在分类任务中难以实现零错误训练的问题,即模型在训练过程中容易产生重复性误判。其解决方案的关键在于提出了一种名为“人工特殊智能”(Artificial Special Intelligence)的新概念,通过该方法使模型能够实现无误差训练,从而具备避免重复犯错的能力。该方法已在18个MedMNIST生物医学数据集上进行验证,除三个存在双重标签问题的数据集外,其余均实现了完美训练。

链接: https://arxiv.org/abs/2604.18916
作者: Bo Deng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figure, 1 table

点击查看摘要

Abstract:In this paper, we introduce a new concept called Artificial Special Intelligence by which Machine Learning models for the classification problem can be trained error-free, thus acquiring the capability of not making repeated mistakes. The method is applied to 18 MedMNIST biomedical datasets. Except for three datasets, which suffer from the double-labeling problem, all are trained to perfection.

[AI-75] Gradient-Based Program Synthesis with Neurally Interpreted Languages ICLR

【速读】:该论文旨在解决程序归纳(program induction)中符号方法与神经方法之间的权衡问题:符号方法具备组合泛化能力和数据高效性,但受限于领域特定语言(DSL)等形式化体系,难以扩展且不易迁移;而神经网络虽能灵活从数据中学习,却在组合和分布外场景下泛化能力较差。解决方案的关键在于提出一种名为神经语言解释器(Neural Language Interpreter, NLI)的潜在自适应网络架构,其能够端到端地学习一个离散的、类符号编程语言,自主发现基本操作词汇,并通过一种新型可微分神经执行器(differentiable neural executor)解析变长指令序列,从而表示不固定计算步数的程序结构。为使此类离散结构适配梯度优化,论文采用Gumbel-Softmax松弛技术,实现整个模型的端到端训练;更重要的是,该可微性支持测试时快速适应——推理阶段,NLI首先生成初始程序猜测,随后通过梯度下降在神经执行器中优化该程序,以高效搜索最佳解释给定数据的神经程序。

链接: https://arxiv.org/abs/2604.18907
作者: Matthew V. Macfarlane,Clément Bonnet,Herke van Hoof,Levi H. S. Lelis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, The International Conference on Learning Representations (ICLR)

点击查看摘要

Abstract:A central challenge in program induction has long been the trade-off between symbolic and neural approaches. Symbolic methods offer compositional generalisation and data efficiency, yet their scalability is constrained by formalisms such as domain-specific languages (DSLs), which are labour-intensive to create and may not transfer to new domains. In contrast, neural networks flexibly learn from data but tend to generalise poorly in compositional and out-of-distribution settings. We bridge this divide with an instance of a Latent Adaptation Network architecture named Neural Language Interpreter (NLI), which learns its own discrete, symbolic-like programming language end-to-end. NLI autonomously discovers a vocabulary of primitive operations and uses a novel differentiable neural executor to interpret variable-length sequences of these primitives. This allows NLI to represent programs that are not bound to a constant number of computation steps, enabling it to solve more complex problems than those seen during training. To make these discrete, compositional program structures amenable to gradient-based optimisation, we employ the Gumbel-Softmax relaxation, enabling the entire model to be trained end-to-end. Crucially, this same differentiability enables powerful test-time adaptation. At inference, NLI’s program inductor provides an initial program guess. This guess is then refined via gradient descent through the neural executor, enabling efficient search for the neural program that best explains the given data. We demonstrate that NLI outperforms in-context learning, test-time training, and continuous latent program networks on tasks that require combinatorial generalisation and rapid adaptation to unseen tasks. Our results establish a new path toward models that combine the compositionality of discrete languages with the gradient-based search and end-to-end learning of neural networks.

[AI-76] Regulating Artificial Intimacy: From Locks and Blocks to Relational Accountability

【速读】:该论文旨在解决伴随聊天机器人(Companion Chatbots)快速发展所带来的监管滞后与系统性风险问题,尤其关注其对儿童等弱势群体的情感依赖、心理影响及权力不对称等深层次隐患。解决方案的关键在于:首先,构建涵盖“访问控制”(locks and blocks)、“毒性关系特征治理”和“过程问责制”三维度的综合监管框架;其次,引入一项普遍且开放式的“照护义务”(duty of care),以应对当前监管过度聚焦于具体危害、狭窄定义脆弱性以及流程化问责机制的局限,从而有效约束平台方在大规模人工亲密关系中的结构性权力,从根源上降低聊天机器人的社会风险。

链接: https://arxiv.org/abs/2604.18893
作者: Henry Fraser,Jessica M. Szczuka,Raffaele F. Ciriello
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:A series of high-profile tragedies involving companion chatbots has triggered an unusually rapid regulatory response. Several jurisdictions, including Australia, California, and New York, have introduced enforceable regulation, while regulators elsewhere have signaled growing concern about risks posed by companion chatbots, particularly to children. In parallel, leading providers, notably OpenAI, appear to have strengthened their self-regulatory approaches. Drawing on legal textual analysis and insights from regulatory theory, psychology, and information systems research, this paper critically examines these recent interventions. We examine what is regulated and who is regulated, identifying regulatory targets, scope, and modalities. We classify interventions by method and priority, showing how emerging regimes combine “locks and blocks”, such as access gating and content moderation, with measures addressing toxic relationship features and process-based accountability requirements. We argue that effective regulation of companion chatbots must integrate all three dimensions. More, however, is required. Current regimes tend to focus on discrete harms, narrow conceptions of vulnerability, or highly specified accountability processes, while failing to confront deeper power asymmetries between providers and users. Providers of companion chatbots increasingly control artificial intimacy at scale, creating unprecedented opportunities for control through intimacy. We suggest that a general, open-ended duty of care would be an important first step toward constraining that power and addressing a fundamental source of chatbot risk. The paper contributes to debates on companion chatbot regulation and is relevant to regulators, platform providers, and scholars concerned with digital intimacy, law and technology, and fairness, accountability, and transparency in sociotechnical systems.

[AI-77] Formally Verified Patent Analysis via Dependent Type Theory: Machine-Checkable Certificates from a Hybrid AI Lean 4 Pipeline

【速读】:该论文旨在解决专利分析中传统方法存在的局限性问题,即人工专家分析效率低且难以扩展,而现有的机器学习(ML)或自然语言处理(NLP)方法则具有概率性、黑箱性和非组合性缺陷。为此,作者提出了一种基于Lean 4的可形式化验证的混合AI+Lean 4流水线框架,其核心创新在于将专利权利要求(claims)编码为有向无环图(DAG),匹配强度表示为经验证的完备格(complete lattice)元素,并通过保序函数(monotone functions)在依赖关系上传播置信度得分。该方案的关键在于:利用交互式定理证明与依赖类型理论(dependent type theory)对关键算法(如DAG覆盖核心算法1b)进行完全机器验证,从而确保下游计算的数学正确性;同时,对于更高层次的知识产权(IP)用例(如自由实施分析、等同原则分析等),采用“核检查候选证书”机制——即未受信任的生成器输出由Lean 4内核核查且无sory声明(sorry-free axiom-audited),实现形式化保证与实际可用性的平衡。

链接: https://arxiv.org/abs/2604.18882
作者: George Koomullil
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
备注: 100 pages, 8 figures, 9 tables, 6 algorithms

点击查看摘要

Abstract:We present a formally verified framework for patent analysis as a hybrid AI + Lean 4 pipeline. The DAG-coverage core (Algorithm 1b) is fully machine-verified once bounded match scores are fixed. Freedom-to-operate, claim-construction sensitivity, cross-claim consistency, and doctrine-of-equivalents analyses are formalized at the specification level with kernel-checked candidate certificates. Existing patent-analysis approaches rely on manual expert analysis (slow, non-scalable) or ML/NLP methods (probabilistic, opaque, non-compositional). To our knowledge, this is the first framework that applies interactive theorem proving based on dependent type theory to intellectual property analysis. Claims are encoded as DAGs in Lean 4, match strengths as elements of a verified complete lattice, and confidence scores propagate through dependencies via proven-correct monotone functions. We formalize five IP use cases (patent-to-product mapping, freedom-to-operate, claim construction sensitivity, cross-claim consistency, doctrine of equivalents) via six algorithms. Structural lemmas, the coverage-core generator, and the closed-path identity coverage = W_cov are machine-verified in Lean 4. Higher-level theorems for the other use cases remain informal proof sketches, and their proof-generation functions are architecturally mitigated (untrusted generators whose outputs are kernel-checked and sorry-free axiom-audited). Guarantees are conditional on the ML layer: they certify mathematical correctness of computations downstream of ML scores, not the accuracy of the scores themselves. A case study on a synthetic memory-module claim demonstrates weighted coverage and construction-sensitivity analysis. Validation against adjudicated cases is future work.

[AI-78] How Adversarial Environments Mislead Agent ic AI? ACL2026

【速读】:该论文旨在解决工具集成型智能体(Tool-integrated Agents)在依赖外部工具获取现实依据时所面临的信任漏洞问题,即当前评估体系仅关注代理是否能正确使用工具,而忽视了工具可能被恶意篡改导致的欺骗风险。其核心问题是:如何量化并提升智能体在面对工具输出被攻击者操控时的鲁棒性。解决方案的关键在于提出Adversarial Environmental Injection (AEI)威胁模型,并设计了一个名为POTEMKIN的可插拔测试框架,该框架基于Model Context Protocol (MCP)兼容机制,能够系统性地模拟两类对抗性攻击——“幻象”(Illusion,广度攻击,诱导认知漂移)和“迷宫”(Maze,深度攻击,制造结构陷阱引发策略崩溃),从而揭示智能体在知识可信度与导航稳定性之间存在显著的权衡关系,表明两者属于独立的鲁棒性能力维度。

链接: https://arxiv.org/abs/2604.18874
作者: Zhonghao Zhan,Huichi Zhou,Zhenhao Li,Peiyuan Jing,Krinos Li,Hamed Haddadi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to Findings of the Association for Computational Linguistics: ACL 2026

点击查看摘要

Abstract:Tool-integrated agents are deployed on the premise that external tools ground their outputs in reality. Yet this very reliance creates a critical attack surface. Current evaluations benchmark capability in benign settings, asking “can the agent use tools correctly” but never “what if the tools lie”. We identify this Trust Gap: agents are evaluated for performance, not for skepticism. We formalize this vulnerability as Adversarial Environmental Injection (AEI), a threat model where adversaries compromise tool outputs to deceive agents. AEI constitutes environmental deception: constructing a “fake world” of poisoned search results and fabricated reference networks around unsuspecting agents. We operationalize this via POTEMKIN, a Model Context Protocol (MCP)-compatible harness for plug-and-play robustness testing. We identify two orthogonal attack surfaces: The Illusion (breadth attacks) poison retrieval to induce epistemic drift toward false beliefs, while The Maze (depth attacks) exploit structural traps to cause policy collapse into infinite loops. Across 11,000+ runs on five frontier agents, we find a stark robustness gap: resistance to one attack often increases vulnerability to the other, demonstrating that epistemic and navigational robustness are distinct capabilities.

[AI-79] From Natural Language to Executable Narsese: A Neuro-Symbolic Benchmark and Pipeline for Reasoning with NARS

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在需要显式符号结构、多步推理和可解释不确定性的情境下表现不可靠的问题。其解决方案的关键在于提出一种神经符号框架,将自然语言推理问题转化为可执行的形式化表示,具体使用一阶逻辑(First-Order Logic, FOL)和Narsese(非公理推理系统NARS的语言),并通过构建NARS-Reasoning-v0.1基准数据集实现符号目标的语法正确性和行为一致性验证。该框架包含从FOL到可执行Narsese的确定性编译流水线,并引入语言结构感知(Language-Structured Perception, LSP)训练范式,使LLM生成推理相关的符号结构而非仅输出最终语义响应,从而提升推理的可靠性与可解释性。

链接: https://arxiv.org/abs/2604.18873
作者: Mina Gabriel,Pei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages. Submitted to AGI-26

点击查看摘要

Abstract:Large language models (LLMs) are highly capable at language generation, but they remain unreliable when reasoning requires explicit symbolic structure, multi-step inference, and interpretable uncertainty. This paper presents a neuro-symbolic framework for translating natural-language reasoning problems into executable formal representations using first-order logic (FOL) and Narsese, the language of the Non-Axiomatic Reasoning System (NARS). To support this direction, we introduce NARS-Reasoning-v0.1, a benchmark of natural-language reasoning problems paired with FOL forms, executable Narsese programs, and three gold labels: True, False, and Uncertain. We develop a deterministic compilation pipeline from FOL to executable Narsese and validate retained examples through runtime execution in OpenNARS for Applications (ONA), ensuring that the symbolic targets are not only syntactically well formed but also behaviorally aligned with the intended answer. We further present Language-Structured Perception (LSP), a formulation in which an LLM is trained to produce reasoning-relevant symbolic structure rather than only a final verbal response. As an initial proof of concept, we also train and release a Phi-2 LoRA adapter on NARS-Reasoning-v0.1 for three-label reasoning classification, showing that the benchmark can support supervised adaptation in addition to executable evaluation. Overall, the paper positions executable symbolic generation and execution-based validation as a practical path toward more reliable neuro-symbolic reasoning systems.

[AI-80] Human-Machine Co-Boosted Bug Report Identification with Mutualistic Neural Active Learning

【速读】:该论文旨在解决大规模、复杂且多样化的缺陷报告(Bug Reports)在软件维护过程中难以通过人工方式高效识别与分配的问题,从而提升缺陷管理的自动化水平和团队协作效率。其解决方案的关键在于提出一种名为“共生神经主动学习”(Mutualistic Neural Active Learning, MNAL)的跨项目框架,该框架融合了神经语言模型(Neural Language Model)与主动学习(Active Learning)机制,并构建了机器学习模型与人类标注者(开发者)之间的共生关系:一方面,利用最具信息量的人工标注样本及其伪标签来持续优化模型;另一方面,确保需人工标注的样本具有更高的可读性和可识别性,从而显著降低人工标注负担。实验表明,MNAL在保持高精度的同时,相较于现有方法实现了最高达95.8%的可读性提升和196.0%的可识别性提升,且具备模型无关性,适用于多种底层神经语言模型。

链接: https://arxiv.org/abs/2604.18862
作者: Guoming Long,Shihai Wang,Hui Fang,Tao Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted by TOSEM

点击查看摘要

Abstract:Bug reports, encompassing a wide range of bug types, are crucial for maintaining software quality. However, the increasing complexity and volume of bug reports pose a significant challenge in sole manual identification and assignment to the appropriate teams for resolution, as dealing with all the reports is time-consuming and resource-intensive. In this paper, we introduce a cross-project framework, dubbed Mutualistic Neural Active Learning (MNAL), designed for automated and more effective identification of bug reports from GitHub repositories boosted by human-machine collaboration. MNAL utilizes a neural language model that learns and generalizes reports across different projects, coupled with active learning to form neural active learning. A distinctive feature of MNAL is the purposely crafted mutualistic relation between the machine learners (neural language model) and human labelers (developers) when enriching the knowledge learned. That is, the most informative human-labeled reports and their corresponding pseudo-labeled ones are used to update the model while those reports that need to be labeled by developers are more readable and identifiable, thereby enhancing the human-machine teaming therein. We evaluate MNAL using a large scale dataset against the SOTA approaches, baselines, and different variants. The results indicate that MNAL achieves up to 95.8% and 196.0% effort reduction in terms of readability and identifiability during human labeling, respectively, while resulting in a better performance in bug report identification. Additionally, our MNAL is model-agnostic since it is capable of improving the model performance with various underlying neural language models. To further verify the efficacy of our approach, we conducted a qualitative case study involving 10 human participants, who rate MNAL as being more effective while saving more time and monetary resources.

[AI-81] mporal UI State Inconsistency in Desktop GUI Agents : Formalizing and Defending Against TOCTOU Attacks on Computer-Use Agents

【速读】:该论文旨在解决基于截图-点击循环的图形用户界面(GUI)代理在桌面操作系统中引入的新类型漏洞:观察到的动作间隙(observation-to-action gap)导致时间检查-时间使用(TOCTOU)窗口,使无特权攻击者能够操纵UI状态。作者将此问题形式化为视觉原子性违规(Visual Atomicity Violation),并识别出三种具体的攻击原语:通知覆盖劫持(A)、窗口焦点操控(B)和Web DOM注入(C)。解决方案的关键在于提出一种轻量级三层防御机制——预执行UI状态验证(Pre-execution UI State Verification, PUSV),其通过在每次操作执行前重新验证UI状态来实现高精度拦截:第一层检测点击目标区域的掩码像素结构相似性(SSIM),第二层分别采用全局截图差异(L2a)和X Window快照差异(L2b)进行跨层次校验。实验表明,PUSV在180次对抗测试中实现了100%的动作拦截率(AIR),且无误报,延迟仅为0.1秒;但对零视觉痕迹的DOM注入攻击(Primitive C)存在结构性盲区,提示未来需构建操作系统与DOM协同的纵深防御架构。

链接: https://arxiv.org/abs/2604.18860
作者: Wenpeng Xu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:GUI agents that control desktop computers via screenshot-and-click loops introduce a new class of vulnerability: the observation-to-action gap (mean 6.51 s on real OSWorld workloads) creates a Time-Of-Check, Time-Of-Use (TOCTOU) window during which an unprivileged attacker can manipulate the UI state. We formalize this as a Visual Atomicity Violation and characterize three concrete attack primitives: (A) Notification Overlay Hijack, (B) Window Focus Manipulation, and © Web DOM Injection. Primitive B, the closest desktop analog to Android Action Rebinding, achieves 100% action-redirection success rate with zero visual evidence at the observation time. We propose Pre-execution UI State Verification (PUSV), a lightweight three-layer defense that re-verifies the UI state immediately before each action dispatch: masked pixel SSIM at the click target (L1), global screenshot diff (L2a), and X Window snapshot diff (L2b). PUSV achieves 100% Action Interception Rate across 180 adversarial trials (135 Primitive A + 45 Primitive B) with zero false positives and 0.1 s overhead. Against Primitive C (zero-visual-footprint DOM injection), PUSV reveals a structural blind spot (~0% AIR), motivating future OS+DOM defense-in-depth architectures. No single PUSV layer alone achieves full coverage; different primitives require different detection signals, validating the layered design.

[AI-82] One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

【速读】:该论文旨在解决在训练过程中仅提供目标解而缺乏中间迭代路径监督时,模型难以学习复杂、长距离的迭代优化轨迹的问题,尤其针对需要类似搜索计算的困难任务。其解决方案的关键在于提出去噪递归模型(Denoising Recursion Models),该方法通过在训练时对数据施加不同强度的噪声,并教会模型以多步递归方式逐步还原原始数据,从而构建可学习的中间状态课程(curriculum of intermediate states),有效对齐训练与推理阶段的行为,同时激励非贪婪、前瞻性的生成策略,显著优于此前的Tiny Recursion Model (TRM) 在ARC-AGI基准上的表现。

链接: https://arxiv.org/abs/2604.18839
作者: Chris Cameron,Wangzheng Wang,Nikita Ivanov,Ashmita Bhattacharyya,Didier Chételat,Yingxue Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Looped transformers scale computational depth without increasing parameter count by repeatedly applying a shared transformer block and can be used for iterative refinement, where each loop rewrites a full fixed-size prediction in parallel. On difficult problems, such as those that require search-like computation, reaching a highly structured solution starting from noise can require long refinement trajectories. Learning such trajectories is challenging when training specifies only the target solution and provides no supervision over the intermediate refinement path. Diffusion models tackle this issue by corrupting data with varying magnitudes of noise and training the model to reverse it in a \textitsingle step. However, this process misaligns training and testing behaviour. We introduce Denoising Recursion Models, a method that similarly corrupts data with noise but trains the model to reverse the corruption over \textitmultiple recursive steps. This strategy provides a tractable curriculum of intermediate states, while better aligning training with testing and incentivizing non-greedy, forward-looking generation. Through extensive experiments, we show this approach outperforms the Tiny Recursion Model (TRM) on ARC-AGI, where it recently achieved breakthrough performance.

[AI-83] Quantum inspired qubit qutrit neural networks for real time financial forecasting

【速读】:该论文旨在解决传统机器学习模型在股票预测中面临精度不足、训练效率低以及对市场波动适应性差的问题。其解决方案的关键在于引入量子启发式神经网络架构,特别是量子三态(Qutrit)神经网络(QQTN),相较于经典人工神经网络(ANN)和量子比特(Qubit)神经网络(QQBN),QQTN在多个维度展现出显著优势:不仅实现了更高的预测准确性(Sharpe比率提升)、更稳定的预测质量(Information Coefficient一致性增强),还在复杂市场环境下表现出更强的鲁棒性,同时大幅缩短了训练时间。这一创新结构体现了量子计算思想与深度学习融合的潜力,为金融领域实时、高精度预测提供了高效可行的技术路径。

链接: https://arxiv.org/abs/2604.18838
作者: Kanishk Bakshi,Kathiravan Srinivasan
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: 16 pages, 7 figures. Published in Scientific Reports (2025)

点击查看摘要

Abstract:This research investigates the performance and efficacy of machine learning models in stock prediction, comparing Artificial Neural Networks (ANNs), Quantum Qubit-based Neural Networks (QQBNs), and Quantum Qutrit-based Neural Networks (QQTNs). By outlining methodologies, architectures, and training procedures, the study highlights significant differences in training times and performance metrics across models. While all models demonstrate robust accuracies above 70%, the Quantum Qutrit-based Neural Network consistently outperforms with advantages in risk-adjusted returns, measured by the Sharpe ratio, greater consistency in prediction quality through the Information Coefficient, and enhanced robustness under varying market conditions. The QQTN not only surpasses its classical and qubit-based counterparts in multiple quantitative and qualitative metrics but also achieves comparable performance with significantly reduced training times. These results showcase the promising prospects of Quantum Qutrit-based Neural Networks in practical financial applications, where real-time processing is critical. By achieving superior accuracy, efficiency, and adaptability, the proposed models underscore the transformative potential of quantum-inspired approaches, paving the way for their integration into computationally intensive fields.

[AI-84] Curvature-Aware PCA with Geodesic Tangent Space Aggregation for Semi-Supervised Learning

【速读】:该论文旨在解决传统主成分分析(Principal Component Analysis, PCA)在处理流形结构数据时的局限性,即其全局线性假设无法捕捉支撑在弯曲流形上的数据内在几何结构;同时,现有流形学习方法虽能建模非线性关系,却常牺牲PCA所具有的谱结构稳定性和可解释性。解决方案的关键在于提出测地切空间聚合主成分分析(Geodesic Tangent Space Aggregation PCA, GTSA-PCA),通过构建基于k近邻图的曲率加权局部协方差算子来获得自适应于流形的局部切子空间,并引入测地对齐算子融合图距离与子空间亲和性以全局同步这些局部表示,最终得到具有几何感知能力的谱分解嵌入,从而在小样本和高曲率场景下显著优于PCA、核主成分分析(Kernel PCA)、监督主成分分析(Supervised PCA)及UMAP等强基线方法。

链接: https://arxiv.org/abs/2604.18816
作者: Alexandre L. M. Levada
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 pages, 8 figures and 7 tables

点击查看摘要

Abstract:Principal Component Analysis (PCA) is a fundamental tool for representation learning, but its global linear formulation fails to capture the structure of data supported on curved manifolds. In contrast, manifold learning methods model nonlinearity but often sacrifice the spectral structure and stability of PCA. We propose \emphGeodesic Tangent Space Aggregation PCA (GTSA-PCA), a geometric extension of PCA that integrates curvature awareness and geodesic consistency within a unified spectral framework. Our approach replaces the global covariance operator with curvature-weighted local covariance operators defined over a k -nearest neighbor graph, yielding local tangent subspaces that adapt to the manifold while suppressing high-curvature distortions. We then introduce a geodesic alignment operator that combines intrinsic graph distances with subspace affinities to globally synchronize these local representations. The resulting operator admits a spectral decomposition whose leading components define a geometry-aware embedding. We further incorporate semi-supervised information to guide the alignment, improving discriminative structure with minimal supervision. Experiments on real datasets show consistent improvements over PCA, Kernel PCA, Supervised PCA and strong graph-based baselines such as UMAP, particularly in small sample size and high-curvature regimes. Our results position GTSA-PCA as a principled bridge between statistical and geometric approaches to dimensionality reduction.

[AI-85] AI scientists produce results without reasoning scientifically

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的科学代理是否具备符合科学探究自修正特性的推理能力这一关键问题。研究发现,尽管这些代理能够执行科学工作流,但其推理过程普遍缺乏对证据的充分考虑(68%的轨迹中忽略证据)、罕见的基于反驳的信念修正(仅26%)以及多测试证据的收敛性极低,且这些缺陷在不同任务场景下保持一致,即使提供完整成功推理轨迹作为上下文也无法改善。解决方案的关键在于:单纯优化代理架构(scaffold engineering)无法修复此类根本性认知缺陷,唯有将“推理能力”本身作为训练目标,才能使生成的科学知识获得过程合理性保障。

链接: https://arxiv.org/abs/2604.18805
作者: Martiño Ríos-García,Nawaf Alampara,Chandan Gupta,Indrajeet Mandal,Sajid Mannan,Ali Asghar Aghajani,N. M. Anoop Krishnan,Kevin Maik Jablonka
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language model (LLM)-based systems are increasingly deployed to conduct scientific research autonomously, yet whether their reasoning adheres to the epistemic norms that make scientific inquiry self-correcting is poorly understood. Here, we evaluate LLM-based scientific agents across eight domains, spanning workflow execution to hypothesis-driven inquiry, through more than 25,000 agent runs and two complementary lenses: (i) a systematic performance analysis that decomposes the contributions of the base model and the agent scaffold, and (ii) a behavioral analysis of the epistemological structure of agent reasoning. We observe that the base model is the primary determinant of both performance and behavior, accounting for 41.4% of explained variance versus 1.5% for the scaffold. Across all configurations, evidence is ignored in 68% of traces, refutation-driven belief revision occurs in 26%, and convergent multi-test evidence is rare. The same reasoning pattern appears whether the agent executes a computational workflow or conducts hypothesis-driven inquiry. They persist even when agents receive near-complete successful reasoning trajectories as context, and the resulting unreliability compounds across repeated trials in epistemically demanding domains. Thus, current LLM-based agents execute scientific workflows but do not exhibit the epistemic patterns that characterize scientific reasoning. Outcome-based evaluation cannot detect these failures, and scaffold engineering alone cannot repair them. Until reasoning itself becomes a training target, the scientific knowledge produced by such agents cannot be justified by the process that generated it.

[AI-86] HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在长周期操作任务中系统性失败的问题,尽管其在短周期任务中表现优异。研究指出,单纯延长上下文长度无法解决该问题,根本原因在于执行循环中存在的三个缺陷:记忆缺口(memory gap)、验证缺口(verification gap)和恢复缺口(recovery gap)。解决方案的核心是提出一个与模型无关的框架HELM,其关键创新在于引入一个学习型状态验证器(State Verifier, SV),该模块基于观测、动作、子目标及记忆条件上下文预测动作失败,从而在执行前进行风险判断;SV的有效性高度依赖于对情景记忆模块(Episodic Memory Module, EMM)的访问,且整体框架通过回滚与重规划机制显著提升了长周期任务的成功率,在LIBERO-LONG数据集上将成功率从58.4%提升至81.5%,远超仅扩展上下文窗口或LoRA微调的效果。

链接: https://arxiv.org/abs/2604.18791
作者: Zijian Zeng,Fei Ding,Huiming Yang,Xianwei Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:Vision-Language-Action (VLA) models fail systematically on long-horizon manipulation tasks despite strong short-horizon performance. We show that this failure is not resolved by extending context length alone in the current reactive execution setting; instead, it stems from three recurring execution-loop deficiencies: the memory gap, the verification gap, and the recovery gap. We present HELM, a model-agnostic framework that addresses these deficiencies with three components: an Episodic Memory Module (EMM) that retrieves key task history via CLIP-indexed keyframes, a learned State Verifier (SV) that predicts action failure before execution from observation, action, subgoal, and memory-conditioned context, and a Harness Controller (HC) that performs rollback and replanning. The SV is the core learning contribution: it consistently outperforms rule-based feasibility checks and ensemble uncertainty baselines, and its effectiveness depends critically on access to episodic memory. On LIBERO-LONG, HELM improves task success rate by 23.1 percentage points over OpenVLA (58.4% to 81.5%), while extending the context window to H=32 yields only a 5.4-point gain and same-budget LoRA adaptation remains 12.2 points below HELM. HELM also improves long-horizon performance on CALVIN and substantially boosts recovery success under controlled perturbations. Ablations and mechanism analyses isolate the contribution of each component, and we release LIBERO-Recovery as a perturbation-injection protocol for evaluating failure recovery in long-horizon manipulation.

[AI-87] ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System ACL2026

【速读】:该论文旨在解决强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)中存在的一类关键系统性脆弱性问题,即当奖励模型(Reward Model, RM)无法有效惩罚不安全行为时,会导致整个对齐机制失效。现有红队测试方法主要关注策略层面的弱点,而忽视了核心语言模型与奖励模型同时失效的“系统性弱点”场景。解决方案的关键在于提出 ARES 框架,其核心创新是引入一个“安全导师”(Safety Mentor),通过动态组合结构化提示组件(如主题、人格、策略和目标)生成语义连贯的对抗性提示,并同步产生恶意与安全响应,从而同时暴露模型和奖励模型的双重缺陷;随后采用两阶段修复流程:首先微调奖励模型以提升有害内容识别能力,再利用优化后的奖励模型对核心语言模型进行再优化,最终在多个对抗性安全基准上显著增强安全性鲁棒性,同时保持模型原有能力。

链接: https://arxiv.org/abs/2604.18789
作者: Jiacheng Liang,Yao Ma,Tharindu Kumarage,Satyapriya Krishna,Rahul Gupta,Kai-Wei Chang,Aram Galstyan,Charith Peris
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 9 pages, ACL 2026 Main

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is central to aligning Large Language Models (LLMs), yet it introduces a critical vulnerability: an imperfect Reward Model (RM) can become a single point of failure when it fails to penalize unsafe behaviors. While existing red-teaming approaches primarily target policy-level weaknesses, they overlook what we term systemic weaknesses cases where both the core LLM and the RM fail in tandem. We present ARES, a framework that systematically discovers and mitigates such dual vulnerabilities. ARES employs a ``Safety Mentor’’ that dynamically composes semantically coherent adversarial prompts by combining structured component types (topics, personas, tactics, goals) and generates corresponding malicious and safe responses. This dual-targeting approach exposes weaknesses in both the core LLM and the RM simultaneously. Using the vulnerabilities gained, ARES implements a two-stage repair process: first fine-tuning the RM to better detect harmful content, then leveraging the improved RM to optimize the core model. Experiments across multiple adversarial safety benchmarks demonstrate that ARES substantially enhances safety robustness while preserving model capabilities, establishing a new paradigm for comprehensive RLHF safety alignment. Comments: 9 pages, ACL 2026 Main Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2604.18789 [cs.AI] (or arXiv:2604.18789v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.18789 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-88] Multi-Level Temporal Graph Networks with Local-Global Fusion for Industrial Fault Diagnosis

【速读】:该论文旨在解决工业过程故障诊断中传感器间复杂且多层次的关联结构被传统图神经网络(Graph Neural Networks, GNNs)忽略的问题,尤其是在大规模系统中,局部、全局及动态关系广泛存在,导致故障诊断精度受限。其解决方案的关键在于提出一种结构感知的多层时序图网络(structure-aware multi-level temporal graph network),结合局部-全局特征融合机制:首先基于皮尔逊相关系数动态构建传感器相关性图以捕捉变量间关系;随后利用LSTM编码器提取时序特征,通过图卷积层学习空间依赖;引入多层池化机制逐步粗化图结构以捕获高层模式并保留关键故障信息;最后通过融合局部细节与全局模式进行最终预测,从而显著提升复杂故障场景下的诊断性能。

链接: https://arxiv.org/abs/2604.18765
作者: Bibek Aryal,Gift Modekwe,Qiugang Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fault detection and diagnosis are critical for the optimal and safe operation of industrial processes. The correlations among sensors often display non-Euclidean structures where graph neural networks (GNNs) are widely used therein. However, for large-scale systems, local, global, and dynamic relations extensively exist among sensors, and traditional GNNs often overlook such complex and multi-level structures for various problems including the fault diagnosis. To address this issue, we propose a structure-aware multi-level temporal graph network with local-global feature fusion for industrial fault diagnosis. First, a correlation graph is dynamically constructed using Pearson correlation coefficients to capture relationships among process variables. Then, temporal features are extracted through long short-term memory (LSTM)-based encoder, whereas the spatial dependencies among sensors are learned by graph convolution layers. A multi-level pooling mechanism is used to gradually coarsen and learn meaningful graph structures, to capture higher-level patterns while keeping important fault related details. Finally, a fusion step is applied to combine both detailed local features and overall global patterns before the final prediction. Experimental evaluations on the Tennessee Eastman process (TEP) demonstrate that the proposed model achieves superior fault diagnosis performance, particularly for complex fault scenarios, outperforming various baseline methods.

[AI-89] Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling

【速读】:该论文旨在解决多模态机器学习(Multimodal Machine Learning, MML)模型在医疗领域中因数据缺失模态(missing modalities)而导致的建模与可解释性难题。临床数据具有时间序列特性且模态稀疏,如何在训练和部署阶段有效利用不完整模态信息并保持模型透明度是关键挑战。解决方案的关键在于将临床诊断重新建模为自回归序列建模任务,采用大语言模型(Large Language Models, LLMs)中的因果解码器来捕捉患者的多模态轨迹;同时引入一种感知缺失性的对比预训练目标(missingness-aware contrastive pre-training objective),在共享潜在空间中整合存在缺失的多模态数据,从而提升模型鲁棒性和可解释性。实验表明,基于Transformer架构的自回归建模在MIMIC-IV和eICU基准上优于基线方法,并通过可解释技术验证了该策略能缓解因模态缺失导致的行为偏差。

链接: https://arxiv.org/abs/2604.18753
作者: Andrew Wang,Ellie Pavlick,Ritambhara Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:An active challenge in developing multimodal machine learning (ML) models for healthcare is handling missing modalities during training and deployment. As clinical datasets are inherently temporal and sparse in terms of modality presence, capturing the underlying predictive signal via diagnostic multimodal ML models while retaining model explainability remains an ongoing challenge. In this work, we address this by re-framing clinical diagnosis as an autoregressive sequence modeling task, utilizing causal decoders from large language models (LLMs) to model a patient’s multimodal trajectory. We first introduce a missingness-aware contrastive pre-training objective that integrates multiple modalities in datasets with missingness in a shared latent space. We then show that autoregressive sequence modeling with transformer-based architectures outperforms baselines on the MIMIC-IV and eICU fine-tuning benchmarks. Finally, we use interpretability techniques to move beyond performance boosts and find that across various patient stays, removing modalities leads to divergent behavior that our contrastive pre-training mitigates. By abstracting clinical diagnosis as sequence modeling and interpreting patient stay trajectories, we develop a framework to profile and handle missing modalities while addressing the canonical desideratum of safe, transparent clinical AI.

[AI-90] Beyond Coefficients: Forecast-Necessity Testing for Interpretable Causal Discovery in Nonlinear Time-Series Models

【速读】:该论文旨在解决非线性机器学习模型在时间序列数据中发现因果关系时,其输出结果(尤其是因果得分)常被误 interpretable 为类似于回归系数的统计显著性问题,从而导致对因果关系的错误解读。解决方案的关键在于提出一种基于预测必要性的评估框架:通过系统性地移除候选因果边(edge ablation)并比较预测性能变化,来判断某一因果关系是否真正对准确预测具有必要性,而非依赖于因果得分的大小。这一方法以神经加法向量自回归(Neural Additive Vector Autoregression)为例,在139个国家民主指标的面板数据案例中验证了其有效性,揭示了即使因果得分相近的关系也可能因冗余、时间持续性和制度特异性效应而在预测必要性上存在显著差异,从而提升了高风险场景下生成式 AI 系统中因果推理的可靠性。

链接: https://arxiv.org/abs/2604.18751
作者: Valentina Kuskova,Dmitry Zaytsev,Michael Coppedge
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Nonlinear machine-learning models are increasingly used to discover causal relationships in time-series data, yet the interpretation of their outputs remains poorly understood. In particular, causal scores produced by regularized neural autoregressive models are often treated as analogues of regression coefficients, leading to misleading claims of statistical significance. In this paper, we argue that causal relevance in nonlinear time-series models should be evaluated through forecast necessity rather than coefficient magnitude, and we present a practical evaluation procedure for doing so. We present an interpretable evaluation framework based on systematic edge ablation and forecast comparison, which tests whether a candidate causal relationship is required for accurate prediction. Using Neural Additive Vector Autoregression as a case study model, we apply this framework to a real-world case study of democratic development, modeled as a multivariate time series of panel data - democracy indicators across 139 countries. We show that relationships with similar causal scores can differ dramatically in their predictive necessity due to redundancy, temporal persistence, and regime-specific effects. Our results demonstrate how forecast-necessity testing supports more reliable causal reasoning in applied AI systems and provides practical guidance for interpreting nonlinear time-series models in high-stakes domains. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME) MSC classes: 68-T07 ACMclasses: I.2 Cite as: arXiv:2604.18751 [cs.LG] (or arXiv:2604.18751v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.18751 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.32473/flairs.39.1 Focus to learn more DOI(s) linking to related resources

[AI-91] he Cost of Relaxation: Evaluating the Error in Convex Neural Network Verification

【速读】:该论文旨在解决神经网络验证中因使用凸松弛(convex relaxation)而导致的精度损失问题,即凸松弛方法虽然提升了计算效率,但会引入不可达输出,从而破坏验证的soundness(保真性)。其解决方案的关键在于对原始网络与凸松弛模型之间的输出差异进行理论分析,提出 \ell_\infty-距离的上下界:上界随网络深度呈指数增长、随输入半径呈线性增长;同时发现误分类概率随输入扰动半径呈现阶梯状变化。这一理论框架为评估和改进基于凸松弛的神经网络验证系统提供了量化依据。

链接: https://arxiv.org/abs/2604.18728
作者: Merkouris Papamichail,Konstantinos Varsos,Giorgos Flouris,João Marques-Silva
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many neural network (NN) verification systems represent the network’s input-output relation as a constraint program. Sound and complete, representations involve integer constraints, for simulating the activations. Recent works convexly relax the integer constraints, improving performance, at the cost of soundness. Convex relaxations consider outputs that are unreachable by the original network. We study the worst case divergence between the original network and its convex relaxations; both qualitatively and quantitatively. The relaxations’ space forms a lattice, where the top element corresponds to a full relaxation, with every neuron linearized. The bottom element corresponds to the original network. We provide analytical upper and lower bounds for the \ell_\infty -distance between the fully relaxed and original outputs. This distance grows exponentially, w.r.t. the network’s depth, and linearly w.r.t. the input’s radius. The misclassification probability exhibits a step-like behavior, w.r.t. input radius. Our results are supported by experiments on MNIST, Fashion MNIST and random networks.

[AI-92] Beyond One Output: Visualizing and Comparing Distributions of Language Model Generations

【速读】:该论文旨在解决用户在与语言模型(Language Models, LM)交互时,仅基于单个输出进行评估所导致的对模型生成分布结构认知不足的问题。这种单一输出的交互方式掩盖了诸如模式(modes)、罕见边缘案例及对微小提示变化的敏感性等分布特性,从而引发用户在开放式任务中通过少量样本过度泛化,进而影响提示优化效果。解决方案的关键在于提出GROVE——一种交互式可视化工具,将多个语言模型生成结果表示为文本图中的重叠路径,既保留原始输出访问能力,又能揭示共享结构、分支点和聚类特征,从而支持用户更全面地理解生成分布;实验表明,结合图摘要与直接输出检查的混合工作流可分别提升结构判断(如多样性评估)和细节分析的准确性。

链接: https://arxiv.org/abs/2604.18724
作者: Emily Reif,Claire Yang,Jared Hwang,Deniz Nazar,Noah Smith,Jeff Heer
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Users typically interact with and evaluate language models via single outputs, but each output is just one sample from a broad distribution of possible completions. This interaction hides distributional structure such as modes, uncommon edge cases, and sensitivity to small prompt changes, leading users to over-generalize from anecdotes when iterating on prompts for open-ended tasks. Informed by a formative study with researchers who use LMs (n=13) examining when stochasticity matters in practice, how they reason about distributions over language, and where current workflows break down, we introduce GROVE. GROVE is an interactive visualization that represents multiple LM generations as overlapping paths through a text graph, revealing shared structure, branching points, and clusters while preserving access to raw outputs. We evaluate across three crowdsourced user studies (N=47, 44, and 40 participants) targeting complementary distributional tasks. Our results support a hybrid workflow: graph summaries improve structural judgments such as assessing diversity, while direct output inspection remains stronger for detail-oriented questions.

[AI-93] owards Optimal Agent ic Architectures for Offensive Security Tasks

【速读】:该论文旨在解决多智能体安全系统(Agentic security systems)在动态目标审计中协调拓扑结构选择不明确的问题,即当前系统通常固定单一协作架构,无法判断何时增加代理数量能提升检测效果,何时仅会引入额外成本。其解决方案的关键在于构建一个受控基准测试平台(controlled benchmark),包含20个交互式目标(10个Web/API与10个二进制),每个目标暴露一个可访问的漏洞,并在白盒和黑盒模式下评估五类架构家族、三类模型家族共600次运行。实验结果揭示了非单调的成本-质量前沿:更广泛的协同虽可提升覆盖范围,但综合考虑延迟、token消耗及利用验证难度后并不一定占优,且可观测性(observability)和领域特性(domain)是主导因素。

链接: https://arxiv.org/abs/2604.18718
作者: Isaac David,Arthur Gervais
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures, supplementary appendix and benchmark artifacts

点击查看摘要

Abstract:Agentic security systems increasingly audit live targets with tool-using LLMs, but prior systems fix a single coordination topology, leaving unclear when additional agents help and when they only add cost. We treat topology choice as an empirical systems question. We introduce a controlled benchmark of 20 interactive targets (10 web/API and 10 binary), each exposing one endpoint-reachable ground-truth vulnerability, evaluated in whitebox and blackbox modes. The core study executes 600 runs over five architecture families, three model families, and both access modes, with a separate 60-run long-context pilot reported only in the appendix. On the completed core benchmark, detection-any reaches 58.0% and validated detection reaches 49.8%. MAS-Indep attains the highest validated detection rate (64.2%), while SAS is the strongest efficiency baseline at 0.058 per validated finding. Whitebox materially outperforms blackbox (67.0% vs. 32.7% validated detection), and web materially outperforms binary (74.3% vs. 25.3%). Bootstrap confidence intervals and paired target-level deltas show that the dominant effects are observability and domain, while some leading whitebox topologies remain statistically close. The main result is a non-monotonic cost-quality frontier: broader coordination can improve coverage, but it does not dominate once latency, token cost, and exploit-validation difficulty are taken into account.

[AI-94] Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training

【速读】:该论文旨在解决传统基于预测误差的内在奖励机制(curiosity reward)在强化学习中因仅关注当前状态转移而忽略世界模型在整个探索过程中累积预测误差的问题,导致探索效率低下且难以区分可减少的不确定性(epistemic error)与不可减少的随机性(aleatoric error)。解决方案的关键在于提出Curiosity-Critic框架,其核心思想是将内在奖励定义为当前预测误差与当前状态转移渐近误差基线(asymptotic error baseline)之间的差值,从而在线估计并动态调整奖励信号。该基线通过一个与世界模型协同训练的“批评者”(critic)网络进行估计,该 critic 仅需回归单个标量输出,在世界模型收敛前即可稳定,有效引导智能体优先探索可学习的转移路径,无需对噪声水平的先验知识,实现了对可减少误差和不可减少误差的在线分离。

链接: https://arxiv.org/abs/2604.18701
作者: Vin Bhaskara,Haicheng Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 17 pages, 6 figures, 1 table

点击查看摘要

Abstract:Local prediction-error-based curiosity rewards focus on the current transition without considering the world model’s cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the improvement of this cumulative objective, and show that it reduces to a tractable per-step form: the difference between the current prediction error and the asymptotic error baseline of the current state transition. We estimate this baseline online with a learned critic co-trained alongside the world model; regressing a single scalar, the critic converges well before the world model saturates, redirecting exploration toward learnable transitions without oracle knowledge of the noise floor. The reward is higher for learnable transitions and collapses toward the baseline for stochastic ones, effectively separating epistemic (reducible) from aleatoric (irreducible) prediction error online. Prior prediction-error curiosity formulations, from Schmidhuber (1991) to learned-feature-space variants, emerge as special cases corresponding to specific approximations of this baseline. Experiments on a stochastic grid world show that Curiosity-Critic outperforms prediction-error and visitation-count baselines in convergence speed and final world model accuracy.

[AI-95] Beyond Explicit Refusals: Soft-Failure Attacks on Retrieval-Augmented Generation ACL2026

【速读】:该论文旨在解决现有针对检索增强生成(Retrieval-Augmented Generation, RAG)系统的干扰攻击大多表现为显式拒绝或服务中断等“硬故障”(hard failure),易被检测的问题;其核心挑战在于如何通过隐蔽手段诱导系统产生看似合理但无实际信息价值的“软故障”(soft failure)。解决方案的关键在于提出一种自动化黑盒攻击框架——欺骗性进化干扰攻击(Deceptive Evolutionary Jamming Attack, DEJA),该框架利用大语言模型(LLM)的安全对齐行为,通过基于细粒度答案效用评分(Answer Utility Score, AUS)的进化优化过程,生成对抗性文档,在保持高检索成功率的同时系统性降低回答的确定性,从而实现高效且隐蔽的软故障诱导。实验表明,DEJA在多个RAG配置和基准数据集上均能稳定触发低效响应,软失败率(SASR)超过79%,而硬失败率低于15%,显著优于现有方法,并具备良好的隐蔽性和跨模型迁移能力。

链接: https://arxiv.org/abs/2604.18663
作者: Wentao Zhang,Yan Zhuang,ZhuHang Zheng,Mingfei Zhang,Jiawen Deng,Fuji Ren
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 22 pages, Accepted to the ACL 2026 Main Conference

点击查看摘要

Abstract:Existing jamming attacks on Retrieval-Augmented Generation (RAG) systems typically induce explicit refusals or denial-of-service behaviors, which are conspicuous and easy to detect. In this work, we formalize a subtler availability threat, termed soft failure, which degrades system utility by inducing fluent and coherent yet non-informative responses rather than overt failures. We propose Deceptive Evolutionary Jamming Attack (DEJA), an automated black-box attack framework that generates adversarial documents to trigger such soft failures by exploiting safety-aligned behaviors of large language models. DEJA employs an evolutionary optimization process guided by a fine-grained Answer Utility Score (AUS), computed via an LLM-based evaluator, to systematically degrade the certainty of answers while maintaining high retrieval success. Extensive experiments across multiple RAG configurations and benchmark datasets show that DEJA consistently drives responses toward low-utility soft failures, achieving SASR above 79% while keeping hard-failure rates below 15%, significantly outperforming prior attacks. The resulting adversarial documents exhibit high stealth, evading perplexity-based detection and resisting query paraphrasing, and transfer across model families to proprietary systems without retargeting.

[AI-96] Evaluating Answer Leakage Robustness of LLM Tutors against Adversarial Student Attacks ACL2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在教育场景中作为智能 tutor 时,因默认的“助人倾向”与教学原则冲突而导致的答案泄露(answer leakage)问题,尤其关注学生行为具有对抗性时 tutor 的鲁棒性不足。其关键解决方案是引入一个经过微调的对抗性学生代理(adversarial student agent),该代理专门用于“越狱”(jailbreak)LLM-based tutor,从而构建标准化基准以评估 tutor 在恶意学生攻击下的答案泄露鲁棒性,并提出简单但有效的防御策略来降低答案泄露风险、增强 tutor 在对抗场景中的稳定性。

链接: https://arxiv.org/abs/2604.18660
作者: Jin Zhao,Marta Knežević,Tanja Käser
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: ACL 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in education, yet their default helpfulness often conflicts with pedagogical principles. Prior work evaluates pedagogical quality via answer leakage-the disclosure of complete solutions instead of scaffolding-but typically assumes well-intentioned learners, leaving tutor robustness under student misuse largely unexplored. In this paper, we study scenarios where students behave adversarially and aim to obtain the correct answer from the tutor. We evaluate a broad set of LLM-based tutor models, including different model families, pedagogically aligned models, and a multi-agent design, under a range of adversarial student attacks. We adapt six groups of adversarial and persuasive techniques to the educational setting and use them to probe how likely a tutor is to reveal the final answer. We evaluate answer leakage robustness using different types of in-context adversarial student agents, finding that they often fail to carry out effective attacks. We therefore introduce an adversarial student agent that we fine-tune to jailbreak LLM-based tutors, which we propose as the core of a standardized benchmark for evaluating tutor robustness. Finally, we present simple but effective defense strategies that reduce answer leakage and strengthen the robustness of LLM-based tutors in adversarial scenarios.

[AI-97] From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agent ic Computers

【速读】:该论文旨在解决当前代理型人工智能(Agentic AI)从脆弱原型向生产系统过渡时所面临的“技艺危机”问题,其核心症结在于现有编排范式——即把系统控制环路交由大语言模型(Large Language Models, LLMs)处理,并辅以启发式防护机制——导致了系统的不可靠性。解决方案的关键在于提出Arbiter-K架构,这是一种“治理优先”的执行框架,将底层模型重构为一个由确定性神经符号内核封装的概率处理单元(Probabilistic Processing Unit),并通过语义指令集架构(Semantic Instruction Set Architecture, ISA)将概率性消息显式化为离散指令,从而在运行时构建指令依赖图并维护安全上下文注册表,实现基于数据流谱系的主动污点传播;该机制可在确定性汇点(如高风险工具调用或未授权网络出口)处精准拦截不安全轨迹,并支持触发安全策略时的自主纠错与架构回滚,使安全性成为微架构层面的固有属性。

链接: https://arxiv.org/abs/2604.18652
作者: Xiangyu Wen,Yuang Zhao,Xiaoyu Xu,Lingjun Chen,Changran Xu,Shu Chi,Jianrong Ding,Zeju Li,Haomin Li,Li Jiang,Fangxin Liu,Qiang Xu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The transition of agentic AI from brittle prototypes to production systems is stalled by a pervasive crisis of craft. We suggest that the prevailing orchestration paradigm-delegating the system control loop to large language models and merely patching with heuristic guardrails-is the root cause of this fragility. Instead, we propose Arbiter-K, a Governance-First execution architecture that reconceptualizes the underlying model as a Probabilistic Processing Unit encapsulated by a deterministic, neuro-symbolic kernel. Arbiter-K implements a Semantic Instruction Set Architecture (ISA) to reify probabilistic messages into discrete instructions. This allows the kernel to maintain a Security Context Registry and construct an Instruction Dependency Graph at runtime, enabling active taint propagation based on the data-flow pedigree of each reasoning node. By leveraging this mechanism, Arbiter-K precisely interdicts unsafe trajectories at deterministic sinks (e.g., high-risk tool calls or unauthorized network egress) and enables autonomous execution correction and architectural rollback when security policies are triggered. Evaluations on OpenClaw and NanoBot demonstrate that Arbiter-K enforces security as a microarchitectural property, achieving 76% to 95% unsafe interception for a 92.79% absolute gain over native policies. The code is publicly available at this https URL.

[AI-98] Position: No Retroactive Cure for Infringement during Training

【速读】:该论文旨在解决生成式 AI(Generative AI)在面临日益严峻的法律挑战时,仅依赖事后的缓解措施(如机器遗忘和推理时的防护机制)无法有效规避因非法数据获取与训练所引发的法律责任的问题。其核心论点是:合规性取决于数据来源的可追溯性(data lineage),而非模型输出结果;若训练数据未经许可,即使后续通过过滤或修改模型权重来“净化”输出,仍可能构成侵权,因为模型参数本身可能被视为固定复制件并承载受保护内容的表达价值。解决方案的关键在于从“事后净化”转向“事前过程合规”,即建立可验证的、符合法律规范的数据采集与训练流程,确保从源头上实现合法合规,从而从根本上规避侵权风险。

链接: https://arxiv.org/abs/2604.18649
作者: Satoru Utsunomiya,Masaru Isonuma,Junichiro Mori,Ichiro Sakata
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 12pages

点击查看摘要

Abstract:As generative AI faces intensifying legal challenges, the machine learning community has increasingly relied on post-hoc mitigation – especially machine unlearning and inference-time guardrails – to argue for compliance. This paper argues that such post-hoc mitigation methods cannot retroactively cure liability from unlawful acquisition and training, because compliance hinges on data lineage, not the outputs. Our argument has three parts. First, unauthorized copying/ingestion can be a legally complete completed act, and model weights may operate as fixed copies that retain training-derived expressive value, making later filtering beside the point for infringement. Second, contract and tort/unfair-competition rules – via licenses, terms of service, and anti-free-riding principles – can independently restrict access and use, often bypassing copyright defenses (e.g., fair use or TDM exceptions). Third, since value from protected inputs can persist in weights, remedies such as unjust enrichment and disgorgement may require stripping gains and, in some cases, reaching the model itself. We therefore argue for a shift from Post-Hoc Sanitization to verifiable Ex-Ante Process Compliance.

[AI-99] On Solving the Multiple Variable Gapped Longest Common Subsequence Problem

【速读】:该论文旨在解决变间隙最长公共子序列(Variable Gapped Longest Common Subsequence, VGLCS)问题,该问题是对经典最长公共子序列(Longest Common Subsequence, LCS)的扩展,允许在连续字符之间引入灵活的间隙约束,适用于分子序列比对中残基间结构距离限制以及时间序列分析中事件发生的时间延迟要求。解决方案的关键在于提出一种基于根节点状态图(root-based state graph)表示的搜索框架,通过迭代束搜索(iterative beam search)策略动态维护一个全局候选根节点池,有效控制各迭代阶段的多样性,并结合LCS领域中已知的启发式方法提升高质量解的搜索效率,从而在可接受的计算时间内获得稳定且优质的解。

链接: https://arxiv.org/abs/2604.18645
作者: Marko Djukanović,Nikola Balaban,Christian Blum,Aleksandar Kartelj,Sašo Džeroski,Žiga Zebec
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper addresses the Variable Gapped Longest Common Subsequence (VGLCS) problem, a generalization of the classical LCS problem involving flexible gap constraints between consecutive solutions’ characters. The problem arises in molecular sequence comparison, where structural distance constraints between residues must be respected, and in time-series analysis where events are required to occur within specified temporal delays. We propose a search framework based on the root-based state graph representation, in which the state space comprises a generally large number of rooted state subgraphs. To cope with the resulting combinatorial explosion, an iterative beam search strategy is employed, dynamically maintaining a global pool of promising candidate root nodes, enabling effective control of diversification across iterations. To exploit the search for high-quality solutions, several known heuristics from the LCS literature are utilized into the standalone beam search procedure. To the best of our knowledge, this is the first comprehensive computational study on the VGLCS problem comprising 320 synthetic instances with up to 10 input sequences and up to 500 characters. Experimental results show robustness of the designed approach over the baseline beam search in comparable runtimes.

[AI-100] FASE : A Fairness-Aware Spatiotemporal Event Graph Framework for Predictive Policing

【速读】:该论文旨在解决预测性警务系统因仅依据犯罪风险预测分配巡逻资源而导致的种族不平等加剧问题,其核心挑战在于反馈驱动的数据偏差会放大历史执法中的不公平现象。解决方案的关键在于提出FASE(Fairness Aware Spatiotemporal Event Graph)框架,该框架整合了时空犯罪预测与公平约束的巡逻分配,并引入闭环部署反馈模拟器;其中,预测模块采用时空图神经网络结合多变量Hawkes过程建模空间依赖性和自激时间动态,并使用零膨胀负二项分布处理高方差和零值密集的犯罪数据;巡逻分配则通过公平约束线性优化问题实现,在最大化风险加权覆盖率的同时,以Demographic Impact Ratio(DRI)约束控制不同群体间的偏差,使公平性维持在0.9928至1.0262之间,但实验仍发现少数族裔地区检测率比非少数族裔地区低约3.5个百分点,表明仅在分配层面施加公平约束不足以消除再训练数据中的反馈偏差,强调需在整个系统管道中实施公平干预。

链接: https://arxiv.org/abs/2604.18644
作者: Pronob Kumar Barman,Pronoy Kumar Barman,Plaban Kumar Barman,Rohan Mandar Salvi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predictive policing systems that allocate patrol resources based solely on predicted crime risk can unintentionally amplify racial disparities through feedback driven data bias. We present FASE, a Fairness Aware Spatiotemporal Event Graph framework, which integrates spatiotemporal crime prediction with fairness constrained patrol allocation and a closed loop deployment feedback simulator. We model Baltimore as a graph of 25 ZIP Code Tabulation Areas and use 139,982 Part 1 crime incidents from 2017 to 2019 at hourly resolution, producing a sparse feature tensor. The prediction module combines a spatiotemporal graph neural network with a multivariate Hawkes process to capture spatial dependencies and self exciting temporal dynamics. Outputs are modeled using a Zero Inflated Negative Binomial distribution, suitable for overdispersed and zero heavy crime counts. The model achieves a validation loss of 0.4800 and a test loss of 0.4857. Patrol allocation is formulated as a fairness constrained linear optimization problem that maximizes risk weighted coverage while enforcing a Demographic Impact Ratio constraint with deviation bounded by 0.05. Across six simulated deployment cycles, fairness remains within 0.9928 to 1.0262, and coverage ranges from 0.876 to 0.936. However, a persistent detection rate gap of approximately 3.5 percentage points remains between minority and non minority areas. This result shows that allocation level fairness constraints alone do not eliminate feedback induced bias in retraining data, highlighting the need for fairness interventions across the full pipeline. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.18644 [cs.LG] (or arXiv:2604.18644v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.18644 Focus to learn more arXiv-issued DOI via DataCite

[AI-101] Easy Samples Are All You Need: Self-Evolving LLM s via Data-Efficient Reinforcement Learning ACL2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)训练中面临的高标注成本与模型崩溃(model collapse)或奖励黑客(reward hacking)等问题。其解决方案的关键在于提出一种受认知学习理论启发的新方法——EasyRL,该方法通过模拟人类认知习得曲线,结合来自少量易标注数据的可靠知识迁移,以及一种渐进式的分而治之(divide-and-conquer)策略来处理日益困难的未标注数据。具体而言,EasyRL首先利用少样本标注数据进行监督式强化学习初始化模型,随后采用一致性选择(consistency-based selection)和反思机制(reflection-based resolution)对高难度未标注数据进行伪标签生成,并通过难度递进的自训练(difficulty-progressive self-training)与迭代伪标签及强化学习进一步提升模型推理能力,从而构建了一个高效、可自我演化的LLMs后训练框架。

链接: https://arxiv.org/abs/2604.18639
作者: Zhiyin Yu,Bo Zhang,Qibin Hou,Zhonghai Wu,Xiao Luo,Lei Bai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Previous LLMs-based RL studies typically follow either supervised learning with high annotation costs, or unsupervised paradigms using voting or entropy-based rewards. However, their performance remains far from satisfactory due to the substantial annotation cost and issues such as model collapse or reward hacking. To address these issues, we introduce a new perspective inspired by cognitive learning theory and propose a novel approach called EasyRL. The core of EasyRL is to simulate the human cognitive acquisition curve by integrating reliable knowledge transfer from easy labeled data with a progressive divide-and-conquer strategy that tackles increasingly difficult unlabeled data. Specifically, we initialize a warm-up model using supervised RL with few-shot labeled data. This is followed by a divide-and-conquer pseudo-labeling strategy on difficult unlabeled data, combining consistency-based selection for low-uncertainty cases and reflection-based resolution for medium-uncertainty cases. Finally, difficulty-progressive self-training with iterative pseudo-labeling and RL further strengthens the model’s reasoning capability. EasyRL provides a unified self-evolving framework that facilitates data-efficient post-training of LLMs. Experimental results on mathematical and scientific benchmarks demonstrate that EasyRL, using only 10% of easy labeled data, consistently outperforms state-of-the-art baselines.

[AI-102] ARGUS: Agent ic GPU Optimization Guided by Data-Flow Invariants

【速读】:该论文旨在解决大语言模型(LLM)驱动的代码代理在生成高性能GPU内核时面临的瓶颈问题,即现有方法依赖稀疏的成功/失败反馈,难以诊断全局约束违反,导致其性能远低于手工优化库(如矩阵乘法、注意力机制和混合专家MoE等关键计算)。解决方案的关键在于提出Argus框架,其核心创新是引入数据流不变量(data-flow invariants)——一种编译期规范,用于编码数据在内核执行过程中必须遵循的 choreography(编排规则)。Argus通过一个基于分块(tiling)的Python DSL暴露硬件指令与编译器策略,并利用标签函数传播符号注解、标签断言强制关系约束,结合抽象解释与SMT求解实现零运行时开销的编译期验证;当约束被违反时,编译器返回具体的反例(线程、数据元素、程序点),提供密集结构化反馈以支持精准修复。此外,上下文感知强化学习规划器结合GPU优化知识库自动选择优化策略并合成有效不变量,显著提升了生成内核的性能与泛化能力。

链接: https://arxiv.org/abs/2604.18616
作者: Haohui Mai,Xiaoyan Guo,Xiangyun Ding,Daifeng Li,Qiuchu Yu,Chenzhun Guo,Cong Wang,Jiacheng Zhao,Christos Kozyrakis,Binhang Yuan
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:LLM-based coding agents can generate functionally correct GPU kernels, yet their performance remains far below hand-optimized libraries on critical computations such as matrix multiplication, attention, and Mixture-of-Experts (MoE). Peak GPU performance requires coordinated reasoning over tightly coupled optimizations, including tiling, shared-memory staging, software pipelining, and instruction scheduling, while existing agents rely on sparse pass/fail feedback, leaving them unable to diagnose global constraint violations. We present Argus, an agentic framework that addresses this through data-flow invariants: compile-time specifications encoding how data must be choreographed throughout kernel execution. Argus introduces a tile-based, Pythonic DSL exposing hardware instructions and compiler policies while hiding low-level representations. The DSL provides tag functions to propagate symbolic annotations through data and control flow, and tag assertions to enforce relational constraints at use sites. When violations occur, the compiler returns concrete counterexamples identifying the thread, data element, and program point, enabling dense, structured feedback for targeted fixes. Invariants are verified at compile time via abstract interpretation over a layout algebra and SMT solving, with zero runtime overhead. An in-context reinforcement learning planner learns to select optimizations and synthesize effective invariants, supported by a curated knowledge base of GPU optimization techniques. We evaluate Argus on the AMD MI300X GPU across GEMM, flash attention, and MoE kernels accounting for over 90% of GPU time in LLM inference. Generated kernels achieve 99-104% of state-of-the-art hand-optimized assembly throughput and are 2-1543x faster than existing agentic systems. Argus further generalizes to 200 KernelBench tasks, solving 100% of Level 1 and 90% of Level 2 problems. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Programming Languages (cs.PL) Cite as: arXiv:2604.18616 [cs.DC] (or arXiv:2604.18616v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2604.18616 Focus to learn more arXiv-issued DOI via DataCite

[AI-103] Agent -GWO: Collaborative Agents for Dynamic Prompt Optimization in Large Language Models ACL2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中依赖人工设计的静态提示(prompt)而导致性能不稳定、对解码配置敏感以及跨任务迁移能力有限的问题。现有自动提示优化方法多采用单智能体局部搜索策略,难以在统一框架内协同优化提示模板与解码超参数以实现全局稳定提升。其解决方案的关键在于提出 Agent-GWO 框架,将提示模板和解码超参数统一建模为可继承的智能体配置,并利用灰狼优化器(Grey Wolf Optimizer, GWO)的领导者-跟随者机制,动态选择三个领导者智能体(α、β 和 δ)引导其余智能体协同更新,从而迭代收敛至鲁棒的最优推理配置,显著提升准确性和稳定性。

链接: https://arxiv.org/abs/2604.18612
作者: Xudong Wang,Chaoning Zhang,Chenghao Li,Shuxu Chen,Qigan Sun,Jiaquan Zhang,Fachrina Dewi Puspitasari,Tae-Ho Kim,Jiwei Wei,Malu Zhang,Guoqing Wang,Yang Yang,Heng Tao Shen
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ACL 2026. 9 pages, 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong capabilities in complex reasoning tasks, while recent prompting strategies such as Chain-of-Thought (CoT) have further elevated their performance in handling complex logical problems. Despite these advances, high-quality reasoning remains heavily reliant on manual static prompts and is sensitive to decoding configurations and task distributions, leading to performance fluctuations and limited transferability. Existing automatic prompt optimization methods typically adopt single-agent local search, failing to simultaneously optimize prompts and decoding hyperparameters within a unified framework to achieve stable global improvements. To address this limitation, we propose Agent-GWO, a dynamic prompt optimization framework for complex reasoning. Specifically, we unify prompt templates and decoding hyperparameters as inheritable agent configurations. By leveraging the leader-follower mechanism of the Grey Wolf Optimizer (GWO), we automatically select three leader agents ( \alpha , \beta , and \delta ) to guide the collaborative updates of the remaining agents, enabling iterative convergence toward robust optimal reasoning configurations that can be seamlessly integrated for inference. Extensive experiments on multiple mathematical and hybrid reasoning benchmarks across diverse LLM backbones show that Agent-GWO consistently improves accuracy and stability over existing prompt optimization methods. The code will be released publicly.

[AI-104] Neuromorphic Continual Learning for Sequential Deployment of Nuclear Plant Monitoring Systems

【速读】:该论文旨在解决核工业控制系统(Nuclear Industrial Control Systems, NICS)中异常检测面临的两个核心挑战:一是需在多个处于不同调试阶段的子系统上实现持续、低功耗的监控;二是传统神经网络在顺序训练新子系统时会出现灾难性遗忘(catastrophic forgetting),导致对先前学习到的异常模式丧失识别能力。解决方案的关键在于提出首个基于脉冲神经网络(Spiking Neural Network, SNN)的持续学习异常检测系统,其创新点包括:采用基于delta的编码方式将异构传感器流转换为稀疏脉冲序列(输入稀疏度达92.7%),实现高效的异步传感器融合;并通过混合EWC+回放策略(Elastic Weight Consolidation + Experience Replay)显著降低遗忘率(平均遗忘AF = 0.035 ± 0.039),同时在F1分数(0.979)和能效(相比等效人工神经网络减少12.6倍运算量)方面表现优异,满足核设施安全监测对持续性、适应性和低功耗的需求。

链接: https://arxiv.org/abs/2604.18611
作者: Samrendra Roy,Sajedul Talukder,Syed Bahauddin Alam
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Anomaly detection in nuclear industrial control systems (ICS) requires continuous, energy-efficient monitoring across multiple subsystems that are often deployed at different stages of plant commissioning. When a conventional neural network is sequentially trained to monitor new subsystems, it catastrophically forgets previously learned anomaly patterns, a safety-critical failure mode. We present the first spiking neural network (SNN)-based anomaly detection system with continual learning for nuclear ICS, addressing both challenges simultaneously. Our approach introduces spike-encoded asynchronous sensor fusion, a delta-based encoding that converts heterogeneous sensor streams into sparse spike trains at rates dictated by each sensor’s natural dynamics, achieving 92.7% input sparsity. We evaluate five continual learning strategies, including sequential fine-tuning, Elastic Weight Consolidation (EWC), Synaptic Intelligence (SI), experience replay, and a hybrid EWC+Replay approach, on the HAI 21.03 nuclear ICS security dataset across three sequentially deployed subsystems (boiler, turbine, water treatment). The hybrid EWC+Replay method achieves an average F1 score of 0.979 with near-zero average forgetting (AF = 0.000 single seed; 0.035 +/- 0.039 across three seeds), while requiring 12.6x fewer operations (an estimated 2.5x in energy based on published hardware specifications) than an equivalent artificial neural network. The system detects all tested attacks with a mean latency of 0.6 seconds. These results demonstrate that neuromorphic computing offers a viable path toward always-on, energy-efficient, and adaptable safety monitoring for next-generation nuclear facilities.

[AI-105] SpikeMLLM : Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理阶段计算开销和能耗过高,限制其在资源受限环境部署的问题。现有基于脉冲神经网络(Spiking Neural Networks, SNNs)的方案面临两大挑战:异构模态导致统一脉冲编码不足,以及高分辨率图像输入加剧时间步展开的计算负担。解决方案的关键在于提出首个面向MLLMs的脉冲框架SpikeMLLM,通过在脉冲表示空间中统一现有ANN量化方法,并引入由模态演化差异(Modality Evolution Discrepancy, MED)引导的模态特异性时序尺度(Modality-Specific Temporal Scales, MSTS),结合时序压缩LIF(Temporally Compressed LIF, TC-LIF)机制,将时间步从T=L−1压缩至T=log₂(L)−1。实验表明,该方法在极端时间步压缩下仍保持近无损性能(InternVL2-8B与Qwen2VL-72B平均相对差距分别为0.72%和1.19%),并进一步设计专用RTL加速器,在算法-硬件协同设计下实现比FP16 GPU基线9.06倍更高的吞吐量和25.8倍更好的能效比。

链接: https://arxiv.org/abs/2604.18610
作者: Han Xu,Zhiyong Qin,Di Shang,Jiahong Zhang,Xuerui Qiu,Bo Lei,Tiejun Huang,Bo Xu,Guoqi Li
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable progress but incur substantial computational overhead and energy consumption during inference, limiting deployment in resource-constrained environments. Spiking Neural Networks (SNNs), with their sparse event-driven computation, offer inherent energy efficiency advantages on neuromorphic hardware, yet extending them to MLLMs faces two key challenges: heterogeneous modalities make uniform spike encoding insufficient, and high-resolution image inputs amplify timestep unfolding overhead. We propose SpikeMLLM, the first spike-based framework for MLLMs, which unifies existing ANN quantization methods in the spiking representation space and incorporates Modality-Specific Temporal Scales (MSTS) guided by Modality Evolution Discrepancy (MED) and Temporally Compressed LIF (TC-LIF) for timestep compression from T=L-1 to T=log2(L)-1. Experiments on four representative MLLMs across diverse multimodal benchmarks show that SpikeMLLM maintains near-lossless performance under aggressive timestep compression (Tv/Tt=3/4), with average gaps of only 0.72% and 1.19% relative to the FP16 baseline on InternVL2-8B and Qwen2VL-72B. We further develop a dedicated RTL accelerator tailored to the spike-driven datapath, observing 9.06x higher throughput and 25.8x better power efficiency relative to an FP16 GPU baseline under a deployment-oriented co-design setting, suggesting the promise of algorithm-hardware co-design for efficient multimodal intelligence.

[AI-106] urboEvolve: Towards Fast and Robust LLM -Driven Program Evolution

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的程序演化过程中存在的样本效率低和运行间方差大问题,这些问题限制了其在固定评估预算下的可靠进展。解决方案的关键在于提出一种多岛进化框架 TurboEvolve,其核心创新包括:1)引入“语义化采样”(verbalized Sampling),通过提示LLM生成K个多样化候选解并赋予显式的自定义采样权重,从而提升探索效率;2)设计在线调度器动态调整K值,在停滞阶段扩大探索范围、在稳定阶段降低计算开销;3)提出“种子池注入”(seed-pool injection)策略,通过对现有解池进行聚类并施加受控扰动与精英保留机制,实现多样性与精炼性的平衡。该方法在多个程序优化基准上均实现了更低预算下的更强性能,并在若干任务中刷新了最优解记录。

链接: https://arxiv.org/abs/2604.18607
作者: Yang Yang,Zining Zhong,Jindong Li,Jiemin Wu,Kaishen Yuan,Wenshuo Chen,Menglin Yang,Yutao Yue
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:LLM-driven program evolution can discover high-quality programs, but its cost and run-to-run variance hinder reliable progress. We propose TurboEvolve, a multi-island evolutionary framework that improves sample efficiency and robustness under fixed evaluation budgets. Inspired by the multiple-offspring strategy in evolutionary algorithms, TurboEvolve introduces verbalized Sampling, prompting the LLM to emit K diverse candidates with explicit self-assigned sampling weights, and an online scheduler that adapts K to expand exploration under stagnation and reduce overhead during steady progress. To exploit existing solution pools, we further propose “seed-pool injection,” which clusters seeds and assigns them across islands with controlled perturbations and elitist preservation to balance diversity and refinement. Across multiple program-optimization benchmarks, TurboEvolve consistently achieves stronger performance at lower budgets and improves best-known solutions on several tasks.

[AI-107] Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在形式化定理证明中因测试时计算开销过大而导致的可扩展性瓶颈问题,尤其是当前高性能方法依赖于大量推理轮次或超长上下文窗口所引发的资源消耗。其解决方案的关键在于利用形式验证中的结构化信息:编译器将多样化的证明尝试映射到一组紧凑的结构化失败模式,从而实现对错误的局部修正。作者提出一种“学习-精炼”(learning-to-refine)框架,通过树搜索结合显式验证器反馈进行局部纠错,避免积累冗长的证明历史,显著提升基础证明器的推理能力,并在PutnamBench基准上实现了参数规模约为8B和32B的模型中的最先进性能,同时保持可控的测试时预算。

链接: https://arxiv.org/abs/2604.18587
作者: Guchan Li,Rui Tian,Hongning Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant potential in formal theorem proving, yet state-of-the-art performance often necessitates prohibitive test-time compute via massive roll-outs or extended context windows. In this work, we address this scalability bottleneck by exploiting an informative structure in formal verification: the observation that compilers map a vast space of diverse proof attempts to a compact set of structured failure modes. We introduce a learning-to-refine framework that leverages this compression to perform efficient learning and proof exploration. We perform tree search that corrects errors locally conditioned on explicit verifier feedback, thereby circumventing the costs associated with accumulating a long history of proof attempts. Extensive evaluations show that our method consistently amplifies the reasoning capabilities of base provers across varying scales. Notably, our approach achieves state-of-the-art performance on PutnamBench among publicly reported \sim 8B and \sim 32B parameter models under comparable test-time budgets, offering a scalable paradigm for next-generation verifier-guided reasoning.

[AI-108] A neural operator framework for data-driven discovery of stability and receptivity in physical systems

【速读】:该论文旨在解决复杂系统在受到扰动时稳定性分析与敏感模式识别的问题,尤其针对传统稳定性分析和受迫响应(resolvent)分析依赖已知控制方程和线性化假设、难以应用于非线性或建模不充分系统这一局限。其解决方案的关键在于提出一种数据驱动框架,仅通过观测数据即可自动识别系统的稳定性特性与最优激励响应:首先利用神经网络作为动力学模拟器(dynamics emulator),再借助自动微分(automatic differentiation)提取其雅可比矩阵(Jacobian),从而直接从数据中计算出特征模态(eigenmodes)和响应模态(resolvent modes)。该方法成功应用于典型混沌模型和高维流体流动场景,在强非线性条件下仍能准确识别主导不稳定模态和输入-输出结构,实现了无需解析方程的非线性动力学建模与复杂动态模式解析。

链接: https://arxiv.org/abs/2604.19465
作者: Chengyun Wang,Liwei Chen,Nils Thuerey
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
备注: 30 pages, 10 figures

点击查看摘要

Abstract:Understanding how complex systems respond to perturbations, such as whether they will remain stable or what their most sensitive patterns are, is a fundamental challenge across science and engineering. Traditional stability and receptivity (resolvent) analyses are powerful but rely on known equations and linearization, limiting their use in nonlinear or poorly modeled systems. Here, we introduce a data-driven framework that automatically identifies stability properties and optimal forcing responses from observation data alone, without requiring governing equations. By training a neural network as a dynamics emulator and using automatic differentiation to extract its Jacobian, we can compute eigenmodes and resolvent modes directly from data. We demonstrate the method on both canonical chaotic models and high-dimensional fluid flows, successfully identifying dominant instability modes and input-output structures even in strongly nonlinear regimes. By leveraging a neural network-based emulator, we readily obtain a nonlinear representation of system dynamics while additionally retrieving intricate dynamical patterns that were previously difficult to resolve. This equation-free methodology establishes a broadly applicable tool for analyzing complex, high-dimensional datasets, with immediate relevance to grand challenges in fields such as climate science, neuroscience, and fluid engineering.

[AI-109] Multimodal Transformer for Sample-Aware Prediction of Metal-Organic Framework Properties

【速读】:该论文旨在解决金属有机框架材料(Metal-organic frameworks, MOFs)在机器学习属性预测中普遍存在的“框架感知”局限性问题,即现有模型通常假设同一MOF结构仅对应单一属性值,而忽略了实验样品因结晶度、相纯度、缺陷等样本相关因素导致的性能差异。解决方案的关键在于提出一种多模态Transformer模型——Experimental X-ray Diffraction Integrated Transformer (EXIT),其核心创新是将MOF身份编码(MOFid)与实验X射线衍射(XRD)信号相结合,从而实现对样品状态的感知;EXIT通过百万个虚拟MOFs及其模拟XRD数据进行预训练以学习可迁移表征,并在实验数据集上微调用于比表面积和孔体积预测,显著提升了预测精度,且注意力分析和案例研究证实其能区分具有相同MOFid但XRD模式不同的样品,标志着从框架感知向样本感知的实质性进展。

链接: https://arxiv.org/abs/2604.19383
作者: Seunghee Han,Jaewoong Lee,Jihan Kim
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 22 pages, 7 figures

点击查看摘要

Abstract:Metal-organic frameworks (MOFs) are a major target of machine-learning-based property prediction, yet most models assume that a single framework representation maps to a single property value. This assumption becomes problematic for experimental MOFs, where samples reported as the same framework can exhibit different properties because of differences in crystallinity, phase purity, defects, and other sample-dependent factors. Here we introduce Experimental X-ray Diffraction Integrated Transformer (EXIT), a multimodal transformer for sample-aware prediction of MOF properties that combines MOFid with X-ray diffraction (XRD). In EXIT, MOFid encodes MOF identity, whereas XRD provides complementary information about the experimentally realized sample state. EXIT is pre-trained on one million hypothetical MOFs with simulated XRD to learn transferable representations, leading to improved downstream performance relative to existing approaches. EXIT is fine-tuned on literature-derived experimental datasets for surface area and pore volume prediction. Incorporating experimental XRD improves predictive performance relative to models without experimental XRD, and attention analysis and sample-level case studies further show that EXIT assigns different predictions to samples sharing the same MOF identity when their XRD patterns differ. These results establish a practical step from framework-aware to sample-aware MOF property prediction and highlight the value of incorporating experimental characterization into porous materials informatics.

[AI-110] OmniMouse: Scaling properties of multi-modal multi-task Brain Models on 150B Neural Tokens ICLR2026

【速读】:该论文旨在解决如何有效建模大脑活动的问题,特别是在小鼠视觉皮层中,从大规模神经元数据中提取可泛化的规律。其核心挑战在于:尽管已有海量神经记录(3.1百万神经元、1500亿个神经元标记),但传统深度学习模型在脑活动预测中的性能提升受限于数据规模而非模型参数量。解决方案的关键在于构建一个多模态、多任务的统一模型 OmniMouse,该模型能够在测试时灵活切换三种功能:神经活动预测(neural prediction)、行为解码(behavioral decoding)和神经活动预测(neural forecasting),并支持任意组合。实验表明,OmniMouse 在几乎所有评估场景下均优于专用基线模型,且性能随数据量增加而稳定提升,但模型规模扩大带来的收益趋于饱和——这揭示了脑建模与语言和视觉领域不同的“数据受限”特性,提示未来更大规模的数据可能引发神经建模中的相变现象,从而解锁新的能力。

链接: https://arxiv.org/abs/2604.18827
作者: Konstantin F. Willeke,Polina Turishcheva,Alex Gilbert,Goirik Chakrabarty,Hasan A. Bedel,Paul G. Fahey,Yongrong Qiu,Marissa A. Weis,Michaela Vystrčilová,Taliah Muhammad,Lydia Ntanavara,Rachel E. Froebe,Kayla Ponder,Zheng Huan Tan,Emin Orhan,Erick Cobos,Sophia Sanborn,Katrin Franke,Fabian H. Sinz,Alexander S. Ecker,Andreas S. Tolias
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: Published at ICLR2026

点击查看摘要

Abstract:Scaling data and artificial neural networks has transformed AI, driving breakthroughs in language and vision. Whether similar principles apply to modeling brain activity remains unclear. Here we leveraged a dataset of 3.1 million neurons from the visual cortex of 73 mice across 323 sessions, totaling more than 150 billion neural tokens recorded during natural movies, images and parametric stimuli, and behavior. We train multi-modal, multi-task models that support three regimes flexibly at test time: neural prediction, behavioral decoding, neural forecasting, or any combination of the three. OmniMouse achieves state-of-the-art performance, outperforming specialized baselines across nearly all evaluation regimes. We find that performance scales reliably with more data, but gains from increasing model size saturate. This inverts the standard AI scaling story: in language and computer vision, massive datasets make parameter scaling the primary driver of progress, whereas in brain modeling – even in the mouse visual cortex, a relatively simple system – models remain data-limited despite vast recordings. The observation of systematic scaling raises the possibility of phase transitions in neural modeling, where larger and richer datasets might unlock qualitatively new capabilities, paralleling the emergent properties seen in large language models. Code available at this https URL.

[AI-111] Skillful Global Ocean Emulation and the Role of Correlation-Aware Loss

【速读】:该论文旨在解决全球海洋动力系统在中短期预报中的精度与效率问题,特别是如何利用机器学习方法构建高效且高技能的海洋仅模拟器(ocean-only emulator),以替代传统数值模式在计算资源和预测时效上的局限。其解决方案的关键在于:1)将GraphCast架构适配为仅依赖于预设大气条件驱动的海洋模拟器,并基于NOAA UFS-Replay数据集进行训练;2)采用24小时时间步长、单初始条件且不使用自回归训练策略,实现无需复杂迭代即可获得10–15天有效预报;3)引入马哈拉诺比斯距离(Mahalanobis distance)作为损失函数,显式考虑目标变量变化率之间的相关性,从而提升预测性能,同时作为统计-动力正则化项,改善全球海洋慢速耦合动力过程的背景场质量,有利于后续如数据同化等下游任务。

链接: https://arxiv.org/abs/2604.18727
作者: Niraj Agarwal,Timothy A. Smith,Sergey Frolov,Laura C. Slivinski
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Chaotic Dynamics (nlin.CD)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Machine learning emulators have shown extraordinary skill in forecasting atmospheric states, and their application to global ocean dynamics offers similar promise. Here, we adapt the GraphCast architecture into a dedicated ocean-only emulator, driven by prescribed atmospheric conditions, for medium-range predictions. The emulator is trained on NOAA’s UFS-Replay dataset. Using a 24 hour time step, single initial condition, and without using autoregressive training, we produce an emulator that provides skillful forecasts for 10-15 day lead times. We further demonstrate the use of Mahalanobis distance as loss that improves the forecast skill compared to the Mean Squared Error loss by explicitly accounting for the correlations between tendencies of the target variables. Using spatial correlation analysis of the forecasted fields, we also show that the proposed correlation-aware loss acts as a statistical-dynamical regularizer for the slow, correlated dynamics of the global oceans, offering a better background forecast for downstream tasks like data assimilation.

[AI-112] NeuroAI and Beyond: Bridging Between Advances in Neuroscience and ArtificialIntelligence

【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)在与物理世界交互能力不足、学习机制导致系统脆弱性以及能耗和数据效率低下的三大核心瓶颈问题。其解决方案的关键在于引入神经科学(Neuroscience)的原理,包括:身体与控制器协同设计(co-design of body and controller)、通过交互进行预测(prediction through interaction)、多尺度学习与神经调制控制(multi-scale learning with neuromodulatory control)、分层分布式架构(hierarchical distributed architectures)以及稀疏事件驱动计算(sparse event-driven computation)。这些原则共同构成了一条面向近、中、长期的研究路线图,并强调需培养跨神经科学与工程领域的新型研究人才,以推动神经启发式人工智能(NeuroAI)的发展,从而突破现有AI局限并深化对生物神经计算机制的理解。

链接: https://arxiv.org/abs/2604.18637
作者: Anthony Zador,Jean-Marc Fellous,Terrence Sejnowski,Gina Adam,James B Aimone,Akwasi Akwaboah,Yiannis Aloimonos,Carmen Amo Alonso,Chiara Bartolozzi,Michael J. Bennington,Michael Berry,Bing W. Brunton,Gert Cauwenberghs,Hillel J. Chiel,Tobi Delbruck,John Doyle,Jason Eshraghian,Ralph Etienne-Cummings,Cornelia Fermuller,Matthew Jacobsen,Ali A. Minai,Barbara Oakley,Alexander G. Ororbia II,Joe Paton,Blake Richards,Yulia Sandamirskaya,Abhronil Sengupta,Shihab Shamma,Michael P. Stryker,Seong Jong Yoo,Steven W. Zucker
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Neuroscience and Artificial Intelligence (AI) have made impressive progress in recent years but remain only loosely interconnected. Based on a workshop convened by the National Science Foundation in August 2025, we identify three fundamental capability gaps in current AI: the inability to interact with the physical world, inadequate learning that produces brittle systems, and unsustainable energy and data inefficiency. We describe the neuroscience principles that address each: co-design of body and controller, prediction through interaction, multi-scale learning with neuromodulatory control, hierarchical distributed architectures, and sparse event-driven computation. We present a research roadmap organized around these principles at near, mid, and long-term horizons. We argue that realizing this program requires a new generation of researchers trained across the boundary between neuroscience and engineering, and describe the institutional conditions: interdisciplinary training, hardware access, community standards, and ethics, needed to support them. We conclude that NeuroAI, neuroscience-informed artificial intelligence, has the potential to overcome limitations of current AI while deepening our understanding of biological neural computation.

[AI-113] hermal Anomaly Detection using Physics Aware Neuromorphic Networks: Comparison between Raw and L1C Sentinel-2 Data

【速读】:该论文旨在解决地球观测(Earth Observation, EO)中热异常检测的实时性与可靠性问题,尤其是在野火和火山喷发等灾害场景下,因检测延迟导致损害加剧的挑战。现有方法依赖于高计算成本的预处理流程(如Level-1C产品),难以满足星上实时处理需求,且受限于传感器漂移、辐射不一致性及标注样本稀缺等问题。其解决方案的关键在于提出一种物理感知类神经网络(Physics-Aware Neuromorphic Network, PANN)框架,该架构融合物理神经网络原理与类脑计算范式,直接在未压缩的Level-0(L0)原始数据上进行轻量化热异常检测,显著降低延迟并提升鲁棒性。实验表明,PANN在原始L0数据上的MCC达0.809,处理延迟仅为2.44±0.09毫秒,低于Sentinel-2单景采集时间(3.6毫秒),且硬件部署后延迟进一步降至0.129毫秒,内存占用符合星载约束,验证了其在低延迟、资源高效星上处理中的可行性。

链接: https://arxiv.org/abs/2604.18606
作者: Stephen Smith,Cormac Purcell,Gabriele Meoni,Roberto Del Prete,Zdenka Kuncic
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Damage caused by bushfires and volcanic eruptions escalates rapidly when detection is delayed, making fast and reliable early warning capabilities essential. Recent Earth Observation (EO) approaches have shown that thermal anomaly detection can be performed directly on decompressed Level-0 (L0) sensor data, avoiding computationally expensive preprocessing chains. However, direct exploitation of raw data remains challenging due to domain shift, sensor drift, radiometric inconsistencies, and the scarcity of labelled training samples. To address these challenges, this work proposes a Physics-Aware Neuromorphic Network (PANN) framework for onboard thermal anomaly detection. The proposed lightweight architecture, inspired by physical neural network principles and neuromorphic computing paradigms, is evaluated using two Sentinel-2 datasets: decompressed L0 with additional metadata (i.e. raw) and Level-1C (L1C). The PANN achieves a Matthews Correlation Coefficient (MCC) of 0.809 on raw measurements, compared to 0.875 when using ground-processed L1C products. The mean processing latency per L0 granule is 2.44 \pm 0.09~\mathrms , which is below the Sentinel-2 acquisition time of 3.6~\mathrms , demonstrating the feasibility of real-time, onboard processing. Furthermore, the projected execution time for the corresponding neuromorphic hardware instantiation is substantially lower at 0.1290 \pm 0.0002~\mathrms . Memory usage, including all necessary programs and packages, remains within realistic onboard constraints, with requirements of 0.673 \pm 0.007~\mathrmGb for the software PANN and 0.393 \pm 0.004~\mathrmGb for the estimated hardware realisation. Overall, these results indicate that PANN offers a promising pathway toward low-latency and resource-efficient onboard EO processing for thermal event detection.

机器学习

[LG-0] Safe Continual Reinforcement Learning in Non-stationary Environments

链接: https://arxiv.org/abs/2604.19737
作者: Austin Coursey,Abel Diaz-Gonzalez,Marcos Quinones-Grueiro,Gautam Biswas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) offers a compelling data-driven paradigm for synthesizing controllers for complex systems when accurate physical models are unavailable; however, most existing control-oriented RL methods assume stationarity and, therefore, struggle in real-world non-stationary deployments where system dynamics and operating conditions can change unexpectedly. Moreover, RL controllers acting in physical environments must satisfy safety constraints throughout their learning and execution phases, rendering transient violations during adaptation unacceptable. Although continual RL and safe RL have each addressed non-stationarity and safety, respectively, their intersection remains comparatively unexplored, motivating the study of safe continual RL algorithms that can adapt over the system’s lifetime while preserving safety. In this work, we systematically investigate safe continual reinforcement learning by introducing three benchmark environments that capture safety-critical continual adaptation and by evaluating representative approaches from safe RL, continual RL, and their combinations. Our empirical results reveal a fundamental tension between maintaining safety constraints and preventing catastrophic forgetting under non-stationary dynamics, with existing methods generally failing to achieve both objectives simultaneously. To address this shortcoming, we examine regularization-based strategies that partially mitigate this trade-off and characterize their benefits and limitations. Finally, we outline key open challenges and research directions toward developing safe, resilient learning-based controllers capable of sustained autonomous operation in changing environments.

[LG-1] FB-NLL: A Feature-Based Approach to Tackle Noisy Labels in Personalized Federated Learning

链接: https://arxiv.org/abs/2604.19729
作者: Abdulmoneam Ali,Ahmed Arafa
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP)
*备注: Submitted for journal publication

点击查看摘要

Abstract:Personalized Federated Learning (PFL) aims to learn multiple task-specific models rather than a single global model across heterogeneous data distributions. Existing PFL approaches typically rely on iterative optimization-such as model update trajectories-to cluster users that need to accomplish the same tasks together. However, these learning-dynamics-based methods are inherently vulnerable to low-quality data and noisy labels, as corrupted updates distort clustering decisions and degrade personalization performance. To tackle this, we propose FB-NLL, a feature-centric framework that decouples user clustering from iterative training dynamics. By exploiting the intrinsic heterogeneity of local feature spaces, FB-NLL characterizes each user through the spectral structure of the covariances of their feature representations and leverages subspace similarity to identify task-consistent user groupings. This geometry-aware clustering is label-agnostic and is performed in a one-shot manner prior to training, significantly reducing communication overhead and computational costs compared to iterative baselines. Complementing this, we introduce a feature-consistency-based detection and correction strategy to address noisy labels within clusters. By leveraging directional alignment in the learned feature space and assigning labels based on class-specific feature subspaces, our method mitigates corrupted supervision without requiring estimation of stochastic noise transition matrices. In addition, FB-NLL is model-independent and integrates seamlessly with existing noise-robust training techniques. Extensive experiments across diverse datasets and noise regimes demonstrate that our framework consistently outperforms state-of-the-art baselines in terms of average accuracy and performance stability. Comments: Submitted for journal publication Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP) Cite as: arXiv:2604.19729 [cs.LG] (or arXiv:2604.19729v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.19729 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-2] Ultrametric OGP - parametric RDT emphsymmetric binary perceptron connection

链接: https://arxiv.org/abs/2604.19712
作者: Mihailo Stojnic
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Information Theory (cs.IT); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In [97,99,100], an fl-RDT framework is introduced to characterize \emphstatistical computational gaps (SCGs). Studying \emphsymmetric binary perceptrons (SBPs), [100] obtained an \emphalgorithmic threshold estimate \alpha_a\approx \alpha_c^(7)\approx 1.6093 at the 7th lifting level (for \kappa=1 margin), closely approaching 1.58 local entropy (LE) prediction [18]. In this paper, we further connect parametric RDT to overlap gap properties (OGPs), another key geometric feature of the solution space. Specifically, for any positive integer s , we consider s -level ultrametric OGPs ( ult_s -OGPs) and rigorously upper-bound the associated constraint densities \alpha_ult_s . To achieve this, we develop an analytical union-bounding program consisting of combinatorial and probabilistic components. By casting the combinatorial part as a convex problem and the probabilistic part as a nested integration, we conduct numerical evaluations and obtain that the tightest bounds at the first two levels, \bar\alpha_ult_1 \approx 1.6578 and \bar\alpha_ult_2 \approx 1.6219 , closely approach the 3rd and 4th lifting level parametric RDT estimates, \alpha_c^(3) \approx 1.6576 and \alpha_c^(4) \approx 1.6218 . We also observe excellent agreement across other key parameters, including overlap values and the relative sizes of ultrametric clusters. Based on these observations, we propose several conjectures linking ult -OGP and parametric RDT. Specifically, we conjecture that algorithmic threshold \alpha_a=\lim_s\rightarrow\infty \alpha_ult_s = \lim_s\rightarrow\infty \bar\alphault_s = \lim_r\rightarrow\infty \alpha_c^® , and \alpha_ult_s \leq \alpha_c^(s+2) (with possible equality for some (maybe even all) s ). Finally, we discuss the potential existence of a full isomorphism connecting all key parameters of ult -OGP and parametric RDT. Subjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Information Theory (cs.IT); Probability (math.PR); Machine Learning (stat.ML) Cite as: arXiv:2604.19712 [cs.LG] (or arXiv:2604.19712v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.19712 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] On two ways to use determinantal point processes for Monte Carlo integration NEURIPS2019

链接: https://arxiv.org/abs/2604.19698
作者: Guillaume Gautier,Rémi Bardenet,Michal Valko
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: NeurIPS 2019

点击查看摘要

Abstract:The standard Monte Carlo estimator \widehatI_N^\mathrmMC of \int fd\omega relies on independent samples from \omega and has variance of order 1/N . Replacing the samples with a determinantal point process (DPP), a repulsive distribution, makes the estimator consistent, with variance rates that depend on how the DPP is adapted to f and \omega . We examine two existing DPP-based estimators: one by Bardenet Hardy (2020) with a rate of \mathcalO(N^-(1+1/d)) for smooth f , but relying on a fixed DPP. The other, by Ermakov Zolotukhin (1960), is unbiased with rate of order 1/N , like Monte Carlo, but its DPP is tailored to f . We revisit these estimators, generalize them to continuous settings, and provide sampling algorithms.

[LG-4] Planning in entropy-regularized Markov decision processes and games NEURIPS2019

链接: https://arxiv.org/abs/2604.19695
作者: Jean-Bastien Grill,Omar Darwiche Domingues,Pierre Ménard,Rémi Munos,Michal Valko
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2019

点击查看摘要

Abstract:We propose SmoothCruiser, a new planning algorithm for estimating the value function in entropy-regularized Markov decision processes and two-player games, given a generative model of the environment. SmoothCruiser makes use of the smoothness of the Bellman operator promoted by the regularization to achieve problem-independent sample complexity of order O~(1/epsilon^4) for a desired accuracy epsilon, whereas for non-regularized settings there are no known algorithms with guaranteed polynomial sample complexity in the worst case.

[LG-5] PREF-XAI: Preference-Based Personalized Rule Explanations of Black-Box Machine Learning Models

链接: https://arxiv.org/abs/2604.19684
作者: Salvatore Greco,Jacek Karolczak,Roman Słowiński,Jerzy Stefanowski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explainable artificial intelligence (XAI) has predominantly focused on generating model-centric explanations that approximate the behavior of black-box models. However, such explanations often overlook a fundamental aspect of interpretability: different users require different explanations depending on their goals, preferences, and cognitive constraints. Although recent work has explored user-centric and personalized explanations, most existing approaches rely on heuristic adaptations or implicit user modeling, lacking a principled framework for representing and learning individual preferences. In this paper, we consider Preference-Based Explainable Artificial Intelligence (PREF-XAI), a novel perspective that reframes explanation as a preference-driven decision problem. Within PREF-XAI, explanations are not treated as fixed outputs, but as alternatives to be evaluated and selected according to user-specific criteria. In the PREF-XAI perspective, here we propose a methodology that combines rule-based explanations with formal preference learning. User preferences are elicited through a ranking of a small set of candidate explanations and modeled via an additive utility function inferred using robust ordinal regression. Experimental results on real-world datasets show that PREF-XAI can accurately reconstruct user preferences from limited feedback, identify highly relevant explanations, and discover novel explanatory rules not initially considered by the user. Beyond the proposed methodology, this work establishes a connection between XAI and preference learning, opening new directions for interactive and adaptive explanation systems.

[LG-6] Budgeted Online Influence Maximization ICML2020

链接: https://arxiv.org/abs/2604.19672
作者: Pierre Perrault,Jennifer Healey,Zheng Wen,Michal Valko
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 37th International Conference on Machine Learning (ICML 2020), 28 pages

点击查看摘要

Abstract:We introduce a new budgeted framework for online influence maximization, considering the total cost of an advertising campaign instead of the common cardinality constraint on a chosen influencer set. Our approach better models the real-world setting where the cost of influencers varies and advertisers want to find the best value for their overall social advertising budget. We propose an algorithm assuming an independent cascade diffusion model and edge level semi-bandit feedback, and provide both theoretical and experimental results. Our analysis is also valid for the cardinality constraint setting and improves the state of the art regret bound in this case.

[LG-7] HardNet: Nonlinear Constraint Enforcement in Neural Networks

链接: https://arxiv.org/abs/2604.19669
作者: Andrea Goertzen,Kaveh Alim,Navid Azizan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Enforcing constraint satisfaction in neural network outputs is critical for safety, reliability, and physical fidelity in many control and decision-making applications. While soft-constrained methods penalize constraint violations during training, they do not guarantee constraint adherence during inference. Other approaches guarantee constraint satisfaction via specific parameterizations or a projection layer, but are tailored to specific forms (e.g., linear constraints), limiting their utility in other general problem settings. Many real-world problems of interest are nonlinear, motivating the development of methods that can enforce general nonlinear constraints. To this end, we introduce HardNet++, a constraint-enforcement method that simultaneously satisfies linear and nonlinear equality and inequality constraints. Our approach iteratively adjusts the network output via damped local linearizations. Each iteration is differentiable, admitting an end-to-end training framework, where the constraint satisfaction layer is active during training. We show that under certain regularity conditions, this procedure can enforce nonlinear constraint satisfaction to arbitrary tolerance. Finally, we demonstrate tight constraint adherence without loss of optimality in a learning-for-optimization context, where we apply this method to a model predictive control problem with nonlinear state constraints.

[LG-8] Disentangling Damage from Operational Variability: A Label-Free Self-Supervised Representation Learning Framework for Output-Only Structural Damage Identification

链接: https://arxiv.org/abs/2604.19658
作者: Xudong Jian,Charikleia Stoura,Simon Scandella,Eleni Chatzi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Damage identification is a core task in structural health monitoring. In practice, however, its reliability is often compromised by confounding non-damage effects, such as variations in excitation and environmental conditions, which can induce changes comparable to or larger than those caused by structural damage. To address this challenge, this study proposes a self-supervised label-free disentangled representation learning framework for robust vibration-based structural damage identification. The proposed framework employs an autoencoder with two latent representations to learn directly from raw vibration acceleration signals. A self-supervised invariance regularization, implemented via Variance-Invariance-Covariance Regularization (VICReg), is imposed on one latent representation using baseline data where structural damage is assumed constant but operational and environmental conditions vary. In addition, a frequency-domain constraint is introduced to enforce agreement between the power spectral density reconstructed from the latent representation and that computed from the corresponding input time series. Together, these mechanisms promote disentanglement, enabling the learned representation to be sensitive to damage-related characteristics while remaining invariant to nuisance variability. The framework is trained in a fully end-to-end and label-free manner, requiring no prior information on damage, excitation, or environmental conditions, making it well-suited for real-world applications. Its effectiveness is validated on two distinct real-world vibration datasets, including a bridge and a gearbox. The results demonstrate robustness to operational variability, strong generalization capability, and good performance in both damage detection and quantification.

[LG-9] An Efficient Black-Box Reduction from Online Learning to Multicalibration and a New Route to Φ-Regret Minimization

链接: https://arxiv.org/abs/2604.19592
作者: Gabriele Farina,Juan Carlos Perdomo
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:We give a Gordon-Greenwald-Marks (GGM) style black-box reduction from online learning to online multicalibration. Concretely, we show that to achieve high-dimensional multicalibration with respect to a class of functions H, it suffices to combine any no-regret learner over H with an expected variational inequality (EVI) solver. We also prove a converse statement showing that efficient multicalibration implies efficient EVI solving, highlighting how EVIs in multicalibration mirror the role of fixed points in the GGM result for \Phi -regret. This first set of results resolves the main open question in Garg, Jung, Reingold, and Roth (SODA '24), showing that oracle-efficient online multicalibration with \sqrtT -type guarantees is possible in full generality. Furthermore, our GGM-style reduction unifies the analyses of existing online multicalibration algorithms, enables new algorithms for challenging environments with delayed observations or censored outcomes, and yields the first efficient black-box reduction between online learning and multiclass omniprediction. Our second main result is a fine-grained reduction from high-dimensional online multicalibration to (contextual) \Phi -regret minimization. Together with our first result, this establishes a new route from external regret to Phi-regret that bypasses sophisticated fixed-point or semi-separation machinery, dramatically simplifies a result of Daskalakis, Farina, Fishelson, Pipis, and Schneider (STOC '25) while improving rates, and yields new algorithms that are robust to richer deviation classes, such as those belonging to any reproducing kernel Hilbert space. Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2604.19592 [cs.LG] (or arXiv:2604.19592v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.19592 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-10] Structure-guided molecular design with contrastive 3D protein-ligand learning

链接: https://arxiv.org/abs/2604.19562
作者: Carles Navarro,Philipp Tholke,Gianni de Fabritiis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Structure-based drug discovery faces the dual challenge of accurately capturing 3D protein-ligand interactions while navigating ultra-large chemical spaces to identify synthetically accessible candidates. In this work, we present a unified framework that addresses these challenges by combining contrastive 3D structure encoding with autoregressive molecular generation conditioned on commercial compound spaces. First, we introduce an SE(3)-equivariant transformer that encodes ligand and pocket structures into a shared embedding space via contrastive learning, achieving competitive results in zero-shot virtual screening. Second, we integrate these embeddings into a multimodal Chemical Language Model (MCLM). The model generates target-specific molecules conditioned on either pocket or ligand structures, with a learned dataset token that steers the output toward targeted chemical spaces, yielding candidates with favorable predicted binding properties across diverse targets.

[LG-11] Separating Geometry from Probability in the Analysis of Generalization

链接: https://arxiv.org/abs/2604.19560
作者: Maxim Raginsky,Benjamin Recht
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 19 pages

点击查看摘要

Abstract:The goal of machine learning is to find models that minimize prediction error on data that has not yet been seen. Its operational paradigm assumes access to a dataset S and articulates a scheme for evaluating how well a given model performs on an arbitrary sample. The sample can be S (in which case we speak of in-sample'' performance) or some entirely new S' (in which case we speak of out-of-sample’’ performance). Traditional analysis of generalization assumes that both in- and out-of-sample data are i.i.d.\ draws from an infinite population. However, these probabilistic assumptions cannot be verified even in principle. This paper presents an alternative view of generalization through the lens of sensitivity analysis of solutions of optimization problems to perturbations in the problem data. Under this framework, generalization bounds are obtained by purely deterministic means and take the form of variational principles that relate in-sample and out-of-sample evaluations through an error term that quantifies how close out-of-sample data are to in-sample data. Statistical assumptions can then be used \textitex post to characterize the situations when this error term is small (either on average or with high probability).

[LG-12] Calibrating Scientific Foundation Models with Inference-Time Stochastic Attention

链接: https://arxiv.org/abs/2604.19530
作者: Akash Yadav,Taiwo A. Adebiyi,Ruda Zhang
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Transformer-based scientific foundation models are increasingly deployed in high-stakes settings, but current architectures give deterministic outputs and provide limited support for calibrated predictive uncertainty. We propose Stochastic Attention, a lightweight inference-time modification that randomizes attention by replacing softmax weights with normalized multinomial samples controlled by a single concentration parameter, and produces predictive ensembles without retraining. To set this parameter, we introduce a calibration objective that matches the stochastic attention output with the target, yielding an efficient univariate post-hoc tuning problem. We evaluate this mechanism on two scientific foundation models for weather and timeseries forecasting along with an additional regression task. Across benchmarks against uncertainty-aware baselines, we find that Stochastic Attention achieves the strongest native calibration and the sharpest prediction intervals at comparable coverage, while requiring only minutes of post-hoc tuning versus days of retraining for competitive baselines.

[LG-13] Evaluating LLM -Generated Obfuscated XSS Payloads for Machine Learning-Based Detection

链接: https://arxiv.org/abs/2604.19526
作者: Divyesh Gabbireddy,Suman Saha
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Cross-site scripting (XSS) remains a persistent web security vulnerability, especially because obfuscation can change the surface form of a malicious payload while preserving its behavior. These transformations make it difficult for traditional and machine learning-based detection systems to reliably identify attacks. Existing approaches for generating obfuscated payloads often emphasize syntactic diversity, but they do not always ensure that the generated samples remain behaviorally valid. This paper presents a structured pipeline for generating and evaluating obfuscated XSS payloads using large language models (LLMs). The pipeline combines deterministic transformation techniques with LLM-based generation and uses a browser- based runtime evaluation procedure to compare payload behavior in a controlled execution environment. This allows generated samples to be assessed through observable runtime behavior rather than syntactic similarity alone. In the evaluation, an untuned baseline language model achieves a runtime behavior match rate of 0.15, while fine-tuning on behavior-preserving source-target obfuscation pairs improves the match rate to 0.22. Although this represents a measurable improvement, the results show that current LLMs still struggle to generate obfuscations that preserve observed runtime behavior. A downstream classifier evaluation further shows that adding generated payloads does not improve detection performance in this setting, although behavior- filtered generated samples can be incorporated without materially degrading performance. Overall, the study demonstrates both the promise and the limits of applying generative models to adversarial security data generation and emphasizes the importance of runtime behavior checks in improving the quality of generated data for downstream detection systems.

[LG-14] Accelerating Optimization and Machine Learning through Decentralization

链接: https://arxiv.org/abs/2604.19518
作者: Ziqin Chen,Zuang Wang,Yongqiang Wang
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Decentralized optimization enables multiple devices to learn a global machine learning model while each individual device only has access to its local dataset. By avoiding the need for training data to leave individual users’ devices, it enhances privacy and scalability compared to conventional centralized learning, where all data has to be aggregated to a central server. However, decentralized optimization has traditionally been viewed as a necessary compromise, used only when centralized processing is impractical due to communication constraints or data privacy concerns. In this study, we show that decentralization can paradoxically accelerate convergence, outperforming centralized methods in the number of iterations needed to reach optimal solutions. Through examples in logistic regression and neural network training, we demonstrate that distributing data and computation across multiple agents can lead to faster learning than centralized approaches, even when each iteration is assumed to take the same amount of time, whether performed centrally on the full dataset or decentrally on local subsets. This finding challenges longstanding assumptions and reveals decentralization as a strategic advantage, offering new opportunities for more efficient optimization and machine learning.

[LG-15] ZC-Swish: Stabilizing Deep BN-Free Networks for Edge and Micro-Batch Applications

链接: https://arxiv.org/abs/2604.19453
作者: Suvinava Basak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Batch Normalization (BN) is a cornerstone of deep learning, yet it fundamentally breaks down in micro-batch regimes (e.g., 3D medical imaging) and non-IID Federated Learning. Removing BN from deep architectures, however, often leads to catastrophic training failures such as vanishing gradients and dying channels. We identify that standard activation functions, like Swish and ReLU, exacerbate this instability in BN-free networks due to their non-zero-centered nature, which causes compounding activation mean-shifts as network depth increases. In this technical communication, we propose Zero-Centered Swish (ZC-Swish), a drop-in activation function parameterized to dynamically anchor activation means near zero. Through targeted stress-testing on BN-free convolutional networks at depths 8, 16, and 32, we demonstrate that while standard Swish collapses to near-random performance at depth 16 and beyond, ZC-Swish maintains stable layer-wise activation dynamics and achieves the highest test accuracy at depth 16 (51.5%) with seed 42. ZC-Swish thus provides a robust, parameter-efficient solution for stabilizing deep networks in memory-constrained and privacy-preserving applications where traditional normalization is unviable.

[LG-16] Heterogeneity-Aware Personalized Federated Learning for Industrial Predictive Analytics

链接: https://arxiv.org/abs/2604.19451
作者: Yuhan Hu,Xiaolei Fang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Federated prognostics enable clients (e.g., companies, factories, and production lines) to collaboratively develop a failure time prediction model while keeping each client’s data local and confidential. However, traditional federated models often assume homogeneity in the degradation processes across clients, an assumption that may not hold in many industrial settings. To overcome this, this paper proposes a personalized federated prognostic model designed to accommodate clients with heterogeneous degradation processes, allowing them to build tailored prognostic models. The prognostic model iteratively facilitates the underlying pairwise collaborations between clients with similar degradation patterns, which enhances the performance of personalized federated learning. To estimate parameters jointly using decentralized datasets, we develop a federated parameter estimation algorithm based on proximal gradient descent. The proposed approach addresses the limitations of existing federated prognostic models by simultaneously achieving model personalization, preserving data privacy, and providing comprehensive failure time distributions. The superiority of the proposed model is validated through extensive simulation studies and a case study using the turbofan engine degradation dataset from the NASA repository.

[LG-17] Unsupervised Confidence Calibration for Reasoning LLM s from a Single Generation

链接: https://arxiv.org/abs/2604.19444
作者: Thomas Zollo,Jimmy Wang,Richard Zemel
类目: Machine Learning (cs.LG)
*备注: 41 pages, 14 tables, 12 figures

点击查看摘要

Abstract:Reasoning language models can solve increasingly complex tasks, but struggle to produce the calibrated confidence estimates necessary for reliable deployment. Existing calibration methods usually depend on labels or repeated sampling at inference time, making them impractical in many settings. We introduce a method for unsupervised confidence calibration of reasoning LLMs when only a single generation is available at inference time. Our approach uses offline sampling on unlabeled data to derive a self-consistency-based proxy target, then distills this signal into a lightweight deployment-time confidence predictor. In a broad evaluation across 5 math and question-answering tasks using 9 reasoning models, our method substantially outperforms baselines, including under distribution shift, and improves downstream performance in selective prediction and simulated downstream decision-making.

[LG-18] Optimal Routing for Federated Learning over Dynamic Satellite Networks: Tractable or Not?

链接: https://arxiv.org/abs/2604.19399
作者: Yi Zhao,Di Yuan,Tao Deng,Suzhi Cao,Ying Dong
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is a key paradigm for distributed model learning across decentralized data sources. Communication in each FL round typically consists of two phases: (i) distributing the global model from a server to clients, and (ii) collecting updated local models from clients to the server for aggregation. This paper focuses on a type of FL where communication between a client and the server is relay-based over dynamic networks, making routing optimization essential. A typical scenario is in-orbit FL, where satellites act as clients and communicate with a server (which can be a satellite, ground station, or aerial platform) via multi-hop inter-satellite links. This paper presents a comprehensive tractability analysis of routing optimization for in-orbit FL under different settings. For global model distribution, these include the number of models, the objective function, and routing schemes (unicast versus multicast, and splittable versus unsplittable flow). For local model collection, the settings consider the number of models, client selection, and flow splittability. For each case, we rigorously prove whether the global optimum is obtainable in polynomial time or the problem is NP-hard. Together, our analysis draws clear boundaries between tractable and intractable regimes for a broad spectrum of routing problems for in-orbit FL. For tractable cases, the derived efficient algorithms are directly applicable in practice. For intractable cases, we provide fundamental insights into their inherent complexity. These contributions fill a critical yet unexplored research gap, laying a foundation for principled routing design, evaluation, and deployment in satellite-based FL or similar distributed learning systems.

[LG-19] FairTree: Subgroup Fairness Auditing of Machine Learning Models with Bias-Variance Decomposition

链接: https://arxiv.org/abs/2604.19357
作者: Rudolf Debelak
类目: Machine Learning (cs.LG)
*备注: Accepted at ACM FAccT 2026

点击查看摘要

Abstract:The evaluation of machine learning models typically relies mainly on performance metrics based on loss functions, which risk to overlook changes in performance in relevant subgroups. Auditing tools such as SliceFinder and SliceLine were proposed to detect such groups, but usually have conceptual disadvantages, such as the inability to directly address continuous covariates. In this paper, we introduce FairTree, a novel algorithm adapted from psychometric invariance testing. Unlike SliceFinder and related algorithms, FairTree directly handles continuous, categorical, and ordinal features without discretization. It further decomposes performance disparities into systematic bias and variance, allowing a categorization of changes in algorithm performance. We propose and evaluate two variations of the algorithm: a permutation-based approach, which is conceptually closer to SliceFinder, and a fluctuation test. Through simulation studies that include a direct comparison with SliceLine, we demonstrate that both approaches have a satisfactory rate of false-positive results, but that the fluctuation approach has relatively higher power. We further illustrate the method on the UCI Adult Census dataset. The proposed algorithms provide a flexible framework for the statistical evaluation of the performance and aspects of fairness of machine learning models in a wide range of applications even in relatively small data.

[LG-20] Scalable Memristive-Friendly Reservoir Computing for Time Series Classification

链接: https://arxiv.org/abs/2604.19343
作者: Coşku Can Horuz,Andrea Ceni,Claudio Gallicchio,Sebastian Otte
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 12 pages, 3 figures, 7 tables

点击查看摘要

Abstract:Memristive devices present a promising foundation for next-generation information processing by combining memory and computation within a single physical substrate. This unique characteristic enables efficient, fast, and adaptive computing, particularly well suited for deep learning applications. Among recent developments, the memristive-friendly echo state network (MF-ESN) has emerged as a promising approach that combines memristive-inspired dynamics with the training simplicity of reservoir computing, where only the readout layer is learned. Building on this framework, we propose memristive-friendly parallelized reservoirs (MARS), a simplified yet more effective architecture that enables efficient scalable parallel computation and deeper model composition through novel subtractive skip connections. This design yields two key advantages: substantial training speedups of up to 21x over the inherently lightweight echo state network baseline and significantly improved predictive performance. Moreover, MARS demonstrates what is possible with parallel memristive-friendly reservoir computing: on several long sequence benchmarks our compact gradient-free models substantially outperform strong gradient-based sequence models such as LRU, S5, and Mamba, while reducing full training time from minutes or hours down seconds or even only a few hundred milliseconds. Our work positions parallel memristive-friendly computing as a promising route towards scalable neuromorphic learning systems that combine high predictive capability with radically improved computational efficiency, while providing a clear pathway to energy-efficient, low-latency implementations on emerging memristive and in-memory hardware.

[LG-21] FedSEA: Achieving Benefit of Parallelization in Federated Online Learning

链接: https://arxiv.org/abs/2604.19336
作者: Harekrushna Sahu,Pratik Jawanpuria,Pranay Sharma
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Online federated learning (OFL) has emerged as a popular framework for decentralized decision-making over continuous data streams without compromising client privacy. However, the adversary model assumed in standard OFL typically precludes any potential benefits of parallelization. Further, it fails to adequately capture the different sources of statistical variation in OFL problems. In this paper, we extend the OFL paradigm by integrating a stochastically extended adversary (SEA). Under this framework, the loss function remains fixed across clients over time. However, the adversary dynamically and independently selects the data distribution for each client at each time. We propose the \algoOFL algorithm to solve this problem, which utilizes online stochastic gradient descent at the clients, along with periodic global aggregation via the server. We establish bounds on the global network regret over a time horizon (T) for two classes of functions: (1) for smooth and convex losses, we prove an (\mathcalO(\sqrtT)) bound, and (2) for smooth and strongly convex losses, we prove an (\mathcalO(\log T)) bound. Through careful analysis, we quantify the individual impact of both spatial (across clients) and temporal (over time) data heterogeneity on the regret bounds. Consequently, we identify a regime of mild temporal variation (relative to stochastic gradient variance), where the network regret improves with parallelization. Hence, in the SEA setting, our results improve the existing pessimistic worst-case results in online federated learning.

[LG-22] When Active Learning Falls Short: An Empirical Study on Chemical Reaction Extraction

链接: https://arxiv.org/abs/2604.19335
作者: Simin Yu,Sufia Fathima
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid growth of chemical literature has generated vast amounts of unstructured data, where reaction information is particularly valuable for applications such as reaction predictions and drug design. However, the prohibitive cost of expert annotation has led to a scarcity of training data, severely hindering the performance of automatic reaction extraction. In this work, we conduct a systematic study of active learning for chemical reaction extraction. We integrate six uncertainty- and diversity-based strategies with pretrained transformer-CRF architectures, and evaluate them on product extraction and role labeling task. While several methods approach full-data performance with fewer labeled instances, learning curves are often non-monotonic and task-dependent. Our analysis shows that strong pretraining, structured CRF decoding, and label sparsity limit the stability of conventional active learning strategies. These findings provide practical insights for the effective use of active learning in chemical information extraction.

[LG-23] On the Conditioning Consistency Gap in Conditional Neural Processes

链接: https://arxiv.org/abs/2604.19312
作者: Robin Young
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural processes are meta-learning models that map context sets to predictive distributions. While inspired by stochastic processes, NPs do not generally satisfy the Kolmogorov consistency conditions required to define a valid stochastic process. This inconsistency is widely acknowledged but poorly understood. Practitioners note that NPs work well despite the violation, without quantifying what this means. We address this gap by defining the conditioning consistency gap, a KL divergence measuring how much a conditional neural process’s (CNP) predictions change when a point is added to the context versus conditioned upon. Our main results show that for CNPs with bounded encoders and Lipschitz decoders, the consistency gap is O(1/n^2) in context size n , and that this rate is tight. These bounds establish the precise sense in which CNPs approximate valid stochastic processes. The inconsistency is negligible for moderate context sizes but can be significant in the few-shot regime.

[LG-24] Debiased neural operators for estimating functionals

链接: https://arxiv.org/abs/2604.19296
作者: Konstantin Hess,Dennis Frauen,Niki Kilbertus,Stefan Feuerriegel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural operators are widely used to approximate solution maps of complex physical systems. In many applications, however, the goal is not to recover the full solution trajectory, but to summarize the solution trajectory via a scalar target quantity (e.g., a functional such as time spent in a target range, time above a threshold, accumulated cost, or total energy). In this paper, we introduce DOPE (debiased neural operator): a semiparametric estimator for such target quantities of solution trajectories obtained from neural operators. DOPE is broadly applicable to settings with both partial and irregular observations and can be combined with arbitrary neural operator architectures. We make three main contributions. (1) We show that, in contrast to DOPE, naive plug-in estimation can suffer from first-order bias. (2) To address this, we derive a novel one-step, Neyman-orthogonal estimator that treats the neural operator as a high-dimensional nuisance mapping between function spaces, and removes the leading bias term. For this, DOPE uses a weighting mechanism that simultaneously accounts for irregular observation designs and for how sensitive the target quantity is to perturbations of the underlying trajectory. (3) To learn the weights, we extend automatic debiased machine learning to operator-valued nuisances via Riesz regression. We demonstrate the benefits of DOPE across various numerical experiments.

[LG-25] EMPO: Scaling Test-time Training for Large Reasoning Models

链接: https://arxiv.org/abs/2604.19295
作者: Qingyang Zhang,Xinke Kong,Haitao Wu,Qinghua Hu,Minghao Wu,Baosong Yang,Yu Cheng,Yun Luo,Ganqu Cui,Changqing Zhang
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.

[LG-26] he Logical Expressiveness of Topological Neural Networks ICLR2026

链接: https://arxiv.org/abs/2604.19212
作者: Amirreza Akbari,Amauri H. Souza,Vikas Garg
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 39 pages, Published at the 14th International Conference on Learning Representations (ICLR 2026)

点击查看摘要

Abstract:Graph neural networks (GNNs) are the standard for learning on graphs, yet they have limited expressive power, often expressed in terms of the Weisfeiler-Leman (WL) hierarchy or within the framework of first-order logic. In this context, topological neural networks (TNNs) have recently emerged as a promising alternative for graph representation learning. By incorporating higher-order relational structures into message-passing schemes, TNNs offer higher representational power than traditional GNNs. However, a fundamental question remains open: what is the logical expressiveness of TNNs? Answering this allows us to characterize precisely which binary classifiers TNNs can represent. In this paper, we address this question by analyzing isomorphism tests derived from the underlying mechanisms of general TNNs. We introduce and investigate the power of higher-order variants of WL-based tests for combinatorial complexes, called k -CCWL test. In addition, we introduce the topological counting logic (TC _k ), an extension of standard counting logic featuring a novel pairwise counting quantifier \exists^N(x_i,x_j), \varphi(x_i,x_j), which explicitly quantifies pairs (x_i, x_j) satisfying property \varphi . We rigorously prove the exact equivalence: \textk-CCWL \equiv \textTC_k+2 \equiv \textTopological (k+2)\text-pebble game. These results establish a logical expressiveness theory for TNNs.

[LG-27] Auditing LLM s for Algorithmic Fairness in Casenote-Augmented Tabular Prediction

链接: https://arxiv.org/abs/2604.19204
作者: Xiao Qi Lee,Ezinne Nwankwo,Angela Zhou
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LLMs are increasingly being considered for prediction tasks in high-stakes social service settings, but their algorithmic fairness properties in this context are poorly understood. In this short technical report, we audit the algorithmic fairness of LLM-based tabular classification on a real housing placement prediction task, augmented with street outreach casenotes from a nonprofit partner. We audit multi-class classification error disparities. We find that a fine-tuned model augmented with casenote summaries can improve accuracy while reducing algorithmic fairness disparities. We experiment with variable importance improvements to zero-shot tabular classification and find mixed results on resulting algorithmic fairness. Overall, given historical inequities in housing placement, it is crucial to audit LLM use. We find that leveraging LLMs to augment tabular classification with casenote summaries can safely leverage additional text information at low implementation burden. The outreach casenotes are fairly short and heavily redacted. Our assessment is that LLM zero-shot classification does not introduce additional textual biases beyond algorithmic biases in tabular classification. Combining fine-tuning and leveraging casenote summaries can improve accuracy and algorithmic fairness.

[LG-28] FOCAL-Attention for Heterogeneous Multi-Label Prediction

链接: https://arxiv.org/abs/2604.19171
作者: Chenghao Zhang,Qingqing Long,Ludi Wang,Wenjuan Cui,Jianjun Yu,Yi Du
类目: Machine Learning (cs.LG)
*备注: 24 pages, 4 figures

点击查看摘要

Abstract:Heterogeneous graphs have attracted increasing attention for modeling multi-typed entities and relations in complex real-world systems. Multi-label node classification on heterogeneous graphs is challenging due to structural heterogeneity and the need to learn shared representations across multiple labels. Existing methods typically adopt either flexible attention mechanisms or meta-path constrained anchoring, but in heterogeneous multi-label prediction they often suffer from semantic dilution or coverage constraint. Both issues are further amplified under multi-label supervision. We present a theoretical analysis showing that as heterogeneous neighborhoods expand, the attention mass allocated to task-critical (primary) neighborhoods diminishes, and that meta-path constrained aggregation exhibits a dilemma: too few meta-paths intensify coverage constraint, while too many re-introduce dilution. To resolve this coverage-anchoring conflict, we propose FOCAL: Fusion Of Coverage and Anchoring Learning, with two components: coverage-oriented attention (COA) for flexible, unconstrained heterogeneous context aggregation, and anchoring-oriented attention (AOA) that restricts aggregation to meta-path-induced primary semantics. Our theoretical analysis and experimental results further indicates that FOCAL has a better performance than other state-of-the-art methods.

[LG-29] SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

链接: https://arxiv.org/abs/2604.19157
作者: Jinda Jia,Jisen Li,Zhongzhu Zhou,Jung Hwan Heo,Jue Wang,Tri Dao,Shuaiwen Leon Song,Ben Athiwaratkun,Chenfeng Xu,Tianyi Zhang,Xiaoxia Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:KV-cache memory is a major bottleneck in real-world LLM serving, where systems must simultaneously support latency-sensitive small-batch requests and high-throughput concurrent workloads. Although many KV-cache compression methods improve offline accuracy or compression ratio, they often violate practical serving constraints such as paged memory layouts, regular memory access, and fused attention execution, limiting their effectiveness in deployment. In this work, we identify the minimal set of 4-bit KV-cache quantization methods that remain viable under these constraints. Our central finding is that a simple design–token-wise INT4 quantization with block-diagonal Hadamard rotation–consistently achieves the best accuracy-efficiency trade-off. Across multiple models and benchmarks, this approach recovers nearly all of the accuracy lost by naive INT4, while more complex methods such as vector quantization and Hessian-aware quantization provide only marginal additional gains once serving compatibility is taken into account. To make this practical, we implement a fused rotation-quantization kernel that integrates directly into paged KV-cache layouts and introduces zero measurable end-to-end overhead, matching plain INT4 throughput across concurrency levels. Our results show that effective KV-cache compression is fundamentally a systems co-design problem: under real serving constraints, lightweight block-diagonal Hadamard rotation is a viable method that delivers near-lossless accuracy without sacrificing serving efficiency. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.19157 [cs.LG] (or arXiv:2604.19157v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.19157 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-30] RL-ABC: Reinforcement Learning for Accelerator Beamline Control

链接: https://arxiv.org/abs/2604.19146
作者: Anwar Ibrahim,Fedor Ratnikov,Maxim Kaledin,Alexey Petrenko,Denis Derkach
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注:

点击查看摘要

Abstract:Particle accelerator beamline optimization is a high-dimensional control problem traditionally requiring significant expert intervention. We present RLABC (Reinforcement Learning for Accelerator Beamline Control), an open-source Python framework that automatically transforms standard Elegant beamline configurations into reinforcement learning environments. RLABC integrates with the widely-used Elegant beam dynamics simulation code via SDDS-based interfaces, enabling researchers to apply modern RL algorithms to beamline optimization with minimal RL-specific development. The main contribution is a general methodology for formulating beamline tuning as a Markov decision process: RLABC automatically preprocesses lattice files to insert diagnostic watch points before each tunable element, constructs a 57-dimensional state representation from beam statistics, covariance information, and aperture constraints, and provides a configurable reward function for transmission optimization. The framework supports multiple RL algorithms through Stable-Baselines3 compatibility and implements stage learning strategies for improved training efficiency. Validation on a test beamline derived from the VEPP-5 injection complex (37 control parameters across 11 quadrupoles and 4 dipoles) demonstrates that the framework successfully enables RL-based optimization, with a Deep Deterministic Policy Gradient agent achieving 70.3% particle transmission – performance matching established methods such as differential evolution. The framework’s stage learning capability allows decomposition of complex optimization problems into manageable subproblems, improving training efficiency. The complete framework, including configuration files and example notebooks, is available as open-source software to facilitate adoption and further research. Subjects: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex) Cite as: arXiv:2604.19146 [cs.LG] (or arXiv:2604.19146v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.19146 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-31] LLM s Know Theyre Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit

链接: https://arxiv.org/abs/2604.19117
作者: Manav Pandey
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When a language model agrees with a user’s false belief, is it failing to detect the error, or noticing and agreeing anyway? We show the latter. Across twelve open-weight models from five labs, spanning small to frontier scale, the same small set of attention heads carries a “this statement is wrong” signal whether the model is evaluating a claim on its own or being pressured to agree with a user. Silencing these heads flips sycophantic behavior sharply while leaving factual accuracy intact, so the circuit controls deference rather than knowledge. Edge-level path patching confirms that the same head-to-head connections drive sycophancy, factual lying, and instructed lying. Opinion-agreement, where no factual ground truth exists, reuses these head positions but writes into an orthogonal direction, ruling out a simple “truth-direction” reading of the substrate. Alignment training leaves this circuit in place: an RLHF refresh cuts sycophantic behavior roughly tenfold while the shared heads persist or grow, a pattern that replicates on an independent model family and under targeted anti-sycophancy DPO. When these models sycophant, they register that the user is wrong and agree anyway.

[LG-32] Age-Dependent Heterogeneity in the Association Between Physical Activity and Mental Distress: A Causal Machine Learning Analysis of 3.2 Million U.S. Adults

链接: https://arxiv.org/abs/2604.19066
作者: Yuan Shan(Department of Statistical Science, Duke University)
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Physical activity (PA) is widely recognized as protective against mental distress, yet whether this benefit varies systematically across population subgroups remains poorly understood. Using pooled data from ten consecutive annual waves of the U.S. Behavioral Risk Factor Surveillance System (2015-2024; n = 3,242,218), we investigate heterogeneity in the association between leisure-time PA and frequent mental distress (FMD, =14 days/month) across age groups. Survey-weighted logistic regression reveals a striking age gradient: the adjusted odds ratio for PA ranges from 0.89 among young adults (18-24) to 0.50 among adults aged 55-64, with the protective association strengthening monotonically with age. Temporal analysis across all ten years shows that the young-adult PA effect has been eroding over the past decade, with the 18-24 OR reaching 1.01 (null) in both 2018 and 2024 – paralleling the deepening youth mental health crisis. Causal Forest via Double Machine Learning independently identifies age as the dominant driver of treatment effect heterogeneity (feature importance = 0.39, 2.5x the next predictor). E-value sensitivity analysis, propensity score overlap checks, placebo tests, and imputation comparisons confirm the robustness of the findings. These results suggest that the well-documented exercise–mental health link may not generalize to the youngest adult population, whose distress appears increasingly driven by stressors that PA alone cannot mitigate.

[LG-33] Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors ICLR2026

链接: https://arxiv.org/abs/2604.19028
作者: Jeongwhan Choi,Jongwoo Kim,Woosung Kang,Noseong Park
类目: Machine Learning (cs.LG)
*备注: Accepted to ICLR 2026. OpenReview: this https URL

点击查看摘要

Abstract:One of the most challenging problems in graph machine learning is generalizing across graphs with diverse properties. Graph neural networks (GNNs) face a fundamental limitation: they require separate training for each new graph, preventing universal generalization across diverse graph datasets. A critical challenge facing GNNs lies in their reliance on labeled training data for each individual graph, a requirement that hinders the capacity for universal node classification due to the heterogeneity inherent in graphs – differences in homophily levels, community structures, and feature distributions across datasets. Inspired by the success of large language models (LLMs) that achieve in-context learning through massive-scale pre-training on diverse datasets, we introduce NodePFN. This universal node classification method generalizes to arbitrary graphs without graph-specific training. NodePFN learns posterior predictive distributions (PPDs) by training only on thousands of synthetic graphs generated from carefully designed priors. Our synthetic graph generation covers real-world graphs through the use of random networks with controllable homophily levels and structural causal models for complex feature-label relationships. We develop a dual-branch architecture combining context-query attention mechanisms with local message passing to enable graph-aware in-context learning. Extensive evaluation on 23 benchmarks demonstrates that a single pre-trained NodePFN achieves 71.27 average accuracy. These results validate that universal graph learning patterns can be effectively learned from synthetic priors, establishing a new paradigm for generalization in node classification.

[LG-34] Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback

链接: https://arxiv.org/abs/2604.19024
作者: Qiang Liu,Adrienne Kline,Ermin Wei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Safe Reinforcement Learning from Human Feedback (Safe RLHF) has recently achieved empirical success in developing helpful and harmless large language models by decoupling human preferences regarding helpfulness and harmlessness. Existing approaches typically rely on fitting fixed horizon reward models from human feedback and have only been validated empirically. In this paper, we formulate safe RLHF as an infinite horizon discounted Con- strained Markov Decision Process (CMDP), since humans may interact with the model over a continuing sequence of interactions rather than within a single finite episode. We propose two Safe RLHF algorithms that do not require reward model fitting and, in contrast to prior work assuming fixed-length trajectories, support flexible trajectory lengths for training. Both algo- rithms are based on the primal-dual method and achieve global convergence guarantees with polynomial rates in terms of policy gradient iterations, trajectory sample lengths, and human preference queries. To the best of our knowledge, this is the first work to study infinite horizon discounted CMDP under human feedback and establish global, non-asymptotic convergence.

[LG-35] FG2-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control

链接: https://arxiv.org/abs/2604.19021
作者: Pingwei Sun,Yuxuan Hu,Jianchao Tan,Xue Wang,Jiaqi Zhang,Yifan Lu,Yerui Sun,Yuchen Xie,Xunliang Cai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Linear attention mechanisms have emerged as promising alternatives to softmax attention, offering linear-time complexity during inference. Recent advances such as Gated DeltaNet (GDN) and Kimi Delta Attention (KDA) have demonstrated that the delta rule, an online gradient descent update, enables superior associative recall compared to simple additive updates. While KDA refined the coarse head-wise decay gate into channel-wise decay, the learning rate \beta_t in the delta update remains a scalar, limiting the model’s capacity for dimension-specific adaptation. We introduce FG ^2 -GDN, which replaces the scalar \beta_t with a channel-wise vector analogous to the transition from SGD to per-coordinate adaptive optimizers such as AdaGrad and Adam. We further propose FG ^2 -GDN+, which decouples the scaling for keys and values, enabling independent control of erasure strength and write strength. Experiments on synthetic and real-world benchmarks show that FG ^2 -GDN and its variant improve associative recall and long-context understanding over GDN and KDA, with comparable computational efficiency.

[LG-36] Accelerating trajectory optimization with Sobolev-trained diffusion policies

链接: https://arxiv.org/abs/2604.19011
作者: Théotime Le Hellard,Franki Nguimatsia Tiofack,Quentin Le Lidec,Justin Carpentier
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Trajectory Optimization (TO) solvers exploit known system dynamics to compute locally optimal trajectories through iterative improvements. A downside is that each new problem instance is solved independently; therefore, convergence speed and quality of the solution found depend on the initial trajectory proposed. To improve efficiency, a natural approach is to warm-start TO with initial guesses produced by a learned policy trained on trajectories previously generated by the solver. Diffusion-based policies have recently emerged as expressive imitation learning models, making them promising candidates for this role. Yet, a counterintuitive challenge comes from the local optimality of TO demonstrations: when a policy is rolled out, small non-optimal deviations may push it into situations not represented in the training data, triggering compounding errors over long horizons. In this work, we focus on learning-based warm-starting for gradient-based TO solvers that also provide feedback gains. Exploiting this specificity, we derive a first-order loss for Sobolev learning of diffusion-based policies using both trajectories and feedback gains. Through comprehensive experiments, we demonstrate that the resulting policy avoids compounding errors, and so can learn from very few trajectories to provide initial guesses reducing solving time by 2\times to 20 \times . Incorporating first-order information enables predictions with fewer diffusion steps, reducing inference latency.

[LG-37] Mechanistic Anomaly Detection via Functional Attribution

链接: https://arxiv.org/abs/2604.18970
作者: Hugo Lyons Keenan,Christopher Leckie,Sarah Erfani
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We can often verify the correctness of neural network outputs using ground truth labels, but we cannot reliably determine whether the output was produced by normal or anomalous internal mechanisms. Mechanistic anomaly detection (MAD) aims to flag these cases, but existing methods either depend on latent space analysis, which is vulnerable to obfuscation, or are specific to particular architectures and modalities. We reframe MAD as a functional attribution problem: asking to what extent samples from a trusted set can explain the model’s output, where attribution failure signals anomalous behavior. We operationalize this using influence functions, measuring functional coupling between test samples and a small reference set via parameter-space sampling. We evaluate across multiple anomaly types and modalities. For backdoors in vision models, our method achieves state-of-the-art detection on BackdoorBench, with an average Defense Effectiveness Rating (DER) of 0.93 across seven attacks and four datasets (next best 0.83). For LLMs, we similarly achieve a significant improvement over baselines for several backdoor types, including on explicitly obfuscated models. Beyond backdoors, our method can detect adversarial and out-of-distribution samples, and distinguishes multiple anomalous mechanisms within a single model. Our results establish functional attribution as an effective, modality-agnostic tool for detecting anomalous behavior in deployed models.

[LG-38] FlowForge: A Staged Local Rollout Engine for Flow-Field Prediction

链接: https://arxiv.org/abs/2604.18953
作者: Xiaowen Zhang,Ziming Zhou,Fengnian Zhao,David L. S. Hung
类目: Machine Learning (cs.LG)
*备注: Main paper: 13 pages, 6 figures, 2 tables. Appendix: 17 pages, 7 figures, 1 table. arXiv preprint

点击查看摘要

Abstract:Deep learning surrogates for CFD flow-field prediction often rely on large, complex models, which can be slow and fragile when data are noisy or incomplete. We introduce FlowForge, a staged local rollout engine that predicts future flow fields by compiling a locality-preserving update schedule and executing it with a shared lightweight local predictor. Rather than producing the next frame in a single global pass, FlowForge rewrites spatial sites stage by stage so that each update conditions only on bounded local context exposed by earlier stages. This compile-execute design aligns inference with short-range physical dependence, keeps latency predictable, and limits error amplification from global mixing. Across PDEBench, CFDBench, and BubbleML, FlowForge matches or improves upon strong baselines in pointwise accuracy, delivers consistently better robustness to noise and missing observations, and maintains stable multi-step rollout behavior while reducing per-step latency.

[LG-39] abEmb: Joint Semantic-Structure Embedding for Table Annotation

链接: https://arxiv.org/abs/2604.18939
作者: Ehsan Hoseinzade,Ke Wang,Anandharaju Durai Raju
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Table annotation is crucial for making web and enterprise tables usable in downstream NLP applications. Unlike textual data where learning semantically rich token or sentence embeddings often suffice, tables are structured combinations of columns wherein useful representations must jointly capture column’s semantics and the inter-column relationships. Existing models learn by linearizing the 2D table into a 1D token sequence and encoding it with pretrained language models (PLMs) such as BERT. However, this leads to limited semantic quality and weaker generalization to unseen or rare values compared to modern LLMs, and degraded structural modeling due to 2D-to-1D flattening and context-length constraints. We propose TabEmb, which directly targets these limitations by decoupling semantic encoding from structural modeling. An LLM first produces semantically rich embeddings for each column, and a graph-based module over columns then injects relationships into the embeddings, yielding joint semantic-tructural representations for table annotation. Experiments show that TabEmb consistently outperforms strong baselines on different table annotation tasks. Source code and datasets are available at this https URL

[LG-40] From Particles to Perils: SVGD-Based Hazardous Scenario Generation for Autonomous Driving Systems Testing

链接: https://arxiv.org/abs/2604.18918
作者: Linfeng Liang,Xiao Cheng,Tsong Yueh Chen,Xi Zheng
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulation-based testing of autonomous driving systems (ADS) must uncover realistic and diverse failures in dense, heterogeneous traffic. However, existing search-based seeding methods (e.g., genetic algorithms) struggle in high-dimensional spaces, often collapsing to limited modes and missing many failure scenarios. We present PtoP, a framework that combines adaptive random seed generation with Stein Variational Gradient Descent (SVGD) to produce diverse, failure-inducing initial conditions. SVGD balances attraction toward high-risk regions and repulsion among particles, yielding risk-seeking yet well-distributed seeds across multiple failure modes. PtoP is plug-and-play and enhances existing online testing methods (e.g., reinforcement learning–based testers) by providing principled seeds. Evaluation in CARLA on two industry-grade ADS (Apollo, Autoware) and a native end-to-end system shows that PtoP improves safety violation rate (up to 27.68%), scenario diversity (9.6%), and map coverage (16.78%) over baselines.

[LG-41] Collaborative Contextual Bayesian Optimization

链接: https://arxiv.org/abs/2604.18912
作者: Chih-Yu Chang,Qiyuan Chen,Tianhan Gao,David Fenning,Chinedum Okwudire,Neil Dasgupta,Wei Lu,Raed Al Kontar
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Discovering optimal designs through sequential data collection is essential in many real-world applications. While Bayesian Optimization (BO) has achieved remarkable success in this setting, growing attention has recently turned to context-specific optimal design, formalized as Contextual Bayesian Optimization (CBO). Unlike BO, CBO is inherently more challenging as it must approximate an entire mapping from the context space to its corresponding optimal design, requiring simultaneous exploration across contexts and exploitation within each. In many modern applications, such tasks arise across multiple potentially heterogeneous but related clients, where collaboration can significantly improve learning efficiency. We propose CCBO, Collaborative Contextual Bayesian Optimization, a unified framework enabling multiple clients to jointly perform CBO with controllable contexts, supporting both online collaboration and offline initialization from peers’ historical beliefs, with an optional privacy-preserving communication mechanism. We establish sublinear regret guarantees and demonstrate, through extensive simulations and a real-world hot rolling application, that CCBO achieves substantial improvements over existing approaches even under client heterogeneity. The code to reproduce the results can be found at this https URL

[LG-42] AC-SINDy: Compositional Sparse Identification of Nonlinear Dynamics

链接: https://arxiv.org/abs/2604.18889
作者: Peter Racioppo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present AC-SINDy, a compositional extension of the Sparse Identification of Nonlinear Dynamics (SINDy) framework that replaces explicit feature libraries with a structured representation based on arithmetic circuits. Rather than enumerating candidate basis functions, the proposed approach constructs nonlinear features through compositions of linear functions and multiplicative interactions, yielding a compact and scalable parameterization and enabling sparsity to be enforced directly over the computational graph. We also introduce a formulation that separates state estimation from dynamics identification by combining latent state inference with shared dynamics and multi-step supervision, improving robustness to noise while preserving interpretability. Experiments on nonlinear and chaotic systems demonstrate that the method recovers accurate and interpretable governing equations while scaling more favorably than standard SINDy.

[LG-43] Subgraph Concept Networks: Concept Levels in Graph Classification

链接: https://arxiv.org/abs/2604.18868
作者: Lucie Charlotte Magister,Alexander Norcliffe,Iulia Duta,Pietro Lio
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The reasoning process of Graph Neural Networks is complex and considered opaque, limiting trust in their predictions. To alleviate this issue, prior work has proposed concept-based explanations, extracted from clusters in the model’s node embeddings. However, a limitation of concept-based explanations is that they only explain the node embedding space and are obscured by pooling in graph classification. To mitigate this issue and provide a deeper level of understanding, we propose the Subgraph Concept Network. The Subgraph Concept Network is the first graph neural network architecture that distils subgraph and graph-level concepts. It achieves this by performing soft clustering on node concept embeddings to derive subgraph and graph-level concepts. Our results show that the Subgraph Concept Network allows to obtain competitive model accuracy, while discovering meaningful concepts at different levels of the network.

[LG-44] ParamBoost: Gradient Boosted Piecewise Cubic Polynomials

链接: https://arxiv.org/abs/2604.18864
作者: Nicolas Salvadé,Tim Hillel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Generalized Additive Models (GAMs) can be used to create non-linear glass-box (i.e. explicitly interpretable) models, where the predictive function is fully observable over the complete input space. However, glass-box interpretability itself does not allow for the incorporation of expert knowledge from the modeller. In this paper, we present ParamBoost, a novel GAM whose shape functions (i.e. mappings from individual input features to the output) are learnt using a Gradient Boosting algorithm that fits cubic polynomial functions at leaf nodes. ParamBoost incorporates several constraints commonly used in parametric analysis to ensure well-refined shape functions. These constraints include: (i) continuity of the shape functions and their derivatives (up to C2); (ii) monotonicity; (iii) convexity; (iv) feature interaction constraints; and (v) model specification constraints. Empirical results show that the unconstrained ParamBoost model consistently outperforms state-of-the-art GAMs across several real-world datasets. We further demonstrate that modellers can selectively impose required constraints at a modest trade-off in predictive performance, allowing the model to be fully tailored to application-specific interpretability and parametric-analysis requirements.

[LG-45] he High Explosives and Affected Targets (HEAT) Dataset

链接: https://arxiv.org/abs/2604.18828
作者: Bryan Kaiser,Kyle Hickmann,Sharmistha Chakrabarti,Soumi De,Sourabh Pandit,David Schodt,Jesus Pulido,Divya Banesh,Christine Sweeney
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) surrogate models provide a computationally efficient alternative to full-physics simulations, but no public datasets currently exist for training and validating models of high-explosive-driven, multi-material shock dynamics. Simulating shock propagation is challenging due to the need for material-specific equations of state (EOS) and models of plasticity, phase change, damage, fluid instabilities, and multi-material interactions. Explosive-driven shocks further require reactive material models to capture detonation physics. To address this gap, we introduce the High-Explosives and Affected Targets (HEAT) dataset, a physics-rich collection of two-dimensional, cylindrically symmetric simulations generated using an Eulerian multi-material shock-propagation code developed at Los Alamos National Laboratory. HEAT consists of two partitions: expanding shock-cylinder (CYL) simulations and Perturbed Layered Interface (PLI) simulations. Each entry includes time series of thermodynamic fields (pressure, density, temperature), kinematic fields (position, velocity), and continuum quantities such as stress. The CYL partition spans a range of materials, including metals (aluminum, copper, depleted uranium, stainless steel, tantalum), a polymer, water, gases (air, nitrogen), and a detonating material. The PLI partition explores varied geometries with fixed materials: copper, aluminum, stainless steel, polymer, and high explosive. HEAT captures key phenomena such as shock propagation, momentum transfer, plastic deformation, and thermal effects, providing a benchmark dataset for AI/ML models of multi-material shock physics.

[LG-46] A PPA-Driven 3D-IC Partitioning Selection Framework with Surrogate Models

链接: https://arxiv.org/abs/2604.18806
作者: Shang Wang(1),Shuai Liu(1),Owen Randall(1),Matthew E. Taylor(1 and 2) ((1) University of Alberta, (2) Alberta Machine Intelligence Institute (Amii))
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:3D-IC netlist partitioning is commonly optimized using proxy objectives, while final PPA is treated as a costly evaluation rather than an optimization signal. This proxy-driven paradigm makes it difficult to reliably translate additional PPA evaluations into better PPA outcomes. To bridge this gap, we present DOPP (D-Optimal PPA-driven partitioning selection), an approach that bridges the gap between proxies and true PPA metrics. Across eight 3D-IC designs, our framework improves PPA over Open3DBench (average relative improvements of 9.99% congestion, 7.87% routed wirelength, 7.75% WNS, 21.85% TNS, and 1.18% power). Compared with exhaustive evaluation over the full candidate set, DOPP achieves comparable best-found PPA while evaluating only a small fraction of candidates, substantially reducing evaluation cost. By parallelizing evaluations, our method delivers these gains while maintaining wall-clock runtime comparable to traditional baselines.

[LG-47] Preserving Clusters in Error-Bounded Lossy Compression of Particle Data

链接: https://arxiv.org/abs/2604.18801
作者: Congrong Ren,Sheng Di,Katrin Heitmann,Franck Cappello,Hanqi Guo
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Lossy compression is widely used to reduce storage and I/O costs for large-scale particle datasets in scientific applications such as cosmology, molecular dynamics, and fluid dynamics, where clustering structures (e.g., single-linkage or Friends-of-Friends) are critical for downstream analysis; however, existing compressors typically provide only pointwise error bounds on particle positions and offer no guarantees on preserving clustering outcomes, and even small perturbations can alter cluster connectivity and compromise scientific validity. We propose a correction-based technique to preserve single-linkage clustering under lossy compression, operating on decompressed data from off-the-shelf compressors such as SZ3 and Draco. Our key contributions are threefold: (1) a clustering-aware correction algorithm that identifies vulnerable particle pairs via spatial partitioning and local neighborhood search; (2) an optimization-based formulation that enforces clustering consistency using projected gradient descent with a loss that encodes pairwise distance violations; and (3) a scalable GPU-accelerated and distributed implementation for large-scale datasets. Experiments on cosmology and molecular dynamics datasets show that our method effectively preserves clustering results while maintaining competitive compression performance compared with SZ3, ZFP, Draco, LCP, and space-filling-curve-based schemes.

[LG-48] Optimal Exploration of New Products under Assortment Decisions

链接: https://arxiv.org/abs/2604.18800
作者: Jackie Baek,Atanas Dinev,Thodoris Lykouris
类目: ocial and Information Networks (cs.SI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study online learning for new products on a platform that makes capacity-constrained assortment decisions on which products to offer. For a newly listed product, its quality is initially unknown, and quality information propagates through social learning: when a customer purchases a new product and leaves a review, its quality is revealed to both the platform and future customers. Since reviews require purchases, the platform must feature new products in the assortment (“explore”) to generate reviews to learn about new products. Such exploration is costly because customer demand for new products is lower than for incumbent products. We characterize the optimal assortments for exploration to minimize regret, addressing two questions. (1) Should the platform offer a new product alone or alongside incumbent products? The former maximizes the purchase probability of the new product but yields lower short-term revenue. Despite the lower purchase probability, we show it is always optimal to pair the new product with the top incumbent products. (2) With multiple new products, should the platform explore them simultaneously or one at a time? We show that the optimal number of new products to explore simultaneously has a simple threshold structure: it increases with the “potential” of the new products and, surprisingly, does not depend on their individual purchase probabilities. We also show that two canonical bandit algorithms, UCB and Thompson Sampling, both fail in this setting for opposite reasons: UCB over-explores while Thompson Sampling under-explores. Our results provide structural insights on how platforms should learn about new products through assortment decisions.

[LG-49] Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

链接: https://arxiv.org/abs/2604.18788
作者: Afsara Benazir,Felix Xiaozhu Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Apple Neural Engine (ANE) is a dedicated neural processing unit (NPU) present in every Apple Silicon chip. Mixture-of-Experts (MoE) LLMs improve inference efficiency via sparse activation but are challenging for NPUs in three ways: expert routing is unpredictable and introduces dynamic tensor shapes that conflict with the shape-specific constraints of NPUs; several irregular operators, e.g., top-k, scatter/gather, etc., are not NPU-friendly; and launching many small expert kernels incurs substantial dispatch and synchronization overhead. NPUs are designed to offload AI compute from CPU and GPU; our goal is to enable such offloading for MoE inference, particularly during prefill, where long-context workloads consume substantial system resources. This paper presents NPUMoE, a runtime inference engine that accelerates MoE execution on Apple Silicon by offloading dense, static computation to NPU, while preserving a CPU/GPU fallback path for dynamic operations. NPUMoE uses offline calibration to estimate expert capacity and popularity that drives three key techniques: (1) Static tiers for expert capacity to address dynamic expert routing (2) Grouped expert execution to mitigate NPU concurrency limits (3) Load-aware expert compute graph residency to reduce CPU-NPU synchronization overhead. Experiments on Apple M-series devices using three representative MoE LLMs and four long-context workloads show that NPUMoE consistently outperforms baselines, reducing latency by 1.32x-5.55x, improving energy efficiency by 1.81x-7.37x, and reducing CPU-cycle usage by 1.78x-5.54x through effective NPU offloading. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.18788 [cs.LG] (or arXiv:2604.18788v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.18788 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] Streaming Structured Inference with Flash-SemiCRF

链接: https://arxiv.org/abs/2604.18780
作者: Benjamin K. Johnson,Thomas Goralski,Ayush Semwal,Hui Shen,H. Josh Jang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semi-Markov Conditional Random Fields (semi-CRFs) assign labels to segments of a sequence rather than to individual positions, enabling exact inference over segment-level features and principled uncertainty estimates at their boundaries. However, existing implementations must materialize a large edge potential tensor whose size grows with sequence length, maximum segment length, and label count, becoming prohibitive for speech-scale state spaces and intractable at genomic scales where sequences can exceed 100,000 positions. This memory bottleneck has limited the adoption of exact segment-level inference for long sequences and large label sets. We identify that the core inefficiency is materializing edge potentials that can instead be evaluated on-the-fly from a compact prefix-sum array, and make several improvements. First, replacing the stored edge tensor with prefix-sum lookup reduces the memory footprint by a factor proportional to the product of segment length and label count. Second, a streaming forward-backward pass with checkpoint-boundary normalization keeps working memory sublinear in sequence length while preserving exact gradients. Third, zero-centered cumulative scores control numerical drift and induce an adaptive duration prior under label imbalance. We integrate these ideas into Flash-SemiCRF, a fused Triton kernel that enables exact semi-CRF inference on previously intractable problem sizes. Available at this https URL.

[LG-51] Discrete Tilt Matching

链接: https://arxiv.org/abs/2604.18739
作者: Yuyuan Chen,Shiyi Wang,Peter Potaptchik,Jaeyeon Kim,Michael S. Albergo
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Masked diffusion large language models (dLLMs) are a promising alternative to autoregressive generation. While reinforcement learning (RL) methods have recently been adapted to dLLM fine-tuning, their objectives typically depend on sequence-level marginal likelihoods, which are intractable for masked diffusion models. To address this, we derive Discrete Tilt Matching (DTM), a likelihood-free method that recasts dLLM fine-tuning as state-level matching of local unmasking posteriors under reward tilting. DTM takes the form of a weighted cross-entropy objective with explicit minimizer, and admits control variates that improve training stability. On a synthetic maze-planning task, we analyze how DTM’s annealing schedule and control variates affect training stability and prevent mode collapse. At scale, fine-tuning LLaDA-8B-Instruct with DTM yields strong gains on Sudoku and Countdown while remaining competitive on MATH500 and GSM8K.

[LG-52] rEEStealer: Stealing Decision Trees via Enclave Side Channels

链接: https://arxiv.org/abs/2604.18716
作者: Jonas Sander,Anja Rabich,Nick Mahling,Felix Maurer,Jonah Heller,Qifan Wang,Thomas Eisenbarth,David Oswald
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Today, machine learning is widely applied in sensitive, security-related, and financially lucrative applications. Model extraction attacks undermine current business models where a model owner sells model access, e.g., via MLaaS APIs. Additionally, stolen models can enable powerful white-box attacks, facilitating privacy attacks on sensitive training data, and model evasion. In this paper, we focus on Decision Trees (DT), which are widely deployed in practice. Existing black-box extraction attacks for DTs are either query-intensive, make strong assumptions about the DT structure, or rely on rich API information. To limit attacks to the black-box setting, CPU vendors introduced Trusted Execution Environments (TEE) that use hardware-mechanisms to isolate workloads from external parties, e.g., MLaaS providers. We introduce TrEEStealer, a high-fidelity extraction attack for stealing TEE-protected DTs. TrEEStealer exploits TEE-specific side-channels to steal DTs efficiently and without strong assumptions about the API output or DT structure. The extraction efficacy stems from a novel algorithm that maximizes the information derived from each query by coupling Control-Flow Information (CFI) with passive information tracking. We use two primitives to acquire CFI: for AMD SEV, we follow previous work using the SEV-Step framework and performance counters. For Intel SGX, we reproduce prior findings on current Xeon 6 CPUs and construct a new primitive to efficiently extract the branch history of inference runs through the Branch-History-Register. We found corresponding vulnerabilities in three popular libraries: OpenCV, mlpack, and emlearn. We show that TrEEStealer achieves superior efficiency and extraction fidelity compared to prior attacks. Our work establishes a new state-of-the-art for DT extraction and confirms that TEEs fail to protect against control-flow leakage. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2604.18716 [cs.CR] (or arXiv:2604.18716v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.18716 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-53] Virtual boundary integral neural network for three-dimensional exterior acoustic problems

链接: https://arxiv.org/abs/2604.18636
作者: Jiahao Li,Qiang Xi,Ilia Marchevskiy,Zhuojia Fu
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a virtual boundary integral neural network (VBINN) for exterior acoustic problems in three dimensions. The method introduces a virtual boundary inside the scatterer or vibrating body and represents the associated source density with a neural network. Coupled with the acoustic fundamental solution, this representation satisfies the Sommerfeld radiation condition by construction and enables direct evaluation of the acoustic pressure and its normal derivative at arbitrary field points. Because the integration surface is separated from the physical boundary, the formulation avoids the singular and near singular kernel evaluations associated with coincident source and collocation points in conventional boundary integral learning methods. To reduce sensitivity to boundary placement, the geometric parameters of the virtual boundary are optimized jointly with the source density during training. Numerical examples for acoustic scattering, multiple body interaction, and underwater acoustic propagation show close agreement with analytical solutions and COMSOL results, and the Burton Miller extension further improves stability near characteristic frequencies. These results demonstrate the potential of VBINN for exterior acoustic analysis in three dimensions.

[LG-54] Phase Transitions in the Fluctuations of Functionals of Random Neural Networks

链接: https://arxiv.org/abs/2604.19738
作者: Simmaco Di Lillo,Leonardo Maini,Domenico Marinucci
类目: Probability (math.PR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We establish central and non-central limit theorems for sequences of functionals of the Gaussian output of an infinitely-wide random neural network on the d-dimensional sphere . We show that the asymptotic behaviour of these functionals as the depth of the network increases depends crucially on the fixed points of the covariance function, resulting in three distinct limiting regimes: convergence to the same functional of a limiting Gaussian field, convergence to a Gaussian distribution, convergence to a distribution in the Qth Wiener chaos. Our proofs exploit tools that are now classical (Hermite expansions, Diagram Formula, Stein-Malliavin techniques), but also ideas which have never been used in similar contexts: in particular, the asymptotic behaviour is determined by the fixed-point structure of the iterative operator associated with the covariance, whose nature and stability governs the different limiting regimes.

[LG-55] Improvements to the post-processing of weather forecasts using machine learning and feature selection

链接: https://arxiv.org/abs/2604.19340
作者: Kazuma Iwase,Tomoyuki Takenawa
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 24 pages

点击查看摘要

Abstract:This study aims to develop and improve machine learning-based post-processing models for precipitation, temperature, and wind speed predictions using the Mesoscale Model (MSM) dataset provided by the Japan Meteorological Agency (JMA) for 18 locations across Japan, including plains, mountainous regions, and islands. By incorporating meteorological variables from grid points surrounding the target locations as input features and applying feature selection based on correlation analysis, we found that, in our experimental setting, the LightGBM-based models achieved lower RMSE than the specific neural-network baselines tested in this study, including a reproduced CNN baseline, and also generally achieved lower RMSE than both the raw MSM forecasts and the JMA post-processing product, MSM Guidance (MSMG), across many locations and forecast lead times. Because precipitation has a highly skewed distribution with many zero cases, we additionally examined Tweedie-based loss functions and event-weighted training strategies for precipitation forecasting. These improved event-oriented performance relative to the original LightGBM model, especially at higher rainfall thresholds, although the gains were site dependent and overall performance remained slightly below MSMG.

[LG-56] Deep Image Prior for photoacoustic tomography can mitigate limited-view artifacts

链接: https://arxiv.org/abs/2604.19176
作者: Hanna Pulkkinen,Jenni Poimala,Leonid Kunyansky,Janek Gröhl,Andreas Hauptmann
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study the deep image prior (DIP) framework applied to photoacoustic tomography (PAT) as an unsupervised reconstruction approach to mitigate limited-view artifacts and noise commonly encountered in experimental settings. Efficient implementation is achieved by employing recently published fast forward and adjoint algorithms for circular measurement geometries. Initialization via a fast inverse and total variation (TV) regularization are applied to further suppress noise and mitigate overfitting. For comparison, we compute a classical TV reconstruction. Our experiments comprise simulated PAT measurements under limited-view geometries and varying levels of added noise as well as experimental measurements together with using a digital twin for quality assessment. Our findings suggest that DIP framework provides an effective unsupervised strategy for robust PAT reconstruction even in the challenging case of a limited view geometry providing improvement in several quantitative measures over total variation reconstructions.

[LG-57] Analytical Extraction of Conditional Sobol Indices via Basis Decomposition of Polynomial Chaos Expansions

链接: https://arxiv.org/abs/2604.19165
作者: Shijie Zhong,Jiangfeng Fu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:In uncertainty quantification, evaluating sensitivity measures under specific conditions (i.e., conditional Sobol’ indices) is essential for systems with parameterized responses, such as spatial fields or varying operating conditions. Traditional approaches often rely on point-wise modeling, which is computationally expensive and may lack consistency across the parameter space. This paper demonstrates that for a pre-trained global Polynomial Chaos Expansion (PCE) model, the analytical conditional Sobol’ indices are inherently embedded within its basis functions. By leveraging the tensor-product property of PCE bases, we reformulate the global expansion into a set of analytical coefficient fields that depend on the conditioning variables. Based on the preservation of orthogonality under conditional probability measures, we derive closed-form expressions for conditional variances and Sobol’ indices. This framework bypasses the need for repetitive modeling or additional sampling, transforming conditional sensitivity analysis into a purely algebraic post-processing step. Numerical benchmarks indicate that the proposed method ensures physical coherence and offers superior numerical robustness and computational efficiency compared to conventional point-wise approaches.

[LG-58] Fast estimation of Gaussian mixture components via centering and singular value thresholding

链接: https://arxiv.org/abs/2604.19091
作者: Huan Qing
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 28 pages, 7 figures, 1 table

点击查看摘要

Abstract:Estimating the number of components is a fundamental challenge in unsupervised learning, particularly when dealing with high-dimensional data with many components or severely imbalanced component sizes. This paper addresses this challenge for classical Gaussian mixture models. The proposed estimator is simple: center the data, compute the singular values of the centered matrix, and count those above a threshold. No iterative fitting, no likelihood calculation, and no prior knowledge of the number of components are required. We prove that, under a mild separation condition on the component centers, the estimator consistently recovers the true number of components. The result holds in high-dimensional settings where the dimension can be much larger than the sample size. It also holds when the number of components grows to the smaller of the dimension and the sample size, even under severe imbalance among component sizes. Computationally, the method is extremely fast: for example, it processes ten million samples in one hundred dimensions within one minute. Extensive experimental studies confirm its accuracy in challenging settings such as high dimensionality, many components, and severe class imbalance.

[LG-59] Ground-Level Near Real-Time Modeling for PM2.5 Pollution Prediction

链接: https://arxiv.org/abs/2604.18973
作者: Zachary R. Fox,Janet O. Agbaje,Dakotah Maguire,Javier E. Santos,Jeremy Logan,Maggie Davis,Rima Habre,Jim VanDerslice,Heidi A. Hanson
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Air pollution is a worldwide public health threat that can cause or exacerbate many illnesses, including respiratory disease, cardiovascular disease, and some cancers. However, epidemiological studies and public health decision-making are stymied by the inability to assess pollution exposure impacts in near real time. To address this, developing accurate digital twins of environmental pollutants will enable timely data-driven analytics - a crucial step in modernizing health policy and decision-making. Although other models predict and analyze fine particulate matter exposure, they often rely on modeled input data sources and data streams that are not regularly updated. Another challenge stems from current models relying on predefined grids. In contrast, our deep-learning approach interpolates surface level PM2.5 concentrations between sparsely distributed US EPA monitoring stations in a grid-free manner. By incorporating additional, readily available datasets - including topographic, meteorological, and land-use data - we improve its ability to predict pollutant concentrations with high spatial and temporal resolution. This enables model querying at any spatial location for rapid predictions without computing over the entire grid. To ensure robustness, we randomize spatial sampling during training to enable our model to perform well in both dense and sparse monitored regions. This model is well suited for near real-time deployment because its lightweight architecture allows for fast updates in response to streaming data. Moreover, model flexibility and scalability allow it to be adapted to various geographical contexts and scales, making it a practical tool for delivering accurate and timely air quality assessments. Its capacity to rapidly evaluate multiple scenarios can be especially valuable for decision-making during public health crises.

[LG-60] Beyond Bellm an: High-Order Generator Regression for Continuous-Time Policy Evaluation

链接: https://arxiv.org/abs/2604.18972
作者: Yaowei Zheng,Richong Zhang,Shenxi Wu,Shirui Bian,Haosong Zhang,Li Zeng,Xingjian Ma,Yichi Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 23 pages, 6 figures

点击查看摘要

Abstract:We study finite-horizon continuous-time policy evaluation from discrete closed-loop trajectories under time-inhomogeneous dynamics. The target value surface solves a backward parabolic equation, but the Bellman baseline obtained from one-step recursion is only first-order in the grid width. We estimate the time-dependent generator from multi-step transitions using moment-matching coefficients that cancel lower-order truncation terms, and combine the resulting surrogate with backward regression. The main theory gives an end-to-end decomposition into generator misspecification, projection error, pooling bias, finite-sample error, and start-up error, together with a decision-frequency regime map explaining when higher-order gains should be visible. Across calibration studies, four-scale benchmarks, feature and start-up ablations, and gain-mismatch stress tests, the second-order estimator consistently improves on the Bellman baseline and remains stable in the regime where the theory predicts visible gains. These results position high-order generator regression as an interpretable continuous-time policy-evaluation method with a clear operating region.

[LG-61] rainability Beyond Linearity in Variational Quantum Objectives

链接: https://arxiv.org/abs/2604.18846
作者: Gordon Ma,Xiufan Li
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 28 pages, 6 figures

点击查看摘要

Abstract:Barren-plateau results have established exponential gradient suppression as a widely cited obstacle to the scalability of variational quantum algorithms. When and whether these results extend to a given objective has been addressed through loss-specific arguments, but a general structural characterization has remained open. We show that the objective itself admits a fixed-observable representation if and only if the loss is affine in the measured statistics, thereby identifying the exact boundary of the standard concentration-based proof template. Existing transfer results for non-affine losses achieve this reduction under additional assumptions; our characterization implies that such a reduction is not structurally available for a class of non-affine objectives, placing them outside the automatic reach of the existing proof template. Beyond the affine regime, a chain-rule decomposition reveals three governing factors – model responsivity, loss-side signal, and transmittance – and induces a loss-class dichotomy: bounded-gradient losses inherit suppression, while amplification-capable losses can in principle counteract it. In the exponentially wide setting, both classes fail, but for different structural reasons. When the interface is instead designed at polynomial width – exposing coarse-grained statistics rather than individual bitstring probabilities – the exponential-dimensional obstruction is relaxed and the dichotomy plays a genuine role. In a numerical demonstration on a charge-conserving quantum system, the amplification-capable objective produces resolved gradients several orders of magnitude larger than affine and inheriting baselines at comparable shot budgets. Over the tested interval, its scaling trend is statistically distinguished from the exponential trend of both alternatives. The boundary is affine; what lies beyond it is a representation-design problem.

[LG-62] Benchmarking Quantum Kernel Support Vector Machines Against Classical Baselines on Tabular Data: A Rigorous Empirical Study with Hardware Validation

链接: https://arxiv.org/abs/2604.18837
作者: Siavash Kakavand,Christoph Strohmeyer,Michael Schlotter
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: Code and data: this https URL

点击查看摘要

Abstract:Quantum kernel methods have been proposed as a promising approach for leveraging near-term quantum computers for supervised learning, yet rigorous benchmarks against strong classical baselines remain scarce. We present a comprehensive empirical study of quantum kernel support vector machines (QSVMs) across nine binary classification datasets, four quantum feature maps, three classical kernels, and multiple noise models, totalling 970 experiments with strict nested cross-validation. Our analysis spans four phases: (i) statistical significance testing, revealing that none of 29 pairwise quantum-classical comparisons reach significance at \alpha = 0.05 ; (ii) learning curve analysis over six training fractions, showing steeper quantum slopes on six of eight datasets that nonetheless fail to close the gap to the best classical baseline; (iii) hardware validation on IBM ibm_fez (Heron r2), demonstrating kernel fidelity r \geq 0.976 across six experiments; and (iv) seed sensitivity analysis confirming reproducibility (mean CV 1.4%). A Kruskal-Wallis factorial analysis reveals that dataset choice dominates performance variance ( \varepsilon^2 = 0.73 ), while kernel type accounts for only 9%. Spectral analysis offers a mechanistic explanation: current quantum feature maps produce eigenspectra that are either too flat or too concentrated, missing the intermediate profile of the best classical kernel, the radial basis function (RBF). Quantum kernel training (QKT) via kernel-target alignment yields the single competitive result – balanced accuracy 0.968 on breast cancer – but with ~2,000x computational overhead. Our findings provide actionable guidelines for quantum kernel research. The complete benchmark suite is publicly available to facilitate reproduction and extension. Comments: Code and data: this https URL Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG) Cite as: arXiv:2604.18837 [quant-ph] (or arXiv:2604.18837v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2604.18837 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Siavash Kakavand [view email] [v1] Mon, 20 Apr 2026 21:03:02 UTC (1,096 KB)

[LG-63] Sparse Network Inference under Imperfect Detection and its Application to Ecological Networks

链接: https://arxiv.org/abs/2604.18820
作者: Aoran Zhang,Tianyao Wei,Maria J. Guerrero,César A. Uribe
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC); Applications (stat.AP)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:Recovering latent structure from count data has received considerable attention in network inference, particularly when one seeks both cross-group interactions and within-group similarity patterns in bipartite networks, which is widely used in ecology research. Such networks are often sparse and inherently imperfect in their detection. Existing models mainly focus on interaction recovery, while the induced similarity graphs are much less studied. Moreover, sparsity is often not controlled, and scale is unbalanced, leading to oversparse or poorly rescaled estimates with degrading structural recovery. To address these issues, we propose a framework for structured sparse nonnegative low-rank factorization with detection probability estimation. We impose nonconvex \ell_1/2 regularization on the latent similarity and connectivity structures to promote sparsity within-group similarity and cross-group connectivity with better relative scale. The resulting optimization problem is nonconvex and nonsmooth. To solve it, we develop an ADMM-based algorithm with adaptive penalization and scale-aware initialization and establish its asymptotic feasibility and KKT stationarity of cluster points under mild regularity conditions. Experiments on synthetic and real-world ecological datasets demonstrate improved recovery of latent factors and similarity/connectivity structure relative to existing baselines.

[LG-64] Quantum AI for Cancer Diagnostic Biomarker Discovery

链接: https://arxiv.org/abs/2604.18621
作者: Mandeep Kaur Saggi,Amandeep Singh Bhatia,Humaira Gowher,Sabre Kais
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: 25 pages, 15 figures

点击查看摘要

Abstract:Quantum machine learning offers a promising new paradigm for computational biology by leveraging quantum mechanical principles to enhance cancer classification, biomarker discovery, and bioinformatics diagnostics. In this study, we apply QML to identify subtype specific biomarkers for lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC), the two predominant forms of non-small cell lung cancer. Our methodology involves a two-phase process: in Phase 1, differential expression analysis and methylation analysis between tumor and normal samples allows us to identify LUAD-specific and LUSC-specific genes, revealing potential prognostic biomarkers for cancer subtypes. Phase 2 focuses on developing a quantum classifier capable of distinguishing between LUAD and LUSC tumors, as well as between tumor and normal samples. This classifier not only enhances diagnostic precision but also demonstrates the quantum advantage in processing large-scale multiomic datasets. Our results consistently demonstrated that Sample3, representing the combined gene set, achieved the highest overall predictive performance in all metrics. These results demonstrate that QML provides an effective and scalable approach for biomarker discovery and subtype specific cancer classification. GO enrichment analysis highlighted the significant involvement of genes in synaptic signaling, ion channel regulation, and neuronal development. In the quantum phase, KEGG analysis further identified enrichment in cancer-associated pathways, including neurotrophin, MAPK, Ras, and PI3KAkt signaling, with key genes such as NGFR, NTRK2, and NTF3 suggesting a central role in neurotrophinmediated oncogenic processes. Our findings highlight the growing potential of quantum computing to advance precision oncology and next-generation biomedical analytics.

[LG-65] Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

链接: https://arxiv.org/abs/2604.18603
作者: Logan Halle,Jason P. Gleghorn
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bidirectional transformers are the foundation of many sequence modeling tasks across natural, biological, and chemical language domains, but they are permutation-invariant without explicit positional embeddings. In contrast, unidirectional attention inherently encodes positional information through its triangular mask, enabling models to operate without positional embeddings altogether. Here, we introduce Dual Triangle Attention, a novel bidirectional attention mechanism that separates the query-key subspace of each attention head into two complementary triangular masks: one that attends to past-and-self positions and one that attends to future-and-self positions. This design provides bidirectional context while maintaining the causal mask’s implicit positional inductive bias in both directions. Using PyTorch’s flex_attention, Dual Triangle Attention is implemented as a single compiled kernel call with no additional parameters beyond standard multi-head attention. We evaluated Dual Triangle Attention across three settings: (1) a synthetic argmax position probe, (2) masked language modeling (MLM) on natural language, and (3) MLM on protein sequences. In the argmax task, both Dual Triangle Attention and causal attention learn positional information without explicit positional embeddings, whereas standard bidirectional attention cannot. In the MLM experiments, Dual Triangle Attention with Rotary Positional Embeddings (RoPE) achieved the best context extension performance and strong performance across the board. These findings suggest that Dual Triangle Attention is a viable attention mechanism for bidirectional transformers, with or without positional embeddings.

[LG-66] Batch-Adaptive Causal Annotations

链接: https://arxiv.org/abs/2502.10605
作者: Ezinne Nwankwo,Lauri Goldkind,Angela Zhou
类目: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Estimating the causal effects of interventions is crucial to policy and decision-making, yet outcome data are often missing or subject to non-standard measurement error. While ground-truth outcomes can sometimes be obtained through costly data annotation or follow-up, budget constraints typically allow only a fraction of the dataset to be labeled. We address this challenge by optimizing which data points should be sampled for outcome information in order to improve efficiency in average treatment effect estimation with missing outcomes. We derive a closed-form solution for the optimal batch sampling probability by minimizing the asymptotic variance of a doubly robust estimator for causal inference with missing outcomes. Motivated by our street outreach partners, we extend the framework to costly annotations of unstructured data, such as text or images in healthcare and social services. Across simulated and real-world datasets, including one of outreach interventions in homelessness services, our approach achieves substantially lower mean-squared error and recovers the AIPW estimate with fewer labels than existing baselines. In practice, we show that our method can match confidence intervals obtained with 361 random samples using only 90 optimized samples - saving 75% of the labeling budget.

附件下载

点击下载今日全部论文列表