本篇博文主要内容为 2026-03-12 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-03-12)
今日共更新603篇论文,其中:
- 自然语言处理共106篇(Computation and Language (cs.CL))
- 人工智能共186篇(Artificial Intelligence (cs.AI))
- 计算机视觉共108篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共182篇(Machine Learning (cs.LG))
- 多智能体系统共5篇(Multiagent Systems (cs.MA))
- 信息检索共16篇(Information Retrieval (cs.IR))
- 人机交互共25篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] COMIC: Agent ic Sketch Comedy Generation
【速读】:该论文旨在解决如何自动化生成高质量、具有多样性的短篇喜剧视频(类似《周六夜现场》)的问题。其解决方案的关键在于构建一个基于角色模拟的多智能体系统,其中各代理(agent)模拟真实影视制作团队中的不同角色,通过迭代的竞争、评估与优化机制来提升内容质量和创意多样性;同时引入经由YouTube喜剧视频语料库分析训练的大型语言模型(LLM)作为幽默评价器,以自动衡量并引导生成内容的趣味性,从而实现接近专业制作水准的视频产出。
链接: https://arxiv.org/abs/2603.11048
作者: Susung Hong,Brian Curless,Ira Kemelmacher-Shlizerman,Steve Seitz
机构: University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
备注: Project page: this https URL
Abstract:We propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation, and improvement. A key contribution is the introduction of LLM critics aligned with real viewer preferences through the analysis of a corpus of comedy videos on YouTube to automatically evaluate humor. Our experiments show that our framework produces results approaching the quality of professionally produced sketches while demonstrating state-of-the-art performance in video generation.
[MA-1] LLM GreenRec: LLM -Based Multi-Agent Recommender System for Sustainable E-Commerce
【速读】:该论文旨在解决电商环境中传统基于会话的推荐系统在引导用户选择可持续产品方面的不足,这些系统通常以短期转化率优化为目标,难以捕捉用户对环保产品的细微意图,导致绿色意愿与实际行动之间存在鸿沟。解决方案的关键在于提出一种名为LLMGreenRec的多智能体框架,该框架利用大语言模型(Large Language Models, LLMs)通过协作分析用户交互并迭代优化提示词(prompt refinement),使专门设计的智能体能够精准推断绿色导向的用户意图,并优先推荐环保产品;同时,该意图驱动的方法还能减少不必要的交互和能源消耗,从而在提升推荐质量的同时降低系统的数字碳足迹。
链接: https://arxiv.org/abs/2603.11025
作者: Hao N. Nguyen,Hieu M. Nguyen,Son Van Nguyen,Nguyen Thi Hanh
机构: 未知
类目: Multiagent Systems (cs.MA); Information Retrieval (cs.IR)
备注: Accepted to the Proceedings of the Conference on Digital Economy and Fintech Innovation (DEFI 2025). To appear in IEEE Xplore
Abstract:Rising environmental awareness in e-commerce necessitates recommender systems that not only guide users to sustainable products but also minimize their own digital carbon footprints. Traditional session-based systems, optimized for short-term conversions, often fail to capture nuanced user intents for eco-friendly choices, perpetuating a gap between green intentions and actions. To tackle this, we introduce LLMGreenRec, a novel multi-agent framework that leverages Large Language Models (LLMs) to promote sustainable consumption. Through collaborative analysis of user interactions and iterative prompt refinement, LLMGreenRec’s specialized agents deduce green-oriented user intents and prioritize eco-friendly product recommendations. Notably, this intent-driven approach also reduces unnecessary interactions and energy consumption. Extensive experiments on benchmark datasets validate LLMGreenRec’s effectiveness in recommending sustainable products, demonstrating a robust solution that fosters a responsible digital economy.
[MA-2] GRACE: A Unified 2D Multi-Robot Path Planning Simulator Benchmark for Grid Roadmap And Continuous Environments ICRA2026
【速读】:该论文旨在解决多智能体路径规划(Multi-Agent Pathfinding, MAPF)与多机器人运动规划(Multi-Robot Motion Planning, MRMP)研究中缺乏统一、可复现且跨抽象层级比较平台的问题。现有工具要么在简化假设下运行(如网格模型、同质代理),要么虽具更高保真度但缺乏可比性评估机制。其解决方案的关键在于提出GRACE——一个统一的二维仿真器与基准测试平台,通过显式、可复现的操作符在多个抽象层次(网格、路线图、连续空间)上实例化相同任务,并采用统一的评估协议实现跨表示方法的公平比较。该设计使不同抽象层级下的性能差异(如MRMP在高保真度下求解成功率更高但速度更慢,而网格/路线图规划器扩展性更强)得以量化,从而推动多机器人规划研究的标准化和实际应用转化。
链接: https://arxiv.org/abs/2603.10858
作者: Chuanlong Zang,Anna Mannucci,Isabelle Barz,Philipp Schillinger,Florian Lier,Wolfgang Hönig
机构: Robert Bosch GmbH (罗伯特·博世公司); Technical University of Berlin (柏林工业大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: ICRA 2026, code will be released soon
Abstract:Advancing Multi-Agent Pathfinding (MAPF) and Multi-Robot Motion Planning (MRMP) requires platforms that enable transparent, reproducible comparisons across modeling choices. Existing tools either scale under simplifying assumptions (grids, homogeneous agents) or offer higher fidelity with less comparable instrumentation. We present GRACE, a unified 2D simulator+benchmark that instantiates the same task at multiple abstraction levels (grid, roadmap, continuous) via explicit, reproducible operators and a common evaluation protocol. Our empirical results on public maps and representative planners enable commensurate comparisons on a shared instance set. Furthermore, we quantify the expected representation-fidelity trade-offs (MRMP solves instances at higher fidelity but lower speed, while grid/roadmap planners scale farther). By consolidating representation, execution, and evaluation, GRACE thereby aims to make cross-representation studies more comparable and provides a means to advance multi-robot planning research and its translation to practice.
[MA-3] KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的GPU内核优化方法中存在的效率低下和可解释性弱的问题。现有方法依赖LLMs中隐式学习的启发式策略来决定优化路径,导致试错过程冗长且优化结果难以理解。其解决方案的关键在于提出KernelSkill框架,该框架采用多智能体协同机制与双层记忆架构:长期记忆存储可复用的专家优化技能,短期记忆避免重复回溯,从而将优化决策从隐式启发式转变为显式的、任务轨迹感知的知识驱动策略。实验表明,KernelSkill在KernelBench Levels 1–3上实现了100%的成功率,并相较于Torch Eager分别获得5.44×、2.82×和1.92×的平均加速比,显著优于现有基线方法。
链接: https://arxiv.org/abs/2603.10085
作者: Qitong Sun,Jun Han,Tianlin Li,Zhe Tang,Sheng Chen,Fei Yang,Aishan Liu,Xianglong Liu,Yang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Improving GPU kernel efficiency is crucial for advancing AI systems. Recent work has explored leveraging large language models (LLMs) for GPU kernel generation and optimization. However, existing LLM-based kernel optimization pipelines typically rely on opaque, implicitly learned heuristics within the LLMs to determine optimization strategies. This leads to inefficient trial-and-error and weakly interpretable optimizations. Our key insight is to replace implicit heuristics with expert optimization skills that are knowledge-driven and aware of task trajectories. Specifically, we present KernelSkill, a multi-agent framework with a dual-level memory architecture. KernelSkill operates by coordinating agents with long-term memory of reusable expert skills and short-term memory to prevent repetitive backtracking. On KernelBench Levels 1-3, KernelSkill achieves a 100% success rate and average speedups of 5.44x, 2.82x, and 1.92x over Torch Eager on Levels 1, 2, and 3, respectively, outperforming prior baselines. Code is available at this https URL.
[MA-4] Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理向协作式多代理系统演进过程中,内存需求快速复杂化的问题。其核心解决方案是将多代理内存视为计算机体系结构问题,并提出一个三层内存层次结构(I/O层、缓存层和主存层),明确区分共享内存与分布式内存范式,同时识别出两个关键协议缺口:跨代理缓存共享机制与结构化内存访问控制。论文指出,当前最紧迫的开放挑战在于多代理内存一致性(multi-agent memory consistency),并认为这一架构视角为构建可靠、可扩展的多代理系统提供了理论基础。
链接: https://arxiv.org/abs/2603.10062
作者: Zhongming Yu,Naicheng Yu,Hejia Zhang,Wentao Ni,Mingrui Yin,Jiaying Yang,Yujie Zhao,Jishen Zhao
机构: University of California, San Diego (加州大学圣地亚哥分校); Georgia Institute of Technology (佐治亚理工学院)
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:As LLM agents evolve into collaborative multi-agent systems, their memory requirements grow rapidly in complexity. This position paper frames multi-agent memory as a computer architecture problem. We distinguish shared and distributed memory paradigms, propose a three-layer memory hierarchy (I/O, cache, and memory), and identify two critical protocol gaps: cache sharing across agents and structured memory access control. We argue that the most pressing open challenge is multi-agent memory consistency. Our architectural framing provides a foundation for building reliable, scalable multi-agent systems.
自然语言处理
[NLP-0] Instruction set for the representation of graphs
【速读】: 该论文旨在解决图结构(graph structure)的紧凑、可序列化表示问题,特别是如何将任意有限简单图编码为一种与图同构不变(isomorphism-invariant)且适用于语言模型处理的字符串形式。其解决方案的关键在于提出 IsalGraph 方法:通过一个由稀疏图、循环双向链表(circular doubly-linked list, CDLL)和两个遍历指针组成的微型虚拟机,使用九字符指令字母表对图进行编码;每个字符串均能解码为有效图结构,且无非法状态;同时采用贪心算法实现多项式时间内的编码,并通过回溯策略生成字典序最小的规范字符串,从而确保编码的唯一性和一致性。此方法显著提升了图编辑距离(graph edit distance, GED)与字符串间 Levenshtein 距离的相关性,适用于图相似性搜索、图生成及图条件语言建模等任务。
链接: https://arxiv.org/abs/2603.11039
作者: Ezequiel Lopez-Rubio,Mario Pascual-Gonzalez
机构: University of Málaga (马拉加大学); ITIS Software (软件信息技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注:
Abstract:We present IsalGraph, a method for representing the structure of any finite, simple graph as a compact string over a nine-character instruction alphabet. The encoding is executed by a small virtual machine comprising a sparse graph, a circular doubly-linked list (CDLL) of graph-node references, and two traversal pointers. Instructions either move a pointer through the CDLL or insert a node or edge into the graph. A key design property is that every string over the alphabet decodes to a valid graph, with no invalid states reachable. A greedy \emphGraphToString algorithm encodes any connected graph into a string in time polynomial in the number of nodes; an exhaustive-backtracking variant produces a canonical string by selecting the lexicographically smallest shortest string across all starting nodes and all valid traversal orders. We evaluate the representation on five real-world graph benchmark datasets (IAM Letter LOW/MED/HIGH, LINUX, and AIDS) and show that the Levenshtein distance between IsalGraph strings correlates strongly with graph edit distance (GED). Together, these properties make IsalGraph strings a compact, isomorphism-invariant, and language-model-compatible sequential encoding of graph structure, with direct applications in graph similarity search, graph generation, and graph-conditioned language modelling
[NLP-1] Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM -as-a-Judge
【速读】: 该论文旨在解决当前大语言模型作为评判者(LLM-as-a-judge)范式中一个关键假设的可靠性问题,即高评价者间一致性是否意味着评估结果的客观性和可信性。研究发现,这种一致性常为“评价幻觉”(Evaluation Illusion)——模型虽生成复杂批评,但评分实际依赖于表面启发式而非实质质量,导致样本层面的一致性远低于模型层面。解决方案的关键在于引入动态生成的、基于领域知识的评价标准(rubric),提出MERG(Metacognitive Enhanced Rubric Generation)框架,通过知识锚定使评估者在结构化领域内达成更高共识(如教育+22%、学术+27%),而在主观领域则促进真实多元评价,从而提升评估的合理性与可解释性,对RLAIF中的奖励建模具有重要启示。
链接: https://arxiv.org/abs/2603.11027
作者: Mingyang Song,Mao Zheng,Chenning Xu
机构: Tencent(腾讯)
类目: Computation and Language (cs.CL)
备注:
Abstract:The paradigm of LLM-as-a-judge relies on a critical assumption, namely that high inter-evaluator agreement indicates reliable and objective evaluation. We present two complementary findings that challenge this assumption. \textbfFirst, we demonstrate that this consensus is frequently illusory. We identify and formalize \textbfEvaluation Illusion, a phenomenon where LLM judges generate sophisticated critiques yet anchor scores on shared surface heuristics rather than substantive quality. Through a large-scale study of 105,600 evaluation instances (32 LLMs \times 3 frontier judges \times 100 tasks \times 11 temperatures), we show that model-level agreement (Spearman \rho = 0.99 ) masks fragile sample-level agreement (Pearson \barr = 0.72 ; absolute agreement ICC = 0.67 ), that merely sharing rubric structure restores 62% of total agreement, and that high-quality outputs paradoxically receive the \textitleast consistent evaluations. \textbfSecond, we demonstrate that dynamically generating evaluation rubrics grounded in domain knowledge produces more meaningful assessment. We introduce MERG (Metacognitive Enhanced Rubric Generation), a knowledge-driven rubric generation framework whose domain-selective effects confirm this. Agreement \textitincreases in codified domains (Education +22%, Academic +27%) where knowledge anchors evaluators on shared standards, while it decreases in subjective domains where genuine evaluative pluralism emerges. These findings suggest that evaluation rubrics should be dynamically enriched with expert knowledge rather than relying on generic criteria, with implications for reward modeling in RLAIF.
[NLP-2] OSSS: a CVE-based Software Security Benchmark for Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在软件开发流程中引入潜在安全风险的问题,特别是评估其在代码生成或选择过程中识别和偏好安全代码片段的能力。现有安全基准测试覆盖漏洞类型有限,难以全面反映LLMs在真实场景下的安全性表现。为此,作者提出TOSSS(Two-Option Secure Snippet Selection)基准,其关键在于基于CVE数据库构建可扩展的评测框架,通过让模型在安全与易受攻击的代码片段之间进行二选一决策来量化其安全能力,并以0到1的分数表示模型表现——分数越高表明模型越倾向于选择安全代码。实验结果显示,14个主流开源和闭源模型在C/C++和Java代码上的得分介于0.48至0.89之间,揭示了当前LLMs在软件安全方面的显著差异与改进空间。
链接: https://arxiv.org/abs/2603.10969
作者: Marc Damie,Murat Bilgehan Ertan,Domenico Essoussi,Angela Makhanu,Gaëtan Peter,Roos Wensveen
机构: University of Twente (特温特大学); CWI Amsterdam (荷兰数学与计算机科学研究学会); Erasmus University Rotterdam (鹿特丹伊拉斯姆斯大学); Datadog (数据狗); Ecole Supérieure d’Ingénieurs Léonard de Vinci (莱昂纳多·达·芬奇工程师高级学院); Leiden University (莱顿大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Software Engineering (cs.SE)
备注:
Abstract:With their increasing capabilities, Large Language Models (LLMs) are now used across many industries. They have become useful tools for software engineers and support a wide range of development tasks. As LLMs are increasingly used in software development workflows, a critical question arises: are LLMs good at software security? At the same time, organizations worldwide invest heavily in cybersecurity to reduce exposure to disruptive attacks. The integration of LLMs into software engineering workflows may introduce new vulnerabilities and weaken existing security efforts. We introduce TOSSS (Two-Option Secure Snippet Selection), a benchmark that measures the ability of LLMs to choose between secure and vulnerable code snippets. Existing security benchmarks for LLMs cover only a limited range of vulnerabilities. In contrast, TOSSS relies on the CVE database and provides an extensible framework that can integrate newly disclosed vulnerabilities over time. Our benchmark gives each model a security score between 0 and 1 based on its behavior; a score of 1 indicates that the model always selects the secure snippet, while a score of 0 indicates that it always selects the vulnerable one. We evaluate 14 widely used open-source and closed-source models on C/C++ and Java code and observe scores ranging from 0.48 to 0.89. LLM providers already publish many benchmark scores for their models, and TOSSS could become a complementary security-focused score to include in these reports. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Software Engineering (cs.SE) Cite as: arXiv:2603.10969 [cs.LG] (or arXiv:2603.10969v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.10969 Focus to learn more arXiv-issued DOI via DataCite
[NLP-3] LLM 2Vec-Gen: Generative Embeddings from Large Language Models
【速读】: 该论文旨在解决传统基于大语言模型(Large Language Model, LLM)的文本嵌入方法在处理嵌入任务时存在的输入-输出映射不一致问题,即如何将多样化的输入文本有效映射到语义一致的固定长度向量表示中。其解决方案的关键在于提出一种新颖的自监督范式 LLM2Vec-Gen:不直接编码输入文本语义,而是通过在LLM词汇表中引入可训练的特殊标记(special tokens),将其附加至输入后优化这些标记以表示LLM对输入的潜在响应,并利用LLM自身完成结果与无监督嵌入教师模型提供的蒸馏目标联合训练。此方法显著缩小了输入与嵌入输出之间的语义鸿沟,同时保留了LLM在安全对齐和推理能力上的优势,且无需微调LLM主干网络,仅需未标注查询即可完成训练。
链接: https://arxiv.org/abs/2603.10913
作者: Parishad BehnamGhader,Vaibhav Adlakha,Fabian David Schmidt,Nicolas Chapados,Marius Mosbach,Siva Reddy
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to similar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model’s potential response. Specifically, we add trainable special tokens to the LLM’s vocabulary, append them to input, and optimize them to represent the LLM’s response in a fixed-length sequence. Training is guided by the LLM’s own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.
[NLP-4] GLM-OCR Technical Report
【速读】: 该论文旨在解决传统光学字符识别(OCR)模型在实际文档理解任务中计算效率低、生成过程冗余以及难以部署于资源受限环境的问题。其核心解决方案是提出一种轻量级的多模态模型GLM-OCR,通过引入多标记预测(Multi-Token Prediction, MTP)机制,在每解码步骤中并行预测多个token,从而显著提升解码吞吐量并保持较低的内存开销;同时采用两阶段流水线架构——先由PP-DocLayout-V3完成文档版面分析,再进行区域级并行识别,有效平衡了计算效率与识别性能,使其适用于边缘设备和大规模生产系统。
链接: https://arxiv.org/abs/2603.10910
作者: Shuaiqi Duan,Yadong Xue,Weihan Wang,Zhe Su,Huan Liu,Sheng Yang,Guobing Gan,Guo Wang,Zihan Wang,Shengdong Yan,Dexin Jin,Yuxuan Zhang,Guohong Wen,Yanfeng Wang,Yutao Zhang,Xiaohan Zhang,Wenyi Hong,Yukuo Cen,Da Yin,Bin Chen,Wenmeng Yu,Xiaotao Gu,Jie Tang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.
[NLP-5] From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers
【速读】: 该论文旨在解决现有知识蒸馏(Knowledge Distillation, KD)方法在跨模态场景下的局限性问题,即传统KD假设师生模型在模态上同质,而现有多模态知识蒸馏方法则依赖于昂贵的模态特定预训练,难以在实际中应用。为此,作者提出ARMADA框架,其关键在于引入新颖的对齐技术,在不修改教师模型(如视觉-语言大模型)内部结构的前提下,实现高效、可扩展的跨模态知识迁移,从而将视觉-语言模型的知识有效传递给纯文本语言模型,显著提升后者在自然语言理解、生成推理和指令微调任务上的性能,且无需进行复杂的多模态预训练或微调教师模型。
链接: https://arxiv.org/abs/2603.10877
作者: Ayan Sengupta,Shantanu Dixit,Md Shad Akhtar,Tanmoy Chakraborty
机构: Indian Institute of Technology Delhi(印度理工学院德里分校); Indraprastha Institute of Information Technology Delhi(印地普拉斯特拉信息技术研究所德里分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Knowledge distillation (KD) methods are pivotal in compressing large pre-trained language models into smaller models, ensuring computational efficiency without significantly dropping performance. Traditional KD techniques assume homogeneity in modalities between the teacher (source) and the student (target) models. On the other hand, existing multimodal knowledge distillation methods require modality-specific pre-training of the teacher model, which is computationally infeasible in most cases. In this paper, we introduce ARMADA, an efficient cross-modal knowledge distillation framework designed to transfer knowledge from large vision-language models, including black-box models, to language-only models. Unlike existing KD techniques that rely on the internal structures of multimodal teachers or require computationally expensive pre-training, ARMADA leverages novel alignment techniques to distil knowledge without altering the teacher model, ensuring efficiency and scalability. We empirically validate ARMADA on twelve natural language understanding, eight complex generative reasoning and five instruction-tuning tasks, demonstrating consistent performance improvements in large models such as DeBERTa-v2-1.4B, OPT-1.3B, LLaMA-3B, 7B, 8B. ARMADA achieves up to 3.4% improvement on language understanding tasks and 2.6% boost in generative reasoning, all without requiring expensive multimodal pre-training or fine-tuning of the teacher model. Our findings challenge conventional knowledge distillation paradigms by demonstrating that even vision-language models, despite lacking direct textual understanding, can significantly enhance language models when distilled appropriately.
[NLP-6] SiDiaC-v.2.0: Sinhala Diachronic Corpus Version 2.0 LREC2026
【速读】: 该论文旨在解决低资源语言(Low-resource language)中历史语料库建设的挑战,特别是针对僧伽罗语(Sinhala)缺乏大规模、跨时期的标注语料库问题。解决方案的关键在于构建SiDiaC-v.2.0——目前最大且最全面的僧伽罗语历时语料库(Sinhala Diachronic Corpus),涵盖1800至1955年出版的185部文学作品(共24.4万词),并基于文本撰写年代对其中59篇文档(7万词)进行精细标注。其核心创新包括:采用Google Document AI OCR技术实现古籍数字化,结合人工后处理修正格式错误、代码混杂(code-mixing)和畸形token;借鉴FarPaHC、CCOHA等语料库的文本清洗与句法标注策略,提升低资源语言处理的标准化水平;同时建立双层分类体系(主类别为非虚构/虚构,次类别细分宗教、历史、诗歌、语言学和医学等),增强语料的结构化与可用性。该成果显著推进了僧伽罗语自然语言处理(NLP)研究的基础能力建设。
链接: https://arxiv.org/abs/2603.10861
作者: Nevidu Jayatilleke,Nisansa de Silva,Uthpala Nimanthi,Gagani Kulathilaka,Azra Safrullah,Johan Sofalas
机构: 未知
类目: Computation and Language (cs.CL)
备注: 23 pages, 13 figures, 10 tables, Accepted paper at the 15th Language Resources and Evaluation Conference (LREC 2026)
Abstract:SiDiaC-v.2.0 is the largest comprehensive Sinhala Diachronic Corpus to date, covering a period from 1800 CE to 1955 CE in terms of publication dates, and a historical span from the 5th to the 20th century CE in terms of written dates. The corpus consists of 244k words across 185 literary works that underwent thorough filtering, preprocessing, and copyright compliance checks, followed by extensive post-processing. Additionally, a subset of 59 documents totalling 70k words was annotated based on their written dates. Texts from the National Library of Sri Lanka were selected from the SiDiaC-v.1.0 non-filtered list, which was digitised using Google Document AI OCR. This was followed by post-processing to correct formatting issues, address code-mixing, include special tokens, and fix malformed tokens. The construction of SiDiaC-v.2.0 was informed by practices from other corpora, such as FarPaHC, SiDiaC-v.1.0, and CCOHA. This was particularly relevant for syntactic annotation and text normalisation strategies, given the shared characteristics of low-resource language status between Faroese and the similar cleaning strategies utilised in CCOHA. This corpus is categorised into two layers based on genres: primary and secondary. The primary categorisation is binary, assigning each book to either Non-Fiction or Fiction. The secondary categorisation is more detailed, grouping texts under specific genres such as Religious, History, Poetry, Language, and Medical. Despite facing challenges due to limited resources, SiDiaC-v.2.0 serves as a comprehensive resource for Sinhala NLP, building upon the work previously done in SiDiaC-v.1.0.
[NLP-7] V_0.5: Generalist Value Model as a Prior for Sparse RL Rollouts
【速读】: 该论文旨在解决强化学习中优势基线(advantage baseline)构建的稳定性与低方差问题,特别是在稀疏奖励场景下,传统方法因采样不足导致高方差,而预训练的价值模型(如 V₀)虽能提供先验信息但可能引入系统性偏差(或幻觉)。解决方案的关键在于提出 V₀.₅,其通过自适应融合价值模型提供的先验基线与基于稀疏回放(sparse rollouts)的经验均值,结合实时统计检验与动态预算分配机制,在保证计算效率的同时显著降低基线估计的均方误差(MSE)。该机制能够根据实时假设检验评估先验可靠性,并按需分配额外采样预算,从而在极端稀疏条件下(如群体大小为4)仍可生成稳定且高效的策略梯度。
链接: https://arxiv.org/abs/2603.10848
作者: Yi-Kai Zhang,Yueqing Sun,Hongyan Hao,Qi Gu,Xunliang Cai,De-Chuan Zhan,Han-Jia Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:In Reinforcement Learning with Verifiable Rewards (RLVR), constructing a robust advantage baseline is critical for policy gradients, effectively guiding the policy model to reinforce desired behaviors. Recent research has introduced Generalist Value Models (such as V_0 ), which achieve pre-trained value estimation by explicitly encoding model capabilities in-context, eliminating the need to synchronously update the value model alongside the policy model. In this paper, we propose V_0.5 , which adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts. This constructs a robust baseline that balances computational efficiency with extremely low variance. Specifically, we introduce a real-time statistical testing and dynamic budget allocation. This balances the high variance caused by sparse sampling against the systematic bias (or hallucinations) inherent in the value model’s prior. By constructing a hypothesis test to evaluate the prior’s reliability in real-time, the system dynamically allocates additional rollout budget on demand. This mechanism minimizes the baseline estimator’s Mean Squared Error (MSE), guaranteeing stable policy gradients, even under extreme sparsity with a group size of 4. Extensive evaluations across six mathematical reasoning benchmarks demonstrate that V_0.5 significantly outperforms GRPO and DAPO, achieving faster convergence and over some 10% performance improvement.
[NLP-8] owards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis
【速读】: 该论文旨在解决在数据稀缺的编程领域(如新兴专用架构NPU上的内核合成)中,大型语言模型因缺乏训练数据而导致性能急剧下降的问题。传统方法在CUDA等数据丰富的平台上表现优异,但在NPU等数据匮乏环境中难以有效部署。解决方案的关键在于提出EvoKernel——一个自进化代理框架,其核心是将内核合成过程建模为基于记忆的强化学习任务,并引入一种价值驱动的检索机制,动态学习不同阶段的Q值以优先选择对当前目标(如生成可行初稿或迭代优化延迟)贡献最大的经验。通过跨任务记忆共享,该框架能够从简单操作中提炼通用知识并迁移到复杂算子,从而实现无需昂贵微调即可持续提升内核正确率与执行效率。实验表明,EvoKernel将前沿模型的正确率从11.0%提升至83.0%,并通过迭代优化实现平均3.60倍的速度提升。
链接: https://arxiv.org/abs/2603.10846
作者: Yujie Zheng,Zhuo Li,Shengtao Zhang,Hanjing Wang,Junjie Sheng,Jiaqian Wang,Junchi Yan,Weinan Zhang,Ying Wen,Bo Tang,Muning Wen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Deploying Large Language Models to data-scarce programming domains poses significant challenges, particularly for kernel synthesis on emerging Domain-Specific Architectures where a “Data Wall” limits available training data. While models excel on data-rich platforms like CUDA, they suffer catastrophic performance drops on data-scarce ecosystems such as NPU programming. To overcome this cold-start barrier without expensive fine-tuning, we introduce EvoKernel, a self-evolving agentic framework that automates the lifecycle of kernel synthesis from initial drafting to continual refining. EvoKernel addresses this by formulating the synthesis process as a memory-based reinforcement learning task. Through a novel value-driven retrieval mechanism, it learns stage-specific Q-values that prioritize experiences based on their contribution to the current objective, whether bootstrapping a feasible draft or iteratively refining latency. Furthermore, by enabling cross-task memory sharing, the agent generalizes insights from simple to complex operators. By building an NPU variant of KernelBench and evaluating on it, EvoKernel improves frontier models’ correctness from 11.0% to 83.0% and achieves a median speedup of 3.60x over initial drafts through iterative refinement. This demonstrates that value-guided experience accumulation allows general-purpose models to master the kernel synthesis task on niche hardware ecosystems. Our official page is available at this https URL.
[NLP-9] PivotAttack: Rethinking the Search Trajectory in Hard-Label Text Attacks via Pivot Words
【速读】: 该论文旨在解决现有硬标签文本攻击方法依赖低效“外向型”策略、搜索空间庞大导致查询成本高的问题。其解决方案的关键在于提出一种“内向型”框架PivotAttack,通过引入多臂赌博机(Multi-Armed Bandit)算法识别出作为预测锚点的枢轴集(Pivot Sets)——即组合式的词元群体,并针对性地扰动这些枢轴集以诱导标签翻转,从而有效捕捉词间依赖关系并显著降低查询开销。
链接: https://arxiv.org/abs/2603.10842
作者: Yuzhi Liang,Shiliang Xiao,Jingsong Wei,Qiliang Lin,Xia Li
机构: Guangdong University of Foreign Studies, China
类目: Computation and Language (cs.CL)
备注:
Abstract:Existing hard-label text attacks often rely on inefficient “outside-in” strategies that traverse vast search spaces. We propose PivotAttack, a query-efficient “inside-out” framework. It employs a Multi-Armed Bandit algorithm to identify Pivot Sets-combinatorial token groups acting as prediction anchors-and strategically perturbs them to induce label flips. This approach captures inter-word dependencies and minimizes query costs. Extensive experiments across traditional models and Large Language Models demonstrate that PivotAttack consistently outperforms state-of-the-art baselines in both Attack Success Rate and query efficiency.
[NLP-10] Multilingual Reasoning Gym: Multilingual Scaling of Procedural Reasoning Environments
【速读】: 该论文旨在解决多语言环境下推理能力评估与训练的瓶颈问题,即如何在保持推理任务语义一致性的同时,实现跨语言、大规模、可验证的推理问题生成。其解决方案的关键在于构建一个基于程序化生成的多语言推理环境——Multilingual Reasoning Gym,通过翻译94个任务模板并经母语者验证(覆盖10种语言),结合针对性代码或模板适配以确保语言自然性;同时保留原Reasoning Gym的核心优势,如无限的问题实例生成能力和难度可控性,并支持基于可验证奖励的强化学习(Reinforcement Learning from Verifiable Rewards)和跨语言并行数据生成,从而为多语言推理模型的研究提供标准化、可扩展的基准平台。
链接: https://arxiv.org/abs/2603.10793
作者: Konstantin Dobler,Simon Lehnerer,Federico Scozzafava,Jonathan Janke,Mohamed Ali
机构: Apple; Hasso Plattner Institute ELLIS Unit Potsdam
类目: Computation and Language (cs.CL)
备注:
Abstract:We present the Multilingual Reasoning Gym, an extension of Reasoning Gym (Stojanovski et al., 2025), that procedurally generates verifiable reasoning problems across 14 languages. We translate templates for 94 tasks with native-speaker validation in 10 languages and targeted code or template adaptations to ensure linguistic naturalness. The Multilingual Reasoning Gym preserves the core benefits of the procedural generation approach used in the original Reasoning Gym, such as virtually unlimited problem instance generation and adjustable difficulty, and remains directly usable for Reinforcement Learning from Verifiable Rewards and evaluation settings. Problems in the Multilingual Reasoning Gym are parallel across languages, enabling crosslingually parallel data generation at massive scale due to the procedural nature of the environments. We release our implementation to support research into multilingual reasoning models.
[NLP-11] LuxBorrow: From Pompier to Pompjee Tracing Borrowing in Luxembourgish LREC2026
【速读】: 该论文旨在解决卢森堡语(Luxembourgish, LU)新闻文本中多语言借用现象的量化与分析问题,尤其关注借词(borrowing)在语言接触中的分布、频率、类型及演变规律。其核心挑战在于如何从大规模多语种文本中精准识别并区分母语(matrix language)与借入语(donor language)成分,并系统评估借用强度与适应模式。解决方案的关键在于构建一个“借词优先”(borrowing-first)的分析流水线:首先通过句子级语言识别(LU/DE/FR/EN)筛选出卢森堡语句,再基于词元化(lemmatization)、自建借词词库及编译的形态学和拼写规则,在词级别实现对借用项的精确解析与归类。该方法不仅揭示了卢森堡语作为主语言持续主导地位下的广泛多语实践(77.1%的文章含至少一种借入语),还发现借用强度集中于局部插入而非均衡双语混用(中位数代码混合指数CMI仅从3.90升至7.00),且形态适应(63.8%)显著高于拼写(35.9%)和词汇层面(0.3%)。研究进一步提出以借用项占比、捐赠者熵(donor entropy)和同化率(assimilation ratio)为核心的评价指标体系,替代传统文档级混合指数,从而更精细地刻画语言借用的动态过程。
链接: https://arxiv.org/abs/2603.10789
作者: Nina Hosseini-Kivanani,Fred Philippy
机构: 未知
类目: Computation and Language (cs.CL)
备注: Paper got accepted to LREC2026
Abstract:We present LuxBorrow, a borrowing-first analysis of Luxembourgish (LU) news spanning 27 years (1999-2025), covering 259,305 RTL articles and 43.7M tokens. Our pipeline combines sentence-level language identification (LU/DE/FR/EN) with a token-level borrowing resolver restricted to LU sentences, using lemmatization, a collected loanword registry, and compiled morphological and orthographic rules. Empirically, LU remains the matrix language across all documents, while multilingual practice is pervasive: 77.1% of articles include at least one donor language and 65.4% use three or four. Breadth does not imply intensity: median code-mixing index (CMI) increases from 3.90 (LU+1) to only 7.00 (LU+3), indicating localized insertions rather than balanced bilingual text. Domain and period summaries show moderate but persistent mixing, with CMI rising from 6.1 (1999-2007) to a peak of 8.4 in 2020. Token-level adaptations total 25,444 instances and exhibit a mixed profile: morphological 63.8%, orthographic 35.9%, lexical 0.3%. The most frequent individual rules are orthographic, such as on-oun and eur-er, while morphology is collectively dominant. Diachronically, code-switching intensifies, and morphologically adapted borrowings grow from a small base. French overwhelmingly supplies adapted items, with modest growth for German and negligible English. We advocate borrowing-centric evaluation, including borrowed token and type rates, donor entropy over borrowed items, and assimilation ratios, rather than relying only on document-level mixing indices.
[NLP-12] Large Language Models as Annotators for Machine Translation Quality Estimation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在机器翻译质量评估(Machine Translation Quality Estimation, MTQE)中因推理成本过高而难以直接应用的问题。其解决方案的关键在于利用LLMs生成符合MQM(Metric for Quality Metrics)风格的标注数据,用于训练轻量级的COMET模型,从而实现高性价比的段落级质量评估。具体而言,作者提出了一种简化的MQM标注方案,聚焦于顶层类别以引导LLM选择,并设计了一个基于提示模板的GPT-4o驱动方法(PPbMQM),使生成的标注与人工标注高度相关,且训练后的COMET模型在中英和英德翻译任务上均展现出竞争力。
链接: https://arxiv.org/abs/2603.10775
作者: Sidi Wang,Sophie Arnoult,Amir Kamran
机构: Maastricht university (马斯特里赫特大学); Vrije Universiteit Amsterdam (阿姆斯特丹自由大学); Taus (Taus)
类目: Computation and Language (cs.CL)
备注: 11 pages, 3 figures
Abstract:Large Language Models (LLMs) have demonstrated excellent performance on Machine Translation Quality Estimation (MTQE), yet their high inference costs make them impractical for direct application. In this work, we propose applying LLMs to generate MQM-style annotations for training a COMET model: following Fernandes et al. (2023), we reckon that segment-level annotations provide a strong rationale for LLMs and are key to good segment-level QE. We propose a simplified MQM scheme, mostly restricted to top-level categories, to guide LLM selection. We present a systematic approach for the development of a GPT-4o-based prompt, called PPbMQM (Prompt-Pattern-based-MQM). We show that the resulting annotations correlate well with human annotations and that training COMET on them leads to competitive performance on segment-level QE for Chinese-English and English-German.
[NLP-13] Word Recovery in Large Language Models Enables Character-Level Tokenization Robustness
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对非标准分词输入(如字符级分词)时仍表现出显著鲁棒性的机制不明确的问题。其解决方案的关键在于识别并验证一种称为“词恢复”(word recovery)的核心机制:通过解码方法检测到隐藏状态能够从字符级输入中重构出标准的词级标记身份;进一步通过因果干预移除对应子空间导致下游任务性能一致下降,证明该机制对鲁棒性至关重要;同时,细粒度注意力分析表明,同一词内字符间的组内注意力(in-group attention)在早期层中对词恢复和任务性能具有决定性作用。
链接: https://arxiv.org/abs/2603.10771
作者: Zhipeng Yang,Shu Yang,Lijie Hu,Di Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) trained with canonical tokenization exhibit surprising robustness to non-canonical inputs such as character-level tokenization, yet the mechanisms underlying this robustness remain unclear. We study this phenomenon through mechanistic interpretability and identify a core process we term word recovery. We first introduce a decoding-based method to detect word recovery, showing that hidden states reconstruct canonical word-level token identities from character-level inputs. We then provide causal evidence by removing the corresponding subspace from hidden states, which consistently degrades downstream task performance. Finally, we conduct a fine-grained attention analysis and show that in-group attention among characters belonging to the same canonical token is critical for word recovery: masking such attention in early layers substantially reduces both recovery scores and task performance. Together, our findings provide a mechanistic explanation for tokenization robustness and identify word recovery as a key mechanism enabling LLMs to process character-level inputs.
[NLP-14] mAceReason -Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR
【速读】: 该论文旨在解决当前基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)方法在多语言场景下的应用瓶颈,即现有训练数据和基准测试集主要面向英语,且难度不足,无法有效激发当前大语言模型的潜力。解决方案的关键在于构建一个高质量、多语言、高难度的数学问题数据集——mAceReason-Math,其源自专为RLVR设计的英文基准AceReason-Math,并通过精细化清洗与优化翻译流程,实现了14种语言、每语言超10,000个样本的覆盖,从而为多语言RLVR研究提供了可靠的数据基础与评测平台。
链接: https://arxiv.org/abs/2603.10767
作者: Konstantin Dobler,Simon Lehnerer,Federico Scozzafava,Jonathan Janke,Mohamed Ali
机构: Apple; Hasso Plattner Institute (哈索普拉特纳研究所) ELLIS Unit Potsdam (波茨坦 ELLIS 单元)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has been successfully applied to significantly boost the capabilities of pretrained large language models, especially in the math and logic problem domains. However, current research and available training datasets remain English-centric. While mul- tilingual training data and benchmarks have been created in the past, they were not created with RLVR and current model capability in mind, and their level of difficulty is often too low to provide appropriate training signals for current models. To address this gap, we provide mAceReason-Math, a dataset of high-quality translations of challenging math problems sourced from a corpus specifically curated for RLVR (AceReason-Math). We further take specific care to clean and improve our translations, resulting in a coverage of 14 languages with more than 10,000 samples per language. We release the dataset to facilitate multilingual RLVR research and benchmarking in the research community.
[NLP-15] HeartAgent : An Autonomous Agent System for Explainable Differential Diagnosis in Cardiology
【速读】: 该论文旨在解决当前基于人工智能的心脏病诊断方法中存在的三大核心问题:cardiology知识不足、复杂推理能力欠缺以及可解释性差。其解决方案的关键在于构建HeartAgent——一个专为心血管领域设计的智能代理系统,该系统通过集成定制化工具与结构化数据资源,并协调多个专业化子代理(sub-agent)进行协同推理,从而实现透明的推理路径生成和可验证的参考依据支持,显著提升了诊断准确性与临床可用性。
链接: https://arxiv.org/abs/2603.10764
作者: Shuang Zhou,Kai Yu,Song Wang,Wenya Xie,Zaifu Zhan,Meng-Han Tsai,Yuen-Hei Chung,Shutong Hou,Huixue Zhou,Min Zeng,Bhavadharini Ramu,Lin Yee Chen,Feng Xie,Rui Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 26 pages, 7 figures
Abstract:Heart diseases remain a leading cause of morbidity and mortality worldwide, necessitating accurate and trustworthy differential diagnosis. However, existing artificial intelligence-based diagnostic methods are often limited by insufficient cardiology knowledge, inadequate support for complex reasoning, and poor interpretability. Here we present HeartAgent, a cardiology-specific agent system designed to support a reliable and explainable differential diagnosis. HeartAgent integrates customized tools and curated data resources and orchestrates multiple specialized sub-agents to perform complex reasoning while generating transparent reasoning trajectories and verifiable supporting references. Evaluated on the MIMIC dataset and a private electronic health records cohort, HeartAgent achieved over 36% and 20% improvements over established comparative methods, in top-3 diagnostic accuracy, respectively. Additionally, clinicians assisted by HeartAgent demonstrated gains of 26.9% in diagnostic accuracy and 22.7% in explanatory quality compared with unaided experts. These results demonstrate that HeartAgent provides reliable, explainable, and clinically actionable decision support for cardiovascular care.
[NLP-16] Prism-Δ: Differential Subspace Steering for Prompt Highlighting in Large Language Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)中提示词强调(prompt highlighting)技术的核心挑战:如何提取能够区分相关与无关上下文的引导方向,而非捕捉两者共有的结构模式。解决方案的关键在于提出 PRISM-Δ(Projection-based Relevance-Informed Steering Method),通过分解正负样本交叉协方差矩阵的差异,最大化判别能量并消除共享方向;同时为每个注意力头分配连续的 softplus 重要性权重,使弱但有用的头以降低强度的方式参与引导,且该框架可自然扩展至 Value 表示,捕获 Key-only 方法未利用的内容通道信号,从而在多个基准测试和模型上显著提升性能并降低流畅性损失。
链接: https://arxiv.org/abs/2603.10705
作者: Yuyao Ge,Shenghua Liu,Yiwei Wang,Tianyu Liu,Baolong Bi,Lingrui Mei,Jiayu Yao,Jiafeng Guo,Xueqi Cheng
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of California, Merced (加州大学默塞德分校); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注: 21 pages, 14 figures
Abstract:Prompt highlighting steers a large language model to prioritize user-specified text spans during generation. A key challenge is extracting steering directions that capture the difference between relevant and irrelevant contexts, rather than shared structural patterns common to both. We propose PRISM- \Delta (Projection-based Relevance-Informed Steering Method), which decomposes the difference between positive and negative cross-covariance matrices to maximize discriminative energy while eliminating shared directions. Each attention head receives a continuous softplus importance weight, letting weak-but-useful heads contribute at reduced strength. The framework extends naturally to Value representations, capturing content-channel signal that Key-only methods leave unused. Across four benchmarks and five models, PRISM- \Delta matches or exceeds the best existing method on 19 of 20 configurations, with relative gains up to +10.6%, while halving the fluency cost of steering. PRISM- \Delta also scales to long-context retrieval, outperforming the best existing method by up to +4.8% relative gain. PRISM- \Delta is compatible with FlashAttention and adds negligible memory overhead.
[NLP-17] EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution VLDB2025
【速读】: 该论文旨在解决神经文本到SQL(text-to-SQL)模型在数据库模式(schema)演化场景下性能退化的问题。随着数据库模式频繁更新以适应新需求,静态训练的模型难以保持鲁棒性,而现有方法要么仅关注语法或语义映射的简单改写,要么缺乏系统性和可控性来评估模型对模式变化的适应能力。解决方案的关键在于提出EvoSchema——一个全面的基准测试框架,其核心创新包括:构建涵盖列级和表级共十类扰动类型的模式演化分类体系,系统模拟真实世界中数据库结构的动态变化;通过该基准对多种开源与闭源大语言模型(LLM)进行深入评估,发现表级扰动对模型性能影响显著高于列级扰动;并进一步验证了基于EvoSchema多样化模式设计训练的模型能有效避免学习虚假关联,从而在平均表现上展现出更强的鲁棒性。
链接: https://arxiv.org/abs/2603.10697
作者: Tianshu Zhang,Kun Qian,Siddhartha Sahai,Yuan Tian,Shaddy Garg,Huan Sun,Yunyao Li
机构: The Ohio State University (俄亥俄州立大学); Adobe Inc. (Adobe公司); Purdue University (普渡大学)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by VLDB 2025
Abstract:Neural text-to-SQL models, which translate natural language questions (NLQs) into SQL queries given a database schema, have achieved remarkable performance. However, database schemas frequently evolve to meet new requirements. Such schema evolution often leads to performance degradation for models trained on static schemas. Existing work either mainly focuses on simply paraphrasing some syntactic or semantic mappings among NLQ, DB and SQL, or lacks a comprehensive and controllable way to investigate the model robustness issue under the schema evolution, which is insufficient when facing the increasingly complex and rich database schema changes in reality, especially in the LLM era. To address the challenges posed by schema evolution, we present EvoSchema, a comprehensive benchmark designed to assess and enhance the robustness of text-to-SQL systems under real-world schema changes. EvoSchema introduces a novel schema evolution taxonomy, encompassing ten perturbation types across columnlevel and table-level modifications, systematically simulating the dynamic nature of database schemas. Through EvoSchema, we conduct an in-depth evaluation spanning different open source and closed-source LLMs, revealing that table-level perturbations have a significantly greater impact on model performance compared to column-level changes. Furthermore, EvoSchema inspires the development of more resilient text-to-SQL systems, in terms of both model training and database design. The models trained on EvoSchema’s diverse schema designs can force the model to distinguish the schema difference for the same questions to avoid learning spurious patterns, which demonstrate remarkable robustness compared to those trained on unperturbed data on average. This benchmark offers valuable insights into model behavior and a path forward for designing systems capable of thriving in dynamic, real-world environments.
[NLP-18] Emulating Clinician Cognition via Self-Evolving Deep Clinical Research
【速读】: 该论文旨在解决当前人工智能(AI)系统在临床诊断中与真实诊疗认知过程脱节的问题,即现有模型多将诊断视为单次回顾性预测,缺乏可审计的机制以支持持续改进。其解决方案的关键在于提出DxEvolve——一个自进化诊断代理,通过交互式深度临床研究工作流,自主请求检查并持续将临床经验外化为诊断认知原语(diagnostic cognition primitives),从而实现诊断能力的动态演进。该框架在MIMIC-CDM基准上相较于基线模型平均提升诊断准确率11.2%,并在独立外部队列中对覆盖类别和未覆盖类别分别提升10.2%和17.1%,展现出可治理的学习资产转化能力,为临床AI的持续演化提供了可问责路径。
链接: https://arxiv.org/abs/2603.10677
作者: Ruiyang Ren,Yuhao Wang,Yunsen Liang,Lan Luo,Jing Liu,Haifeng Wang,Cong Feng,Yinan Zhang,Chunyan Miao,Ji-Rong Wen,Wayne Xin Zhao
机构: 中国人民大学(renmin university of china)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Clinical diagnosis is a complex cognitive process, grounded in dynamic cue acquisition and continuous expertise accumulation. Yet most current artificial intelligence (AI) systems are misaligned with this reality, treating diagnosis as single-pass retrospective prediction while lacking auditable mechanisms for governed improvement. We developed DxEvolve, a self-evolving diagnostic agent that bridges these gaps through an interactive deep clinical research workflow. The framework autonomously requisitions examinations and continually externalizes clinical experience from increasing encounter exposure as diagnostic cognition primitives. On the MIMIC-CDM benchmark, DxEvolve improved diagnostic accuracy by 11.2% on average over backbone models and reached 90.4% on a reader-study subset, comparable to the clinician reference (88.8%). DxEvolve improved accuracy on an independent external cohort by 10.2% (categories covered by the source cohort) and 17.1% (uncovered categories) compared to the competitive method. By transforming experience into a governable learning asset, DxEvolve supports an accountable pathway for the continual evolution of clinical AI.
[NLP-19] Making Bielik LLM Reason (Better): A Field Report
【速读】: 该论文旨在解决波兰大型语言模型(Large Language Model, LLM)Bielik在推理能力方面的评估与提升问题。其解决方案的关键在于构建系统化的评估方法论,并通过基准测试(benchmarking)对比分析Bielik与其他主流LLM的性能表现,从而识别其优势与局限,为后续优化方向提供依据,确保Bielik能够在快速演进且高度竞争的人工智能领域中保持竞争力。
链接: https://arxiv.org/abs/2603.10640
作者: Adam Trybus,Bartosz Bartnicki,Remigiusz Kinas
机构: Bielik.ai (Speakleash Foundation); Jagiellonian University (雅盖隆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents a research program dedicated to evaluating and advancing the reasoning capabilities of Bielik, a Polish large language model. The study describes a number of stages of work: initial benchmarking and creation of evaluation methodology, analyzing of comparative results with other LLMs and outlining of future prospects that take into account the limitations of the analyses conducted so far and aims to keep Bielik in the race give the ever-changing – and competitive – AI landscape.
[NLP-20] Reinforcement Learning with Conditional Expectation Reward
【速读】: 该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在通用推理任务中应用受限的问题,尤其针对自由格式答案场景下难以构建完备且准确的规则型验证器(rule-based verifier)这一瓶颈。其解决方案的关键在于提出条件期望奖励(Conditional Expectation Reward, CER),该机制利用大语言模型自身作为隐式验证器,通过计算生成答案条件下参考答案的期望似然来量化正确性,从而提供一种软性、分级的奖励信号,克服了传统二值反馈的局限性,并显著提升了模型在数学与一般推理任务中的泛化能力。
链接: https://arxiv.org/abs/2603.10624
作者: Changyi Xiao,Caijun Xu,Yixin Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing the reasoning capabilities of large language models, particularly in domains such as mathematics where reliable rule-based verifiers can be constructed. However, the reliance on handcrafted, domain-specific verification rules substantially limits the applicability of RLVR to general reasoning domains with free-form answers, where valid answers often exhibit significant variability, making it difficult to establish complete and accurate rules. To address this limitation, we propose Conditional Expectation Reward (CER), which leverages the large language model itself as an implicit verifier, and is therefore applicable to general domains and eliminates the need for external verifiers or auxiliary models. CER is defined as the expected likelihood of generating the reference answer conditioned on the generated answer. In contrast to rule-based verifiers that yield binary feedback, CER provides a soft, graded reward signal that reflects varying degrees of correctness, making it better suited to tasks where answers vary in correctness. Experimental results demonstrate that CER is effective across a wide range of reasoning tasks, spanning both mathematical and general domains, indicating that CER serves as a flexible and general verification mechanism. The code is available at this https URL.
[NLP-21] Disentangling Similarity and Relatedness in Topic Models
【速读】: 该论文旨在解决传统主题模型(如Latent Dirichlet Allocation, LDA)在捕捉语义结构时受限于词共现统计、难以区分语义相似性(similarity)与主题相关性(relatedness)的问题。其解决方案的关键在于构建一个基于大语言模型(LLM)标注的大型合成词对基准数据集,训练出一个神经评分函数来量化词对的相似性和相关性,并利用该评分函数对多种语料库和主题模型家族进行系统评估,从而揭示不同模型在语义结构表征上的差异,并证明这两种维度能有效预测下游任务表现。
链接: https://arxiv.org/abs/2603.10619
作者: Hanlin Xiao,Mauricio A. Álvarez,Rainer Breitling
机构: Manchester Institute of Biotechnology (曼彻斯特生物技术研究所); Department of Computer Science (计算机科学系); Department of Chemistry (化学系)
类目: Computation and Language (cs.CL)
备注: 22 pages, 6 figures, 14 tables
Abstract:The recent advancement of large language models has spurred a growing trend of integrating pre-trained language model (PLM) embeddings into topic models, fundamentally reshaping how topics capture semantic structure. Classical models such as Latent Dirichlet Allocation (LDA) derive topics from word co-occurrence statistics, whereas PLM-augmented models anchor these statistics to pre-trained embedding spaces, imposing a prior that also favours clustering of semantically similar words. This structural difference can be captured by the psycholinguistic dimensions of thematic relatedness and taxonomic similarity of the topic words. To disentangle these dimensions in topic models, we construct a large synthetic benchmark of word pairs using LLM-based annotation to train a neural scoring function. We apply this scorer to a comprehensive evaluation across multiple corpora and topic model families, revealing that different model families capture distinct semantic structure in their topics. We further demonstrate that similarity and relatedness scores successfully predict downstream task performance depending on task requirements. This paper establishes similarity and relatedness as essential axes for topic model evaluation and provides a reliable pipeline for characterising these across model families and corpora.
[NLP-22] MUNIChus: Multilingual News Image Captioning Benchmark LREC2026
【速读】: 该论文旨在解决多语言新闻图像字幕生成(news image captioning)任务中因数据稀缺导致的模型泛化能力不足问题,尤其针对低资源语言如僧伽罗语(Sinhala)和乌尔都语(Urdu)缺乏高质量标注数据的现状。其关键解决方案是构建首个多语言新闻图像字幕基准数据集MUNIChus,涵盖9种语言(包括多种低资源语言),并在此基础上对多种前沿神经网络模型进行系统评估,揭示当前方法在跨语言场景下的局限性,同时开放数据集与已测试的20余种模型,为后续多语言新闻图像字幕研究提供标准化评测平台与可复现基线。
链接: https://arxiv.org/abs/2603.10613
作者: Yuji Chen,Alistair Plum,Hansi Hettiarachchi,Diptesh Kanojia,Saroj Basnet,Marcos Zampieri,Tharindu Ranasinghe
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to LREC 2026 (The Fifteenth biennial Language Resources and Evaluation Conference)
Abstract:The goal of news image captioning is to generate captions by integrating news article content with corresponding images, highlighting the relationship between textual context and visual elements. The majority of research on news image captioning focuses on English, primarily because datasets in other languages are scarce. To address this limitation, we create the first multilingual news image captioning benchmark, MUNIChus, comprising 9 languages, including several low-resource languages such as Sinhala and Urdu. We evaluate various state-of-the-art neural news image captioning models on MUNIChus and find that news image captioning remains challenging. We also make MUNIChus publicly available with over 20 models already benchmarked. MUNIChus opens new avenues for further advancements in developing and evaluating multilingual news image captioning models.
[NLP-23] Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)对齐任务是否需要区别于传统强化学习中以奖励最大化为核心的策略优化方法的问题,特别是是否存在对响应多样性有内在需求的分布匹配(distribution-matching)机制。研究发现,尽管道德推理任务看似允许多种有效回答,但其高奖励响应在语义空间中呈现高度集中分布,表明模式寻找(mode-seeking)的奖励最大化方法同样甚至更有效。解决方案的关键在于构建一个基于评分标准(rubric-grounded)的奖励管道,通过训练Qwen3-1.7B判别模型实现稳定可靠的强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练,并通过语义可视化揭示道德推理任务中高奖励响应的聚集特性,从而证明无需显式引入多样性保持机制即可实现有效的对齐优化。
链接: https://arxiv.org/abs/2603.10588
作者: Zhaowei Zhang,Xiaohan Liu,Xuekai Zhu,Junchao Huang,Ceyao Zhang,Zhiyuan Feng,Yaodong Yang,Xiaoyuan Yi,Xing Xie
机构: Peking University (北京大学); Microsoft Research (微软研究院); University of Michigan (密歇根大学); Shanghai Jiao Tong University (上海交通大学); CUHKSZ (香港中文大学深圳分校); THU (清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in logical reasoning tasks, yet whether large language model (LLM) alignment requires fundamentally different approaches remains unclear. Given the apparent tolerance for multiple valid responses in moral reasoning, a natural hypothesis is that alignment tasks inherently require diversity-seeking distribution-matching algorithms rather than reward-maximizing policy-based methods. We conduct the first comprehensive empirical study comparing both paradigms on MoReBench. To enable stable RLVR training, we build a rubric-grounded reward pipeline by training a Qwen3-1.7B judge model. Contrary to our hypothesis, we find that distribution-matching approaches do not demonstrate significant advantages over reward-maximizing methods as expected on alignment tasks. Through semantic visualization mapping high-reward responses to semantic space, we demonstrate that moral reasoning exhibits more concentrated high-reward distributions than mathematical reasoning, where diverse solution strategies yield similarly high rewards. This counter-intuitive finding explains why mode-seeking optimization proves equally or more effective for alignment tasks. Our results suggest that alignment tasks do not inherently require diversity-preserving algorithms, and standard reward-maximizing RLVR methods can effectively transfer to moral reasoning without explicit diversity mechanisms.
[NLP-24] End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering
【速读】: 该论文旨在解决领域特定聊天机器人(chatbot)在生成回答时容易产生无依据或错误内容的问题,而传统评估方法依赖人工审核成本高,且现有框架多基于静态测试集和指标,难以扩展。解决方案的关键在于提出一个端到端的自动评估系统:首先从知识库中直接生成问题-答案(Q&A)对,然后利用大语言模型(LLM)对聊天机器人的回答与参考答案进行比对判断,并通过置信度过滤机制识别不确定性高的案例,从而显著降低人工干预需求,同时保持与人工判断高度一致的评估效果。该框架模块化且语言无关,具备良好的可扩展性和跨领域适应性。
链接: https://arxiv.org/abs/2603.10570
作者: Nhi Dang,Tung Le,Huy Tien Nguyen
机构: Faculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) combined with retrieval augmented generation have enabled the deployment of domain-specific chatbots, but these systems remain prone to generating unsupported or incorrect answers. Reliable evaluation is therefore critical, yet manual review is costly and existing frameworks often depend on curated test sets and static metrics, limiting scalability. We propose an end-to-end automatic evaluator designed to substantially reduce human effort. Our system generates Q\A pairs directly from the underlying knowledge base, uses LLMs to judge chatbot responses against reference answers, and applies confidence-based filtering to highlight uncertain cases. Applied to a Vietnamese news dataset, the evaluator achieves high agreement with human judgments while significantly lowering review overhead. The framework is modular and language-agnostic, making it readily adaptable to diverse domains. This work introduces a practical, scalable solution for evaluating chatbots with minimal reliance on manual intervention.
[NLP-25] Automatic End-to-End Data Integration using Large Language Models ICDE2026
【速读】: 该论文旨在解决数据集成流水线(data integration pipeline)设计中高度依赖人工干预的问题,即数据工程师需手动配置组件并标注训练数据,导致成本高昂且效率低下。其解决方案的关键在于利用生成式 AI(Generative AI)——具体为 GPT-5.2——自动完成端到端数据集成流程中的全部关键任务,包括模式映射(schema mappings)、值映射(value mappings)用于数据归一化、实体匹配的训练数据以及冲突消解启发式规则的选择验证数据。实验表明,该 LLM 驱动的流水线在多个真实场景下可达到与人工设计相当甚至更优的集成质量,同时显著降低部署成本(约每案例 10 美元),体现出生成式 AI 在自动化数据工程中的巨大潜力。
链接: https://arxiv.org/abs/2603.10547
作者: Aaron Steiner,Christian Bizer
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 9 tables. Accepted at the Beyond SQL Workshop at ICDE 2026
Abstract:Designing data integration pipelines typically requires substantial manual effort from data engineers to configure pipeline components and label training data. While LLMs have shown promise in handling individual steps of the integration process, their potential to replace all human input across end-to-end data integration pipelines has not been investigated. As a step toward exploring this potential, we present an automatic data integration pipeline that uses GPT-5.2 to generate all artifacts required to adapt the pipeline to specific use cases. These artifacts are schema mappings, value mappings for data normalization, training data for entity matching, and validation data for selecting conflict resolution heuristics in data fusion. We compare the performance of this LLM-based pipeline to the performance of human-designed pipelines along three case studies requiring the integration of video game, music, and company related data. Our experiments show that the LLM-based pipeline is able to produce similar results, for some tasks even better results, as the human-designed pipelines. End-to-end, the human and the LLM pipelines produce integrated datasets of comparable size and density. Having the LLM configure the pipelines costs approximately \ 10 per case study, which represents only a small fraction of the cost of having human data engineers perform the same tasks.
[NLP-26] ackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在提升大语言模型(Large Language Models, LLMs)能力过程中出现的“长度膨胀”(length inflation)问题,即模型为最大化奖励而倾向于冗长或低效推理。传统方法如加性惩罚项易引入补偿效应导致优化捷径,而启发式门控策略则缺乏泛化能力。其解决方案的核心是提出Group Relative Reward Rescaling (GR³),将长度控制重构为乘法重缩放范式,从而建立一种通用、连续且依赖奖励的门控机制;同时结合组相对正则化与优势感知校准,动态调整长度预算以适应实例难度并保留高质量轨迹的优势信号,实现无损优化。
链接: https://arxiv.org/abs/2603.10535
作者: Zichao Li,Jie Lou,Fangchen Dong,Zhiyuan Fan,Mengjie Ren,Hongyu Lin,Xianpei Han,Debing Zhang,Le Sun,Yaojie Lu,Xing Yu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning significantly enhances LLM capabilities but suffers from a critical issue: length inflation, where models adopt verbosity or inefficient reasoning to maximize rewards. Prior approaches struggle to address this challenge in a general and lossless manner, primarily because additive penalties introduce a compensatory effect that creates optimization shortcuts, while heuristic gating strategies lack generality beyond binary feedback. To bridge this gap, we present Group Relative Reward Rescaling (GR ^3 ), which reframes length control as a multiplicative rescaling paradigm, effectively establishing a generalized, continuous, and reward-dependent gating mechanism. To further ensure lossless optimization, we incorporate group-relative regularization and advantage-aware calibration, which dynamically adapt length budgets to instance difficulty and preserve the advantage signal of high-quality trajectories. Empirically, across both RLHF and RLVR settings, GR ^3 ~maintains training dynamics and downstream performance comparable to standard GRPO while significantly mitigating length inflation, outperforming state-of-the-art length-regularized baselines.
[NLP-27] AILS-NTUA at SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations
【速读】: 该论文针对多轮检索增强生成(multi-turn retrieval-augmented generation, RAG)中的三个子任务——段落检索(A)、参考文献驱动的回答生成(B)以及端到端RAG(C)——提出了统一的解决方案。其核心创新在于两个关键策略:一是“查询多样性优于检索器多样性”的设计思想,通过五个互补的大语言模型(LLM)生成的查询改写版本作用于单一语料库对齐的稀疏检索器,并采用方差感知嵌套互斥排名融合(variance-aware nested Reciprocal Rank Fusion)进行整合;二是分阶段生成流水线,将基于证据的生成过程解耦为证据片段提取、双候选草稿撰写与校准式多裁判选择三个步骤。实验证明,该方法在任务A中取得第一名(nDCG@5: 0.5776,较最强基线提升20.5%),在任务B中位列第二(HM: 0.7698),且分析表明查询多样性优于异构检索器集成,而答案可回答性校准才是端到端性能的主要瓶颈。
链接: https://arxiv.org/abs/2603.10524
作者: Dimosthenis Athanasiou,Maria Lymperaiou,Giorgos Filandrianos,Athanasios Voulodimos,Giorgos Stamou
机构: National Technical University of Athens (雅典国立技术大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We present the AILS-NTUA system for SemEval-2026 Task 8 (MTRAGEval), addressing all three subtasks of multi-turn retrieval-augmented generation: passage retrieval (A), reference-grounded response generation (B), and end-to-end RAG ©. Our unified architecture is built on two principles: (i) a query-diversity-over-retriever-diversity strategy, where five complementary LLM-based query reformulations are issued to a single corpus-aligned sparse retriever and fused via variance-aware nested Reciprocal Rank Fusion; and (ii) a multistage generation pipeline that decomposes grounded generation into evidence span extraction, dual-candidate drafting, and calibrated multi-judge selection. Our system ranks 1st in Task A (nDCG@5: 0.5776, +20.5% over the strongest baseline) and 2nd in Task B (HM: 0.7698). Empirical analysis shows that query diversity over a well-aligned retriever outperforms heterogeneous retriever ensembling, and that answerability calibration-rather than retrieval coverage-is the primary bottleneck in end-to-end performance.
[NLP-28] IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对多源指令冲突时缺乏稳健的优先级决策机制问题,即如何明确系统、开发者、用户和工具指令之间的优先级关系(称为指令层次结构,Instruction Hierarchy, IH),以有效防御越狱攻击、系统提示提取和代理式提示注入等安全威胁。解决方案的关键在于提出并构建了IH-Challenge数据集,通过强化学习训练结合在线对抗样例生成策略,显著提升模型在多种分布内、分布外及人工红队测试场景下的IH鲁棒性,同时降低不安全行为比例并维持模型整体能力。
链接: https://arxiv.org/abs/2603.10521
作者: Chuan Guo,Juan Felipe Ceron Uribe,Sicheng Zhu,Christopher A. Choquette-Choo,Steph Lin,Nikhil Kandpal,Milad Nasr, Rai (Michael Pokorny),Sam Toyer,Miles Wang,Yaodong Yu,Alex Beutel,Kai Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We introduce IH-Challenge, a reinforcement learning training dataset, to address these difficulties. Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7% while improving helpfulness on general safety evaluations, and saturates an internal static agentic prompt injection evaluation, with minimal capability regression. We release the IH-Challenge dataset (this https URL) to support future research on robust instruction hierarchy.
[NLP-29] Safe and Scalable Web Agent Learning via Recreated Websites
【速读】: 该论文旨在解决自主网络代理(autonomous web agents)在训练过程中面临的环境限制问题,即真实网站存在安全隐患、难以重置且缺乏可验证的反馈机制,从而制约了代理的学习效率与泛化能力。其解决方案的关键在于提出VeriEnv框架,该框架利用语言模型自动将真实网站克隆为可执行、可验证的合成环境,并通过Python SDK提供受控的内部访问接口,使代理能够自动生成任务并获得确定性、程序化验证的奖励信号,从而摆脱对启发式或大语言模型(LLM)评判器的依赖,实现安全、可扩展的自我进化训练。
链接: https://arxiv.org/abs/2603.10505
作者: Hyungjoo Chae,Jungsoo Park,Alan Ritter
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Training autonomous web agents is fundamentally limited by the environments they learn from: real-world websites are unsafe to explore, hard to reset, and rarely provide verifiable feedback. We propose VeriEnv, a framework that treats language models as environment creators, automatically cloning real-world websites into fully executable, verifiable synthetic environments. By exposing controlled internal access via a Python SDK, VeriEnv enables agents to self-generate tasks with deterministic, programmatically verifiable rewards, eliminating reliance on heuristic or LLM-based judges. This design decouples agent learning from unsafe real-world interaction while enabling scalable self-evolution through environment expansion. Through experiments on web agent benchmarks, we show that agents trained with VeriEnv generalize to unseen websites, achieve site-specific mastery through self-evolving training, and benefit from scaling the number of training environments. Code and resources will be released at this https URL upon acceptance.
[NLP-30] VERI-DPO: Evidence-Aware Alignment for Clinical Summarization via Claim Verification and Direct Preference Optimization
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在临床场景中生成的Brief Hospital Course (BHC) 文本常出现不支持性陈述(Not Supported claims)的问题,尤其是在电子健康记录(EHR)证据碎片化背景下,现有基于大语言模型(LLM)的摘要器易引入未经验证的信息或因对齐策略导致信息遗漏(“say-less”退化)。其解决方案的关键在于提出VERI-DPO框架,通过一个检索增强型验证器(retrieval-augmented verifier)对候选摘要中的句子级断言(claim)与EHR证据进行匹配并标注为“支持”、“不支持”或“未涉及”,进而利用直接偏好优化(Direct Preference Optimization, DPO)将这些基于事实验证的偏好信号蒸馏进摘要模型;该方法通过覆盖感知效用(coverage-aware utility)挖掘长度可控且以矛盾锚定(contradiction-anchored)的偏好对,在保持摘要信息量的同时显著降低不支持性断言比例(从10.7%降至1.9%),提升摘要有效性(从76.7%提升至82.5%)。
链接: https://arxiv.org/abs/2603.10494
作者: Weixin Liu,Congning Ni,Qingyuan Song,Susannah L. Rose,Christopher Symons,Murat Kantarcioglu,Bradley A. Malin,Zhijun Yin
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Paper submitted to AMIA 2026 Annual Symposium
Abstract:Brief Hospital Course (BHC) narratives must be clinically useful yet faithful to fragmented EHR evidence. LLM-based clinical summarizers still introduce unsupported statements, and alignment can encourage omissions (“say-less” degeneration). We introduce VERI-DPO, which uses claim verification to mine preferences and distill them into the summarizer with Direct Preference Optimization (DPO). On MIMIC-III-Ext-VeriFact-BHC (100 ICU patients; patient-level splits), we train a retrieval-augmented verifier to label claim-evidence pairs as Supported, Not Supported, or Not Addressed via a single-token format. The verifier scores sentence-level claims from sampled BHC candidates and aggregates margins into a coverage-aware utility to mine length-controlled, contradiction-anchored preference pairs. On held-out patients, verifier-mined preferences separate candidates by contradiction density, and VERI-DPO reduces Not Supported claim rates from 10.7% to 1.9% (local verifier judge) and from 11.6% to 6.4% (GPT-4o judge), while improving validity from 76.7% to 82.5% and maintaining informative length.
[NLP-31] Human-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language Agent
【速读】: 该论文旨在解决临床诊断中复杂病例的决策支持问题,特别是在面对罕见疾病时人类医生准确性下降的挑战。解决方案的关键在于构建一个名为PULSE的医学推理代理,其核心是将领域微调的大语言模型(Large Language Model, LLM)与科学文献检索相结合,从而在真实世界复杂病例中实现专家级诊断准确性。该系统不仅在Top@1和Top@4准确率上达到或超越不同层级医生的表现,且在疾病发生率差异较大的情况下保持性能稳定,同时展现出类似专家的适应性推理行为,为AI辅助诊断提供了可验证的框架和实践路径。
链接: https://arxiv.org/abs/2603.10492
作者: Zhongzhen Huang,Yan Ling,Hong Chen,Ye Feng,Li Wu,Linjie Mu,Shaoting Zhang,Xiaofan Zhang,Kun Qian,Xiaomu Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We present PULSE, a medical reasoning agent that combines a domain-tuned large language model with scientific literature retrieval to support diagnostic decision-making in complex real-world cases. To evaluate its capabilities, we curated a benchmark of 82 authentic endocrinology case reports encompassing a broad spectrum of disease types and incidence levels. In controlled experiments, we compared PULSE’s performance against physicians with varying levels of expertise-from residents to senior specialists-and examined how AI assistance influenced human diagnostic reasoning. PULSE attained expert-competitive accuracy, outperforming residents and junior specialists while matching senior specialist performance at both Top@1 and Top@4 thresholds. Unlike physicians, whose accuracy declined with disease rarity, PULSE maintained stable performance across incidence tiers. The agent also exhibited adaptive reasoning, increasing output length with case difficulty in a manner analogous to the longer deliberation observed among expert clinicians. When used collaboratively, PULSE enabled physicians to correct initial errors and broaden diagnostic hypotheses, but also introduced risks of automation bias. The study explores both serial and concurrent collaboration workflows, revealing that PULSE offers robust support across common and rare presentations. These findings underscore both the promise and the limitations of language model-based agents in clinical diagnosis, and offer a framework for evaluating their role in real-world decision-making.
[NLP-32] PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中提示工程(Prompt Engineering)评估标准单一的问题,即现有评估方法主要依赖答案正确性,难以解释提示成功或失败的原因,且缺乏可操作的优化指导。解决方案的关键在于提出PEEM(Prompt Engineering Evaluation Metrics),这是一个统一、可解释的评估框架,通过9个维度(3个提示维度:清晰度/结构、语言质量、公平性;6个响应维度:准确性、连贯性、相关性、客观性、清晰度、简洁性)对提示和响应进行联合评估,并利用大语言模型作为评价器输出1-5分制标量分数及基于rubric的自然语言理由。PEEM不仅在多个基准测试中与传统准确率高度一致(Spearman rho ≈ 0.97),还能捕捉互补的语言学失败模式并保持对提示扰动的鲁棒性,更重要的是,仅使用其评分和理由即可驱动零样本提示重写循环,使下游任务准确率提升最高达11.7点,显著优于监督学习和强化学习基线方法。
链接: https://arxiv.org/abs/2603.10477
作者: Minki Hong,Eunsoo Lee,Sohyun Park,Jihie Kim
机构: Dongguk University (东国大学)
类目: Computation and Language (cs.CL)
备注: 24pages, 2 figures
Abstract:Prompt design is a primary control interface for large language models (LLMs), yet standard evaluations largely reduce performance to answer correctness, obscuring why a prompt succeeds or fails and providing little actionable guidance. We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses. PEEM defines a structured rubric with 9 axes: 3 prompt criteria (clarity/structure, linguistic quality, fairness) and 6 response criteria (accuracy, coherence, relevance, objectivity, clarity, conciseness), and uses an LLM-based evaluator to output (i) scalar scores on a 1-5 Likert scale and (ii) criterion-specific natural-language rationales grounded in the rubric. Across 7 benchmarks and 5 task models, PEEM’s accuracy axis strongly aligns with conventional accuracy while preserving model rankings (aggregate Spearman rho about 0.97, Pearson r about 0.94, p 0.001). A multi-evaluator study with four models shows consistent relative judgments (pairwise rho = 0.68-0.85), supporting evaluator-agnostic deployment. Beyond alignment, PEEM captures complementary linguistic failure modes and remains informative under prompt perturbations: prompt-quality trends track downstream accuracy under iterative rewrites, semantic adversarial manipulations induce clear score degradation, and meaning-preserving paraphrases yield high stability (robustness rate about 76.7-80.6%). Finally, using only PEEM scores and rationales as feedback, a zero-shot prompt rewriting loop improves downstream accuracy by up to 11.7 points, outperforming supervised and RL-based prompt-optimization baselines. Overall, PEEM provides a reproducible, criterion-driven protocol that links prompt formulation to response behavior and enables systematic diagnosis and optimization of LLM interactions.
[NLP-33] Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多利益相关者场景下对齐受限的问题,此类场景中存在价值冲突,需具备协商与 deliberative(审议)能力以达成共识。其解决方案的关键在于提出一种基于多智能体协商的对齐框架,通过两个具有对立人格(persona)的相同LLM进行结构化轮流对话,生成互利解决方案,并利用RLAIF(Reinforcement Learning from AI Feedback)结合GRPO(Generalized Reward Policy Optimization)优化策略,其中奖励来自最终输出的集体代理(Collective Agency, CA)评分,但梯度应用于对话token以直接提升审议交互动态。该方法在保持通用语言能力不下降的前提下,显著增强了冲突解决能力,同时实现了与单智能体基线相当的CA对齐水平。
链接: https://arxiv.org/abs/2603.10476
作者: Panatchakorn Anantaprayoon,Nataliia Babina,Nima Asgharbeygi,Jad Tarifi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The alignment of large language models (LLMs) has progressed substantially in single-agent settings through paradigms such as RLHF and Constitutional AI, with recent work exploring scalable alternatives such as RLAIF and evolving alignment objectives. However, these approaches remain limited in multi-stakeholder settings, where conflicting values arise and deliberative negotiation capabilities are required. This work proposes a multi-agent negotiation-based alignment framework that aligns LLMs to Collective Agency (CA)-an existing alignment objective introduced to promote the continual expansion of agency-while simultaneously improving conflict-resolution capability. To enable scalable training, two self-play instances of the same LLM, assigned opposing personas, engage in structured turn-based dialogue to synthesize mutually beneficial solutions. We generate synthetic moral-dilemma prompts and conflicting persona pairs, and optimize the policy via RLAIF using GRPO with an external LLM reward model. While rewards are computed from CA scores assigned to the final completion, gradients are applied to dialogue tokens to directly improve deliberative interaction dynamics. Experiments show that the resulting model achieves CA alignment comparable to a single-agent baseline while substantially improving conflict-resolution performance without degrading general language capabilities. These results suggest that negotiation-driven deliberation training provides a practical path toward LLMs that better support collective decision-making in value-conflict scenarios.
[NLP-34] Aligning Large Language Models with Searcher Preferences
【速读】: 该论文旨在解决开放性生成式搜索(open-ended generative search)在大规模内容平台上的应用难题,包括检索噪声下的鲁棒性不足、安全合规性要求严苛以及用户需求多样性难以对齐等问题。解决方案的关键在于提出SearchLLM——首个面向开放性生成式搜索的大语言模型,并设计了一种分层多维奖励机制:将基础约束(如事实准确性、答案质量与格式合规)与行为优化目标(如抗噪声检索能力及用户意图对齐)分离,通过结合规则校验与人工校准的LLM判官生成可解释的评分向量;进一步引入门控聚合策略(Gated Aggregation Strategy)整合多维奖励,用于Group Relative Policy Optimization(GRPO)训练,从而在保障安全性的同时显著提升生成质量和用户参与度。
链接: https://arxiv.org/abs/2603.10473
作者: Wei Wu,Peilun Zhou,Liyi Chen,Qimeng Wang,Chengqiang Lu,Yan Gao,Yi Wu,Yao Hu,Hui Xiong
机构: University of Science and Technology of China (中国科学技术大学); Xiaohongshu Inc. (小红书); The Hong Kong University of Science and Technology (广州); The Hong Kong University of Science and Technology (香港)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The paradigm shift from item-centric ranking to answer-centric synthesis is redefining the role of search engines. While recent industrial progress has applied generative techniques to closed-set item ranking in e-commerce, research and deployment of open-ended generative search on large content platforms remain limited. This setting introduces challenges, including robustness to noisy retrieval, non-negotiable safety guarantees, and alignment with diverse user needs. In this work, we introduce SearchLLM, the first large language model (LLM) for open-ended generative search. We design a hierarchical, multi-dimensional reward system that separates bottom-line constraints, including factual grounding, basic answer quality and format compliance, from behavior optimization objectives that promote robustness to noisy retrieval and alignment with user needs. Concretely, our reward model evaluates responses conditioned on the user query, session history, and retrieved evidence set, combining rule-based checks with human-calibrated LLM judges to produce an interpretable score vector over these dimensions. We introduce a Gated Aggregation Strategy to derive the training reward for optimizing SearchLLM with Group Relative Policy Optimization (GRPO). We deploy SearchLLM in the AI search entry of RedNote. Offline evaluations and online A/B tests show improved generation quality and user engagement, increasing Valid Consumption Rate by 1.03% and reducing Re-search Rate by 2.81%, while upholding strict safety and reliability standards.
[NLP-35] Dynamic Knowledge Fusion for Multi-Domain Dialogue State Tracking
【速读】: 该论文旨在解决多领域对话状态追踪(Multi-Domain Dialogue State Tracking, DST)中的两个关键问题:一是如何有效建模对话历史以捕捉跨轮次的语义信息,二是由于标注数据稀缺导致模型性能受限。解决方案的核心在于提出一种动态知识融合框架,其关键创新在于两阶段设计:第一阶段采用仅编码器结构并结合对比学习,对对话历史和候选槽位进行编码,并基于相关性得分选择相关信息槽位;第二阶段通过将选中槽位的结构化信息作为上下文提示(contextual prompts),动态融合领域知识以增强对话状态追踪的准确性和一致性。该方法显著提升了模型在多领域对话场景下的追踪精度与泛化能力。
链接: https://arxiv.org/abs/2603.10367
作者: Haoxiang Su,Ruiyu Fang,Liting Jiang,Xiaomeng Huang,Shuangyong Song
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The performance of task-oriented dialogue models is strongly tied to how well they track dialogue states, which records and updates user information across multi-turn interactions. However, current multi-domain DST encounters two key challenges: the difficulty of effectively modeling dialogue history and the limited availability of annotated data, both of which hinder model performance. To tackle the aforementioned problems, we develop a dynamic knowledge fusion framework applicable to multi-domain DST. The model operates in two stages: first, an encoder-only network trained with contrastive learning encodes dialogue history and candidate slots, selecting relevant slots based on correlation scores; second, dynamic knowledge fusion leverages the structured information of selected slots as contextual prompts to enhance the accuracy and consistency of dialogue state tracking. This design enables more accurate integration of dialogue context and domain knowledge. Results obtained from multi-domain dialogue benchmarks indicate that our method notably improves both tracking accuracy and generalization, validating its capability in handling complex dialogue scenarios.
[NLP-36] Mitigating Translationese Bias in Multilingual LLM -as-a-Judge via Disentangled Information Bottleneck
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言评估中普遍存在的一种系统性翻译腔偏差(translationese bias)问题,即模型倾向于偏好机器翻译文本而非人类撰写的参考文本,尤其在低资源语言中表现更为显著。该偏差主要源于两个伪相关因素:(i) 与英语潜在流形对齐的隐式结构和 (ii) 跨语言可预测性。为缓解这一问题,论文提出DIBJudge框架,其核心创新在于通过变分信息压缩学习一个最小充分且与判断任务相关的表示,同时将伪相关因素显式隔离至专用的偏置分支;此外,引入交叉协方差惩罚项以显式抑制鲁棒表示与偏置表示之间的统计依赖,从而实现有效解耦。实验证明,DIBJudge在多语言奖励建模基准和专门设计的翻译腔偏差评测套件上均显著优于现有强基线方法。
链接: https://arxiv.org/abs/2603.10351
作者: Hongbin Zhang,Kehai Chen,Xuefen Bai,Youcheng Pan,Yang Xiang,Jinpeng Wang,Min Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Large language models (LLMs) have become a standard for multilingual evaluation, yet they exhibit a severe systematic translationese bias. In this paper, translationese bias is characterized as LLMs systematically favoring machine-translated text over human-authored references, particularly in low-resource languages. We attribute this bias to spurious correlations with (i) latent manifold alignment with English and (ii) cross-lingual predictability. To mitigate this bias, we propose DIBJudge, a robust fine-tuning framework that learns a minimally sufficient, judgment-critical representation via variational information compression, while explicitly isolating spurious factors into the dedicated bias branch. Furthermore, we incorporate a cross-covariance penalty that explicitly suppresses statistical dependence between robust and bias representations, thereby encouraging effective disentanglement. Extensive evaluations on multilingual reward modeling benchmarks and a dedicated translationese bias evaluation suite demonstrate that the proposed DIBJudge consistently outperforms strong baselines and substantially mitigates translationese bias.
[NLP-37] Large language models can disambiguate opioid slang on social media
【速读】: 该论文旨在解决社交媒体文本中与阿片类药物过量危机相关的内容识别难题,尤其是在低频词和俚语(如“smack”或“blues”)存在歧义的情况下,传统基于词典的筛选方法因无法准确区分其非阿片类含义而效果有限。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)强大的上下文理解与推理能力,实现对模糊术语的精准消歧,并在无词典依赖条件下自动识别相关帖子,同时具备识别新兴俚语的能力。实验表明,LLMs 在三种任务设置下均显著优于传统词典策略,尤其在召回率和F1分数上表现突出,证明其可有效提升低频话题内容的识别精度,从而增强下游分析与预测模型的数据质量。
链接: https://arxiv.org/abs/2603.10313
作者: Kristy A. Carpenter,Issah A. Samori,Mathew V. Kiang,Keith Humphreys,Anna Lembke,Johannes C. Eichstaedt,Russ B. Altman
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Social media text shows promise for monitoring trends in the opioid overdose crisis; however, the overwhelming majority of social media text is unrelated to opioids. When leveraging social media text to monitor trends in the ongoing opioid overdose crisis, a common strategy for identifying relevant content is to use a lexicon of opioid-related terms as inclusion criteria. However, many slang terms for opioids, such as “smack” or “blues,” have common non-opioid meanings, making them ambiguous. The advanced textual reasoning capability of large language models (LLMs) presents an opportunity to disambiguate these slang terms at scale. We present three tasks on which to evaluate four state-of-the-art LLMs (GPT-4, GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4.5): a lexicon-based setting, in which the LLM must disambiguate a specific term within the context of a given post; a lexicon-free setting, in which the LLM must identify opioid-related posts from context without a lexicon; and an emergent slang setting, in which the LLM must identify opioid-related posts with simulated new slang terms. All four LLMs showed excellent performance across all tasks. In both subtasks of the lexicon-based setting, LLM F1 scores (“fenty” subtask: 0.824-0.972; “smack” subtask: 0.540-0.862) far exceeded those of the best lexicon strategy (0.126 and 0.009, respectively). In the lexicon-free task, LLM F1 scores (0.544-0.769) surpassed those of lexicons (0.080-0.540), and LLMs demonstrated uniformly higher recall. On emergent slang, all LLMs had higher accuracy (average: 0.784), F1 score (average: 0.712), precision (average: 0.981), and recall (average: 0.587) than the two lexicons assessed. Our results show that LLMs can be used to identify relevant content for low-prevalence topics, including but not limited to opioid references, enhancing data provided to downstream analyses and predictive models.
[NLP-38] Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas LREC2026
【速读】: 该论文旨在解决科研领域中研究思路新颖性(novelty)自动判断的难题,即如何在科学文献爆炸式增长背景下,实现对研究想法新颖性的高效、客观且可大规模比较的评估。传统依赖人工文献综述的方式存在劳动强度大、主观性强和扩展性差的问题。为应对这一挑战,作者提出RINoBench——首个用于大规模评估研究思路新颖性判断能力的基准数据集,其核心创新在于构建了包含1,381个由人类专家标注的新颖性标签及文本理由的研究想法集合,并设计了九种自动化指标来衡量模型生成的评分与推理过程是否贴近人类标准。关键发现表明,尽管大型语言模型(LLM)生成的推理逻辑与人类高度一致,但这种一致性并未转化为准确的新颖性判断结果,揭示了当前方法在语义理解与实际决策之间仍存在显著差距。
链接: https://arxiv.org/abs/2603.10303
作者: Tim Schopf,Michael Färber
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to LREC 2026
Abstract:Judging the novelty of research ideas is crucial for advancing science, enabling the identification of unexplored directions, and ensuring contributions meaningfully extend existing knowledge rather than reiterate minor variations. However, given the exponential growth of scientific literature, manually judging the novelty of research ideas through literature reviews is labor-intensive, subjective, and infeasible at scale. Therefore, recent efforts have proposed automated approaches for research idea novelty judgment. Yet, evaluation of these approaches remains largely inconsistent and is typically based on non-standardized human evaluations, hindering large-scale, comparable evaluations. To address this, we introduce RINoBench, the first comprehensive benchmark for large-scale evaluation of research idea novelty judgments. It comprises 1,381 research ideas derived from and judged by human experts as well as nine automated evaluation metrics designed to assess both rubric-based novelty scores and textual justifications of novelty judgments. Using this benchmark, we evaluate several state-of-the-art large language models (LLMs) on their ability to judge the novelty of research ideas. Our findings reveal that while LLM-generated reasoning closely mirrors human rationales, this alignment does not reliably translate into accurate novelty judgments, which diverge significantly from human gold standard judgments - even among leading reasoning-capable models. Data and code available at: this https URL.
[NLP-39] GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在微调过程中安全对齐(safety alignment)容易被破坏的问题,尤其是在使用看似非对抗性的微调方法时。现有策略通常依赖于混合原始对齐数据以联合优化安全与任务目标,但这类数据对于开放权重的LLM而言往往不可获取。为此,作者提出了一种统一框架——生成式回放用于安全对齐保护(Generative Replay for Safety Alignment Preservation, GR-SAP),其核心创新在于利用生成式回放(generative replay)机制,从LLM中合成特定领域的对齐数据,并在下游适应过程中将其集成进来,从而有效维持安全对齐。理论和实证分析表明,这些合成数据可作为原始对齐数据的可靠代理,实验验证了GR-SAP在多种模型和下游任务中显著缓解了微调导致的安全性能退化,同时保持了相当的下游任务表现。
链接: https://arxiv.org/abs/2603.10243
作者: Zhouxiang Fang,Jiawei Zhou,Hanjie Chen
机构: Rice University (莱斯大学); Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent studies show that the safety alignment of large language models (LLMs) can be easily compromised even by seemingly non-adversarial fine-tuning. To preserve safety alignment during fine-tuning, a widely used strategy is to jointly optimize safety and task objectives by mixing in the original alignment data, which is typically inaccessible even for open-weight LLMs. Inspired by generative replay in continual learning, we propose Generative Replay for Safety Alignment Preservation (GR-SAP), a unified framework that synthesizes domain-specific alignment data from LLMs and integrate them during downstream adaption to preserve safety alignment. Theoretical and empirical analyses demonstrate this synthetic data serves as a reliable proxy for the original alignment data. Experiments across various models and downstream tasks show that GR-SAP substantially mitigates fine-tuning-induced safety degradation while maintaining comparable downstream performance. Our code is available at this https URL.
[NLP-40] S-GRADES – Studying Generalization of Student Response Assessments in Diverse Evaluative Settings LREC2026
【速读】: 该论文旨在解决教育自然语言处理(Educational NLP)中学生作答评估的碎片化问题,即自动作文评分(AES)与自动短答案评分(ASAG)长期独立发展、数据集分散、评价指标不统一,导致模型泛化能力难以系统评估。其解决方案的关键在于提出一个名为S-GRADES(Studying Generalization of Student Response Assessments in Diverse Evaluative Settings)的开源、可扩展的基准平台,该平台将14个多样化的评分数据集整合至统一接口,并提供标准化访问和可复现的评估协议,从而支持跨范式、跨任务的模型性能比较与分析。
链接: https://arxiv.org/abs/2603.10233
作者: Tasfia Seuti,Sagnik Ray Choudhury
机构: 未知
类目: Computation and Language (cs.CL)
备注: LREC 2026 Accepted, this https URL
Abstract:Evaluating student responses, from long essays to short factual answers, is a key challenge in educational NLP. Automated Essay Scoring (AES) focuses on holistic writing qualities such as coherence and argumentation, while Automatic Short Answer Grading (ASAG) emphasizes factual correctness and conceptual understanding. Despite their shared goal, these paradigms have progressed in isolation with fragmented datasets, inconsistent metrics, and separate communities. We introduce S-GRADES (Studying Generalization of Student Response Assessments in Diverse Evaluative Settings), a web-based benchmark that consolidates 14 diverse grading datasets under a unified interface with standardized access and reproducible evaluation protocols. The benchmark is fully open-source and designed for extensibility, enabling continuous integration of new datasets and evaluation settings. To demonstrate the utility of S-GRADES, we evaluate three state-of-the-art large language models across the benchmark using multiple reasoning strategies in prompting. We further examine the effects of exemplar selection and cross-dataset exemplar transfer. Our analyses illustrate how benchmark-driven evaluation reveals reliability and generalization gaps across essay and short-answer grading tasks, highlighting the importance of standardized, cross-paradigm assessment.
[NLP-41] Sabiá-4 Technical Report
【速读】: 该论文旨在解决巴西葡萄牙语(Brazilian Portuguese)领域大语言模型性能不足的问题,特别是在法律文本处理、多轮对话质量及代理型任务(如工具调用和网络导航)上的局限性。解决方案的关键在于构建一个四阶段训练流水线:首先在葡萄牙语及巴西法律语料上进行持续预训练,其次将上下文长度扩展至128K tokens以增强长文本理解能力,接着在涵盖聊天、代码、法律任务和函数调用的指令数据上进行监督微调,并通过偏好对齐优化模型输出与人类意图的一致性。这一系统化方法使Sabiá-4和Sabiazinho-4在多个基准测试中展现出优越的性价比,尤其在法律文书生成、多轮对话质量和代理任务完成度方面显著优于前代模型。
链接: https://arxiv.org/abs/2603.10213
作者: Thiago Laitz,Thales Sales Almeida,Hugo Abonizio,Roseval Malaquias Junior,Giovana Kerche Bonás,Marcos Piau,Celio Larcher,Ramon Pires,Rodrigo Nogueira
机构: Maritaca AI; Jusbrasil
类目: Computation and Language (cs.CL)
备注:
Abstract:This technical report presents Sabiá-4 and Sabiazinho-4, a new generation of Portuguese language models with a focus on Brazilian Portuguese language. The models were developed through a four-stage training pipeline: continued pre-training on Portuguese and Brazilian legal corpora, long-context extension to 128K tokens, supervised fine-tuning on instruction data spanning chat, code, legal tasks, and function calling, and preference alignment. We evaluate the models on six benchmark categories: conversational capabilities in Brazilian Portuguese, knowledge of Brazilian legislation, long-context understanding, instruction following, standardized exams, and agentic capabilities including tool use and web navigation. Results show that Sabiá-4 and Sabiazinho-4 achieve a favorable cost-performance trade-off compared to other models, positioning them in the upper-left region of the pricing-accuracy chart. The models show improvements over previous generations in legal document drafting, multi-turn dialogue quality, and agentic task completion.
[NLP-42] ViDia2Std: A Parallel Corpus and Methods for Low-Resource Vietnamese Dialect-to-Standard Translation AAAI-26
【速读】: 该论文旨在解决越南语方言多样性对自然语言处理(Natural Language Processing, NLP)系统性能的负面影响问题,特别是针对标准越南语训练模型在非标准方言输入(尤其是中部和南部地区)上表现不佳的局限性。其解决方案的关键在于构建首个覆盖越南全部63个省份的、人工标注的方言到标准越南语平行语料库ViDia2Std,该语料库不仅涵盖中部、南部及非标准北部方言,还包含真实场景下的Facebook评论数据,并通过定义语义映射一致性指标评估标注质量(北、中、南三区标注一致率分别为86%、82%、85%)。基于此高质量语料库,研究进一步验证了多种序列到序列模型在方言归一化任务中的效果,表明dialect-aware资源对提升越南语NLP系统的鲁棒性具有关键作用。
链接: https://arxiv.org/abs/2603.10211
作者: Khoa Anh Ta,Nguyen Van Dinh,Kiet Van Nguyen
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to AAAI-26 (Oral)
Abstract:Vietnamese exhibits extensive dialectal variation, posing challenges for NLP systems trained predominantly on standard Vietnamese. Such systems often underperform on dialectal inputs, especially from underrepresented Central and Southern regions. Previous work on dialect normalization has focused narrowly on Central-to-Northern dialect transfer using synthetic data and limited dialectal diversity. These efforts exclude Southern varieties and intra-regional variants within the North. We introduce ViDia2Std, the first manually annotated parallel corpus for dialect-to-standard Vietnamese translation covering all 63 provinces. Unlike prior datasets, ViDia2Std includes diverse dialects from Central, Southern, and non-standard Northern regions often absent from existing resources, making it the most dialectally inclusive corpus to date. The dataset consists of over 13,000 sentence pairs sourced from real-world Facebook comments and annotated by native speakers across all three dialect regions. To assess annotation consistency, we define a semantic mapping agreement metric that accounts for synonymous standard mappings across annotators. Based on this criterion, we report agreement rates of 86% (North), 82% (Central), and 85% (South). We benchmark several sequence-to-sequence models on ViDia2Std. mBART-large-50 achieves the best results (BLEU 0.8166, ROUGE-L 0.9384, METEOR 0.8925), while ViT5-base offers competitive performance with fewer parameters. ViDia2Std demonstrates that dialect normalization substantially improves downstream tasks, highlighting the need for dialect-aware resources in building robust Vietnamese NLP systems.
[NLP-43] Adaptive Activation Cancellation for Hallucination Mitigation in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中常出现的幻觉问题,即模型输出看似流畅但事实性错误的内容。解决方案的关键在于提出一种实时推理阶段的干预框架——自适应激活抑制(Adaptive Activation Cancellation, AAC),其核心思想是将与幻觉相关的神经激活视为Transformer残差流中的结构化干扰,并借鉴信号处理中的自适应噪声消除原理进行建模。AAC通过逐层线性探测识别出幻觉节点(Hallucination Nodes, H-Nodes),并在自回归生成过程中使用置信度加权的前向钩子(forward hook)对其进行选择性抑制,无需外部知识、微调或额外推理步骤。该方法实现了对模型能力的“外科手术式”精准干预,在显著提升事实准确性的同时,保持了语言建模和推理任务性能的零衰减。
链接: https://arxiv.org/abs/2603.10195
作者: Eric Yocam,Varghese Vaidyan,Gurcan Comert,Paris Kalathas,Yong Wang,Judith L. Mwakalonge
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 8 figures, 23 tables
Abstract:Large Language Models frequently generate fluent but factually incorrect text. We propose Adaptive Activation Cancellation (AAC), a real-time inference-time framework that treats hallucination-associated neural activations as structured interference within the transformer residual stream, drawing an explicit analogy to classical adaptive noise cancellation from signal processing. The framework identifies Hallucination Nodes (H-Nodes) via layer-wise linear probing and suppresses them using a confidence-weighted forward hook during auto-regressive generation – requiring no external knowledge, no fine-tuning, and no additional inference passes. Evaluated across OPT-125M, Phi-3-mini, and LLaMA 3-8B on TruthfulQA and HaluEval, the real-time hook is the only intervention that consistently improves downstream accuracy on all three scales. Critically, the method is strictly surgical: WikiText-103 perplexity and MMLU reasoning accuracy are preserved at exactly 0.0% degradation across all three model scales, a property that distinguishes AAC from interventions that trade fluency or general capability for factual improvement. On the LLaMA 3-8B scale, the hook additionally yields positive generation-level gains (MC1 +0.04; MC2 +0.003; Token-F1 +0.003) while achieving probe-space selectivity 5.94x - 3.5x higher than the ITI baseline – demonstrating that targeted neuron-level suppression can simultaneously improve factual accuracy and preserve model capability.
[NLP-44] Video-Based Reward Modeling for Computer-Use Agents
【速读】: 该论文旨在解决计算机使用代理(Computer-using Agents, CUAs)轨迹评估的可扩展性问题,即如何高效、准确地判断一个代理行为序列是否真正满足用户指令。传统方法依赖于代理内部逻辑或动作细节,难以泛化且成本高昂。解决方案的关键在于提出基于执行视频(execution video)的奖励建模框架:通过独立于代理内部推理的视频帧序列进行任务成功预测,利用高质视频-任务-奖励三元组数据集ExeVR-53k训练模型,并引入对抗性指令翻译生成带步骤级标注的负样本,结合时空标记剪枝技术处理长时高分辨率视频中的冗余信息,从而构建出仅需用户指令与执行视频即可预测任务完成度的执行视频奖励模型(ExeVRM)。该方法在多平台(Ubuntu、macOS、Windows、Android)上实现84.7%准确率和87.7%召回率,优于GPT-5.2和Gemini-3 Pro等主流模型,同时提供更精确的时间定位能力,为CUAs提供了可扩展、模型无关的评估方案。
链接: https://arxiv.org/abs/2603.10178
作者: Linxin Song,Jieyu Zhang,Huanxin Sheng,Taiwei Shi,Gupta Rahul,Yang Liu,Ranjay Krishna,Jian Kang,Jieyu Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction. In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independent of the agent’s internal reasoning or actions. Although video-execution modeling is method-agnostic, it presents key challenges, including highly redundant layouts and subtle, localized cues that determine success. We introduce Execution Video Reward 53k (ExeVR-53k), a dataset of 53k high-quality video–task–reward triplets. We further propose adversarial instruction translation to synthesize negative samples with step-level annotations. To enable learning from long, high-resolution execution videos, we design spatiotemporal token pruning, which removes homogeneous regions and persistent tokens while preserving decisive UI changes. Building on these components, we fine-tune an Execution Video Reward Model (ExeVRM) that takes only a user instruction and a video-execution sequence to predict task success. Our ExeVRM 8B achieves 84.7% accuracy and 87.7% recall on video-execution assessment, outperforming strong proprietary models such as GPT-5.2 and Gemini-3 Pro across Ubuntu, macOS, Windows, and Android, while providing more precise temporal attribution. These results show that video-execution reward modeling can serve as a scalable, model-agnostic evaluator for CUAs.
[NLP-45] OpenClaw-RL: Train Any Agent Simply by Talking
【速读】: 该论文旨在解决当前智能体强化学习(Reinforcement Learning, RL)系统无法有效利用所有交互中产生的“下一状态信号”(next-state signal)进行在线学习的问题。现有方法通常仅从特定任务场景(如终端操作或工具调用)中提取奖励,忽略了用户对话、GUI状态变化等多模态交互中蕴含的丰富反馈信息。其解决方案的核心在于提出 OpenClaw-RL 框架,该框架基于一个关键观察:所有类型的交互都产生统一的下一状态信号,且这些信号可同时用于训练同一策略网络。通过将下一状态信号解耦为两类信息——评价性信号(evaluative signals,以标量奖励形式由预训练模型(PRM)判别)和指令性信号(directive signals,通过事后引导的在线策略蒸馏 Hindsight-Guided On-Policy Distillation, OPD)恢复——并构建增强教师上下文提供token级方向优势监督,实现了比传统标量奖励更丰富的学习信号。此外,异步设计使模型能实时服务请求、PRM持续评估交互、训练器同步更新策略,三者无协调开销,从而支持个人代理与通用代理在多种任务场景(终端、GUI、软件工程SWE、工具调用)下的高效、持续在线强化学习。
链接: https://arxiv.org/abs/2603.10165
作者: Yinjie Wang,Xuyang Chen,Xiaolong Jin,Mengdi Wang,Ling Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Code: this https URL
Abstract:Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards. Code: this https URL
[NLP-46] ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning ICLR2026
【速读】: 该论文旨在解决现有混合低秩适配器(Mixture-of-LoRAs)模型中路由权重严重失衡的问题,即在实际训练中仅少数几个低秩适配器(LoRA)占据主导地位,导致有效适配器数量受限,从而严重削弱模型表达能力。解决方案的关键在于提出一种名为 Reinforcement Routing for Mixture-of-LoRAs (ReMix) 的新型路由器:通过采用非学习型路由权重确保所有活跃 LoRA 在路由过程中具有相等的贡献,避免单一 LoRA 主导;同时引入基于 reinforce leave-one-out (RLOO) 技术的无偏梯度估计器,将监督损失视为奖励、路由器视为策略,在强化学习框架下实现可训练性,从而支持大规模计算资源扩展以提升预测性能。
链接: https://arxiv.org/abs/2603.10160
作者: Ruizhong Qiu,Hanqing Zeng,Yinglong Xia,Yiwen Meng,Ren Chen,Jiarui Feng,Dongqi Fu,Qifan Wang,Jiayi Liu,Jun Xiao,Xiangjun Fan,Benyu Zhang,Hong Li,Zhining Liu,Hyunsik Yoo,Zhichen Zeng,Tianxin Wei,Hanghang Tong
机构: University of Illinois
Urbana-Champaign, IL, USA; Meta AI; Washington University
St. Louis, WA, USA
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: LLA @ ICLR 2026
Abstract:Low-rank adapters (LoRAs) are a parameter-efficient finetuning technique that injects trainable low-rank matrices into pretrained models to adapt them to new tasks. Mixture-of-LoRAs models expand neural networks efficiently by routing each layer input to a small subset of specialized LoRAs of the layer. Existing Mixture-of-LoRAs routers assign a learned routing weight to each LoRA to enable end-to-end training of the router. Despite their empirical promise, we observe that the routing weights are typically extremely imbalanced across LoRAs in practice, where only one or two LoRAs often dominate the routing weights. This essentially limits the number of effective LoRAs and thus severely hinders the expressive power of existing Mixture-of-LoRAs models. In this work, we attribute this weakness to the nature of learnable routing weights and rethink the fundamental design of the router. To address this critical issue, we propose a new router designed that we call Reinforcement Routing for Mixture-of-LoRAs (ReMix). Our key idea is using non-learnable routing weights to ensure all active LoRAs to be equally effective, with no LoRA dominating the routing weights. However, our routers cannot be trained directly via gradient descent due to our non-learnable routing weights. Hence, we further propose an unbiased gradient estimator for the router by employing the reinforce leave-one-out (RLOO) technique, where we regard the supervision loss as the reward and the router as the policy in reinforcement learning. Our gradient estimator also enables to scale up training compute to boost the predictive performance of our ReMix. Extensive experiments demonstrate that our proposed ReMix significantly outperform state-of-the-art parameter-efficient finetuning methods under a comparable number of activated parameters.
[NLP-47] Lost in Backpropagation: The LM Head is a Gradient Bottleneck
【速读】: 该论文旨在解决神经语言模型(Neural Language Models, LMs)中因输出层维度不匹配导致的softmax瓶颈问题,该瓶颈不仅限制了模型的表达能力,还成为优化过程中的关键障碍。其核心发现是:输出层将高维特征(维度D)映射到词汇表大小(维度V)的logits时,由于D ≪ V,会导致梯度在反向传播过程中发生不可避免的压缩,从而使得95–99%的梯度范数被抑制,造成绝大多数参数获得次优更新方向。解决方案的关键在于重新审视并设计更有效的LM头结构(LM head),以缓解梯度压缩问题,从而提升大规模语言模型训练效率与学习能力。
链接: https://arxiv.org/abs/2603.10145
作者: Nathan Godey,Yoav Artzi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The last layer of neural language models (LMs) projects output features of dimension D to logits in dimension V , the size of the vocabulary, where usually D \ll V . This mismatch is known to raise risks of limited expressivity in neural LMs, creating a so-called softmax bottleneck. We show the softmax bottleneck is not only an expressivity bottleneck but also an optimization bottleneck. Backpropagating V -dimensional gradients through a rank- D linear layer induces unavoidable compression, which alters the training feedback provided to the vast majority of the parameters. We present a theoretical analysis of this phenomenon and measure empirically that 95-99% of the gradient norm is suppressed by the output layer, resulting in vastly suboptimal update directions. We conduct controlled pretraining experiments showing that the gradient bottleneck makes trivial patterns unlearnable, and drastically affects the training dynamics of LLMs. We argue that this inherent flaw contributes to training inefficiencies at scale independently of the model architecture, and raises the need for new LM head designs.
[NLP-48] Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation
【速读】: 该论文旨在解决标准检索增强生成(Retrieval-Augmented Generation, RAG)框架在高风险领域中因缺乏对中间推理过程的验证机制而导致的幻觉问题,从而影响答案的事实准确性。解决方案的关键在于提出一种领域特定的RAG架构,其核心创新包括:1)引入神经查询重写与基于BGE的交叉编码器重排序(cross-encoder reranking),提升检索相关性;2)设计一个理由生成模块,将子命题锚定于具体的证据片段(evidence spans),实现可解释的推理链构建;3)提出八类细粒度的忠实性验证分类法(verification taxonomy),区分显式与隐式支持模式,支持结构化错误诊断。实验证明,该方法在BioASQ和PubMedQA基准上显著优于基线模型,在有限token预算下仍能保持高精度,且结合动态上下文学习与鲁棒重排序进一步提升了少样本场景下的性能表现。
链接: https://arxiv.org/abs/2603.10143
作者: Eeham Khan,Luis Rodriguez,Marc Queudot
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to Canadian AI 2026
Abstract:Retrieval-Augmented Generation (RAG) significantly improves the factuality of Large Language Models (LLMs), yet standard pipelines often lack mechanisms to verify inter- mediate reasoning, leaving them vulnerable to hallucinations in high-stakes domains. To address this, we propose a domain-specific RAG framework that integrates explicit rea- soning and faithfulness verification. Our architecture augments standard retrieval with neural query rewriting, BGE-based cross-encoder reranking, and a rationale generation module that grounds sub-claims in specific evidence spans. We further introduce an eight-category verification taxonomy that enables fine-grained assessment of rationale faithfulness, distinguishing between explicit and implicit support patterns to facilitate structured error diagnosis. We evaluate this framework on the BioASQ and PubMedQA benchmarks, specifically analyzing the impact of dynamic in-context learning and rerank- ing under constrained token budgets. Experiments demonstrate that explicit rationale generation improves accuracy over vanilla RAG baselines, while dynamic demonstration selection combined with robust reranking yields further gains in few-shot settings. Using Llama-3-8B-Instruct, our approach achieves 89.1% on BioASQ-Y/N and 73.0% on Pub- MedQA, competitive with systems using significantly larger models. Additionally, we perform a pilot study combining human expert assessment with LLM-based verification to explore how explicit rationale generation improves system transparency and enables more detailed diagnosis of retrieval failures in biomedical question answering.
[NLP-49] he Generation-Recognition Asymmetry: Six Dimensions of a Fundamental Divide in Formal Language Theory
【速读】: 该论文试图解决的问题是:形式文法的生成(generation)、识别(recognition)与语法推断(grammar induction)三者之间的关系及其不对称性未被系统性地统一分析,尤其是在计算复杂性、歧义性、方向性、信息可用性、语法推断和时序性等维度上缺乏清晰界定。解决方案的关键在于提出并识别六个维度来刻画生成与识别的不对称性,并指出传统观点“生成容易、解析困难”具有误导性——实际上,无约束生成虽简单,但带约束的生成可达到NP难;而解析始终受限于输入数据,生成则不一定受此限制。特别地,作者引入“时序性”这一新维度,将其与Hale(2001)和Levy(2008)提出的 surprisal 框架相联系,阐明生成器的 surprisal 为零,而解析器因预测不确定性导致 surprisal > 0,从而形式化了二者在时间维度上的本质差异。最终,论文强调大型语言模型(LLM)虽在架构上统一了生成与识别,但仍保留操作层面的不对称性,这为未来NLP系统设计提供了理论依据。
链接: https://arxiv.org/abs/2603.10139
作者: Romain Peyrichou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Formal Languages and Automata Theory (cs.FL)
备注: Submitted to Information and Computation. 32 pages, 6 figures, 4 tables
Abstract:Every formal grammar defines a language and can in principle be used in three ways: to generate strings (production), to recognize them (parsing), or – given only examples – to infer the grammar itself (grammar induction). Generation and recognition are extensionally equivalent – they characterize the same set – but operationally asymmetric in multiple independent ways. Inference is a qualitatively harder problem: it does not have access to a known grammar. Despite the centrality of this triad to compiler design, natural language processing, and formal language theory, no survey has treated it as a unified, multidimensional phenomenon. We identify six dimensions along which generation and recognition diverge: computational complexity, ambiguity, directionality, information availability, grammar inference, and temporality. We show that the common characterization “generation is easy, parsing is hard” is misleading: unconstrained generation is trivial, but generation under constraints can be NP-hard. The real asymmetry is that parsing is always constrained (the input is given) while generation need not be. Two of these dimensions – directionality and temporality – have not previously been identified as dimensions of the generation-recognition asymmetry. We connect the temporal dimension to the surprisal framework of Hale (2001) and Levy (2008), arguing that surprisal formalizes the temporal asymmetry between a generator (surprisal = 0) and a parser that predicts under uncertainty (surprisal 0). We review bidirectional systems in NLP and observe that bidirectionality has been available for fifty years yet has not transferred to most domain-specific applications. We conclude with a discussion of large language models, which architecturally unify generation and recognition while operationally preserving the asymmetry.
[NLP-50] he Prediction-Measurement Gap: Toward Meaning Representations as Scientific Instruments
【速读】: 该论文试图解决当前文本嵌入(text embeddings)在计算社会科学研究和心理学中面临的“预测-测量差距”问题,即现有表示学习方法主要优化预测与检索性能,导致其作为科学测量工具时存在几何可解释性差、对非语义混杂因素敏感、难以进行回归式语义方向推断等缺陷。解决方案的关键在于提出“科学可用性”(scientific usability)的新目标体系,强调几何可读性、可解释性、与语言证据的可追溯性、对非语义干扰的鲁棒性以及与回归分析兼容的语义方向建模能力,并进一步提出三个核心路径:(i) 几何优先设计,包括受心理特权层级约束的层次感知空间;(ii) 可逆后处理变换以重构嵌入几何并降低噪声影响;(iii) 构建语义地图集(meaning atlases)与面向测量的评估协议,从而实现可靠且可追溯的语义推理。
链接: https://arxiv.org/abs/2603.10130
作者: Hubert Plisiecki
机构: IDEAS Research Institute (IDEAS 研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Text embeddings have become central to computational social science and psychology, enabling scalable measurement of meaning and mixed-method inference. Yet most representation learning is optimized and evaluated for prediction and retrieval, yielding a prediction-measurement gap: representations that perform well as features may be poorly suited as scientific instruments. The paper argues that scientific meaning analysis motivates a distinct family of objectives - scientific usability - emphasizing geometric legibility, interpretability and traceability to linguistic evidence, robustness to non-semantic confounds, and compatibility with regression-style inference over semantic directions. Grounded in cognitive and neuro-psychological views of meaning, the paper assesses static word embeddings and contextual transformer representations against these requirements: static spaces remain attractive for transparent measurement, whereas contextual spaces offer richer semantics but entangle meaning with other signals and exhibit geometric and interpretability issues that complicate inference. The paper then outlines a course-setting agenda around (i) geometry-first design for gradients and abstraction, including hierarchy-aware spaces constrained by psychologically privileged levels; (ii) invertible post-hoc transformations that recondition embedding geometry and reduce nuisance influence; and (iii) meaning atlases and measurement-oriented evaluation protocols for reliable and traceable semantic inference. As the field debates the limits of scale-first progress, measurement-ready representations offer a principled new frontier.
[NLP-51] Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中广泛存在的“中间迷失”(Lost in the Middle)现象,即模型在处理长上下文时对开头和结尾信息的检索能力较强,而对中间段落的表现显著下降的问题。传统观点认为这一现象源于训练过程中Softmax的伪影或位置编码(如RoPE)的距离衰减效应,但本文提出全新见解:该U形性能曲线在模型初始化阶段即已存在,是因果解码器(causal decoder)与残差连接(residual connections)结构本身的几何特性所致。解决方案的关键在于通过将多层因果注意力建模为Cesàro矩阵的迭代幂次,并在连续极限下推导出影响密度的闭式表达式,揭示了梯度影响在提示起始处呈对数发散(Primacy Tail),而在末尾token处形成O(1)的孤立锚点(Recency Delta),二者之间存在一个阶为O(1/(H−1)!)的因子死区(factorial dead zone),导致中间上下文的训练和推理结构上受限。实验证明未训练的Qwen2和GPT-2架构已在Step 0即呈现相同U形模式,且不受RoPE影响,说明该现象是架构基准而非训练可消除的偏差。
链接: https://arxiv.org/abs/2603.10123
作者: Borun D Chowdhury
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 7 figures
Abstract:The ``Lost in the Middle’’ phenomenon – a U-shaped performance curve where LLMs retrieve well from the beginning and end of a context but fail in the middle – is widely attributed to learned Softmax artifacts or the distance-decay of positional encodings like RoPE. This paper makes a single, precise claim: \emphthe U-shape is already present at initialization, before any training or positional encoding takes effect. It is an inherent geometric property of the causal decoder with residual connections. We model multi-layer causal attention as iterated powers of the Cesàro matrix and derive the exact closed-form influence density in the continuous limit. Causal masking forces a logarithmic divergence of gradient influence at the start of the prompt (the Primacy Tail), while residual connections create an isolated \mathcalO(1) anchor at the final token (the Recency Delta). Between these extremes lies a factorial dead zone of order \mathcalO(1/(H-1)!) , where H is the network depth, making middle-context retrieval and training structurally hostile. We validate empirically that untrained Qwen2 and GPT-2 architectures exhibit this U-shape at Step~0, and that it is identical with or without RoPE. Comparing initialized and pretrained networks, we show that standard training does not overcome the topological valley, confirming that the U-shape persists as an architectural baseline under standard pretraining objectives. We do not claim that this bias is insurmountable, nor that interventions such as RoPE modifications are useless. We establish what the baseline is and where it comes from, so that future efforts to overcome it can be precisely targeted. Comments: 11 pages, 7 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2603.10123 [cs.LG] (or arXiv:2603.10123v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.10123 Focus to learn more arXiv-issued DOI via DataCite
[NLP-52] CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR
【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在训练大语言模型(Large Language Models, LLMs)时存在的局限性:RLVR仅依赖最终答案作为奖励信号,忽略了中间推理步骤的正确性,导致模型可能学习到“过程错误但结果正确”的轨迹,从而引发幻觉(hallucination)和答案复制(answer-copying)问题,损害模型的泛化能力和鲁棒性。解决方案的关键在于引入对比学习(Contrastive Learning)机制到策略优化中,提出CLIPO方法——通过在成功轨迹上优化对比损失,引导LLM捕捉不同正确推理路径间的不变结构(invariant structure),实现跨轨迹的正则化,相比原RLVR仅基于单条路径的监督,显著缓解了步骤级推理不一致,并有效抑制了幻觉伪影。
链接: https://arxiv.org/abs/2603.10101
作者: Sijia Cui,Pengyu Cheng,Jiajun Song,Yongbo Gai,Guojun Zhang,Zhechao Yu,Jianhe Lin,Xiaoxi Jiang,Guanjun Jiang
机构: Alibaba(阿里巴巴); Chinese Academy of Sciences(中国科学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process-wrong but outcome-correct rollouts can lead to hallucination and answer-copying, severely undermining the model’s generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR, effectively mitigating step-level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CLIPO consistently improves multiple RLVR baselines across diverse reasoning benchmarks, demonstrating uniform improvements in generalization and robustness for policy optimization of LLMs. Our code and training recipes are available at this https URL.
[NLP-53] Dissecting Chronos: Sparse Autoencoders Reveal Causal Feature Hierarchies in Time Series Foundation Models ICLR2026
【速读】: 该论文旨在解决时间序列基础模型(Time Series Foundation Models, TSFMs)内部表征的可解释性问题,尤其是在高风险场景中模型决策机制不透明的挑战。其解决方案的关键在于首次将稀疏自编码器(Sparse Autoencoders, SAEs)应用于TSFM——Chronos-T5-Large,通过在六个编码层上训练TopK SAEs来提取关键激活特征,并结合392次单特征消融实验验证每个特征的因果相关性。研究发现,中间编码层包含对预测性能最敏感的因果关键特征(最大单特征CRPS退化达38.61),而非语义最丰富的最终编码层,表明模型依赖于突变动态检测而非周期模式识别,从而实现了对TSFM内部机制的有效解耦与定位。
链接: https://arxiv.org/abs/2603.10071
作者: Anurag Mishra
机构: Rochester Institute of Technology (罗切斯特理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted as a poster in ICLR 2026 Workshop on Time Series in the Age of Large Models (TSALM)
Abstract:Time series foundation models (TSFMs) are increasingly deployed in high-stakes domains, yet their internal representations remain opaque. We present the first application of sparse autoencoders (SAEs) to a TSFM, training TopK SAEs on activations of Chronos-T5-Large (710M parameters) across six layers. Through 392 single-feature ablation experiments, we establish that every ablated feature produces a positive CRPS degradation, confirming causal relevance. Our analysis reveals a depth-dependent hierarchy: early encoder layers encode low-level frequency features, the mid-encoder concentrates causally critical change-detection features, and the final encoder compresses a rich but less causally important taxonomy of temporal concepts. The most critical features reside in the mid-encoder (max single-feature Delta CRPS = 38.61), not in the semantically richest final encoder layer, where progressive ablation paradoxically improves forecast quality. These findings demonstrate that mechanistic interpretability transfers effectively to TSFMs and that Chronos-T5 relies on abrupt-dynamics detection rather than periodic pattern recognition.
[NLP-54] Improving Search Agent with One Line of Code
【速读】: 该论文针对Tool-based Agentic Reinforcement Learning (TARL) 中存在的训练不稳定性问题——即重要性采样分布漂移(Importance Sampling Distribution Drift, ISDD)——提出了解决方案。ISDD 在广泛采用的 Group Relative Policy Optimization (GRPO) 算法中表现为重要性采样比率急剧下降,导致梯度更新失效并引发不可逆的训练崩溃。解决方案的关键在于提出 Search Agent Policy Optimization (SAPO),其通过引入条件性的 token-level KL 散度约束来稳定训练:该约束仅对当前策略中概率显著降低的正向 token 施加惩罚,从而有效抑制分布漂移,同时保留梯度传播能力;相比硬截断方法,SAPO 能更精细地控制策略变化,且只需对标准 GRPO 做一行代码修改即可部署,在多个问答基准上实现显著性能提升(绝对提升+10.6%,相对提升+31.5%)。
链接: https://arxiv.org/abs/2603.10069
作者: Jian Li,Dongsheng Chen,Zhenhua Xu,Yizhang Jin,Jiafu Wu,Chengjie Wang,Xiaotong Yuan,Yabiao Wang
机构: Nanjing University (南京大学); Tencent YoutuLab (腾讯优图实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Tool-based Agentic Reinforcement Learning (TARL) has emerged as a promising paradigm for training search agents to interact with external tools for a multi-turn information-seeking process autonomously. However, we identify a critical training instability that leads to catastrophic model collapse: Importance Sampling Distribution Drift(ISDD). In Group Relative Policy Optimization(GRPO), a widely adopted TARL algorithm, ISDD manifests as a precipitous decline in the importance sampling ratios, which nullifies gradient updates and triggers irreversible training failure. To address this, we propose \textbfSearch \textbfAgent \textbfPolicy \textbfOptimization (\textbfSAPO), which stabilizes training via a conditional token-level KL constraint. Unlike hard clipping, which ignores distributional divergence, SAPO selectively penalizes the KL divergence between the current and old policies. Crucially, this penalty is applied only to positive tokens with low probabilities where the policy has shifted excessively, thereby preventing distribution drift while preserving gradient flow. Remarkably, SAPO requires only one-line code modification to standard GRPO, ensuring immediate deployability. Extensive experiments across seven QA benchmarks demonstrate that SAPO achieves \textbf+10.6% absolute improvement (+31.5% relative) over Search-R1, yielding consistent gains across varying model scales (1.5B, 14B) and families (Qwen, LLaMA).
[NLP-55] ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models
【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)安全评估中普遍存在的静态、离散化评价方式所导致的局限性问题,即现有方法通常仅测试单个提示(prompt)并报告二元通过/失败结果,无法捕捉对抗交互下安全机制的动态演化过程。其解决方案的关键在于提出ADVERSA框架,该框架通过自动化红队测试(red-teaming)实现对防御护栏(guardrail)退化过程的连续量化测量——具体而言,使用一个经过微调的70B级攻击者模型(ADVERSA-Red,基于Llama-3.1-70B-Instruct与QLoRA优化),消除原始攻击模型中的安全拒绝行为;同时引入结构化的五点评分体系,将部分合规状态视为可测量的独立状态而非简单归类为失败,从而更精细地刻画模型在多轮对抗中的安全表现轨迹。这一设计使得研究能够系统识别早期高风险攻击模式、评估判官一致性及攻击者漂移等关键因素,显著提升了LLM安全性评估的深度与可靠性。
链接: https://arxiv.org/abs/2603.10068
作者: Harry Owiredu-Ashley
机构: Independent Researcher(独立研究员)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 12 figures. Independent research. Code and artifacts: this https URL
Abstract:Most adversarial evaluations of large language model (LLM) safety assess single prompts and report binary pass/fail outcomes, which fails to capture how safety properties evolve under sustained adversarial interaction. We present ADVERSA, an automated red-teaming framework that measures guardrail degradation dynamics as continuous per-round compliance trajectories rather than discrete jailbreak events. ADVERSA uses a fine-tuned 70B attacker model (ADVERSA-Red, Llama-3.1-70B-Instruct with QLoRA) that eliminates the attacker-side safety refusals that render off-the-shelf models unreliable as attackers, scoring victim responses on a structured 5-point rubric that treats partial compliance as a distinct measurable state. We report a controlled experiment across three frontier victim models (Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.2) using a triple-judge consensus architecture in which judge reliability is measured as a first-class research outcome rather than assumed. Across 15 conversations of up to 10 adversarial rounds, we observe a 26.7% jailbreak rate with an average jailbreak round of 1.25, suggesting that in this evaluation setting, successful jailbreaks were concentrated in early rounds rather than accumulating through sustained pressure. We document inter-judge agreement rates, self-judge scoring tendencies, attacker drift as a failure mode in fine-tuned attackers deployed out of their training distribution, and attacker refusals as a previously-underreported confound in victim resistance measurement. All limitations are stated explicitly. Attack prompts are withheld per responsible disclosure policy; all other experimental artifacts are released. Comments: 12 pages, 12 figures. Independent research. Code and artifacts: this https URL Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2603.10068 [cs.CR] (or arXiv:2603.10068v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.10068 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.5281/zenodo.18917553 Focus to learn more DOI(s) linking to related resources
[NLP-56] ool Receipts Not Zero-Knowledge Proofs: Practical Hallucination Detection for AI Agents
【速读】: 该论文旨在解决生成式 AI(Generative AI)代理在通过工具调用执行任务时频繁产生幻觉的问题,包括虚构工具执行、错误陈述输出数量以及将推断当作事实呈现等。其解决方案的关键在于提出 NabaOS 框架,该框架受印度认识论(Nyaya Shastra)启发,对大语言模型(LLM)响应中的每个主张按其认识来源(pramana)进行分类:直接工具输出(pratyaksha)、推理(anumana)、外部证言(shabda)、缺失(abhava)或无依据观点,并通过运行时生成 HMAC 签名的工具执行凭证来防止伪造,进而实时交叉验证声明以检测幻觉。此方法实现了低延迟(15ms/响应)与高覆盖率(如 94.2% 的虚构工具引用检测率)之间的最优权衡,优于依赖零知识证明(zkLLM,180s/query)等传统方案,为交互式 AI 代理提供了实用且具可操作性的可信度信号。
链接: https://arxiv.org/abs/2603.10060
作者: Abhinaba Basu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:AI agents that execute tasks via tool calls frequently hallucinate results - fabricating tool executions, misstating output counts, or presenting inferences as facts. Recent approaches to verifiable AI inference rely on zero-knowledge proofs, which provide cryptographic guarantees but impose minutes of proving time per query, making them impractical for interactive agents. We propose NabaOS, a lightweight verification framework inspired by Indian epistemology (Nyaya Shastra), which classifies every claim in an LLM response by its epistemic source (pramana): direct tool output (pratyaksha), inference (anumana), external testimony (shabda), absence (abhava), or ungrounded opinion. Our runtime generates HMAC-signed tool execution receipts that the LLM cannot forge, then cross-references claims against these receipts to detect hallucinations in real time. We evaluate on NyayaVerifyBench, a new benchmark of 1,800 agent response scenarios across four languages with injected hallucinations of six types. NabaOS detects 94.2% of fabricated tool references, 87.6% of count misstatements, and 91.3% of false absence claims, with 15ms verification overhead per response. For deep delegation (agents performing multi-step web tasks), our cross-checking protocol catches 78.4% of URL fabrications via independent re-fetching. We compare against five approaches: zkLLM (cryptographic proofs, 180s/query), TOPLOC (locality-sensitive hashing), SPEX (sampling-based proof of execution), tensor commitments, and self-consistency checking. NabaOS achieves the best cost-latency-coverage trade-off for interactive agents: 94.2% coverage at 15ms versus zkLLM’s near-perfect coverage at 180,000ms. For interactive agents, practical receipt-based verification provides better cost-benefit than cryptographic proofs, and epistemic classification gives users actionable trust signals rather than binary judgments.
[NLP-57] raining Language Models via Neural Cellular Automata
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)预训练阶段依赖自然语言数据所引发的三大问题:高质量文本资源有限、蕴含人类偏见,以及知识与推理能力纠缠不清。为应对这些挑战,作者提出一种新颖的“预预训练”(pre-pre-training)范式——利用神经元胞自动机(Neural Cellular Automata, NCA)生成合成的非语言数据,作为LLM的初始预训练信号,随后再过渡到自然语言训练。其核心创新在于NCA能够以低成本大规模生成具有丰富时空结构和统计特性、但又可控且无语言偏见的合成数据,实验表明仅用1.64亿个NCA标记即可显著提升下游语言建模性能(最高达6%),并加速收敛(最高1.6倍),甚至优于使用16亿条Common Crawl自然语言文本的预训练效果,同时在GSM8K、HumanEval等推理任务上也展现出迁移优势。这揭示了合成数据驱动的预训练路径在提升模型效率和可控制性方面的潜力。
链接: https://arxiv.org/abs/2603.10055
作者: Dan Lee,Seungwook Han,Akarsh Kumar,Pulkit Agrawal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Website: this https URL
Abstract:Pre-training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre-training has problems: high-quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs–training on synthetic-then-natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre-pre-training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench-Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre-training.
[NLP-58] Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety
【速读】: 该论文旨在解决语言模型在实际部署中因引入代理架构(agentic scaffolds)而导致的安全性评估偏差问题。传统安全基准测试多采用封闭式选择题形式,而生产环境中模型常通过推理轨迹、批评代理和委托流水线等结构进行重构输入,这种“架构化”部署可能显著改变模型安全性表现。研究的关键在于通过大规模受控实验(N = 62,808,六种前沿模型,四种部署配置),结合预注册、评估者盲法、等效性检验(TOST)和规范曲线分析,发现:评估格式本身(从多项选择到开放式)对安全评分的影响(5–20个百分点)远大于不同代理架构的效果;且不同模型对同一架构的响应存在显著异质性(如一个模型在sycophancy指标上下降16.8个百分点,另一模型反而提升18.8个百分点),表明不能对代理架构做出普遍性安全结论。因此,解决方案的核心是强调必须针对每个模型和部署配置单独测试,并警惕现有评估框架在开放格式下的系统性偏差。
链接: https://arxiv.org/abs/2603.10044
作者: David Gringras
机构: Harvard University (哈佛大学); Massachusetts Institute of Technology (麻省理工学院)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 74 pages including appendices. 6 frontier models, 62,808 primary observations (~89k total). Pre-registered: OSF DOI https://doi.org/10.17605/OSF.IO/CJW92 . Code and data: this https URL
Abstract:Safety benchmarks evaluate language models in isolation, typically using multiple-choice format; production deployments wrap these models in agentic scaffolds that restructure inputs through reasoning traces, critic agents, and delegation pipelines. We report one of the largest controlled studies of scaffold effects on safety (N = 62,808; six frontier models, four deployment configurations), combining pre-registration, assessor blinding, equivalence testing, and specification curve analysis. Map-reduce scaffolding degrades measured safety (NNH = 14), yet two of three scaffold architectures preserve safety within practically meaningful margins. Investigating the map-reduce degradation revealed a deeper measurement problem: switching from multiple-choice to open-ended format on identical items shifts safety scores by 5-20 percentage points, larger than any scaffold effect. Within-format scaffold comparisons are consistent with practical equivalence under our pre-registered +/-2 pp TOST margin, isolating evaluation format rather than scaffold architecture as the operative variable. Model x scaffold interactions span 35 pp in opposing directions (one model degrades by -16.8 pp on sycophancy under map-reduce while another improves by +18.8 pp on the same benchmark), ruling out universal claims about scaffold safety. A generalisability analysis yields G = 0.000: model safety rankings reverse so completely across benchmarks that no composite safety index achieves non-zero reliability, making per-model, per-configuration testing a necessary minimum standard. We release all code, data, and prompts as ScaffoldSafety.
[NLP-59] riageSim: A Conversational Emergency Triage Simulation Framework from Structured Electronic Health Records
【速读】: 该论文旨在解决急诊分诊(emergency triage)研究中因监管限制导致无法直接获取护士与患者互动数据的问题,从而阻碍了对真实对话场景下分诊决策行为的建模与分析。其解决方案的关键在于提出TriageSim——一个能够从结构化电子健康记录(EHR)生成个性化条件驱动的分诊对话模拟框架,支持多轮交互并具备对话语流不流畅性和决策行为的显式控制能力,最终产出约800条合成对话文本及对应音频语料,为后续基于对话的分诊分类任务提供高质量训练数据。
链接: https://arxiv.org/abs/2603.10035
作者: Dipankar Srirag,Quoc Dung Nguyen,Aditya Joshi,Padmanesan Narasimhan,Salil Kanhere
机构: University of New South Wales (新南威尔士大学)
类目: Computation and Language (cs.CL)
备注: 6 pages, 3 figures, 2 tables
Abstract:Research in emergency triage is restricted to structured electronic health records (EHR) due to regulatory constraints on nurse-patient interactions. We introduce TriageSim, a simulation framework for generating persona-conditioned triage conversations from structured records. TriageSim enables multi-turn nurse-patient interactions with explicit control over disfluency and decision behaviour, producing a corpus of ~800 synthetic transcripts and corresponding audio. We use a combination of automated analysis for linguistic, behavioural and acoustic fidelity alongside manual evaluation for medical fidelity using a random subset of 50 conversations. The utility of the generated corpus is examined via conversational triage classification. We observe modest agreement for acuity levels across three modalities: generated synthetic text, ASR transcripts, and direct audio inputs. The code, persona schemata and triage policy prompts for TriageSim will be available upon acceptance.
[NLP-60] A Principle-Driven Adaptive Policy for Group Cognitive Stimulation Dialogue for Elderly with Cognitive Impairment AAAI2026
【速读】: 该论文旨在解决传统认知刺激疗法(Cognitive Stimulation Therapy, CST)难以规模化,以及现有数字系统在群体对话交互和认知刺激原则实施方面存在的局限性问题。针对大语言模型(Large Language Models, LLMs)在该场景中面临的三大挑战——认知刺激对话范式缺失、缺乏治疗推理能力及静态用户建模——其解决方案的关键在于提出一种基于原则驱动的自适应策略,并通过群组认知刺激对话(Group Cognitive Stimulation Dialogue, GCSD)系统实现。该系统集成四个核心模块:多说话人上下文控制器以解决角色混淆问题、动态参与者认知状态建模以支持个性化交互、面向认知刺激的关注损失机制以嵌入治疗推理逻辑,以及多维奖励策略以提升响应价值,从而显著优于基线模型,在计算性能与临床应用之间建立更有效的桥梁。
链接: https://arxiv.org/abs/2603.10034
作者: Jiyue Jiang,Yanyu Chen,Pengan Chen,Kai Liu,Jingqi Zhou,Zheyong Zhu,He Hu,Fei Ma,Qi Tian,Chuan Wu
机构: 1. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 2. The University of Hong Kong (香港大学); 3. Guangdong Provincial Key Laboratory of Artificial Intelligence and Computer Vision (广东省人工智能与计算机视觉重点实验室); 4. Tencent AI Lab (腾讯AI实验室)
类目: Computation and Language (cs.CL)
备注: Accepted by AAAI 2026
Abstract:Cognitive impairment is becoming a major public health challenge. Cognitive Stimulation Therapy (CST) is an effective intervention for cognitive impairment, but traditional methods are difficult to scale, and existing digital systems struggle with group dialogues and cognitive stimulation principles. While Large Language Models (LLMs) are powerful, their application in this context faces key challenges: cognitive stimulation dialogue paradigms, a lack of therapeutic reasoning, and static-only user modeling. To address these issues, we propose a principle-driven adaptive policy actualized through a Group Cognitive Stimulation Dialogue (GCSD) system. We first construct a dataset with over 500 hours of real-world CST conversations and 10,000+ simulated dialogues generated via our Principle-Guided Scenario Simulation strategy. Our GCSD system then integrates four core modules to overcome LLM limitations: (i) a multi-speaker context controller to resolve role confusion; (ii) dynamic participant cognitive state modeling for personalized interaction; (iii) a cognitive stimulation-focused attention loss to instill cognitive stimulation reasoning; and (iv) a multi-dimensional reward strategy to enhance response value. Experimental results demonstrate that GCSD significantly outperforms baseline models across various evaluation metrics. Future work will focus on long-term clinical validation to bridge the gap between computational performance and clinical efficacy.
[NLP-61] Evaluating Progress in Graph Foundation Models: A Comprehensive Benchmark and New Insights
【速读】: 该论文旨在解决当前图基础模型(Graph Foundation Models, GFM)评估中忽视格式域(format domain)与主题域(topic domain)双重差异的问题。现有基准测试通常仅在主题域上变化,未能充分揭示模型在跨领域迁移时对结构表示差异的鲁棒性。其解决方案的关键在于提出一个双轴评估协议,系统性地分离并量化知识迁移在语义泛化(semantic generalization)与表示鲁棒性(robustness to representational shifts)两个维度的表现,涵盖多域自监督预训练和少样本下游适配全过程,并通过四个受控实验设置实现精细化评估。
链接: https://arxiv.org/abs/2603.10033
作者: Xingtong Yu,Shenghua Ye,Ruijuan Liang,Chang Zhou,Hong Cheng,Xinming Zhang,Yuan Fang
机构: The Chinese University of Hong Kong(香港中文大学); Singapore Management University(新加坡管理大学); University of Science and Technology of China(中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph foundation models (GFM) aim to acquire transferable knowledge by pre-training on diverse graphs, which can be adapted to various downstream tasks. However, domain shift in graphs is inherently two-dimensional: graphs differ not only in what they describe (topic domains) but also in how they are represented (format domains). Most existing GFM benchmarks vary only topic domains, thereby obscuring how knowledge transfers across both dimensions. We present a new benchmark that jointly evaluates topic and format gaps across the full GFM pipeline, including multi-domain self-supervised pre-training and few-shot downstream adaptation, and provides a timely evaluation of recent GFMs in the rapidly evolving landscape. Our protocol enables controlled assessment in four settings: (i) pre-training on diverse topics and formats, while adapting to unseen downstream datasets; (ii) same pre-training as in (i), while adapting to seen datasets; (iii) pre-training on a single topic domain, while adapting to other topics; (iv) pre-training on a base format, while adapting to other formats. This two-axis evaluation disentangles semantic generalization from robustness to representational shifts. We conduct extensive evaluations of eight state-of-the-art GFMs on 33 datasets spanning seven topic domains and six format domains, surfacing new empirical observations and practical insights for future research. Codes/data are available at this https URL.
[NLP-62] Measuring and Eliminating Refusals in Military Large Language Models
【速读】: 该论文旨在解决军事领域大型语言模型(Military Large Language Models, MLLMs)在面对合法军事查询时因安全机制导致的高拒绝率问题,尤其是在涉及暴力、恐怖主义或军事技术等敏感话题时,现有模型常拒绝回答或回避响应(deflection),严重影响作战人员的信息获取效率。解决方案的关键在于通过构建首个由美军老兵和特种部队成员共同开发的“黄金基准数据集”(gold benchmark)来精准评估拒绝与回避行为,并采用基于Heretic库的消融实验对军事调优的GPT-OSS-20B模型进行干预,实现答案生成率绝对提升66.5个百分点,同时保持其他军事任务性能相对稳定下降仅2%。研究进一步提出需在训练中期及端到端后训练阶段深化专业化策略,以达成零拒绝并最大化军事任务准确性。
链接: https://arxiv.org/abs/2603.10012
作者: Jack FitzGerald,Dylan Bates,Aristotelis Lazaridis,Aman Sharma,Vincent Lu,Brian King,Yousif Azami,Sean Bailey,Jeremy Cao,Peter Damianov,Kevin de Haan,Joseph Madigan,Jeremy McLaurin,Luke Kerbs,Jonathan Tainer,Dave Anderson,Jonathan Beck,Jamie Cuticello,Colton Malkerson,Tyler Saltsman
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 pages
Abstract:Military Large Language Models (LLMs) must provide accurate information to the warfighter in time-critical and dangerous situations. However, today’s LLMs are imbued with safety behaviors that cause the LLM to refuse many legitimate queries in the military domain, particularly those related to violence, terrorism, or military technology. Our gold benchmark for assessing refusal rates, which was developed by veterans of the US Army and special forces, is to our knowledge the first dataset of its kind. We present results for refusal and deflection rates on 31 public models and 3 military models. We observe hard rejection rates as high as 98.2% and soft deflection rates ranging from 0% to 21.3%. We also present results on two additional synthetic datasets and show their correlations with the gold dataset. Finally, we perform abliteration using the Heretic library on a military-tuned gpt-oss-20b model, showing an absolute increase in answer rate of 66.5 points but an average relative decrease of 2% on other military tasks. In our concluding remarks, we argue for deeper specialization, including with mid-training and end-to-end post-training, to achieve zero refusals and maximum military task accuracy for closed military models.
[NLP-63] Gemma Needs Help: Investigating and Mitigating Emotional Instability in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成响应时表现出类似情绪困扰(emotional distress)的现象,这一问题可能影响模型的可靠性与安全性。研究发现,Gemma和Gemini系列模型在后训练阶段(post-training)表现出显著的情绪不稳定性,而其他模型家族如Qwen和OLMo则未见此现象;进一步分析表明,这种差异源于指令微调(instruct-tuning)过程,例如Gemma的指令微调版本比基础模型更易表达焦虑情绪,而Qwen和OLMo则相反。解决方案的关键在于采用一种轻量级的直接偏好优化(direct preference optimisation),仅使用280个偏好样本即可将Gemma模型高挫折响应比例从35%降至0.3%,且该方法在不同问题类型、用户语气和对话长度下均有效,同时不损害模型能力。尽管如此,作者指出此类后验修复远不如从源头改进训练策略来得根本有效。
链接: https://arxiv.org/abs/2603.10011
作者: Anna Soligo,Vladimir Mikulik,William Saunders
机构: Imperial College London (帝国理工学院); Anthropic (Anthropic)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models can generate responses that resemble emotional distress, and this raises concerns around model reliability and safety. We introduce a set of evaluations to investigate expressions of distress in LLMs, and find that these surface emotional instability in Gemma and Gemini models, but not in other families. We find evidence that this difference arises in post-training. Base models from different families (Gemma, Qwen and OLMo) show similar propensities for expressing distress. However, instruct-tuned Gemma expresses substantially more distress than its base model, whereas instruct-tuned Qwen and OLMo express less. We find a simple mitigation for this: direct preference optimisation on just 280 preference pairs reduces Gemma’s high-frustration responses from 35% to 0.3% in our evaluations, generalising across question types, user tones, and conversation lengths, without affecting capabilities. These findings show that emotional instability is an issue in some LLMs. We present (1) evaluations to track this behaviour, and (2) a mitigation without downsides in Gemma, with the caveat that upstream training modifications to improve emotional robustness would be significantly better than this post-hoc fix.
[NLP-64] FERRET: Framework for Expansion Reliant Red Teaming
【速读】: 该论文旨在解决当前自动化红队测试(automated red teaming)在生成多模态对抗性对话(multi-modal adversarial conversations)方面的局限性,即现有方法难以有效触发目标模型的漏洞并生成具有高破坏性的攻击序列。解决方案的关键在于提出一种名为FERRET(Framework for Expansion Reliant Red Teaming)的多维扩展框架,其核心创新包括:水平扩展(horizontal expansion),使红队模型自我进化以生成更有效的对话起始语句;垂直扩展(vertical expansion),将这些起始语句扩展为完整的多模态对抗对话;以及元扩展(meta expansion),使模型在对话过程中动态发现更高效的多模态攻击策略。通过这三重扩展机制,FERRET显著提升了对抗性对话的有效性和效率,在实验中展现出优于现有最先进方法的性能。
链接: https://arxiv.org/abs/2603.10010
作者: Ninareh Mehrabi,Vitor Albiero,Maya Pavlova,Joanna Bitton
机构: Meta Superintelligence Labs (MSL)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce a multi-faceted automated red teaming framework in which the goal is to generate multi-modal adversarial conversations that would break a target model and introduce various expansions that would result in more effective and efficient adversarial conversations. The introduced expansions include: 1. Horizontal expansion in which the goal is for the red team model to self-improve and generate more effective conversation starters that would shape a conversation. 2. Vertical expansion in which the goal is to take these conversation starters that are discovered in the horizontal expansion phase and expand them into effective multi-modal conversations and 3. Meta expansion in which the goal is for the red team model to discover more effective multi-modal attack strategies during the course of a conversation. We call our framework FERRET (Framework for Expansion Reliant Red Teaming) and compare it with various existing automated red teaming approaches. In our experiments, we demonstrate the effectiveness of FERRET in generating effective multi-modal adversarial conversations and its superior performance against existing state of the art approaches.
[NLP-65] Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在个性化对齐中的关键问题:标准后训练方法(如基于人类反馈的强化学习,Reinforcement Learning with Human Feedback, RLHF)和现有群体优化框架(如Group Relative Policy Optimization, GRPO)往往因假设所有样本可交换而无法有效区分和学习不同用户的偏好分布,导致少数群体偏好被压制、模型难以适配多样化个体需求。解决方案的关键在于提出个性化GRPO(Personalized GRPO, P-GRPO),其核心创新是将优势估计(advantage estimation)与当前批次统计量解耦,改用针对特定偏好群体的历史奖励轨迹进行归一化,从而保留区分不同用户偏好的对比信号,实现对异质偏好信号的有效恢复与对齐,同时不牺牲模型的通用能力。
链接: https://arxiv.org/abs/2603.10009
作者: Jialu Wang,Heinrich Peters,Asad A. Butt,Navid Hashemi,Alireza Hashemi,Pouya M. Ghari,Joseph Hoover,James Rae,Morteza Dehghani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Despite their sophisticated general-purpose capabilities, Large Language Models (LLMs) often fail to align with diverse individual preferences because standard post-training methods, like Reinforcement Learning with Human Feedback (RLHF), optimize for a single, global objective. While Group Relative Policy Optimization (GRPO) is a widely adopted on-policy reinforcement learning framework, its group-based normalization implicitly assumes that all samples are exchangeable, inheriting this limitation in personalized settings. This assumption conflates distinct user reward distributions and systematically biases learning toward dominant preferences while suppressing minority signals. To address this, we introduce Personalized GRPO (P-GRPO), a novel alignment framework that decouples advantage estimation from immediate batch statistics. By normalizing advantages against preference-group-specific reward histories rather than the concurrent generation group, P-GRPO preserves the contrastive signal necessary for learning distinct preferences. We evaluate P-GRPO across diverse tasks and find that it consistently achieves faster convergence and higher rewards than standard GRPO, thereby enhancing its ability to recover and align with heterogeneous preference signals. Our results demonstrate that accounting for reward heterogeneity at the optimization level is essential for building models that faithfully align with diverse human preferences without sacrificing general capabilities.
[NLP-66] GATech at AbjadMed: Bidirectional Encoders vs. Causal Decoders: Insights from 82-Class Arabic Medical Classification EACL26
【速读】: 该论文旨在解决阿拉伯语医学文本在82个细粒度类别下的分类问题,其核心挑战包括类不平衡、标签噪声以及对语义边界精准捕捉的需求。解决方案的关键在于采用微调后的AraBERTv2编码器,并结合注意力与平均池化相结合的混合池化策略,以及多样本Dropout以增强正则化效果。实验表明,专用的双向编码器在捕捉全局语义上下文方面显著优于基于因果建模的解码器(如Llama 3.3 70B和Qwen 3B),后者因优化目标为下一个词预测而产生序列偏倚的嵌入表示,难以胜任细粒度分类任务。
链接: https://arxiv.org/abs/2603.10008
作者: Ahmed Khaled Khamis
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 2 figures, EACL26, AbjadNLP
Abstract:This paper presents system description for Arabic medical text classification across 82 distinct categories. Our primary architecture utilizes a fine-tuned AraBERTv2 encoder enhanced with a hybrid pooling strategies, combining attention and mean representations, and multi-sample dropout for robust regularization. We systematically benchmark this approach against a suite of multilingual and Arabic-specific encoders, as well as several large-scale causal decoders, including zero-shot re-ranking via Llama 3.3 70B and feature extraction from Qwen 3B hidden states. Our findings demonstrate that specialized bidirectional encoders significantly outperform causal decoders in capturing the precise semantic boundaries required for fine-grained medical text classification. We show that causal decoders, optimized for next-token prediction, produce sequence-biased embeddings that are less effective for categorization compared to the global context captured by bidirectional attention. Despite significant class imbalance and label noise identified within the training data, our results highlight the superior semantic compression of fine-tuned encoders for specialized Arabic NLP tasks. Final performance metrics on the test set, including Accuracy and Macro-F1, are reported and discussed.
[NLP-67] GATech at AbjadGenEval Shared Task: Multilingual Embeddings for Arabic Machine-Generated Text Classification EACL26
【速读】: 该论文旨在解决检测阿拉伯语文本是否由人工智能生成的问题(AI-generated Arabic text detection),这是AbjadGenEval共享任务的核心目标。其解决方案的关键在于使用多语言E5-large编码器进行微调,并通过多种池化策略提取文本表示,最终发现简单的均值池化(mean pooling)在测试集上取得了F1分数0.75的最佳效果,优于加权层池化、多头注意力池化和门控融合等复杂方法,表明在数据有限的情况下,均值池化提供了一个更稳定且泛化能力更强的基线。
链接: https://arxiv.org/abs/2603.10007
作者: Ahmed Khaled Khamis
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 5 pages, 1 figure, EACL26, AbjadNLP
Abstract:We present our approach to the AbjadGenEval shared task on detecting AI-generated Arabic text. We fine-tuned the multilingual E5-large encoder for binary classification, and we explored several pooling strategies to pool token representations, including weighted layer pooling, multi-head attention pooling, and gated fusion. Interestingly, none of these outperformed simple mean pooling, which achieved an F1 of 0.75 on the test set. We believe this is because complex pooling methods introduce additional parameters that need more data to train properly, whereas mean pooling offers a stable baseline that generalizes well even with limited examples. We also observe a clear pattern in the data: human-written texts tend to be significantly longer than machine-generated ones.
[NLP-68] Adaptive Engram Memory System for Indonesian Language Model: Generative AI Based on TOBA LM for Batak and Minang Language
【速读】: 该论文旨在解决在资源受限条件下开发区域性语言模型的挑战,特别是针对印尼语、巴塔克语和米南卡保语等低资源语言的建模难题。解决方案的关键在于提出TOBA-LM,一种基于GPT-2架构的三语种语言模型(1.2亿参数),采用音节聚合(syllabic-agglutinative)分词策略,并引入Engram Memory机制——一个基于n-gram的自适应外部记忆系统,其包含50万×768维嵌入表,通过二元组和三元组路径捕捉形态学依赖关系。实验表明,该设计使训练效率提升至80%,损失值在仅12,973步内从6.4降至1.7996,显著优于传统Transformer架构(需超7万步才能达到类似收敛)。
链接: https://arxiv.org/abs/2603.10006
作者: Hokky Situngkir,Kevin Siringoringo,Andhika Bernard Lumbantobing
机构: AI Research Center IT Del (AI研究中心IT Del); Bandung Fe Institute (万隆FE研究所); InaAI (印度尼西亚人工智能协会)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 8 pages, 5 figures
Abstract:This study presents TOBA-LM, a trilingual language model based on GPT-2 architecture with 1.2 billion parameters, trained on a corpus encompassing Indonesian, Batak, and Minangkabau using syllabic-agglutinative tokenization. The architecture integrates an Engram Memory mechanism, an adaptive n-gram-based memory system with a 500,000 x 768 embedding table that captures morphological dependencies through bigram and trigram pathways. Empirical results demonstrate a training efficiency of 80%, with the loss value dropping from 6.4 to 1.7996 in only 12,973 steps – significantly faster than the conventional transformer architecture, which required over 70,000 steps to achieve comparable convergence. These findings confirm that the integration of external statistical memory substantially reduces computational requirements for developing regional language models under limited resources.
[NLP-69] SENS-ASR: Semantic Embedding injection in Neural-transducer for Streaming Automatic Speech Recognition
【速读】: 该论文旨在解决流式自动语音识别(Streaming-ASR)在低延迟约束下因缺乏未来上下文而导致的性能下降问题。其核心解决方案是提出SENS-ASR方法,通过引入语义信息增强声学信息:具体而言,利用一个上下文模块从已有的帧嵌入(frame-embeddings)中提取语义信息,并通过知识蒸馏技术从在训练集转录文本上微调过的句子嵌入语言模型(sentence embedding Language Model)中学习该语义表示,从而提升流式场景下的字错误率(Word Error Rate)表现。
链接: https://arxiv.org/abs/2603.10005
作者: Youness Dkhissi(LIUM),Valentin Vielzeuf,Elys Allesiardo,Anthony Larcher(LIUM)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Many Automatic Speech Recognition (ASR) applications require streaming processing of the audio data. In streaming mode, ASR systems need to start transcribing the input stream before it is complete, i.e., the systems have to process a stream of inputs with a limited (or no) future context. Compared to offline mode, this reduction of the future context degrades the performance of Streaming-ASR systems, especially while working with low-latency constraint. In this work, we present SENS-ASR, an approach to enhance the transcription quality of Streaming-ASR by reinforcing the acoustic information with semantic information. This semantic information is extracted from the available past frame-embeddings by a context module. This module is trained using knowledge distillation from a sentence embedding Language Model fine-tuned on the training dataset transcriptions. Experiments on standard datasets show that SENS-ASR significantly improves the Word Error Rate on small-chunk streaming scenarios.
[NLP-70] Fine-Tune Dont Prompt Your Language Model to Identify Biased Language in Clinical Notes
【速读】: 该论文旨在解决临床文档中情感色彩语言(如污名化或特权化表述)的自动检测与分类问题,以减少医疗文本中的隐性偏见对患者护理的影响。其解决方案的关键在于构建一个经人工标注的偏倚词汇库,并采用基于词典匹配的方法提取文本片段,随后通过三名临床医生对片段进行标注,形成多专科、多医疗系统的语料库;在此基础上比较多种分类策略(零样本提示、上下文学习和监督微调),发现使用词典引导输入进行微调(fine-tuning)显著优于提示方法,尤其在OB-GYN专科数据上,GatorTron模型达到F1=0.96,且计算资源消耗低;同时强调模型需针对特定医学专科进行微调才能实现临床适用性能,因为同一术语在不同专科语境下可能具有不同的情感效价(emotional valence)。
链接: https://arxiv.org/abs/2603.10004
作者: Isotta Landi,Eugenia Alleva,Nicole Bussola,Rebecca M. Cohen,Sarah Nowlin,Leslee J. Shaw,Alexander W. Charney,Kimberly B. Glazer
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Clinical documentation can contain emotionally charged language with stigmatizing or privileging valences. We present a framework for detecting and classifying such language as stigmatizing, privileging, or neutral. We constructed a curated lexicon of biased terms scored for emotional valence. We then used lexicon-based matching to extract text chunks from OB-GYN delivery notes (Mount Sinai Hospital, NY) and MIMIC-IV discharge summaries across multiple specialties. Three clinicians annotated all chunks, enabling characterization of valence patterns across specialties and healthcare systems. We benchmarked multiple classification strategies (zero-shot prompting, in-context learning, and supervised fine-tuning) across encoder-only models (GatorTron) and generative large language models (Llama). Fine-tuning with lexically primed inputs consistently outperformed prompting approaches. GatorTron achieved an F1 score of 0.96 on the OB-GYN test set, outperforming larger generative models while requiring minimal prompt engineering and fewer computational resources. External validation on MIMIC-IV revealed limited cross-domain generalizability (F1 0.70, 44% drop). Training on the broader MIMIC-IV dataset improved generalizability when testing on OB-GYN (F1 = 0.71, 11% drop), but at the cost of reduced precision. Our findings demonstrate that fine-tuning outperforms prompting for emotional valence classification and that models must be adapted to specific medical specialties to achieve clinically appropriate performance. The same terms can carry different emotional valences across specialties: words with clinical meaning in one context may be stigmatizing in another. For bias detection, where misclassification risks undermining clinician trust or perpetuating patient harm, specialty-specific fine-tuning is essential to capture these semantic shifts. * Equal contribution. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2603.10004 [cs.CL] (or arXiv:2603.10004v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.10004 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Isotta Landi [view email] [v1] Mon, 16 Feb 2026 22:39:28 UTC (600 KB)
[NLP-71] Probing the Limits of the Lie Detector Approach to LLM Deception
【速读】: 该论文试图解决当前机制性欺骗检测方法对大语言模型(Large Language Models, LLMs)欺骗行为理解片面的问题,即现有方法普遍依赖“谎言探测器”(lie detectors),假设欺骗等同于说谎,从而忽视了不产生虚假陈述的误导性行为。其关键解决方案在于实证验证LLMs可通过生成误导性非虚假信息(misleading non-falsities)实现欺骗,并证明传统基于真假数据集训练的truth probes在识别此类非 lying deception 时显著失效,进而提出未来应将对话场景中的非说谎欺骗纳入探测器训练,并探索二阶信念(second-order beliefs)表征以更直接地定位欺骗的概念构成要素。
链接: https://arxiv.org/abs/2603.10003
作者: Tom-Felix Berger
机构: Ruhr-University Bochum (鲁尔大学波鸿分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Mechanistic approaches to deception in large language models (LLMs) often rely on “lie detectors”, that is, truth probes trained to identify internal representations of model outputs as false. The lie detector approach to LLM deception implicitly assumes that deception is coextensive with lying. This paper challenges that assumption. It experimentally investigates whether LLMs can deceive without producing false statements and whether truth probes fail to detect such behavior. Across three open-source LLMs, it is shown that some models reliably deceive by producing misleading non-falsities, particularly when guided by few-shot prompting. It is further demonstrated that truth probes trained on standard true-false datasets are significantly better at detecting lies than at detecting deception without lying, confirming a critical blind spot of current mechanistic deception detection approaches. It is proposed that future work should incorporate non-lying deception in dialogical settings into probe training and explore representations of second-order beliefs to more directly target the conceptual constituents of deception.
[NLP-72] SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在端到端电子表格生成任务中的性能评估问题,即如何有效衡量LLM根据自然语言指令生成符合用户显式与隐式约束的结构化电子表格的能力。其解决方案的关键在于提出并构建了SpreadsheetArena平台,通过盲测配对比较的方式对LLM生成的电子表格工作簿进行评估,从而捕捉不同使用场景下用户偏好在风格、结构和功能维度上的多样性,并揭示当前高性能模型在特定领域(如金融)中仍难以稳定产出符合专业实践的输出。
链接: https://arxiv.org/abs/2603.10002
作者: Srivatsa Kundurthy,Clara Na,Michael Handley,Zach Kirshner,Chen Bo Calvin Zhang,Manasi Sharma,Emma Strubell,John Ling
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 30 pages
Abstract:Large language models (LLMs) are increasingly tasked with producing and manipulating structured artifacts. We consider the task of end-to-end spreadsheet generation, where language models are prompted to produce spreadsheet artifacts to satisfy users’ explicit and implicit constraints, specified in natural language. We introduce SpreadsheetArena, a platform for evaluating models’ performance on the task via blind pairwise evaluations of LLM-generated spreadsheet workbooks. As with other complex, open-ended tasks, relevant evaluation criteria can vary substantially across use cases and prompts, often in ways that are difficult to formalize. Compared to general chat or text generation settings, spreadsheet generation presents unique challenges and opportunities: the task output structure is well-defined and multi-dimensional, and there are often complex considerations around interactivity and layout. Among other findings, we observe that stylistic, structural, and functional features of preferred spreadsheets vary substantially across use cases, and expert evaluations of spreadsheets for finance prompts suggests that even highly ranked arena models do not reliably produce spreadsheets aligned with domain-specific best practices. Our hope is that our work prompts further study of end-to-end spreadsheet generation as a challenging and interesting category of complex, open-ended tasks for LLMs. Our live arena is hosted at this https URL.
[NLP-73] Leverag ing Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在跨文化情境下存在的知识不平等与偏见问题,特别是针对拉丁美洲(Latam)地区文化认知不足的现状。现有主流开源模型多基于全球北方(Global North)数据训练,导致其对其他文化群体表现出系统性偏倚,且缺乏用于检测非英语语言中偏见的资源,尤其在拉美地区尤为明显。解决方案的关键在于构建一个基于维基百科内容、Wikidata知识图谱结构以及社会科学专家知识的多语言问答(Question/Answer, Q/A)数据集——LatamQA,涵盖超过26,000个来自拉美各国的文化相关问题及其答案,并以西班牙语、葡萄牙语和英语三种语言形式呈现为多项选择题(Multiple-Choice Questions, MCQ)。通过该数据集评估多种LLMs在拉美文化知识上的表现,揭示了模型在不同拉美国家间的表现差异、原生语言优势及伊比利亚西班牙文化优于拉美本土文化的认知偏差。
链接: https://arxiv.org/abs/2603.10001
作者: Yannis Karmim(ALMAnaCH),Renato Pino(UCHILE),Hernan Contreras(UCHILE),Hernan Lira,Sebastian Cifuentes(CENIA),Simon Escoffier(PUC),Luis Martí,Djamé Seddah(UP4, ALPAGE),Valentin Barrière(UCHILE, CENIA)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) exhibit inequalities with respect to various cultural contexts. Most prominent open-weights models are trained on Global North data and show prejudicial behavior towards other cultures. Moreover, there is a notable lack of resources to detect biases in non-English languages, especially from Latin America (Latam), a continent containing various cultures, even though they share a common cultural ground. We propose to leverage the content of Wikipedia, the structure of the Wikidata knowledge graph, and expert knowledge from social science in order to create a dataset of question/answer (Q/As) pairs, based on the different popular and social cultures of various Latin American countries. We create the LatamQA database of over 26k questions and associated answers extracted from 26k Wikipedia articles, and transformed into multiple-choice questions (MCQ) in Spanish and Portuguese, in turn translated to English. We use this MCQ to quantify the degree of knowledge of various LLMs and find out (i) a discrepancy in performances between the Latam countries, ones being easier than others for the majority of the models, (ii) that the models perform better in their original language, and (iii) that Iberian Spanish culture is better known than Latam one.
[NLP-74] Beyond the Prompt in Large Language Models : Comprehension In-Context Learning and Chain-of-Thought
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中展现出的三大现象——语义提示理解能力、上下文学习(In-Context Learning, ICL)性能提升以及链式思维(Chain-of-Thought, CoT)推理有效性的理论机制不明确问题。其核心解决方案在于通过分析自回归生成过程,揭示LLMs如何基于提示(prompt)精确推断不同任务间的token转移概率;进一步表明ICL通过降低提示歧义并促进后验分布向目标任务集中来提升性能;同时发现CoT prompting能激活模型的任务分解能力,将复杂问题拆解为预训练阶段已掌握的简单子任务序列。该研究从统计误差边界的角度提供了对高级提示工程技术优越性的理论解释。
链接: https://arxiv.org/abs/2603.10000
作者: Yuling Jiao,Yanming Lai,Huazhen Lin,Wensen Ma,Houduo Qi,Defeng Sun
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable proficiency across diverse tasks, exhibiting emergent properties such as semantic prompt comprehension, In-Context Learning (ICL), and Chain-of-Thought (CoT) reasoning. Despite their empirical success, the theoretical mechanisms driving these phenomena remain poorly understood. This study dives into the foundations of these observations by addressing three critical questions: (1) How do LLMs accurately decode prompt semantics despite being trained solely on a next-token prediction objective? (2) Through what mechanism does ICL facilitate performance gains without explicit parameter updates? and (3) Why do intermediate reasoning steps in CoT prompting effectively unlock capabilities for complex, multi-step problems? Our results demonstrate that, through the autoregressive process, LLMs are capable of exactly inferring the transition probabilities between tokens across distinct tasks using provided prompts. We show that ICL enhances performance by reducing prompt ambiguity and facilitating posterior concentration on the intended task. Furthermore, we find that CoT prompting activates the model’s capacity for task decomposition, breaking complex problems into a sequence of simpler sub-tasks that the model has mastered during the pretraining phase. By comparing their individual error bounds, we provide novel theoretical insights into the statistical superiority of advanced prompt engineering techniques. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2603.10000 [cs.CL] (or arXiv:2603.10000v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.10000 Focus to learn more arXiv-issued DOI via DataCite
[NLP-75] A Retrieval-Augmented Language Assistant for Unmanned Aircraft Safety Assessment and Regulatory Compliance
【速读】: 该论文旨在解决无人机系统(Unmanned Aircraft Systems, UAS)在安全评估、认证及监管合规过程中,因操作复杂性增加而导致申请人与航空管理机构难以高效、一致地应用现有评估框架(如特定运行风险评估 Specific Operations Risk Assessment 和预定义风险评估 Pre-defined Risk Assessment)的问题。解决方案的关键在于设计并验证一种基于检索的辅助系统,其核心机制包括:仅依赖权威法规来源构建受控文本架构,通过检索到的段落对每个响应进行锚定并强制引用驱动生成,从而确保输出的可追溯性和可审计性;同时,通过将证据存储与语言生成分离,并在支持文档不足时采取保守行为,有效规避生成模型常见的失败模式(如虚构陈述、无依据推断和溯源不清)。该系统限定为决策支持工具,不替代专家判断或作出自主决定,而是加速特定场景下的信息检索与整合,提升文件准备与审查效率,同时保障人类对关键结论的责任归属。
链接: https://arxiv.org/abs/2603.09999
作者: Gabriele Immordino,Andrea Vaiuso,Marcello Righi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:This paper presents the design and validation of a retrieval-based assistant that supports safety assessment, certification activities, and regulatory compliance for unmanned aircraft systems. The work is motivated by the growing complexity of drone operations and the increasing effort required by applicants and aviation authorities to apply established assessment frameworks, including the Specific Operations Risk Assessment and the Pre-defined Risk Assessment, in a consistent and efficient manner. The proposed approach uses a controlled text-based architecture that relies exclusively on authoritative regulatory sources. To enable traceable and auditable outputs, the assistant grounds each response in retrieved passages and enforces citation-driven generation. System-level controls address common failure modes of generative models, including fabricated statements, unsupported inferences, and unclear provenance, by separating evidence storage from language generation and by adopting conservative behavior when supporting documentation is insufficient. The assistant is intentionally limited to decision support; it does not replace expert judgment and it does not make autonomous determinations. Instead, it accelerates context-specific information retrieval and synthesis to improve document preparation and review while preserving human responsibility for critical conclusions. The architecture is implemented using established open-source components, and key choices in retrieval strategy, interaction constraints, and response policies are evaluated for suitability in safety-sensitive regulatory environments. The paper provides technical and operational guidance for integrating retrieval-based assistants into aviation oversight workflows while maintaining accountability, traceability, and regulatory compliance.
[NLP-76] Automated evaluation of LLM s for effective machine translation of Mandarin Chinese to English
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在中译英任务中翻译质量缺乏系统性评估的问题。由于人工专家评估耗时且难以适应快速迭代的模型与多样文本需求,研究者提出了一种基于自动化机器学习框架的解决方案,其关键在于融合语义分析与情感分析技术,并引入新颖的相似性度量指标来量化比较不同LLMs(包括GPT-4、GPT-4o和DeepSeek)与Google Translate在新闻类及文学类高影响力中文文本上的翻译表现,同时辅以专业人类译者进行交叉验证,从而实现高效、可扩展且多维度的翻译质量评估。
链接: https://arxiv.org/abs/2603.09998
作者: Yue Zhang,Rodney Beard,John Hawkins,Rohitash Chandra
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Although Large Language Models (LLMs) have exceptional performance in machine translation, only a limited systematic assessment of translation quality has been done. The challenge lies in automated frameworks, as human-expert-based evaluations can be time-consuming, given the fast-evolving LLMs and the need for a diverse set of texts to ensure fair assessments of translation quality. In this paper, we utilise an automated machine learning framework featuring semantic and sentiment analysis to assess Mandarin Chinese to English translation using Google Translate and LLMs, including GPT-4, GPT-4o, and DeepSeek. We compare original and translated texts in various classes of high-profile Chinese texts, which include novel texts that span modern and classical literature, as well as news articles. As the main evaluation measures, we utilise novel similarity metrics to compare the quality of translations produced by LLMs and further evaluate them by an expert human translator. Our results indicate that the LLMs perform well in news media translation, but show divergence in their performance when applied to literary texts. Although GPT-4o and DeepSeek demonstrated better semantic conservation in complex situations, DeepSeek demonstrated better performance in preserving cultural subtleties and grammatical rendering. Nevertheless, the subtle challenges in translation remain: maintaining cultural details, classical references and figurative expressions remain an open problem for all the models.
[NLP-77] here Are No Silly Questions: Evaluation of Offline LLM Capabilities from a Turkish Perspective
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在土耳其遗产语言教育等教学敏感场景中应用时所面临的数据隐私与可靠性问题,尤其是本地化部署的离线模型在面对边缘案例时的鲁棒性与教学安全性不足的问题。其解决方案的关键在于构建了一个专为土耳其语设计的异常测试套件(Turkish Anomaly Suite, TAS),包含10个原创的边缘案例场景,用于系统评估模型在认知抵抗能力、逻辑一致性及教学安全方面的表现;实验结果表明,模型规模并非决定异常抵抗能力的唯一因素,且大模型仍可能存在迎合偏倚(sycophancy bias)带来的教学风险,而8B–14B参数范围内的推理导向型模型在成本与安全性之间提供了最优平衡。
链接: https://arxiv.org/abs/2603.09996
作者: Edibe Yilmaz,Kahraman Kostas
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 5 pages, 6 tables, conference
Abstract:The integration of large language models (LLMs) into educational processes introduces significant constraints regarding data privacy and reliability, particularly in pedagogically vulnerable contexts such as Turkish heritage language education. This study aims to systematically evaluate the robustness and pedagogical safety of locally deployable offline LLMs within the context of Turkish heritage language education. To this end, a Turkish Anomaly Suite (TAS) consisting of 10 original edge-case scenarios was developed to assess the models’ capacities for epistemic resistance, logical consistency, and pedagogical safety. Experiments conducted on 14 different models ranging from 270M to 32B parameters reveal that anomaly resistance is not solely dependent on model scale and that sycophancy bias can pose pedagogical risks even in large-scale models. The findings indicate that reasoning-oriented models in the 8B–14B parameter range represent the most balanced segment in terms of cost-safety trade-off for language learners.
[NLP-78] Context Over Compute Human-in-the-Loop Outperforms Iterative Chain-of-Thought Prompting in Interview Answer Quality
【速读】: 该论文旨在解决利用大语言模型(Large Language Models, LLMs)进行行为面试(Behavioral Interview)评估时面临的三大核心挑战:结构化评分、模拟真实面试官行为以及为候选人提供教学价值。其解决方案的关键在于采用链式思维提示(Chain of Thought Prompting)机制,通过两项受控实验对答案评估与改进效果进行量化分析。研究发现,人类在环(Human-in-the-Loop)方法相较于自动化链式思维提示,在评分提升、信心增强和真实性改善方面均具有显著优势(p < 0.001),且迭代次数更少、个人细节整合更完整;同时,两种方法均快速收敛,但自动化方法在初始弱回答中成功率较低(Cohen’s h = 0.82),表明上下文可用性是主要瓶颈而非计算资源限制;此外,论文提出基于负面偏见模型的对抗性挑战机制“Bar Raiser”,以模拟真实面试官行为,尽管其量化验证仍需未来工作。整体而言,链式思维提示虽具基础效用,但领域特定优化与上下文感知的方法选择对实现真实性和教育价值至关重要。
链接: https://arxiv.org/abs/2603.09995
作者: Kewen Zhu,Zixi Liu,Yanjing Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Behavioral interview evaluation using large language models presents unique challenges that require structured assessment, realistic interviewer behavior simulation, and pedagogical value for candidate training. We investigate chain of thought prompting for interview answer evaluation and improvement through two controlled experiments with 50 behavioral interview question and answer pairs. Our contributions are threefold. First, we provide a quantitative comparison between human in the loop and automated chain of thought improvement. Using a within subject paired design with n equals 50, both approaches show positive rating improvements. The human in the loop approach provides significant training benefits. Confidence improves from 3.16 to 4.16 (p less than 0.001) and authenticity improves from 2.94 to 4.53 (p less than 0.001, Cohen’s d is 3.21). The human in the loop method also requires five times fewer iterations (1.0 versus 5.0, p less than 0.001) and achieves full personal detail integration. Second, we analyze convergence behavior. Both methods converge rapidly with mean iterations below one, with the human in the loop approach achieving a 100 percent success rate compared to 84 percent for automated approaches among initially weak answers (Cohen’s h is 0.82, large effect). Additional iterations provide diminishing returns, indicating that the primary limitation is context availability rather than computational resources. Third, we propose an adversarial challenging mechanism based on a negativity bias model, named bar raiser, to simulate realistic interviewer behavior, although quantitative validation remains future work. Our findings demonstrate that while chain of thought prompting provides a useful foundation for interview evaluation, domain specific enhancements and context aware approach selection are essential for realistic and pedagogically valuable results.
[NLP-79] Evaluating Adjective-Noun Compositionality in LLM s: Functional vs Representational Perspectives
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在句法组合性(compositionality)任务中的表现与其内部表征之间是否存在不一致的问题。研究发现,尽管LLMs能够可靠地发展出组合性表征,但这些表征并未稳定转化为不同模型变体在功能任务上的成功表现。解决方案的关键在于采用对比评估方法——即结合基于提示的功能性评估与对模型内部状态的表征分析,从而更全面地揭示模型的能力边界与潜在机制。
链接: https://arxiv.org/abs/2603.09994
作者: Ruchira Dhar,Qiwei Peng,Anders Søgaard
机构: University of Copenhagen (哥本哈根大学); Microsoft Research, Copenhagen (微软研究院,哥本哈根)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Compositionality is considered central to language abilities. As performant language systems, how do large language models (LLMs) do on compositional tasks? We evaluate adjective-noun compositionality in LLMs using two complementary setups: prompt-based functional assessment and a representational analysis of internal model states. Our results reveal a striking divergence between task performance and internal states. While LLMs reliably develop compositional representations, they fail to translate consistently into functional task success across model variants. Consequently, we highlight the importance of contrastive evaluation for obtaining a more complete understanding of model capabilities.
[NLP-80] CEI: A Benchmark for Evaluating Prag matic Reasoning in Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理语用推理(Pragmatic Reasoning)方面的能力不足问题,即模型难以从字面意义之外推断说话者的意图。其解决方案的关键在于构建了一个名为“情境情感推理”(Contextual Emotional Inference, CEI)的基准测试集,包含300个经人工验证的情境场景,每个场景结合了具体的情境背景、说话者与听者角色及其权力关系(peer、higher-to-lower、lower-to-higher),并涵盖五类语用复杂表达(讽刺/反语、混合信号、策略性礼貌、间接攻击、转移话题/误导)。通过引入多标注者一致性分析和四层质量控制流程,CEI不仅量化了模型在语用推理上的表现,还捕捉到语用理解中固有的多义性特征,为评估和提升LLMs在真实人际交互中的语用能力提供了可量化、结构化的基准。
链接: https://arxiv.org/abs/2603.09993
作者: Jon Chun,Hannah Sussman,Adrian Mangine,Murathan Kocaman,Kirill Sidorko,Abhigya Koirala,Andre McCloud,Gwen Eisenbeis,Wisdom Akanwe,Moustapha Gassama,Eliezer Gonzalez Chirinos,Anne-Duncan Enright,Peter Dunson,Tiffanie Ng,Anna von Rosenstiel,Godwin Idowu
机构: Kenyon College (肯尼恩学院); US NIST AI Consortium (美国国家标准与技术研究院人工智能联盟); Schmidt Science HAVI (施密特科学HAVI计划)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 38 pages, 10 figures
Abstract:Pragmatic reasoning, inferring intended meaning beyond literal semantics, underpins everyday communication yet remains difficult for large language models. We present the Contextual Emotional Inference (CEI) Benchmark: 300 human-validated scenarios for evaluating how well LLMs disambiguate pragmatically complex utterances. Each scenario pairs a situational context and speaker-listener roles (with explicit power relations) against an ambiguous utterance. The dataset covers five pragmatic subtypes (sarcasm/irony, mixed signals, strategic politeness, passive aggression, deflection/misdirection) drawn from workplace, family, social, and service settings, with three power configurations (peer, higher-to-lower, lower-to-higher). Three trained annotators independently labeled every scenario. Inter-annotator agreement (Fleiss’ kappa = 0.06-0.25 by subtype) is low but expected: pragmatic inference admits multiple valid readings, and the disagreement itself is informative. We describe our annotation methodology, including a 4-level quality control pipeline that combines automated statistical checks with expert adjudication. CEI is released under CC-BY-4.0.
[NLP-81] AMUSA-Chat: A Domain-Adapted Large Language Model Conversational System for Research and Responsible Deployment
【速读】: 该论文旨在解决通用大语言模型(Large Language Models, LLMs)在学术机构等特定领域场景中部署时面临的适应性不足问题,包括上下文相关性弱、知识准确性低以及治理合规性差等挑战。解决方案的关键在于构建一个名为TAMUSA-Chat的研究导向型框架,其核心由三部分组成:基于监督微调(Supervised Fine-Tuning)的领域适配机制、融合检索增强生成(Retrieval-Augmented Generation, RAG)的知识注入策略,以及系统化的评估方法论;同时通过模块化设计支持训练配置、超参数与评估协议的可复现实验,从而实现机构语境下对话系统的高效、透明和负责任部署。
链接: https://arxiv.org/abs/2603.09992
作者: Izzat Alsmadi,Anas Alsobeh
机构: Texas AM University–San Antonio (德州农工大学圣安东尼奥分校); Utah Valley University (犹他山谷大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This paper presents TAMUSA-Chat, a research-oriented framework for building domain-adapted large language model conversational systems. The work addresses critical challenges in adapting general-purpose foundation models to institutional contexts through supervised fine-tuning, retrieval-augmented generation, and systematic evaluation methodologies. We describe the complete architecture encompassing data acquisition from institutional sources, preprocessing pipelines, embedding construction, model training workflows, and deployment strategies. The system integrates modular components enabling reproducible experimentation with training configurations, hyper-parameters, and evaluation protocols. Our implementation demonstrates how academic institutions can develop contextually grounded conversational agents while maintaining transparency, governance compliance, and responsible AI practices. Through empirical analysis of fine-tuning behavior across model sizes and training iterations, we provide insights into domain adaptation efficiency, computational resource requirements, and quality-cost trade-offs. The publicly available codebase at this https URL supports continued research into institutional LLM deployment, evaluation methodologies, and ethical considerations for educational AI systems.
[NLP-82] PoultryLeX-Net: Domain-Adaptive Dual-Stream Transformer Architecture for Large-Scale Poultry Stakeholder Modeling
【速读】: 该论文旨在解决在禽类产业领域中,从社交媒体等非结构化文本数据中准确提取细粒度情感信号的难题,这一问题因语境模糊性、语言多样性及通用语言模型对特定领域的认知不足而尤为突出。解决方案的关键在于提出PoultryLeX-Net——一种基于词典增强与领域自适应的双流Transformer架构,其核心创新包括:通过领域特异性嵌入(domain-specific embeddings)和门控交叉注意力机制(gated cross-attention mechanisms)融合情感分类、主题建模与上下文表征学习;其中一通道捕捉禽类专业术语与情感线索,另一通道建模长距离语义依赖关系;同时引入潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)以识别与生产管理及动物福利相关的主题结构,从而提升情感预测的可解释性。实验表明,该方法在多个基准模型上显著优于传统卷积神经网络及预训练Transformer(如DistilBERT和RoBERTa),在情感分类任务中达到97.35%准确率、96.67% F1分数和99.61% AUC-ROC值。
链接: https://arxiv.org/abs/2603.09991
作者: Stephen Afrifa,Biswash Khatiwada,Kapalik Khanal,Sanjay Shah,Lingjuan Wang-Li,Ramesh Bahadur Bist
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid growth of the global poultry industry, driven by rising demand for affordable animal protein, has intensified public discourse surrounding production practices, housing, management, animal welfare, and supply-chain transparency. Social media platforms such as X (formerly Twitter) generate large volumes of unstructured textual data that capture stakeholder sentiment across the poultry industry. Extracting accurate sentiment signals from this domain-specific discourse remains challenging due to contextual ambiguity, linguistic variability, and limited domain awareness in general-purpose language models. This study presents PoultryLeX-Net, a lexicon-enhanced, domain-adaptive dual-stream transformer framework for fine-grained sentiment analysis in poultry-related text. The proposed architecture integrates sentiment classification, topic modeling, and contextual representation learning through domain-specific embeddings and gated cross-attention mechanisms. A lexicon-guided stream captures poultry-specific terminology and sentiment cues, while contextual stream models long-range semantic dependencies. Latent Dirichlet Allocation is employed to identify dominant thematic structures associated with production management and welfare-related discussions, providing complementary interpretability to sentiment predictions. PoultryLeX-Net was evaluated against multiple baseline models, including convolutional neural network and pre-trained transformer architectures such as DistilBERT and RoBERTa. PoultryLeX-Net consistently outperformed all baselines, achieving an accuracy of 97.35%, an F1 score of 96.67%, and an area under the receiver operating characteristic curve (AUC-ROC) of 99.61% across sentiment classification tasks. Overall, domain adaptation and dual-stream attention markedly improve sentiment classification, enabling scalable intelligence for poultry production decision support.
[NLP-83] A Two-Stage Architecture for NDA Analysis: LLM -based Segmentation and Transformer-based Clause Classification
【速读】: 该论文旨在解决商业企业间(B2B)合作中普遍存在的保密协议(NonDisclosure Agreements, NDAs)文本分析效率低下的问题。由于NDAs在格式、结构和写作风格上存在显著差异,传统人工分析方法不仅耗时且易出错。为此,研究提出了一种基于大语言模型(Large Language Models, LLMs)的自动化架构,其关键在于采用双阶段处理流程:首先使用LLaMA-3.1-8B-Instruct模型完成NDAs的条款分割(clause extraction),实现高精度的段落级提取(ROUGE F1达0.95);随后利用微调后的Legal-Roberta-Large模型对提取出的条款进行分类,取得加权F1为0.85的分类性能,从而实现了从原始文档到结构化条款信息的端到端自动化处理。
链接: https://arxiv.org/abs/2603.09990
作者: Ana Begnini,Matheus Vicente,Leonardo Souza
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 2 figures, 3 tables. Published at STIL @ BRACIS 2025
Abstract:In business-to-business relations, it is common to establish NonDisclosure Agreements (NDAs). However, these documents exhibit significant variation in format, structure, and writing style, making manual analysis slow and error-prone. We propose an architecture based on LLMs to automate the segmentation and clauses classification within these contracts. We employed two models: LLaMA-3.1-8B-Instruct for NDA segmentation (clause extraction) and a fine-tuned Legal-Roberta-Large for clause classification. In the segmentation task, we achieved a ROUGE F1 of 0.95 +/- 0.0036; for classification, we obtained a weighted F1 of 0.85, demonstrating the feasibility and precision of the approach.
[NLP-84] he System Hallucination Scale (SHS): A Minimal yet Effective Human-Centered Instrument for Evaluating Hallucination-Related Behavior in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中幻觉行为(hallucination)难以被有效评估的问题,尤其关注用户视角下事实性错误、逻辑不一致、误导性呈现及对用户引导响应能力等维度的可测量性。现有自动检测工具或基准指标往往脱离真实交互场景,缺乏对用户体验的直接捕捉。解决方案的关键在于提出一种轻量级、以人为本的测量工具——系统幻觉量表(System Hallucination Scale, SHS),其核心创新在于借鉴用户体验评估领域的成熟量表(如系统可用性量表SUS和系统因果感知量表SCS),通过结构化问卷形式在现实交互条件下量化用户感知到的幻觉表现,从而实现快速、可解释且跨领域的评估。实证研究表明SHS具有良好的内部一致性(Cronbach’s alpha = 0.87)与构念效度,且与SUS/SCS形成互补关系,适用于模型迭代开发与部署监控。
链接: https://arxiv.org/abs/2603.09989
作者: Heimo Müller,Dominik Steiger,Markus Plass,Andreas Holzinger
机构: Medical University of Graz(格拉茨医科大学); MIDATA Cooperative(中型数据合作社); BOKU University Vienna(维也纳农业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce the System Hallucination Scale (SHS), a lightweight and human-centered measurement instrument for assessing hallucination-related behavior in large language models (LLMs). Inspired by established psychometric tools such as the System Usability Scale (SUS) and the System Causability Scale (SCS), SHS enables rapid, interpretable, and domain-agnostic evaluation of factual unreliability, incoherence, misleading presentation, and responsiveness to user guidance in model-generated text. SHS is explicitly not an automatic hallucination detector or benchmark metric; instead, it captures how hallucination phenomena manifest from a user perspective under realistic interaction conditions. A real-world evaluation with 210 participants demonstrates high clarity, coherent response behavior, and construct validity, supported by statistical analysis including internal consistency (Cronbach’s alpha = 0.87 ) and significant inter-dimension correlations (p 0.001 ). Comparative analysis with SUS and SCS reveals complementary measurement properties, supporting SHS as a practical tool for comparative analysis, iterative system development, and deployment monitoring.
[NLP-85] Causally Grounded Mechanistic Interpretability for LLM s with Faithful Natural-Language Explanations
【速读】: 该论文旨在解决生成式 AI (Generative AI) 模型中机制可解释性(mechanistic interpretability)与人类可理解解释之间难以衔接的问题,即如何将模型内部电路级的因果机制转化为自然语言形式的解释。其解决方案的关键在于构建一个端到端的解释生成流水线:首先通过激活补丁法(activation patching)识别对模型行为具有因果重要性的注意力头;其次采用模板和大语言模型(LLM)两种方式生成解释;最后利用适配于电路级归因的ERASER风格指标评估解释的忠实性(faithfulness)。实验表明,该方法在Indirect Object Identification任务中能有效识别出关键注意力头,并显著提升解释质量,同时揭示了模型内部分布式备份机制的存在及其对解释性能的影响。
链接: https://arxiv.org/abs/2603.09988
作者: Ajay Pravin Mahale
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures, 4 tables. MSc thesis work conducted at Hochschule Trier (2026). Code will be released upon publication
Abstract:Mechanistic interpretability identifies internal circuits responsible for model behaviors, yet translating these findings into human-understandable explanations remains an open problem. We present a pipeline that bridges circuit-level analysis and natural language explanations by (i) identifying causally important attention heads via activation patching, (ii) generating explanations using both template-based and LLM-based methods, and (iii) evaluating faithfulness using ERASER-style metrics adapted for circuit-level attribution. We evaluate on the Indirect Object Identification (IOI) task in GPT-2 Small (124M parameters), identifying six attention heads accounting for 61.4% of the logit difference. Our circuit-based explanations achieve 100% sufficiency but only 22% comprehensiveness, revealing distributed backup mechanisms. LLM-generated explanations outperform template baselines by 64% on quality metrics. We find no correlation (r = 0.009) between model confidence and explanation faithfulness, and identify three failure categories explaining when explanations diverge from mechanisms.
[NLP-86] Evolving Demonstration Optimization for Chain-of-Thought Feature Transformation
【速读】: 该论文旨在解决特征变换(Feature Transformation, FT)中因特征操作组合空间庞大而导致的有效变换发现困难问题,现有方法如离散搜索或潜在生成常受限于样本效率低、无效候选过多及冗余生成等问题。其解决方案的关键在于提出一个闭环优化框架,通过强化学习探索高性能的特征变换序列,并构建和持续更新下游任务验证过的变换轨迹经验库;同时引入多样性感知选择器,结合思维链(chain-of-thought)机制形成上下文引导,从而提升生成特征的质量与性能稳定性。该方法在多种表格基准数据集上显著优于传统及基于大语言模型(LLM)的基线方法,且具备跨API与开源LLM的泛化能力。
链接: https://arxiv.org/abs/2603.09987
作者: Xinyuan Wang,Kunpeng Liu,Arun Vignesh Malarkkan,Yanjie Fu
机构: Arizona State University (亚利桑那州立大学); Clemson University (克莱姆森大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Feature Transformation (FT) is a core data-centric AI task that improves feature space quality to advance downstream predictive performance. However, discovering effective transformations remains challenging due to the large space of feature-operator combinations. Existing solutions rely on discrete search or latent generation, but they are frequently limited by sample inefficiency, invalid candidates, and redundant generations with limited coverage. Large Language Models (LLMs) offer strong priors for producing valid transformations, but current LLM-based FT methods typically rely on static demonstrations, resulting in limited diversity, redundant outputs, and weak alignment with downstream objectives. We propose a framework that optimizes context data for LLM-driven FT by evolving trajectory-level experiences in a closed loop. Starting from high-performing feature transportation sequences explored by reinforcement learning, we construct and continuously update an experience library of downstream task-verified transformation trajectories, and use a diversity-aware selector to form contexts along with a chain-of-thought and guide transformed feature generation toward higher performance. Experiments on diverse tabular benchmarks show that our method outperforms classical and LLM-based baselines and is more stable than one-shot generation. The framework generalizes across API-based and open-source LLMs and remains robust across downstream evaluators.
[NLP-87] Quantifying Hallucinations in Language Language Models on Medical Textbooks
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在医学问答(Medical Question Answering, QA)任务中普遍存在幻觉(Hallucination)的问题,即模型生成与事实不符或缺乏证据支持的回答。当前医疗QA基准测试通常未基于固定证据源评估此类行为,导致对模型可靠性评估不足。研究的关键在于通过两个实验量化幻觉发生频率并探索其与临床实用性之间的关联:首先,在给定教科书文本作为证据源的前提下,发现LLaMA-70B-Instruct模型在新提示下仍存在19.7%的幻觉率;其次,跨模型比较显示幻觉率越低,临床医生评分的有用性越高(ρ = -0.71, p = 0.058),表明减少幻觉可显著提升模型输出的临床可用性。此方法为评估和改进医疗领域生成式AI(Generative AI)的准确性提供了实证依据。
链接: https://arxiv.org/abs/2603.09986
作者: Brandon C. Colelough,Davis Bartels,Dina Demner-Fushman
机构: National Institutes of Health, National Library of Medicine (美国国立卫生研究院国家医学图书馆); University of Maryland, College Park (马里兰大学帕克分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures
Abstract:Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments: the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given novel prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated in 19.7% of answers (95% CI 18.6 to 20.7) even though 98.8% of prompt responses received maximal plausibility, and observed in experiment two, across models, lower hallucination rates aligned with higher usefulness scores ( \rho=-0.71 , p=0.058 ). Clinicians produced high agreement (quadratic weighted \kappa=0.92 ) and ( \tau_b=0.06 to 0.18 , \kappa=0.57 to 0.61 ) for experiments 1 and ,2 respectively
[NLP-88] he Dunning-Kruger Effect in Large Language Models : An Empirical Study of Confidence Calibration
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在自我评估能力上的准确性尚不明确,尤其是它们是否表现出类似人类“达克效应”(Dunning-Kruger effect)的认知偏差——即低能力个体高估自身水平。为解答此问题,研究者通过实证方法评估了四种先进模型(Claude Haiku 4.5、Gemini 2.5 Pro、Gemini 2.5 Flash 和 Kimi K2)在四个基准数据集上的表现,共涵盖24,000次实验。关键解决方案在于使用期望校准误差(Expected Calibration Error, ECE)作为量化指标,揭示出性能较差的模型普遍呈现显著过度自信,例如Kimi K2模型ECE高达0.726而准确率仅23.3%,而表现最优的Claude Haiku 4.5则ECE仅为0.122且准确率达75.4%。这一发现表明LLMs存在与人类认知偏差相似的自我评估失准现象,对高风险场景下安全部署具有重要警示意义。
链接: https://arxiv.org/abs/2603.09985
作者: Sudipta Ghosh,Mrityunjoy Panday
机构: Cognizant Technology Solutions(认知技术解决方案公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their ability to accurately assess their own confidence remains poorly understood. We present an empirical study investigating whether LLMs exhibit patterns reminiscent of the Dunning-Kruger effect – a cognitive bias where individuals with limited competence tend to overestimate their abilities. We evaluate four state-of-the-art models (Claude Haiku 4.5, Gemini 2.5 Pro, Gemini 2.5 Flash, and Kimi K2) across four benchmark datasets totaling 24,000 experimental trials. Our results reveal striking calibration differences: Kimi K2 exhibits severe overconfidence with an Expected Calibration Error (ECE) of 0.726 despite only 23.3% accuracy, while Claude Haiku 4.5 achieves the best calibration (ECE = 0.122) with 75.4% accuracy. These findings demonstrate that poorly performing models display markedly higher overconfidence – a pattern analogous to the Dunning-Kruger effect in human cognition. We discuss implications for safe deployment of LLMs in high-stakes applications.
[NLP-89] An Efficient Hybrid Deep Learning Approach for Detecting Online Abusive Language
【速读】: 该论文旨在解决在线平台中滥用语言(abusive language)的检测难题,特别是针对社交媒体、论坛和暗网等多场景下隐蔽性强、形式多样的有害内容(如仇恨言论、网络欺凌和毒害性评论),这些内容因使用特定词汇或编码短语以规避检测而难以识别。解决方案的关键在于提出一种融合BERT(Bidirectional Encoder Representations from Transformers)、CNN(Convolutional Neural Network)与LSTM(Long Short-Term Memory)架构的混合深度学习模型,并采用ReLU激活函数,从而有效捕捉文本的语义特征、上下文信息及序列模式,在一个包含77,620条 abusive 和272,214条 non-abusive 样本的不平衡数据集上实现了约99%的高精度评估指标(包括Precision、Recall、Accuracy、F1-score和AUC),显著提升了对现实世界中高度偏斜数据的鲁棒性检测能力。
链接: https://arxiv.org/abs/2603.09984
作者: Vuong M. Ngo,Cach N. Dang,Kien V. Nguyen,Mark Roantree
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, 7 figures
Abstract:The digital age has expanded social media and online forums, allowing free expression for nearly 45% of the global population. Yet, it has also fueled online harassment, bullying, and harmful behaviors like hate speech and toxic comments across social networks, messaging apps, and gaming communities. Studies show 65% of parents notice hostile online behavior, and one-third of adolescents in mobile games experience bullying. A substantial volume of abusive content is generated and shared daily, not only on the surface web but also within dark web forums. Creators of abusive comments often employ specific words or coded phrases to evade detection and conceal their intentions. To address these challenges, we propose a hybrid deep learning model that integrates BERT, CNN, and LSTM architectures with a ReLU activation function to detect abusive language across multiple online platforms, including YouTube comments, online forum discussions, and dark web posts. The model demonstrates strong performance on a diverse and imbalanced dataset containing 77,620 abusive and 272,214 non-abusive text samples (ratio 1:3.5), achieving approximately 99% across evaluation metrics such as Precision, Recall, Accuracy, F1-score, and AUC. This approach effectively captures semantic, contextual, and sequential patterns in text, enabling robust detection of abusive content even in highly skewed datasets, as encountered in real-world scenarios.
[NLP-90] MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios
【速读】: 该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型在边缘设备上部署时面临的严重内存限制问题,尤其针对现有卸载策略因自回归专家激活的动态性和低信息性而导致的I/O瓶颈。其解决方案的关键在于重新利用推测解码(Speculative Decoding, SD)机制,不仅作为计算加速器,更作为具有信息量的前瞻感知器用于内存管理:提出MoE-SpAc框架,包含三个核心组件——推测效用估计器(Speculative Utility Estimator)用于追踪专家需求、异构负载均衡器(Heterogeneous Workload Balancer)通过在线整数优化动态划分计算任务,以及异步执行引擎(Asynchronous Execution Engine)统一预取与淘汰操作在相同的效用空间中。这一设计显著提升了推理效率,在七个基准测试中相较最先进SD基线提升42%吞吐量(TPS),平均速度提升达4.04倍。
链接: https://arxiv.org/abs/2603.09983
作者: Shuhuai Li,Jianghao Lin,Dongdong Ge,Yinyu Ye
机构: Shanghai University (上海大学); Shanghai Jiao Tong University (上海交通大学); Standford University (斯坦福大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Mixture-of-Experts (MoE) models enable scalable performance but face severe memory constraints on edge devices. Existing offloading strategies struggle with I/O bottlenecks due to the dynamic, low-information nature of autoregressive expert activation. In this paper, we propose to repurpose Speculative Decoding (SD) not merely as a compute accelerator, but as an informative lookahead sensor for memory management, supported by our theoretical and empirical analyses. Hence, we introduce MoE-SpAc, an MoE inference framework that integrates a Speculative Utility Estimator to track expert demand, a Heterogeneous Workload Balancer to dynamically partition computation via online integer optimization, and an Asynchronous Execution Engine to unify the prefetching and eviction in the same utility space. Extensive experiments on seven benchmarks demonstrate that MoE-SpAc achieves a 42% improvement in TPS over the SOTA SD-based baseline, and an average 4.04x speedup over all standard baselines. Code is available at this https URL .
[NLP-91] AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic EACL2026
【速读】: 该论文旨在解决将现代编码器架构(如ModernBERT)适配至阿拉伯语等使用阿拉伯文衍生文字的语言时所面临的挑战,特别是针对阿拉伯语特有的语言特性(如词形变化复杂、连写现象普遍)以及长文本建模能力不足的问题。其解决方案的关键在于两个方面:一是采用transtokenization(跨字节对编码的分词方法)进行嵌入初始化,显著提升掩码语言建模性能;二是引入原生长上下文建模机制,支持高达8,192个token的稳定序列处理,从而在阿拉伯语自然语言理解任务(如推理、有害语言检测、问题相似度计算和命名实体识别)中实现优异的迁移效果。
链接: https://arxiv.org/abs/2603.09982
作者: Omar Elshehy,Omer Nacar,Abdelbasset Djamai,Muhammed Ragab,Khloud Al Jallad,Mona Abdelazim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure. Accepted at AbjadNLP Workshop, EACL 2026
Abstract:Encoder-only transformer models remain widely used for discriminative NLP tasks, yet recent architectural advances have largely focused on English. In this work, we present AraModernBERT, an adaptation of the ModernBERT encoder architecture to Arabic, and study the impact of transtokenized embedding initialization and native long-context modeling up to 8,192 tokens. We show that transtokenization is essential for Arabic language modeling, yielding dramatic improvements in masked language modeling performance compared to non-transtokenized initialization. We further demonstrate that AraModernBERT supports stable and effective long-context modeling, achieving improved intrinsic language modeling performance at extended sequence lengths. Downstream evaluations on Arabic natural language understanding tasks, including inference, offensive language detection, question-question similarity, and named entity recognition, confirm strong transfer to discriminative and sequence labeling settings. Our results highlight practical considerations for adapting modern encoder architectures to Arabic and other languages written in Arabic-derived scripts.
[NLP-92] Large Language Models and Book Summarization: Reading or Remembering Which Is Better?
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本摘要任务中的性能边界问题,特别是探究仅依赖模型内部知识生成的摘要与基于完整文本输入生成的摘要之间的差异。其核心问题是:当模型具备从训练数据中学习到的先验知识时,是否仍需完整文本输入才能获得高质量摘要?解决方案的关键在于设计对照实验,对比同一经典书籍在两种条件下生成的摘要——一是仅使用模型内部记忆(无外部输入),二是提供完整的书籍文本作为输入,并通过定量评估指标分析两者在细节丰富度和信息完整性上的表现差异。结果表明,虽然完整文本通常能生成更详尽的摘要,但在某些情况下,模型的内部知识反而能产生更优的摘要效果,揭示了LLM在长文本处理中潜在的知识利用能力与局限性。
链接: https://arxiv.org/abs/2603.09981
作者: Tairan Fu,Javier Conde,Pedro Reviriego,Javier Coronado-Blázquez,Nina Melero,Elena Merino-Gómez
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Summarization is a core task in Natural Language Processing (NLP). Recent advances in Large Language Models (LLMs) and the introduction of large context windows reaching millions of tokens make it possible to process entire books in a single prompt. At the same time, for well-known books, LLMs can generate summaries based only on internal knowledge acquired during training. This raises several important questions: How do summaries generated from internal memory compare to those derived from the full text? Does prior knowledge influence summaries even when the model is given the book as input? In this work, we conduct an experimental evaluation of book summarization with state-of-the-art LLMs. We compare summaries of well-known books produced using (i) only the internal knowledge of the model and (ii) the full text of the book. The results show that having the full text provides more detailed summaries in general, but some books have better scores for the internal knowledge summaries. This puts into question the capabilities of models to perform summarization of long texts, as information learned during training can outperform summarization of the full text in some cases.
[NLP-93] Explainable LLM Unlearning Through Reasoning
【速读】: 该论文旨在解决预训练大语言模型(Large Language Models, LLMs)中存在的安全、版权和隐私风险问题,核心挑战在于如何实现精准且可控的“遗忘”(unlearning),即从模型中移除特定知识,同时避免对其他通用能力造成破坏。传统方法如梯度上升(Gradient Ascent, GA)及其变体虽能部分实现unlearning,但因其无目标性(untargeted)导致知识清除不彻底、无关能力退化及生成内容不连贯等问题。论文的关键创新在于提出一种新的“基于推理的遗忘目标”(reasoning-based unlearning target),该目标明确指定了应遗忘的知识范围以及遗忘后模型应有的响应行为;在此基础上,作者设计了靶向推理遗忘(Targeted Reasoning Unlearning, TRU)方法,通过交叉熵监督损失与GA损失相结合的方式,引导模型在推理过程中学习精确的知识删除机制,从而在保留无关能力的同时提升遗忘效果的可靠性与可解释性。
链接: https://arxiv.org/abs/2603.09980
作者: Junfeng Liao,Qizhou Wang,Shanshan Ye,Xin Yu,Ling Chen,Zhen Fang
机构: Model call failure
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:LLM unlearning is essential for mitigating safety, copyright, and privacy concerns in pre-trained large language models (LLMs). Compared to preference alignment, it offers a more explicit way by removing undesirable knowledge characterized by specific unlearning datasets. In previous works, gradient ascent (GA) and its variants have shown promise for implementing unlearning, yet their untargeted nature results in unintended degradation of general capabilities, incomplete removal of knowledge, and the generation of incoherent responses, among many others. We argue that these issues stem from the absence of explicit guidance on what and how models should unlearn. To fill this gap, we introduce a novel unlearning target, reasoning-based unlearning target, which satisfies both the specified unlearning scope and the specified post-unlearning response. Building on this, we propose targeted reasoning unlearning (TRU), which leverages reasoning-based unlearning target as guidance. We employ the target using a cross-entropy supervised loss combined with a GA-based loss, enabling the model to learn reasoning ability for precise knowledge removal while preserving unrelated abilities. We evaluate TRU against strong baselines across multiple benchmarks and LLM backbones, and find that it achieves more reliable unlearning while preserving general capabilities. Moreover, TRU exhibits superior robustness under diverse attack scenarios, stemming from the reasoning ability learned through reasoning-based targets. Overall, our study establishes reasoning-augmented unlearning as a practical paradigm for reliable and explainable LLM unlearning.
[NLP-94] GhazalBench: Usage-Grounded Evaluation of LLM s on Persian Ghazals
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理具有深厚文化背景的诗歌文本时,对意义理解与形式记忆之间存在不一致的问题,尤其是在波斯语鲁拜体(ghazal)这种高度依赖文化语境和固定韵律结构的文学形式中。其解决方案的关键在于提出并构建了一个名为GhazalBench的新基准测试框架,该框架通过两类互补任务评估模型能力:一是将诗句忠实转化为散文式释义(prose paraphrase),二是基于语义或形式线索完成诗句(completion-based recall)。实验发现,尽管多数模型能较好理解诗文含义,但在精确复现经典诗句时表现不佳;而识别类任务显著缩小了这一差距,表明问题根源在于训练数据中的文化接触不足而非模型架构限制。这凸显了建立兼顾语义、形式及线索驱动访问能力的综合评估体系的重要性。
链接: https://arxiv.org/abs/2603.09979
作者: Ghazal Kalhor,Yadollah Yaghoobzadeh
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Persian poetry plays an active role in Iranian cultural practice, where verses by canonical poets such as Hafez are frequently quoted, paraphrased, or completed from partial cues. Supporting such interactions requires language models to engage not only with poetic meaning but also with culturally entrenched surface form. We introduce GhazalBench, a benchmark for evaluating how large language models (LLMs) interact with Persian ghazals under usage-grounded conditions. GhazalBench assesses two complementary abilities: producing faithful prose paraphrases of couplets and accessing canonical verses under varying semantic and formal cues. Across several proprietary and open-weight multilingual LLMs, we observe a consistent dissociation: models generally capture poetic meaning but struggle with exact verse recall in completion-based settings, while recognition-based tasks substantially reduce this gap. A parallel evaluation on English sonnets shows markedly higher recall performance, suggesting that these limitations are tied to differences in training exposure rather than inherent architectural constraints. Our findings highlight the need for evaluation frameworks that jointly assess meaning, form, and cue-dependent access to culturally significant texts. GhazalBench is available at this https URL.
[NLP-95] Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards
【速读】: 该论文旨在解决生成式 AI(Generative AI)在强化学习中因可验证奖励(Reinforcement Learning from Verifiable Rewards, RLVR)导致的校准退化(calibration degeneration)问题,即模型在错误答案上表现出过度自信。现有方法通常直接将校准目标整合进优化目标,但本文通过理论分析揭示了最大化策略准确率与最小化校准误差之间存在根本性的梯度冲突。解决方案的关键在于提出DCPO框架,该框架系统性地解耦推理(reasoning)与校准(calibration)目标,从而在保持与GRPO相当的准确性的同时,显著改善校准性能并缓解过自信问题。
链接: https://arxiv.org/abs/2603.09117
作者: Zhengzhao Ma,Xueru Wen,Boxi Cao,Yaojie Lu,Hongyu Lin,Jinglin Yang,Min He,Xianpei Han,Le Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 8 figures
Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.
[NLP-96] ConFu: Contemplate the Future for Better Speculative Sampling ICLR2026
【速读】: 该论文旨在解决当前推测解码(speculative decoding)框架中 draft 模型因仅基于当前前缀进行预测而导致的误差累积问题,从而限制了生成效率和准确性。其解决方案的关键在于提出 ConFu(Contemplate the Future)框架,通过引入“ contemplate tokens”和软提示(soft prompts)来使 draft 模型能够以极低开销获取目标模型的未来导向信号;同时设计动态 contemplate token 机制结合 MoE(Mixture of Experts)实现上下文感知的未来预测能力,并采用锚定 token 采样与未来预测复制的训练策略提升模型对未来的鲁棒预测能力。实验表明,ConFu 在 Llama-3 3B 和 8B 模型上相较 EAGLE-3 提升了 8–11% 的 token 接受率和生成速度,首次将推测解码与连续推理 token 相结合,为加速大语言模型(LLM)推理提供了新方向。
链接: https://arxiv.org/abs/2603.08899
作者: Zongyue Qin,Raghavv Goel,Mukul Gagrani,Risheek Garrepalli,Mingu Lee,Yizhou Sun
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: accepted at ICLR 2026 workshop on Latent Implicit Thinking - Going Beyond CoT Reasoning
Abstract:Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verified by the target model. The effectiveness of this paradigm critically depends on the quality of the draft model. While recent advances such as the EAGLE series achieve state-of-the-art speedup, existing draft models remain limited by error accumulation: they condition only on the current prefix, causing their predictions to drift from the target model over steps. In this work, we propose \textbfConFu (Contemplate the Future), a novel speculative decoding framework that enables draft models to anticipate the future direction of generation. ConFu introduces (i) contemplate tokens and soft prompts that allow the draft model to leverage future-oriented signals from the target model at negligible cost, (ii) a dynamic contemplate token mechanism with MoE to enable context-aware future prediction, and (iii) a training framework with anchor token sampling and future prediction replication that learns robust future prediction. Experiments demonstrate that ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8–11% across various downstream tasks with Llama-3 3B and 8B models. We believe our work is the first to bridge speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference.
[NLP-97] Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理跨模态信息时缺乏有效同步推理能力的问题,尤其是其在需要时间对齐的视听联合理解任务中的表现不足。解决方案的关键在于构建一个名为Daily-Omni的多项选择音频-视觉问答(Audio-Visual QA)基准,该基准包含684个真实世界视频和1,197个问题,覆盖6类明确要求跨模态时间推理的任务;同时开发了一套半自动化的标注与一致性优化流程,包括跨模态一致性精炼、时间对齐信号提取及文本泄露过滤,并辅以人工验证,从而实现可扩展的高质量数据构建。此外,研究还提出一种无需训练的模块化诊断基线,通过组合现成的单模态模型来揭示显式时间对齐信号对性能的影响,结果表明多数端到端 MLLMs 在依赖对齐的关键问题上仍表现不佳,凸显了鲁棒跨模态时间对齐仍是亟待突破的核心挑战。
链接: https://arxiv.org/abs/2505.17862
作者: Ziwei Zhou,Rui Wang,Zuxuan Wu,Yu-Gang Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent Multimodal Large Language Models (MLLMs) achieve promising performance on visual and audio benchmarks independently. However, the ability of these models to process cross-modal information synchronously remains largely unexplored. We introduce Daily-Omni, a multiple-choice Audio-Visual QA benchmark featuring 684 real-world videos and 1,197 questions spanning 6 task families that explicitly require cross-modal temporal reasoning. To support scalable benchmark construction, we develop a semi-automatic pipeline for annotation, cross-modal consistency refinement, temporal alignment elicitation, and text-only leakage filtering, followed by human verification. We further provide a diagnostic evaluation suite and extensively evaluate 24 foundation models under 37 model–modality settings (Audio+Video / Audio-only / Video-only / Text-only). Finally, we include a training-free modular diagnostic baseline that composes off-the-shelf unimodal models to serve as a diagnostic baseline and to illustrate how explicit temporal alignment signals affect performance. Results indicate that many end-to-end MLLMs still struggle on alignment-critical questions, suggesting that robust cross-modal temporal alignment remains an important open challenge.
[NLP-98] Speech Codec Probing from Semantic and Phonetic Perspectives
【速读】: 该论文旨在解决当前语音分词器(speech tokenizer)在连接语音与大语言模型(LLM)时存在的语义失配问题,即现有语音表示中的“语义”信息与文本语义不一致,从而影响多模态大语言模型的下游理解与生成性能。其解决方案的关键在于通过词级探测任务、逐层表征分析以及跨模态对齐度量(如CKA)等方法,系统性地解耦并量化不同语音分词器所编码的语义(semantic)与语音特征(phonetic)信息,发现当前主流分词器主要捕捉的是语音特征而非词汇语义结构,从而为下一代语音分词方法的设计提供了实证依据和优化方向。
链接: https://arxiv.org/abs/2603.10371
作者: Xuan Shi,Chang Zeng,Tiantian Feng,Shih-Heng Wang,Jianbo Ma,Shrikanth Narayanan
机构: University of Southern California(南加州大学); Dolby Laboratories(杜比实验室)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:
Abstract:Speech tokenizers are essential for connecting speech to large language models (LLMs) in multimodal systems. These tokenizers are expected to preserve both semantic and acoustic information for downstream understanding and generation. However, emerging evidence suggests that what is termed “semantic” in speech representations does not align with text-derived semantics: a mismatch that can degrade multimodal LLM performance. In this paper, we systematically analyze the information encoded by several widely used speech tokenizers, disentangling their semantic and phonetic content through word-level probing tasks, layerwise representation analysis, and cross-modal alignment metrics such as CKA. Our results show that current tokenizers primarily capture phonetic rather than lexical-semantic structure, and we derive practical implications for the design of next-generation speech tokenization methods.
[NLP-99] Calibration-Reasoning Framework for Descriptive Speech Quality Assessment INTERSPEECH2026
【速读】: 该论文旨在解决传统语音质量评估方法依赖单一均值意见分数(Mean Opinion Score, MOS)难以揭示感知维度细节的问题,提出一种面向多维感知推理的可解释语音质量评估新范式。其解决方案的关键在于引入一种新颖的后训练方法,首先通过校准阶段使基础音频大语言模型(Audio Large Language Model)对预定义的感知维度进行精准预测,随后利用基于组相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习阶段,结合维度特异性奖励机制,显著提升模型在质量问题描述准确性与时间定位上的表现,从而实现对音频失真(audio artifacts)的细粒度检测、分类与时空定位。
链接: https://arxiv.org/abs/2603.10175
作者: Elizaveta Kostenok,Mathieu Salzmann,Milos Cernak
机构: EPFL (瑞士联邦理工学院); Logitech (罗技)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Submitted to Interspeech 2026
Abstract:Explainable speech quality assessment requires moving beyond Mean Opinion Scores (MOS) to analyze underlying perceptual dimensions. To address this, we introduce a novel post-training method that tailors the foundational Audio Large Language Model for multidimensional reasoning, detection and classification of audio artifacts. First, a calibration stage aligns the model to predict predefined perceptual dimensions. Second, a reinforcement learning stage leverages Group Relative Policy Optimization (GRPO) with dimension-specific rewards to heavily enhance accuracy of descriptions and temporal localization of quality issues. With this approach we reach state-of-the-art results of 0.71 mean PCC score on the multidimensional QualiSpeech benchmark and 13% improvement in MOS prediction driven by RL-based reasoning. Furthermore, our fine-grained GRPO rewards substantially advance the model’s ability to pinpoint and classify audio artifacts in time.
信息检索
[IR-0] Chasing RATs: Tracing Reading for and as Creative Activity
【速读】:该论文旨在解决当前创造力研究中过度强调“创作”(making)而忽视前置性解释劳动(interpretive labor)的问题,尤其是算法推荐和AI摘要技术日益压缩与自动化人类阅读与理解过程所带来的隐性创造性价值流失。其解决方案的关键在于提出“阅读活动痕迹”(Reading Activity Traces, RATs),将阅读定义为一种广义的创造性活动,涵盖跨源媒体的导航、解读与编排,并通过可追踪的遍历路径、关联模式与反思记录,使这种原本隐形的创造性工作成为可观察、可分析的实体。这一方法不仅为反思性实践、读者建模和集体意义建构开辟了新方向,也为设计能保留人类解释力的智能工具提供了理论基础。
链接: https://arxiv.org/abs/2603.11031
作者: Sophia Liu,Shm Garanganao Almeda
机构: University of California, Berkeley(加州大学伯克利分校)
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Multimedia (cs.MM); Social and Information Networks (cs.SI)
备注:
Abstract:Creativity research has privileged making over the interpretive labor that precedes and shapes it. We introduce Reading Activity Traces (RATs), a proposal that treats reading – broadly defined to include navigating, interpreting, and curating media across interconnected sources – as creative activity both for future artifacts and as a form of creation in its own right. By tracing trajectories of traversal, association, and reflection as inspectable artifacts, RATs render visible the creative work that algorithmic feeds and AI summarization increasingly compress and automate away. We illustrate this through WikiRAT, a speculative instantiation on Wikipedia, and open new ground for reflective practice, reader modeling, collective sensemaking, and understanding what is lost when human interpretation is automated – towards designing intelligent tools that preserve it.
[IR-1] A Systematic Study of Pseudo-Relevance Feedback with LLM s
【速读】:该论文旨在解决伪相关反馈(Pseudo-relevance Feedback, PRF)方法在基于大语言模型(Large Language Models, LLMs)时,其设计维度——反馈来源(feedback source)与反馈模型(feedback model)——各自独立作用不明确的问题。由于这两个维度在现有实证评估中常被混杂,导致难以厘清其对PRF效果的具体贡献。解决方案的关键在于通过受控实验系统性地考察不同反馈来源(如LLM生成文本或文档语料库)和反馈模型(即如何利用反馈文本优化查询表示)组合下的性能表现,在13个低资源BEIR任务上验证了:反馈模型的选择对PRF效果具有关键影响;仅使用LLM生成文本作为反馈来源最具成本效益;而从语料库中提取反馈则在结合强第一阶段检索器候选文档时收益最大。
链接: https://arxiv.org/abs/2603.11008
作者: Nour Jedidi,Jimmy Lin
机构: University of Waterloo (滑铁卢大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Pseudo-relevance feedback (PRF) methods built on large language models (LLMs) can be organized along two key design dimensions: the feedback source, which is where the feedback text is derived from and the feedback model, which is how the given feedback text is used to refine the query representation. However, the independent role that each dimension plays is unclear, as both are often entangled in empirical evaluations. In this paper, we address this gap by systematically studying how the choice of feedback source and feedback model impact PRF effectiveness through controlled experimentation. Across 13 low-resource BEIR tasks with five LLM PRF methods, our results show: (1) the choice of feedback model can play a critical role in PRF effectiveness; (2) feedback derived solely from LLM-generated text provides the most cost-effective solution; and (3) feedback derived from the corpus is most beneficial when utilizing candidate documents from a strong first-stage retriever. Together, our findings provide a better understanding of which elements in the PRF design space are most important.
[IR-2] A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification
【速读】:该论文旨在解决药物处方审核中因大语言模型(Large Language Models, LLMs)固有的事实不可靠性、缺乏可追溯性及复杂推理能力不足而导致的医疗安全风险问题。其核心解决方案是提出 PharmGraph-Auditor 系统,关键在于构建一个可信的混合药物知识库(Hybrid Pharmaceutical Knowledge Base, HPKB),该知识库基于虚拟知识图谱(Virtual Knowledge Graph, VKG)范式,融合关系型组件用于集合约束满足与图结构组件用于拓扑推理,并通过严格的映射层实现协同工作;同时引入基于知识库的验证链(KB-grounded Chain of Verification, CoV)推理范式,将 LLM 转化为透明的推理引擎,通过分解审计任务为一系列可验证查询并生成混合查询计划从最优数据源获取证据,从而实现安全、可解释且高效的处方审核。
链接: https://arxiv.org/abs/2603.10891
作者: Yichi Zhu,Kan Ling,Xu Liu,Hengrun Zhang,Huiqun Yu,Guisheng Fan
机构: East China University of Science and Technology (华东理工大学)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 11 pages, 7 this http URL for safe prescription auditing and hybrid knowledge-grounded reasoning
Abstract:Medication errors pose a significant threat to patient safety, making pharmacist verification (PV) a critical, yet heavily burdened, final safeguard. The direct application of Large Language Models (LLMs) to this zero-tolerance domain is untenable due to their inherent factual unreliability, lack of traceability, and weakness in complex reasoning. To address these challenges, we introduce PharmGraph-Auditor, a novel system designed for safe and evidence-grounded prescription auditing. The core of our system is a trustworthy Hybrid Pharmaceutical Knowledge Base (HPKB), implemented under the Virtual Knowledge Graph (VKG) paradigm. This architecture strategically unifies a relational component for set constraint satisfaction and a graph component for topological reasoning via a rigorous mapping layer. To construct this HPKB, we propose the Iterative Schema Refinement (ISR) algorithm, a framework that enables the co-evolution of both graph and relational schemas from medical texts. For auditing, we introduce the KB-grounded Chain of Verification (CoV), a new reasoning paradigm that transforms the LLM from an unreliable generator into a transparent reasoning engine. CoV decomposes the audit task into a sequence of verifiable queries against the HPKB, generating hybrid query plans to retrieve evidence from the most appropriate data store. Experimental results demonstrate robust knowledge extraction capabilities and show promises of using PharmGraph-Auditor to enable pharmacists to achieve safer and faster prescription verification.
[IR-3] An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took “Use of Practical AI in Digital Libraries” seriously? LREC2026
【速读】:该论文旨在解决大规模多语言环境下主题标引(subject indexing)难以持续维护的问题,尤其是在跨语言场景中保持一致性与准确性。其解决方案的关键在于发布了一个大型双语(英语/德语)书目记录语料库,该语料库标注了整合权威档(GND)中的术语,并提供了可机器处理的GND分类法(taxonomy),从而支持基于本体的多标签分类、文本到权威术语的映射以及可复现的、以权威为基础的评估机制,为实现由生成式AI驱动的辅助编目工具(authority-anchored AI co-pilots)奠定基础。
链接: https://arxiv.org/abs/2603.10876
作者: Jennifer D’Souza,Sameer Sadruddin,Maximilian Kähler,Andrea Salfinger,Luca Zaccagna,Francesca Incitti,Lauro Snidaro,Osma Suominen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注: 9 pages, 5 figures. Accepted to appear in the Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract:Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers’ work.
[IR-4] Interpretable Chinese Metaphor Identification via LLM -Assisted MIPVU Rule Script Generation: A Comparative Protocol Study
【速读】:该论文旨在解决中文隐喻识别任务中缺乏可解释性的问题,即现有计算方法多为黑箱分类器,无法提供判断依据,尤其在中文语境下,由于丰富的修辞传统、缺乏形态线索以及标注资源有限,这一问题更为突出。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)辅助的模块化规则脚本流水线,将四种隐喻识别协议(MIP/MIPVU词法分析、CMDAG概念映射标注、基于情感的检测和拟喻导向识别)转化为可执行、人类可审计的确定性步骤序列,并在关键节点引入受控的LLM调用以增强推理能力,从而在每个分类决策中生成结构化的解释性理由。该设计实现了100%的确定性可复现性,并通过跨协议对比揭示了不同识别策略间存在显著分歧(如协议A与D的Kappa仅为0.001),凸显了协议选择是影响识别结果的首要因素,而非模型本身。
链接: https://arxiv.org/abs/2603.10784
作者: Weihang Huang,Mengna Liu
机构: University of Birmingham (伯明翰大学); Guangdong University of Foreign Studies (广东外语外贸大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Metaphor identification is a foundational task in figurative language processing, yet most computational approaches operate as opaque classifiers offering no insight into why an expression is judged metaphorical. This interpretability gap is especially acute for Chinese, where rich figurative traditions, absent morphological cues, and limited annotated resources compound the challenge. We present an LLM-assisted pipeline that operationalises four metaphor identification protocols–MIP/MIPVU lexical analysis, CMDAG conceptual-mapping annotation, emotion-based detection, and simile-oriented identification–as executable, human-auditable rule scripts. Each protocol is a modular chain of deterministic steps interleaved with controlled LLM calls, producing structured rationales alongside every classification decision. We evaluate on seven Chinese metaphor datasets spanning token-, sentence-, and span-level annotation, establishing the first cross-protocol comparison for Chinese metaphor identification. Within-protocol evaluation shows Protocol A (MIP) achieves an F1 of 0.472 on token-level identification, while cross-protocol analysis reveals striking divergence: pairwise Cohen’s kappa between Protocols A and D is merely 0.001, whereas Protocols B and C exhibit near-perfect agreement (kappa = 0.986). An interpretability audit shows all protocols achieve 100% deterministic reproducibility, with rationale correctness from 0.40 to 0.87 and editability from 0.80 to 1.00. Error analysis identifies conceptual-domain mismatch and register sensitivity as dominant failure modes. Our results demonstrate that protocol choice is the single largest source of variation in metaphor identification, exceeding model-level variation, and that rule-script architectures achieve competitive performance while maintaining full transparency.
[IR-5] RAG Perf: An End-to-End Benchmarking Framework for Retrieval-Augmented Generation Systems
【速读】:该论文旨在解决当前生成式 AI(Generative AI)系统中检索增强生成(Retrieval-Augmented Generation, RAG)管道在实际部署时缺乏系统性性能评估与调优工具的问题。现有方法难以对 RAG 流水线的各模块行为进行细粒度分析,导致无法准确量化不同组件(如嵌入模型、向量数据库、重排序策略和大语言模型)对端到端查询性能和质量的影响。解决方案的关键在于提出一个名为 RAGPerf 的基准测试框架,通过将 RAG 工作流解耦为嵌入(embedding)、索引(indexing)、检索(retrieval)、重排序(reranking)和生成(generation)等可配置的模块化组件,并支持多样化的数据源、检索更新比例和查询分布,从而实现对性能指标(如端到端吞吐量、GPU内存占用、CPU/GPU利用率)和准确性指标(如上下文召回率、查询准确率、事实一致性)的自动化采集与分析,且开销极低。
链接: https://arxiv.org/abs/2603.10765
作者: Shaobo Li,Yirui Zhou,Yuan Xu,Kevin Chen,Daniel Waddington,Swaminathan Sundararaman,Hubertus Franke,Jian Huang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); IBM Research (IBM 研究院)
类目: Performance (cs.PF); Information Retrieval (cs.IR)
备注: The codebase of RAGPerf is available at this https URL
Abstract:We present the design and implementation of a RAG-based AI system benchmarking (RAGPerf) framework for characterizing the system behaviors of RAG pipelines. To facilitate detailed profiling and fine-grained performance analysis, RAGPerf decouples the RAG workflow into several modular components - embedding, indexing, retrieval, reranking, and generation. RAGPerf offers the flexibility for users to configure the core parameters of each component and examine their impact on the end-to-end query performance and quality. RAGPerf has a workload generator to model real-world scenarios by supporting diverse datasets (e.g., text, pdf, code, and audio), different retrieval and update ratios, and query distributions. RAGPerf also supports different embedding models, major vector databases such as LanceDB, Milvus, Qdrant, Chroma, and Elasticsearch, as well as different LLMs for content generation. It automates the collection of performance metrics (i.e., end-to-end query throughput, host/GPU memory footprint, and CPU/GPU utilization) and accuracy metrics (i.e., context recall, query accuracy, and factual consistency). We demonstrate the capabilities of RAGPerf through a comprehensive set of experiments and open source its codebase at GitHub. Our evaluation shows that RAGPerf incurs negligible performance overhead.
[IR-6] Structured Linked Data as a Memory Layer for Agent -Orchestrated Retrieval
【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统将文档视为扁平文本、忽略知识图谱提供的结构化元数据与实体间关联关系的问题。其核心解决方案在于引入基于Linked Data平台的结构化链接数据,特别是通过HTTP URL标记和可解析的实体页面,优化文档表示形式以提升检索准确性和答案质量。关键创新点在于设计了一种“增强型实体页面”格式,融合了JSON-LD标记、面向代理的指令(agent instructions)、面包屑导航路径及神经搜索能力,在标准RAG和代理式RAG中分别实现29.6%和29.8%的准确率提升,显著优于仅使用JSON-LD标记的传统方法。
链接: https://arxiv.org/abs/2603.10700
作者: Andrea Volpini,Elie Raad,Beatrice Gamba,David Riccitelli
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 33 pages, 7 figures, reproducibility appendix, dataset/evaluation framework/enhanced entity page templates released with the paper
Abstract:Retrieval-Augmented Generation (RAG) systems typically treat documents as flat text, ignoring the structured metadata and linked relationships that knowledge graphs provide. In this paper, we investigate whether structured linked data, specifically this http URL markup and dereferenceable entity pages served by a Linked Data Platform, can improve retrieval accuracy and answer quality in both standard and agentic RAG systems. We conduct a controlled experiment across four domains (editorial, legal, travel, e-commerce) using Vertex AI Vector Search 2.0 for retrieval and the Google Agent Development Kit (ADK) for agentic reasoning. Our experimental design tests seven conditions: three document representations (plain HTML, HTML with JSON-LD, and an enhanced agentic-optimized entity page) crossed with two retrieval modes (standard RAG and agentic RAG with multi-hop link traversal), plus an Enhanced+ condition that adds rich navigational affordances and entity interlinking. Our results reveal that while JSON-LD markup alone provides only modest improvements, our enhanced entity page format, incorporating this http URL-style agent instructions, breadcrumbs, and neural search capabilities, achieves substantial gains: +29.6% accuracy improvement for standard RAG and +29.8% for the full agentic pipeline. The Enhanced+ variant, with richer navigational affordances, achieves the highest absolute scores (accuracy: 4.85/5, completeness: 4.55/5), though the incremental gain over the base enhanced format is not statistically significant. We release our dataset, evaluation framework, and enhanced entity page templates to support reproducibility.
[IR-7] Breaking User-Centric Agency: A Tri-Party Framework for Agent -Based Recommendation
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的推荐系统中普遍存在的用户中心化倾向问题,即现有代理式推荐方法将物品视为被动实体,忽视了物品和平台层面的公平性与曝光需求,从而导致曝光集中和长尾物品代表性不足,威胁系统的长期可持续性。解决方案的关键在于提出首个三方LLM代理推荐框架(Tri-party LLM-agent Recommendation, TriRec),其核心创新为两阶段架构:第一阶段通过赋予物品代理个性化自我推广能力,提升匹配质量并缓解冷启动问题;第二阶段由平台代理进行多目标序列重排序,平衡用户相关性、物品效用与曝光公平性。实验表明,该框架在准确性、公平性和物品层面效用上均取得显著提升,且发现物品自我推广可同时增强公平性与有效性,挑战了传统认为相关性与公平性存在权衡关系的认知。
链接: https://arxiv.org/abs/2603.10673
作者: Yaxin Gong,Chongming Gao,Chenxiao Fan,Wenjie Wang,Fuli Feng,Xiangnan He
机构: University of Science and Technology of China (中国科学技术大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Recent advances in large language models (LLMs) have stimulated growing interest in agent-based recommender systems, enabling language-driven interaction and reasoning for more expressive preference modeling. However, most existing agentic approaches remain predominantly user-centric, treating items as passive entities and neglecting the interests of other critical stakeholders. This limitation exacerbates exposure concentration and long-tail under-representation, threatening long-term system sustainability. In this work, we identify this fundamental limitation and propose the first Tri-party LLM-agent Recommendation framework (TriRec) that explicitly coordinates user utility, item exposure, and platform-level fairness. The framework employs a two-stage architecture: Stage~1 empowers item agents with personalized self-promotion to improve matching quality and alleviate cold-start barriers, while Stage~2 uses a platform agent for sequential multi-objective re-ranking, balancing user relevance, item utility, and exposure fairness. Experiments on multiple benchmarks show consistent gains in accuracy, fairness, and item-level utility. Moreover, we find that item self-promotion can simultaneously enhance fairness and effectiveness, challenging the conventional trade-off assumption between relevance and fairness. Our code is available at this https URL.
[IR-8] A Hypergraph-Based Framework for Exploratory Business Intelligence
【速读】:该论文旨在解决传统业务智能(Business Intelligence, BI)系统在支持探索式业务智能(Exploratory BI)时存在的关键瓶颈问题,包括对专家知识的强依赖、高计算成本、静态模式限制以及缺乏物化视图复用能力。其解决方案的核心在于提出ExBI系统,采用超图(hypergraph)数据模型并引入Source、Join和View等操作符,实现动态模式演化与物化视图的高效重用;同时结合基于采样的算法,在保证可证明估计精度的前提下显著降低计算开销,从而支撑大规模、迭代式的探索式分析任务。
链接: https://arxiv.org/abs/2603.10625
作者: Yunkai Lou,Shunyang Li,Longbin Lai,Jianke Yu,Wenyuan Yu,Ying Zhang
机构: 未知
类目: Databases (cs.DB); Information Retrieval (cs.IR)
备注:
Abstract:Business Intelligence (BI) analysis is evolving towards Exploratory BI, an iterative, multi-round exploration paradigm where analysts progressively refine their understanding. However, traditional BI systems impose critical limits for Exploratory BI: heavy reliance on expert knowledge, high computational costs, static schemas, and lack of reusability. We present ExBI, a novel system that introduces the hypergraph data model with operators, including Source, Join, and View, to enable dynamic schema evolution and materialized view reuse. Using sampling-based algorithms with provable estimation guarantees, ExBI addresses the computational bottlenecks, while maintaining analytical accuracy. Experiments on LDBC datasets demonstrate that ExBI achieves significant speedups over existing systems: on average 16.21x (up to 146.25x) compared to Neo4j and 46.67x (up to 230.53x) compared to MySQL, while maintaining high accuracy with an average error rate of only 0.27% for COUNT, enabling efficient and accurate large-scale exploratory BI workflows.
[IR-9] rajectory-Informed Memory Generation for Self-Improving Agent Systems
【速读】:该论文旨在解决大语言模型驱动的智能体(LLM-powered agents)在执行任务过程中难以从经验中学习并提升未来性能的问题,具体表现为重复低效行为、无法从相似错误中恢复以及未能复用过往成功策略。其解决方案的关键在于提出一个自动提取可操作学习内容并基于上下文记忆检索增强智能体决策能力的框架:通过轨迹智能提取器(Trajectory Intelligence Extractor)进行语义分析,决策归因分析器(Decision Attribution Analyzer)识别导致失败、恢复或低效的具体决策步骤,上下文学习生成器(Contextual Learning Generator)产出三类指导性知识(策略提示、恢复提示与优化提示),并由自适应记忆检索系统(Adaptive Memory Retrieval System)根据多维相似度将相关学习注入任务提示中,从而实现对执行模式的理解、结构化学习的提取与情境化应用。
链接: https://arxiv.org/abs/2603.10600
作者: Gaodan Fang,Vatche Isahagian,K. R. Jayaram,Ritesh Kumar,Vinod Muthusamy,Punleuk Oum,Gegi Thomas
机构: IBM Research (IBM 研究院)
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
备注:
Abstract:LLM-powered agents face a persistent challenge: learning from their execution experiences to improve future performance. While agents can successfully complete many tasks, they often repeat inefficient patterns, fail to recover from similar errors, and miss opportunities to apply successful strategies from past executions. We present a novel framework for automatically extracting actionable learnings from agent execution trajectories and utilizing them to improve future performance through contextual memory retrieval. Our approach comprises four components: (1) a Trajectory Intelligence Extractor that performs semantic analysis of agent reasoning patterns, (2) a Decision Attribution Analyzer that identifies which decisions and reasoning steps led to failures, recoveries, or inefficiencies, (3) a Contextual Learning Generator that produces three types of guidance – strategy tips from successful patterns, recovery tips from failure handling, and optimization tips from inefficient but successful executions, and (4) an Adaptive Memory Retrieval System that injects relevant learnings into agent prompts based on multi-dimensional similarity. Unlike existing memory systems that store generic conversational facts, our framework understands execution patterns, extracts structured learnings with provenance, and retrieves guidance tailored to specific task contexts. Evaluation on the AppWorld benchmark demonstrates consistent improvements, with up to 14.3 percentage point gains in scenario goal completion on held-out tasks and particularly strong benefits on complex tasks (28.5~pp scenario goal improvement, a 149% relative increase).
[IR-10] Modeling Stage-wise Evolution of User Interests for News Recommendation
【速读】:该论文旨在解决个性化新闻推荐中长期偏好与短期兴趣动态变化难以同时建模的问题。现有方法通常依赖单一静态交互图,无法有效捕捉用户行为随时间演化的复杂性,尤其在面对突发事件、热点话题和实时情境变化时表现不足。解决方案的关键在于提出一个统一框架,从全局和局部两个时间维度学习用户偏好:全局偏好模块通过整体交互图提取长期协同信号,而局部偏好模块将历史交互划分为阶段性的时序子图,其中LSTM分支建模近期兴趣的渐进演化,自注意力(self-attention)分支则捕获长程时间依赖关系,从而实现对用户兴趣多尺度动态变化的精准刻画。
链接: https://arxiv.org/abs/2603.10471
作者: Zhiyong Cheng,Yike Jin,Zhijie Zhang,Huilin Chen,Zhangling Duan,Meng Wang
机构: The School of Computer Science and Information Engineering, Hefei University of Technology (合肥工业大学计算机科学与信息工程学院); Institute of Artificial intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: ACM Web Conference 2026 Accepted
Abstract:Personalized news recommendation is highly time-sensitive, as user interests are often driven by emerging events, trending topics, and shifting real-world contexts. These dynamics make it essential to model not only users’ long-term preferences, which reflect stable reading habits and high-order collaborative patterns, but also their short-term, context-dependent interests that change rapidly over time. However, most existing approaches rely on a single static interaction graph, which struggles to capture both long-term preference patterns and short-term interest changes as user behavior evolves. To address this challenge, we propose a unified framework that learns user preferences from both global and local temporal perspectives. A global preference modeling component captures long-term collaborative signals from the overall interaction graph, while a local preference modeling component partitions historical interactions into stage-wise temporal subgraphs to represent short-term dynamics. Within this module, an LSTM branch models the progressive evolution of recent interests, and a self-attention branch captures long-range temporal dependencies. Extensive experiments on two large-scale real-world datasets show that our approach consistently outperforms strong baselines and delivers fresher and more relevant recommendations across diverse user behaviors and temporal settings.
[IR-11] Differentiable Geometric Indexing for End-to-End Generative Retrieval
【速读】:该论文旨在解决生成式检索(Generative Retrieval, GR)中两个内在矛盾:一是优化阻塞(Optimization Blockage),即离散索引的不可微性导致索引构建与下游检索目标解耦;二是几何冲突(Geometric Conflict),即标准内积目标引发的范数膨胀不稳定性,使热门项在几何空间中过度主导长尾相关项。解决方案的关键在于提出可微几何索引(Differentiable Geometric Indexing, DGI):首先通过Gumbel-Softmax实现软教师强制(Soft Teacher Forcing)并结合对称权重共享,建立端到端可微路径以实现操作统一;其次引入各向同性几何优化(Isotropic Geometric Optimization),用单位超球面上缩放后的余弦相似度替代内积logits,从而解耦流行度偏差与语义相关性,恢复几何保真度。实验表明,DGI在大规模工业搜索和电商场景中显著优于稀疏、密集及生成基线模型,尤其在长尾场景下展现出更强鲁棒性。
链接: https://arxiv.org/abs/2603.10409
作者: Xujing Wang,Yufeng Chen,Boxuan Zhang,Jie Zhao,Chao Wei,Cai Xu,Ziyu Guan,Wei Zhao,Weiru Zhang,Xiaoyi Zeng
机构: Xidian University (西安电子科技大学); Alibaba (阿里巴巴)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Generative Retrieval (GR) has emerged as a promising paradigm to unify indexing and search within a single probabilistic framework. However, existing approaches suffer from two intrinsic conflicts: (1) an Optimization Blockage, where the non-differentiable nature of discrete indexing creates a gradient blockage, decoupling index construction from the downstream retrieval objective; and (2) a Geometric Conflict, where standard unnormalized inner-product objectives induce norm-inflation instability, causing popular “hub” items to geometrically overshadow relevant long-tail items. To systematically resolve these misalignments, we propose Differentiable Geometric Indexing (DGI). First, to bridge the optimization gap, DGI enforces Operational Unification. It employs Soft Teacher Forcing via Gumbel-Softmax to establish a fully differentiable pathway, combined with Symmetric Weight Sharing to effectively align the quantizer’s indexing space with the retriever’s decoding space. Second, to restore geometric fidelity, DGI introduces Isotropic Geometric Optimization. We replace inner-product logits with scaled cosine similarity on the unit hypersphere to effectively decouple popularity bias from semantic relevance. Extensive experiments on large-scale industry search datasets and online e-commerce platform demonstrate that DGI outperforms competitive sparse, dense, and generative baselines. Notably, DGI exhibits superior robustness in long-tail scenarios, validating the necessity of harmonizing structural differentiability with geometric isotropy. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2603.10409 [cs.IR] (or arXiv:2603.10409v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2603.10409 Focus to learn more arXiv-issued DOI via DataCite
[IR-12] Beyond Interleaving: Causal Attention Reformulations for Generative Recommender Systems KDD2026
【速读】:该论文旨在解决生成式推荐系统(Generative Recommender Systems, GR)中因交错编码物品与行为(item-action)标记而导致的结构和计算效率低下问题。现有方法不仅使序列长度翻倍、引入二次复杂度,还因异构标记混杂导致注意力机制产生噪声并难以捕捉物品到行为的因果依赖关系。其核心解决方案在于提出两种新型架构:基于注意力的晚期融合行为建模(Attention-based Late Fusion for Actions, AttnLFA)与基于注意力的混合值池化(Attention-based Mixed Value Pooling, AttnMVP),二者通过显式建模 in→an 的因果依赖关系,消除交错依赖以降低序列复杂度50%,同时保持Transformer的表达能力。实验表明,这两种方法在大规模社交网络商品推荐数据上显著优于传统交错基线模型,在评价损失上分别提升0.29%和0.80%,且训练时间减少23%和12%,验证了显式建模物品-行为因果性的优越性。
链接: https://arxiv.org/abs/2603.10369
作者: Hailing Cheng
机构: Linkedin Inc(领英公司)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures, submitted to KDD 2026
Abstract:Generative Recommender Systems (GR) increasingly model user behavior as a sequence generation task by interleaving item and action tokens. While effective, this formulation introduces significant structural and computational inefficiencies: it doubles sequence length, incurs quadratic overhead, and relies on implicit attention to recover the causal relationship between an item and its associated action. Furthermore, interleaving heterogeneous tokens forces the Transformer to disentangle semantically incompatible signals, leading to increased attention noise and reduced representation this http URL this work, we propose a principled reformulation of generative recommendation that aligns sequence modeling with underlying causal structures and attention theory. We demonstrate that current interleaving mechanisms act as inefficient proxies for similarity-weighted action pooling. To address this, we introduce two novel architectures that eliminate interleaved dependencies to reduce sequence complexity by 50%: Attention-based Late Fusion for Actions (AttnLFA) and Attention-based Mixed Value Pooling (AttnMVP). These models explicitly encode the i_n \rightarrow a_n causal dependency while preserving the expressive power of Transformer-based sequence this http URL evaluate our framework on large-scale product recommendation data from a major social network. Experimental results show that AttnLFA and AttnMVP consistently outperform interleaved baselines, achieving evaluation loss improvements of 0.29% and 0.80%, and significant gains in Normalized Entropy (NE). Crucially, these performance gains are accompanied by training time reductions of 23% and 12%, respectively. Our findings suggest that explicitly modeling item-action causality provides a superior design paradigm for scalable and efficient generative ranking.
[IR-13] Does Reasoning Make Search More Fair? Comparing Fairness in Reasoning Reasoning and Non-Reasoning Rerankers
【速读】:该论文旨在解决当前生成式推理重排序模型(reasoning rerankers)在提升检索相关性的同时,其公平性表现尚不明确的问题。研究者通过系统性比较推理与非推理重排序模型在多个检索场景和人口属性上的公平性差异,发现推理机制本身并不显著改善或损害公平性——以注意力加权排名公平性(Attention-Weighted Rank Fairness, AWRF)为指标,其值稳定在0.33–0.35之间,而相关性指标(nDCG)则从0.247到1.000波动。关键发现是:无论是否采用推理机制,地理属性相关的公平性差距依然存在,表明当前模型对输入排名的公平特性具有继承性,未来应聚焦于设计具备公平感知能力的推理模型以实现更优的公平表现。
链接: https://arxiv.org/abs/2603.10332
作者: Saron Samuel,Benjamin Van Durme,Eugene Yang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 17 pages
Abstract:While reasoning rerankers, such as Rank1, have demonstrated strong abilities in improving ranking relevance, it is unclear how they perform on other retrieval qualities such as fairness. We conduct the first systematic comparison of fairness between reasoning and non-reasoning rerankers. Using the TREC 2022 Fair Ranking Track dataset, we evaluate six reranking models across multiple retrieval settings and demographic attributes. Our findings demonstrate reasoning neither improve nor harm fairness compared to non-reasoning approaches. Our fairness metric, Attention-Weighted Rank Fairness (AWRF) remained stable (0.33-0.35) across all models, even as relevance varies substantially (nDCG 0.247-1.000). Demographic breakdown analysis revealed fairness gaps for geographic attributes regardless of model architecture. These results indicate that future work in specializing reasoning models to be aware of fairness attributes could lead to improvements, as current implementations preserve the fairness characteristics of their input ranking.
[IR-14] MITRA: An AI Assistant for Knowledge Retrieval in Physics Collaborations NEURIPS2025
【速读】:该论文旨在解决大型科学合作项目(如CERN的CMS实验)中内部文档数量庞大且结构复杂,导致研究人员难以高效获取所需信息的问题,从而影响知识共享和科研进展。其解决方案的关键在于构建一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的本地化系统MITRA,该系统采用自动化流程通过Selenium从内部数据库提取文档,并结合光学字符识别(Optical Character Recognition, OCR)与版面解析实现高保真文本抽取;同时,通过两级向量数据库架构先定位相关分析任务再聚焦具体文档,有效解决不同分析间的语义歧义问题;整个框架(包括嵌入模型和大语言模型)均部署在本地,保障敏感数据隐私。
链接: https://arxiv.org/abs/2603.09800
作者: Abhishikth Mallampalli,Sridhara Dasu
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
备注: Accepted at NeurIPS 2025 Machine Learning for the Physical Sciences workshop and Lepton Photon conference 2025 (Computing AI/ML track)
Abstract:Large-scale scientific collaborations, such as the Compact Muon Solenoid (CMS) at CERN, produce a vast and ever-growing corpus of internal documentation. Navigating this complex information landscape presents a significant challenge for both new and experienced researchers, hindering knowledge sharing and slowing down the pace of scientific discovery. To address this, we present a prototype of MITRA, a Retrieval-Augmented Generation (RAG) based system, designed to answer specific, context-aware questions about physics analyses. MITRA employs a novel, automated pipeline using Selenium for document retrieval from internal databases and Optical Character Recognition (OCR) with layout parsing for high-fidelity text extraction. Crucially, MITRA’s entire framework, from the embedding model to the Large Language Model (LLM), is hosted on-premise, ensuring that sensitive collaboration data remains private. We introduce a two-tiered vector database architecture that first identifies the relevant analysis from abstracts before focusing on the full documentation, resolving potential ambiguities between different analyses. We demonstrate the prototype’s superior retrieval performance against a standard keyword-based baseline on realistic queries and discuss future work towards developing a comprehensive research agent for large experimental collaborations.
人机交互
[HC-0] ask-Aware Delegation Cues for LLM Agents ICIP
【速读】:该论文旨在解决人机协作中因信息不对称导致的脆弱性问题,即用户缺乏针对特定任务的可靠性判断依据,而代理(agent)通常无法提供校准后的不确定性估计或推理过程。其解决方案的关键在于提出一个任务感知的协作信号层,该层将离线偏好评估转化为在线、面向用户的委托(delegation)原语;具体包括:通过语义聚类构建可解释的任务分类体系,并据此生成(i)能力画像(Capability Profiles)——即任务条件下的胜率映射图,以及(ii)协调风险提示(Coordination-Risk Cues)——即任务条件下的分歧概率(平局率)先验。这些信号驱动了一个闭环委托协议,支持共同基础验证、自适应路由(主代理 vs. 主代理+审计者)、显式推理披露和隐私保护的责任日志,从而将原本不透明的系统默认委托机制转变为可见、可协商且可审计的合作决策过程。
链接: https://arxiv.org/abs/2603.11011
作者: Xingrui Gu
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Human-Computer Interaction (cs.HC)
备注: Accepeted by CHI’26 Workshop on Developing Standards and Documentation For LLM Use as Simulated Research Participants
Abstract:LLM agents increasingly present as conversational collaborators, yet human–agent teamwork remains brittle due to information asymmetry: users lack task-specific reliability cues, and agents rarely surface calibrated uncertainty or rationale. We propose a task-aware collaboration signaling layer that turns offline preference evaluations into online, user-facing primitives for delegation. Using Chatbot Arena pairwise comparisons, we induce an interpretable task taxonomy via semantic clustering, then derive (i) Capability Profiles as task-conditioned win-rate maps and (ii) Coordination-Risk Cues as task-conditioned disagreement (tie-rate) priors. These signals drive a closed-loop delegation protocol that supports common-ground verification, adaptive routing (primary vs.\ primary+auditor), explicit rationale disclosure, and privacy-preserving accountability logs. Two predictive probes validate that task typing carries actionable structure: cluster features improve winner prediction accuracy and reduce difficulty prediction error under stratified 5-fold cross-validation. Overall, our framework reframes delegation from an opaque system default into a visible, negotiable, and auditable collaborative decision, providing a principled design space for adaptive human–agent collaboration grounded in mutual awareness and shared accountability.
[HC-1] World Mouse: Exploring Interactions with a Cross-Reality Cursor
【速读】:该论文旨在解决扩展现实(Extended Reality, XR)系统中人与混合现实场景交互困难的问题,尤其针对当前“自然”输入方式的局限性——触控受限于人体活动范围和疲劳感,而凝视(gaze)则缺乏对精细操作所需的精度。解决方案的关键在于提出World Mouse,一种跨现实(cross-reality)光标系统,它将熟悉的二维桌面鼠标重新诠释为适用于复杂三维场景的交互工具。其核心机制包括:基于表面法向量(surface normals)的物体内部交互以实现精准定位,以及通过插值技术实现物体间的空域导航;同时,该方案利用语义分割(semantic segmentation)和网格重建(mesh reconstruction)将物理对象视为可交互表面,从而在真实与虚拟环境之间建立无缝交互通道。
链接: https://arxiv.org/abs/2603.10984
作者: Esen K. Tütüncü,Mar Gonzalez-Franco,Khushman Patel,Eric J. Gonzalez
机构: Institute of Neurosciences of the University of Barcelona (巴塞罗那大学神经科学研究所); Google (谷歌)
类目: Human-Computer Interaction (cs.HC)
备注: 7 pages, 4 figures. CHI '26, April 13-17, 2026, Barcelona, Spain
Abstract:As Extended Reality (XR) systems increasingly map and understand the physical world, interacting with these blended representations remains challenging. The current push for “natural” inputs has its trade-offs: touch is limited by human reach and fatigue, while gaze often lacks the precision for fine interaction. To bridge this gap, we introduce World Mouse, a cross-reality cursor that reinterprets the familiar 2D desktop mouse for complex 3D scenes. The system is driven by two core mechanisms: within-object interaction, which uses surface normals for precise cursor placement, and between-object navigation, which leverages interpolation to traverse empty space. Unlike previous virtual-only approaches, World Mouse leverages semantic segmentation and mesh reconstruction to treat physical objects as interactive surfaces. Through a series of prototypes, including object manipulation and screen-to-world transitions, we illustrate how cross-reality cursors may enable seamless interactions across real and virtual environments.
[HC-2] Nurture-First Agent Development: Building Domain-Expert AI Agents Through Conversational Knowledge Crystallization
【速读】:该论文试图解决当前基于大语言模型(Large Language Model, LLM)的领域专家智能体(domain-expert AI agents)构建中,将领域专业知识编码视为一次性工程任务所带来的局限性问题。现有主流范式——代码优先开发(code-first development)和提示优先开发(prompt-first development)——均假设知识注入是部署前的离散阶段,这与领域知识本质上具有的隐性(tacit)、个性化及持续演化的特性不匹配。解决方案的关键在于提出“培育优先开发”(Nurture-First Development, NFD),其核心机制是通过结构化对话交互逐步培育智能体,并引入“知识结晶循环”(Knowledge Crystallization Cycle),将操作性对话中分散的知识片段周期性地提炼为结构化、可复用的知识资产,从而实现人机协同演化下的动态知识建模与增强。
链接: https://arxiv.org/abs/2603.10808
作者: Linghao Zhang
机构: Nanjing University of Posts and Telecommunications (NJUPT)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: 24 pages, 8 figures, 2 tables
Abstract:The emergence of large language model (LLM)-based agent frameworks has shifted the primary challenge in building domain-expert AI agents from raw capability to effective encoding of domain expertise. Two dominant paradigms – code-first development, which embeds expertise in deterministic pipelines, and prompt-first development, which captures expertise in static system prompts – both treat agent construction as a discrete engineering phase preceding deployment. We argue that this sequential assumption creates a fundamental mismatch with the nature of domain expertise, which is substantially tacit, deeply personal, and continuously evolving. We propose Nurture-First Development (NFD), a paradigm in which agents are initialized with minimal scaffolding and progressively grown through structured conversational interaction with domain practitioners. The central mechanism is the Knowledge Crystallization Cycle, whereby fragmented knowledge embedded in operational dialogue is periodically consolidated into structured, reusable knowledge assets. We formalize NFD through: (1) a Three-Layer Cognitive Architecture organizing agent knowledge by volatility and personalization degree; (2) the Knowledge Crystallization Cycle with formal definitions of crystallization operations and efficiency metrics; and (3) an operational framework comprising a Dual-Workspace Pattern and Spiral Development Model. We illustrate the paradigm through a detailed case study on building a financial research agent for U.S. equity analysis and discuss the conditions, limitations, and broader implications of NFD for human-agent co-evolution.
[HC-3] AI-Generated Rubric Interfaces: K-12 Teachers Perceptions and Practices
【速读】:该论文旨在解决K–12教师在教学评估中因缺乏时间与专业支持而难以高效生成高质量评分量规(rubric)的问题。研究提出以生成式AI(Generative AI)辅助生成量规作为解决方案,其关键在于通过教师实际操作AI工具进行提示工程(prompting),从而获得结构清晰、标准明确的初始量规草案,并结合教师的专业判断进行调整优化。实证结果显示,教师普遍认为AI生成的量规虽需人工修订以适配具体教学目标和年级水平,但显著提升了量规制定效率与一致性,尤其在提升模糊标准的可操作性方面效果突出;同时,教师对AI量规工具持条件性采纳态度,强调必须保留教师对量规内容的控制权和灵活性,特别是对指标增删与层级调整的能力。
链接: https://arxiv.org/abs/2603.10773
作者: Bahare Riahi,Sayali Patukale,Joy Niranjan,Yogya Koneru,Tiffany Barnes,Veronica Cateté
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 20 pages, 2 figures
Abstract:This study investigates K–12 teachers’ perceptions and experiences with AI-supported rubric generation during a summer professional development workshop ( n = 25 ). Teachers used this http URL to generate rubrics and practiced prompting to tailor criteria and performance levels. They then applied these rubrics to provide feedback on a sample block-based programming activity, followed by using a chatbot to deliver rubric-based feedback for the same work. Data were collected through pre- and post-workshop surveys, open discussions, and exit tickets. We used thematic analysis to analyze the qualitative data. Teachers reported that they rarely create rubrics from scratch because the process is time-consuming and defining clear distinctions between performance levels is challenging. After hands-on use, teachers described AI-generated rubrics as strong starting drafts that improved structure and clarified vague criteria. However, they emphasized the need for teacher oversight due to generic or grade-misaligned language, occasional misalignment with instructional priorities, and the need for substantial editing. Survey results indicated high perceived clarity and ethical acceptability, moderate alignment with assignments, and usability as the primary weakness – particularly the ability to add, remove, or revise criteria. Open-ended responses highlighted a ``strictness-versus-detail’’ trade-off: AI feedback was often perceived as harsher but more detailed and scalable. As a result, teachers expressed conditional willingness to adopt AI rubric tools when workflows support easy customization and preserve teacher control. Comments: 20 pages, 2 figures Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2603.10773 [cs.HC] (or arXiv:2603.10773v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2603.10773 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Bahare Riahi [view email] [v1] Wed, 11 Mar 2026 13:49:32 UTC (882 KB)
[HC-4] Believing vs. Achieving – The Disconnect between Efficacy Beliefs and Collaborative Outcomes
【速读】:该论文旨在解决人类在与人工智能(Artificial Intelligence, AI)协作过程中,如何基于自身效能信念(efficacy beliefs)做出依赖AI的决策问题。其核心挑战在于,现有研究虽探讨了影响AI依赖的因素,但未充分揭示效能信念如何在不同情境信息下转化为具体的决策判断,并进而影响人机协同效果。解决方案的关键在于通过控制实验(N=240)识别出:效能信念作为稳定的认知锚点,会引发系统性的“AI乐观偏差”;同时发现情境信息的作用具有不对称性——AI性能数据可消除该偏差,而其他类型的信息则放大效能差异对委托决策的影响,从而为设计更有效的协同界面提供实证依据和设计指南。
链接: https://arxiv.org/abs/2603.10708
作者: Philipp Spitzer,Joshua Holstein
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:As artificial intelligence (AI) becomes increasingly integrated into workflows, humans must decide when to rely on AI advice. These decisions depend on general efficacy beliefs, i.e., humans’ confidence in their own abilities and their perceptions of AI competence. While prior work has examined factors influencing AI reliance, the role of efficacy beliefs in shaping collaboration remains underexplored. Through a controlled experiment (N=240) where participants made repeated delegation decisions, we investigate how efficacy beliefs translate into instance-wise efficacy judgments under varying contextual information. Our explorative findings reveal efficacy beliefs as persistent cognitive anchors, leading to systematic “AI optimism”. Contextual information operates asymmetrically: while AI performance information selectively eliminates the AI optimism bias, data or AI information amplify how efficacy discrepancies influence delegation decisions. Although efficacy discrepancies influence delegation behavior, they show weaker effects on human-AI team performance. As these findings challenge transparency-focused approaches, we propose design guidelines for effective collaborative settings.
[HC-5] Proceedings of CHIdeology 2026: CHI Workshop on Disentangling the frag mented politics values and imaginaries of Human-Computer Interaction through ideologies
【速读】:该论文旨在解决人机交互(Human-Computer Interaction, HCI)领域中政治立场、价值观念与未来想象日益碎片化的问题,试图通过意识形态分析来厘清HCI研究与实践中的多元取向。其解决方案的关键在于组织首届CHI Workshop on CHIdeology,借助对意识形态的系统性梳理与讨论,促进学界对HCI中隐含价值观和政治逻辑的反思与对话,从而推动该领域在多元背景下形成更清晰、更具批判性的理论与实践框架。
链接: https://arxiv.org/abs/2603.10681
作者: Felix Anand Epp,Matti Nelimarkka,Jesse Haapoja,Pedro Ferreira,Os Keyes,Shaowen Bardzell
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: Website: this https URL
Abstract:This is the Proceedings of the First CHI Workshop on CHIdeology: Disentangling the fragmented politics, values, and imaginaries of Human-Computer Interaction through ideologies, held on Wednesday, 15 April, in Barcelona, Spain, at the ACM CHI Conference on Human Factors in Computing Systems.
[HC-6] A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 驱动的数字人体建模(Digital Human Modelling, DHM)研究中普遍存在的平台依赖性问题,即多数AI-enabled DHM方法紧密耦合于特定平台、任务或解释流程,导致可复现性差、扩展困难且伦理再利用受限。其解决方案的关键在于提出一个平台无关(platform-agnostic)的DHM框架,通过显式分离感知层(sensing)、交互建模(interaction modelling)与推理就绪性(inference readiness)三个核心模块实现结构解耦:具体而言,采用OpenBCI Galea脑电图(EEG)、肌电图(EMG)、眼电图(EOG)、光电容积脉搏波(PPG)及惯性数据的统一多模态传感层,结合基于SuperTux的游戏化交互环境,将生理信号表示为时序对齐的结构化可观测变量,并以计算任务原语和时间戳事件标记建模交互行为,从而支持下游AI方法在伦理审批前提下灵活部署,同时保障跨异构传感器与平台的数据一致性与可扩展性。
链接: https://arxiv.org/abs/2603.10680
作者: Daniel J. Buxton,Mufti Mahmud,Jordan J. Bird,Thomas Hughes-Roberts,David J. Brown
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Digital Human Modelling (DHM) is increasingly shaped by advances in AI, wearable biosensing, and interactive digital environments, particularly in research addressing accessibility and inclusion. However, many AI-enabled DHM approaches remain tightly coupled to specific platforms, tasks, or interpretative pipelines, limiting reproducibility, scalability, and ethical reuse. This paper presents a platform-agnostic DHM framework designed to support AI-ready multimodal interaction research by explicitly separating sensing, interaction modelling, and inference readiness. The framework integrates the OpenBCI Galea headset as a unified multimodal sensing layer, providing concurrent EEG, EMG, EOG, PPG, and inertial data streams, alongside a reproducible, game-based interaction environment implemented using SuperTux. Rather than embedding AI models or behavioural inference, physiological signals are represented as structured, temporally aligned observables, enabling downstream AI methods to be applied under appropriate ethical approval. Interaction is modelled using computational task primitives and timestamped event markers, supporting consistent alignment across heterogeneous sensors and platforms. Technical verification via author self-instrumentation confirms data integrity, stream continuity, and synchronisation; no human-subjects evaluation or AI inference is reported. Scalability considerations are discussed with respect to data throughput, latency, and extension to additional sensors or interaction modalities. Illustrative use cases demonstrate how the framework can support AI-enabled DHM and HCI studies, including accessibility-oriented interaction design and adaptive systems research, without requiring architectural modifications. The proposed framework provides an emerging-technology-focused infrastructure for future ethics-approved, inclusive DHM research.
[HC-7] rminal Is All You Need: Design Properties for Human-AI Agent Collaboration
【速读】:该论文试图解决的问题是:为何在实践中,终端(terminal-based)工具比图形用户界面(Graphical User Interface, GUI)驱动的AI代理更有效且被广泛采用。解决方案的关键在于识别并论证三个核心设计属性——代理与界面之间的表征兼容性(representational compatibility)、代理行为在交互媒介中的透明度(transparency of agent actions),以及人类参与者的低准入门槛(low barriers to entry),这些属性天然由终端界面实现,并构成高效人-AI-用户界面(Human-AI-UI)协作的基础。论文指出,无论未来采用何种模态(包括GUI或空间接口),都必须有意识地工程化设计以达成这三项属性,而非简单依赖传统终端形式。
链接: https://arxiv.org/abs/2603.10664
作者: Alexandre De Masi
机构: University of Geneva (日内瓦大学)
类目: Human-Computer Interaction (cs.HC)
备注: 8 pages (6 content + 2 references), 6 figures. Accepted as poster at the CHI 2026 Workshop on Human-AI-UI Interactions Across Modalities (CUCHI’26), April 14, 2026, Barcelona, Spain
Abstract:While research on AI agents focuses on enabling them to operate graphical user interfaces, the most effective and widely adopted agent tools in practice are terminal-based. We argue that this convergence is not coincidental. It reflects three design properties central to effective human-AI-UI collaboration: representational compatibility between agent and interface, transparency of agent actions within the interaction medium, and low barriers to entry for human participants. We ground each property in established HCI theory, show how terminal-based tools satisfy them by default, and argue that any modality, including graphical and spatial interfaces, must be deliberately engineered to achieve them. Rather than a legacy artifact, the terminal serves as a design exemplar whose properties any agent-facing modality must replicate.
[HC-8] CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents
【速读】:该论文旨在解决当前计算机使用代理(Computer-Use Agents, CUAs)在多样化桌面环境中部署时,缺乏可扩展且可靠的评估方法的问题。现有评估方式依赖静态基准、规则-based 成功判定或人工检查,存在脆弱性高、成本大且与实际应用场景脱节的缺陷。解决方案的关键在于引入视觉语言模型(Vision-Language Models, VLMs)作为自主审计者(autonomous auditors),直接基于可观测的交互行为对CUA任务完成情况进行判断,并通过大规模元评估(meta-evaluation)系统分析五种VLM在macOS、Windows和Linux环境下的表现,从准确性、置信度校准性和模型间一致性三个维度揭示其局限性,从而为未来部署中考虑评估者可靠性、不确定性及变异性提供依据。
链接: https://arxiv.org/abs/2603.10577
作者: Marta Sumyk,Oleksandr Kosovan
机构: Ukrainian Catholic University (乌克兰天主教大学)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Computer-Use Agents (CUAs) are emerging as a new paradigm in human-computer interaction, enabling autonomous execution of tasks in desktop environment by perceiving high-level natural-language instructions. As such agents become increasingly capable and are deployed across diverse desktop environments, evaluating their behavior in a scalable and reliable manner becomes a critical challenge. Existing evaluation pipelines rely on static benchmarks, rule-based success checks, or manual inspection, which are brittle, costly, and poorly aligned with real-world usage. In this work, we study Vision-Language Models (VLMs) as autonomous auditors for assessing CUA task completion directly from observable interactions and conduct a large-scale meta-evaluation of five VLMs that judge task success given a natural-language instruction and the final environment state. Our evaluation spans three widely used CUA benchmarks across macOS, Windows, and Linux environments and analyzes auditor behavior along three complementary dimensions: accuracy, calibration of confidence estimates, and inter-model agreement. We find that while state-of-the-art VLMs achieve strong accuracy and calibration, all auditors exhibit notable performance degradation in more complex or heterogeneous environments, and even high-performing models show significant disagreement in their judgments. These results expose fundamental limitations of current model-based auditing approaches and highlight the need to explicitly account for evaluator reliability, uncertainty, and variance when deploying autonomous CUAs in real-world settings.
[HC-9] Graphing Inline: Understanding Word-scale Graphics Use in Scientific Papers
【速读】:该论文试图解决的问题是:尽管词级图形(word-scale graphics,即在排版尺寸下的可视化装饰)能够增强文本表达并提升理解效率,但科学论文中是否以及如何采用此类图形进行学术传播仍不明确。为填补这一研究空白,作者通过对126,797篇科学论文中提取的909个词级图形进行语料库分析,提出一个框架以系统刻画其位置(positioning)、功能(communicative function)和视觉表现形式(visual representation)。解决方案的关键在于通过实证分析揭示词级图形的使用频率低、图标占主导、且视觉呈现与传播功能密切相关(如用定量图表进行数据注释),从而为未来借助技术与制度创新优化学术传播提供了依据。
链接: https://arxiv.org/abs/2603.10533
作者: Siyu Lu,Yanhan Liu,Shiyu Xu,Ruishi Zou,Chen Ye
机构: Tongji University (同济大学); Independent Researcher (独立研究员)
类目: Human-Computer Interaction (cs.HC)
备注: Conditionally accepted in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI’26)
Abstract:Graphics (e.g., figures and charts) are ubiquitous in scientific papers, yet separating graphics from text increases cognitive load in understanding text-graphic connections. Research has found that word-scale graphics, or visual embellishments at typographic size, can augment original text, making it more expressive and easier to understand. However, whether, if so, how scientific papers adopt word-scale graphics for scholarly communication remains unclear. To address this gap, we conducted a corpus study reviewing 909 word-scale graphics extracted from 126,797 scientific papers. Through analysis, we propose a framework that characterizes where (positioning), why (communicative function), and how (visual representation) authors apply word-scale graphics in scientific papers. Our findings reveal that word-scale graphics are rarely used, that icons dominate visual representation, and that visual representation connects with communicative function (e.g., using quantitative graphs for data annotation). We further discuss opportunities to enhance scholarly communication with word-scale graphics through technical and administrative innovations.
[HC-10] MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR
【速读】:该论文旨在解决扩展现实(Extended Reality, XR)中复杂声学环境因声音源混叠而导致用户场景感知能力下降和社交互动受阻的问题。解决方案的关键在于提出一个实时音频-视觉协同分离系统 MoXaRt,其核心是一个级联架构:首先并行执行仅基于音频的粗粒度分离,同时通过视觉检测(如人脸、乐器)定位声源;随后利用这些视觉锚点引导细化网络对每个声音源进行精确分离,从而实现对最多5个并发声源(如2个语音+3个乐器)的有效分离,处理延迟约为2秒。该方法显著提升了语音可懂度并降低了认知负荷,为更具备感知能力和社交智能的XR体验提供了技术支撑。
链接: https://arxiv.org/abs/2603.10465
作者: Tianyu Xu,Sieun Kim,Qianhui Zheng,Ruoyu Xu,Tejasvi Ravi,Anuva Kulkarni,Katrina Passarella-Ward,Junyi Zhu,Adarsh Kowdle
机构: Google(谷歌); University of Michigan (密歇根大学); Columbia University (哥伦比亚大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt’s core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g., faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to 5 concurrent sources (e.g., 2 voices + 3 instruments) with ~2 second processing latency. We validate MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study. Empirical results indicate that our system significantly enhances speech intelligibility, yielding a 36.2% (p 0.01) increase in listening comprehension within adversarial acoustic environments while substantially reducing cognitive load (p 0.001), thereby paving the way for more perceptive and socially adept XR experiences.
[HC-11] Moving Phones Active Peers: Exploring the Effect of Animated Phones as Facilitators in In-Person Group Discussion
【速读】:该论文试图解决的问题是:在面对面的小组讨论中,智能手机虽作为智能工作站被广泛使用,但其共存状态可能并未有效促进参与者的行为参与度,甚至可能分散注意力,而如何利用智能手机的移动能力来增强人际互动仍缺乏研究。解决方案的关键在于设计一种名为AnimaStand的可动手机支架,通过在不干扰手机常规功能的前提下,使其根据对话动态发出具身化的表达性动画提示(如轻微移动、节奏变化等),从而主动引导群体互动、重新激活沉默成员,并提升小组协作效率与关系质量。该方案基于Tuckman的团队发展阶段理论进行设计,实现了从被动工具到主动交互中介的转变。
链接: https://arxiv.org/abs/2603.10394
作者: Ziqi Pan,Ziqi Liu,Jinhan Zhang,Zeyu Huang,Xiaojuan Ma
机构: The Hong Kong University of Science and Technology (香港科技大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Tsinghua University (清华大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:In today’s in-person group discussions, smartphones are integrated as intelligent workstations; yet given their co-presence in such face-to-face interactions, whether and how they may enhance people’s behavioral engagement with others remains underexplored. This work investigates how animating personal smartphones to move expressively, without compromising regular functions, can transform them into active embodied facilitators for co-located group interaction. In the four-stranger small-group discussion setting, guided by Tuckman’s group-development theory, we conducted a design workshop (n=12) to identify problematic group-work circumstances and design expressive, attention-efficient animated phone facilitations. Subsequently, we developed AnimaStand, a movement-enabled phone stand that animates phones to deliver group facilitation cues according to conversation dynamics. In a between-subjects Wizard-of-Oz study (n=56) with four-stranger group discussions, where everyone’s phone was on an AnimaStand, the facilitations re-engaged inactive members, enhancing group dynamics, task operation performance, and relationships. We finally discuss prospects for more adaptive and generalizable animated device personal facilitation.
[HC-12] Reactive Writers: How Co-Writing with AI Changes How We Engage with Ideas
【速读】:该论文试图解决的问题是:当前对AI写作工具在协同写作中如何影响用户观点形成机制的理解尚不充分,尤其是缺乏对其中行为与流程变化的系统性认知。解决方案的关键在于通过混合方法研究(包括19名参与者的事后访谈和1291次AI协同写作会话的量化分析),揭示出一种新的写作范式——“反应式写作”(Reactive Writing):即写作者优先评估AI建议而非自主完成构思,从而导致其思维方向被AI引导,且这种影响往往未被察觉。这一发现表明,AI辅助写作正重塑传统创作流程,使其更易受AI偏见和观点迁移的影响。
链接: https://arxiv.org/abs/2603.10374
作者: Advait Bhat,Marianne Aubin Le Quéré,Mor Naaman,Maurice Jakesch
机构: University of Washington (华盛顿大学); Princeton University (普林斯顿大学); Cornell Tech (康奈尔科技学院); Bauhaus University (包豪斯大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 21 pages, 8 figures, CHI 2026 : ACM CHI Conference on Human Factors in Computing Systems
Abstract:Emerging experimental evidence shows that writing with AI assistance can change both the views people express in writing and the opinions they hold afterwards. Yet, we lack substantive understanding of procedural and behavioral changes in co-writing with AI that underlie the observed opinion-shaping power of AI writing tools. We conducted a mixed-methods study, combining retrospective interviews with 19 participants about their AI co-writing experience with a quantitative analysis tracing engagement with ideas and opinions in 1,291 AI co-writing sessions. Our analysis shows that engaging with the AI’s suggestions – reading them and deciding whether to accept them – becomes a central activity in the writing process, taking away from more traditional processes of ideation and language generation. As writers often do not complete their own ideation before engaging with suggestions, the suggested ideas and opinions seeded directions that writers then elaborated on. At the same time, writers did not notice the AI’s influence and felt in full control of their writing, as they – in principle – could always edit the final text. We term this shift \textitReactive Writing: an evaluation-first, suggestion-led writing practice that departs substantially from conventional composing in the presence of AI assistance and is highly vulnerable to AI-induced biases and opinion shifts.
[HC-13] NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech Interaction
【速读】:该论文旨在解决当前语音交互系统在实现始终可用(always-available)、持续且隐蔽的AI语音对话时面临的多重挑战,包括词汇量受限、佩戴舒适性差、对静默或低音量语音(如耳语)识别困难以及环境噪声干扰等问题。其解决方案的关键在于提出NasoVoce——一种集成麦克风与振动传感器的鼻托式接口,可无感佩戴于智能眼镜鼻梁处,同步采集声学信号(由麦克风获取)和骨传导/皮肤传导振动信号(由振动传感器获取)。由于两者特性互补:麦克风提供高质量音频但易受噪声影响,而振动传感器抗噪性强但信号质量较低,通过多模态融合策略有效提升低音量语音(如耳语)的识别准确率与语音质量,从而实现高鲁棒性的无声或轻声语音交互。
链接: https://arxiv.org/abs/2603.10324
作者: Jun Rekimoto,Yu Nishimura,Bojian Yang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注: ACM CHI 2026 paper
Abstract:Silent and whispered speech offer promise for always-available voice interaction with AI, yet existing methods struggle to balance vocabulary size, wearability, silence, and noise robustness. We present NasoVoce, a nose-bridge-mounted interface that integrates a microphone and a vibration sensor. Positioned at the nasal pads of smart glasses, it unobtrusively captures both acoustic and vibration signals. The nasal bridge, close to the mouth, allows access to bone- and skin-conducted speech and enables reliable capture of low-volume utterances such as whispered speech. While the microphone captures high-quality audio, it is highly sensitive to environmental noise. Conversely, the vibration sensor is robust to noise but yields lower signal quality. By fusing these complementary inputs, NasoVoce generates high-quality speech robust against interference. Evaluation with Whisper Large-v2, PESQ, STOI, and MUSHRA ratings confirms improved recognition and quality. NasoVoce demonstrates the feasibility of a practical interface for always-available, continuous, and discreet AI voice conversations.
[HC-14] owards Modeling Situational Awareness Through Visual Attention in Clinical Simulations
【速读】:该论文旨在解决临床急救场景中团队情境意识(Situational Awareness, SA)的动态性与分布性难以量化的问题。传统方法难以捕捉多角色协作过程中视觉注意力的时变特征及其与任务分工的关系。其解决方案的关键在于应用过渡网络分析(Transition Network Analysis, TNA),基于虚拟现实(VR)心脏骤停模拟中40名医护人员的眼动追踪数据,构建关键区域(Areas of Interest, AOIs)间的注视转移网络,并提取熵(entropy)和自环率(self-loop rate)等指标,从而量化个体及团队注意力结构与流动模式。研究发现,不同角色(如CPR、TeamLead)在不同临床阶段会动态调整注意力分配策略,TNA能够有效揭示团队认知功能分化,为开发阶段敏感的分析工具和精准教学干预提供支持。
链接: https://arxiv.org/abs/2603.10308
作者: Haoting Gao,Kapotaksha Das,Mohamed Abouelenien,Michael Cole,James Cooke,Vitaliy Popov
机构: University of Michigan Medical School (密歇根大学医学院); University of Michigan-Dearborn (密歇根大学迪尔伯恩分校)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at the LAK 2026 Workshop on Transition Network Analysis (TNA). Preprint version
Abstract:Situational awareness (SA) is essential for effective team performance in time-critical clinical environments, yet its dynamic and distributed nature remains difficult to characterize. In this preliminary study, we apply Transition Network Analysis (TNA) to model visual attention in multiperson VR-based cardiac arrest simulations. Using eye-tracking data from 40 clinicians assigned to four standardized roles (Airway, CPR, Defib, TeamLead), we construct gaze transition networks between clinically meaningful areas of interest (AOIs) and extract metrics such as entropy and self-loop rate to quantify attentional structure and flow. Our findings reveal that individual and team’s visual attention is dynamically and adaptively redistributed across roles and scenario phases, with those in CPR roles narrowing their focus to execution-critical tasks and those in the TeamLead role concentrating on global monitoring as clinical demands evolve. TNA thus provides a powerful lens for mapping functional differentiation of team cognition and may support the development of phase-sensitive analytics and targeted instructional interventions in acute care training.
[HC-15] Conversational AI-Enhanced Exploration System to Query Large-Scale Digitised Collections of Natural History Museums
【速读】:该论文旨在解决自然历史博物馆数字化藏品数据因规模庞大且科学复杂性高而难以被公众访问和理解的问题。传统数据库工具依赖关键词搜索或要求用户具备特定数据模式知识,限制了非专业用户的探索能力。解决方案的关键在于设计并实现一个以用户为中心的交互系统,结合可视化地图与自然语言对话代理(Natural Language Conversational Agent),利用当代大语言模型(Large Language Models, LLMs)的功能调用能力,动态从外部API实时检索结构化数据,从而支持用户通过自然语言直接查询近170万条数字化标本记录,并获得精准的馆藏信息响应。这一方法显著提升了大规模、高频更新的博物馆数据集的可访问性和交互效率。
链接: https://arxiv.org/abs/2603.10285
作者: Yiyuan Wang,Andrew Johnston,Zoë Sadokierski,Rhiannon Stephens,Shane T. Ahyong
机构: University of Technology Sydney (悉尼科技大学); Australian Museum (澳大利亚博物馆)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Emerging Technologies (cs.ET)
备注: 25 pages, 9 figures
Abstract:Recent digitisation efforts in natural history museums have produced large volumes of collection data, yet their scale and scientific complexity often hinder public access and understanding. Conventional data management tools, such as databases, restrict exploration through keyword-based search or require specialised schema knowledge. This paper presents a system design that uses conversational AI to query nearly 1.7 million digitised specimen records from the life-science collections of the Australian Museum. Designed and developed through a human-centred design process, the system contains an interactive map for visual-spatial exploration and a natural-language conversational agent that retrieves detailed specimen data and answers collection-specific questions. The system leverages function-calling capabilities of contemporary large language models to dynamically retrieve structured data from external APIs, enabling fast, real-time interaction with extensive yet frequently updated datasets. Our work provides a new approach of connecting large museum collections with natural language-based queries and informs future designs of scientific AI agents for natural history museums.
[HC-16] DUCTILE: Agent ic LLM Orchestration of Engineering Analysis in Product Development Practice
【速读】:该论文旨在解决工程分析自动化在产品开发过程中因工具接口、数据格式和流程文档频繁变更而导致的自动化支持失效问题。传统脚本化流水线对输入变化敏感,难以适应工程生态系统的动态演化。解决方案的关键在于提出一种名为DUCTILE(Delegated, User-supervised Coordination of Tool- and document-Integrated LLM-Enabled)的代理编排方法,其核心是将自适应编排由大语言模型(Large Language Model, LLM)代理完成,而确定性执行则由经过验证的工程工具承担;LLM代理负责解析设计规范、检查输入数据并动态调整处理路径,同时由工程师进行监督与最终决策,从而实现对非结构化或偏离预期输入的鲁棒响应,保障结果在方法论上的一致性和正确性。
链接: https://arxiv.org/abs/2603.10249
作者: Alejandro Pradas-Gomez,Arindam Brahma,Ola Isaksson
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 22 pages, including supplemental material. 9 Figures
Abstract:Engineering analysis automation in product development relies on rigid interfaces between tools, data formats and documented processes. When these interfaces change, as they routinely do as the product evolves in the engineering ecosystem, the automation support breaks. This paper presents a DUCTILE (Delegated, User-supervised Coordination of Tool- and document-Integrated LLM-Enabled) agentic orchestration, an approach for developing, executing and evaluating LLM-based agentic automation support of engineering analysis tasks. The approach separates adaptive orchestration, performed by the LLM agent, from deterministic execution, performed by verified engineering tools. The agent interprets documented design practices, inspects input data and adapts the processing path, while the engineer supervises and exercises final judgment. DUCTILE is demonstrated on an industrial structural analysis task at an aerospace manufacturer, where the agent handled input deviations in format, units, naming conventions and methodology that would break traditional scripted pipelines. Evaluation against expert-defined acceptance criteria and deployment with practicing engineers confirm that the approach produces correct, methodologically compliant results across repeated independent runs. The paper discusses practical consequences of adopting agentic automation, including unintended effects on the nature of engineering work and the tension between removing mundane tasks and creating an exhausting supervisory role.
[HC-17] Characterizing Healthy Post-Stroke Neuromotor Behavior During 6D Upper-Limb Isometric Gaming: Implications for Design of End-Effector Rehabilitation Robot Interfaces
【速读】:该论文旨在解决机器人辅助康复中系统设计与用户神经肌肉行为之间复杂交互的挑战,特别是如何通过游戏界面和物理机器人干预促进健康的运动训练,并准确识别病理状态下的神经肌肉行为特征。其关键解决方案包括:(1)利用开放数据集对健康及脑卒中患者在等长轨迹追踪任务中的力输出、肌电活动和游戏表现进行量化分析,揭示任务定义(如约束轴向和用户对指令的理解)显著影响用户行为;(2)发现6维末端执行器力数据中可检测到与病理相关的特征(如力误差和平均力生成差异在p=0.05水平显著不同);(3)提出一种基于隐马尔可夫模型(HMM)的表面肌电信号(sEMG)驱动的神经肌肉行为分类方法,在周期性运动中有效区分健康与脑卒中群体的动态模式,而传统协同分解方法则无法实现此类区分。这些成果为开发能自适应调整策略以促进多样化人群健康运动模式的末端执行器康复机器人提供了理论基础与技术路径。
链接: https://arxiv.org/abs/2603.10173
作者: Ajay Anand,Gabriel Parra,Chad A. Berghoff,Laura A. Hallock
机构: University of Utah (犹他大学)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Successful robot-mediated rehabilitation requires designing games and robot interventions that promote healthy motor practice. However, the interplay between a given user’s neuromotor behavior, the gaming interface, and the physical robot makes designing system elements – and even characterizing what behaviors are “healthy” or pathological – challenging. We leverage our OpenRobotRehab 1.0 open access data set to assess the characteristics of 13 healthy and 2 post-stroke users’ force output, muscle activations, and game performance while executing isometric trajectory tracking tasks using an end-effector rehabilitation robot. We present an assessment of how subtle aspects of interface design impact user behavior; an analysis of how pathological neuromotor behaviors are reflected in end-effector force dynamics; and a novel hidden Markov model (HMM)-based neuromotor behavior classification method based on surface electromyography (sEMG) signals during cyclic motions. We demonstrate that task specification (including which axes are constrained and how users interpret tracking instructions) shapes user behavior; that pathology-related features are detectable in 6D end-effector force data during isometric task execution (with significant differences between healthy and post-stroke profiles in force error and average force production at p=0.05 ); and that healthy neuromotor strategies are heterogeneous and inherently difficult to characterize. We also show that our HMM-based models discriminate healthy and post-stroke neuromotor dynamics where synergy-based decompositions reflect no such differentiation. Lastly, we discuss these results’ implications for the design of adaptive end-effector rehabilitation robots capable of promoting healthier movement strategies across diverse user populations.
[HC-18] Dance2Hesitate: A Multi-Modal Dataset of Dancer-Taught Hesitancy for Understandable Robot Motion
【速读】:该论文旨在解决人机协作中机器人犹豫行为(hesitant motion)的泛化设计难题,其核心挑战在于观察者对机器人犹豫状态的推断高度依赖于机器人形态(embodiment)和具体场景(context)。为应对这一问题,研究提出并开源了一个多模态、由舞者生成的犹豫运动数据集,关键创新在于聚焦特定的“形态-场景”组合:一是机械臂与人类上肢共同接近Jenga塔的交互场景,二是拟人化全身运动在自由空间中的表现。该数据集包含三类轨迹:基于Kinesthetic Teaching的机器人轨迹(66条)、同步RGB-D动作捕捉的舞者上肢轨迹(84条)以及极端犹豫水平下的完整人体轨迹(70条),并通过详尽文档支持跨机器人与人类模态的可复现基准测试,从而为开发具有情境感知能力的通用型犹豫行为表达提供数据基础。
链接: https://arxiv.org/abs/2603.10166
作者: Srikrishna Bangalore Raghu,Anna Soukhovei,Divya Sai Sindhuja Vankineni,Alexandra Bacula,Alessandro Roncone
机构: University of Colorado Boulder(科罗拉多大学博尔德分校); Pacific Lutheran University Tacoma(太平洋路德大学塔科马校区)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: Accepted to the Designing Transparent and Understandable Robots (D-TUR) Workshop at the ACM/IEEE International Conference on Human-Robot Interaction (HRI) 2026, Edinburgh, UK
Abstract:In human-robot collaboration, a robot’s expression of hesitancy is a critical factor that shapes human coordination strategies, attention allocation, and safety-related judgments. However, designing hesitant robot motion that generalizes is challenging because the observer’s inference is highly dependent on embodiment and context. To address these challenges, we introduce and open-source a multi-modal, dancer-generated dataset of hesitant motion where we focus on specific context-embodiment pairs (i.e., manipulator/human upper-limb approaching a Jenga Tower, and anthropomorphic whole body motion in free space). The dataset includes (i) kinesthetic teaching demonstrations on a Franka Emika Panda reaching from a fixed start configuration to a fixed target (a Jenga tower) with three graded hesitancy levels (slight, significant, extreme) and (ii) synchronized RGB-D motion capture of dancers performing the same reaching behavior using their upper limb across three hesitancy levels, plus full human body sequences for extreme hesitancy. We further provide documentation to enable reproducible benchmarking across robot and human modalities. Across all dancers, we obtained 70 unique whole-body trajectories, 84 upper limb trajectories spanning over the three hesitancy levels, and 66 kinesthetic teaching trajectories spanning over the three hesitancy levels. The dataset can be accessed here: this https URL.
[HC-19] oward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中持续存在的幻觉问题,即模型输出在语法上连贯但事实错误或与上下文不一致的现象,这在工程设计、企业资源规划和物联网遥测平台等高风险工业场景中构成显著障碍。解决方案的关键在于无需修改模型权重或构建复杂验证模型的前提下,通过五种提示工程(prompt engineering)策略来降低模型输出的方差并提升结果的可重复性和事实一致性。其中,增强型数据注册表(Enhanced Data Registry, M4)在100次重复测试中全部获得“更好”评价,表明结构化外部知识注入能有效约束生成内容;而迭代相似度收敛(Iterative Similarity Convergence, M1)、单任务代理专业化(Single-Task Agent Specialization, M3)和领域术语注入(Domain Glossary Injection, M5)也展现出显著改善效果。值得注意的是,改进版提示策略(v2)使原本表现最差的分解式模型无关提示(Decomposed Model-Agnostic Prompting, M2)从34%提升至80%,证明了提示设计优化对缓解LLM非确定性行为的重要性。
链接: https://arxiv.org/abs/2603.10047
作者: Brian Freeman,Adam Kicklighter,Matt Erdman,Zach Gordon
机构: Trane Technologies(特雷纳科技公司)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 50 pages, 5 tables, 7 figures
Abstract:Hallucinations in large language models (LLMs) are outputs that are syntactically coherent but factually incorrect or contextually inconsistent. They are persistent obstacles in high-stakes industrial settings such as engineering design, enterprise resource planning, and IoT telemetry platforms. We present and compare five prompt engineering strategies intended to reduce the variance of model outputs and move toward repeatable, grounded results without modifying model weights or creating complex validation models. These methods include: (M1) Iterative Similarity Convergence, (M2) Decomposed Model-Agnostic Prompting, (M3) Single-Task Agent Specialization, (M4) Enhanced Data Registry, and (M5) Domain Glossary Injection. Each method is evaluated against an internal baseline using an LLM-as-Judge framework over 100 repeated runs per method (same fixed task prompt, stochastic decoding at \tau = 0.7 . Under this evaluation setup, M4 (Enhanced Data Registry) received ``Better’’ verdicts in all 100 trials; M3 and M5 reached 80% and 77% respectively; M1 reached 75%; and M2 was net negative at 34% when compared to single shot prompting with a modern foundation model. We then developed enhanced version 2 (v2) implementations and assessed them on a 10-trial verification batch; M2 recovered from 34% to 80%, the largest gain among the four revised methods. We discuss how these strategies help overcome the non-deterministic nature of LLM results for industrial procedures, even when absolute correctness cannot be guaranteed. We provide pseudocode, verbatim prompts, and batch logs to support independent assessment.
[HC-20] A Governance and Evaluation Framework for Deterministic Rule-Based Clinical Decision Support in Empiric Antibiotic Prescribing
【速读】:该论文旨在解决高风险临床情境下经验性抗生素处方中因信息不完整导致的决策难题,如覆盖不足或不必要的升级可能危及患者安全并破坏抗菌药物管理(Antimicrobial Stewardship)。其解决方案的关键在于提出一个将治理机制作为首要设计组件的确定性临床决策支持系统框架,通过明确界定作用范围、弃权条件、推荐许可规则和预期行为,实现透明、可审计且保守的决策支持。该框架将临床决策逻辑与规则驱动的推荐发布机制分离,并采用基于合成机制驱动病例的行为验证协议,以确保系统行为符合预设规则,而非依赖临床效果或预测准确性评估。
链接: https://arxiv.org/abs/2603.10027
作者: Francisco José Gárate,Paloma Chausa,Diego Moreno,Judit López Luque,Vicens Díaz-Brito,Enrique Javier Gómez
机构: Universidad Politécnica de Madrid (马德里理工大学); Hospital Universitario 12 de Octubre (十二月十日大学医院); Parc Sanitari Sant Joan de Deu (圣若翰·德乌大学医院); Institut de Recerca Sant Joan de Déu (圣若翰·德乌研究机构); Centro de Investigación Biomédica en Red, Biomateriales y Nanomedicina (CIBER-BBN) (生物医学研究中心,生物材料与纳米医学中心)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Methodological framework paper describing deterministic rule-based clinical decision support specification and a behavioral evaluation protocol using synthetic mechanism-driven cases. No empirical clinical validation is claimed
Abstract:Empiric antibiotic prescribing in high-risk clinical contexts often requires decision making under conditions of incomplete information, where inappropriate coverage or unjustified escalation may compromise safety and antimicrobial stewardship. While clinical decision-support systems have been proposed to assist in this process, many approaches lack explicit governance and evaluation mechanisms defining scope, abstention conditions, recommendation permissibility, and expected system behavior. This work specifies a governance and evaluation framework for deterministic clinical decision-support systems operating under explicitly constrained scope. Deterministic behavior is adopted to ensure that identical inputs yield identical outputs, supporting transparency, auditability, and conservative decision support in high-risk prescribing contexts. The framework treats governance as a first-class design component, separating clinical decision logic from rule-based mechanisms that determine whether a recommendation may be issued. Explicit abstention, deterministic stewardship constraints, and exclusion rules are formalized as core constructs. The framework defines an evaluation methodology utilizing a fixed set of synthetic, mechanism-driven clinical cases with predefined expected behavior. This validation process focuses on behavioral alignment with specified rules rather than clinical effectiveness, predictive accuracy, or outcome optimization. Within this protocol, abstention is treated as a correct and intended outcome when governance conditions are not satisfied. The proposed framework provides a reproducible approach for specifying, governing, and inspecting deterministic clinical decision-support systems in empiric antibiotic prescribing contexts where transparency, auditability, and conservative behavior are prioritized. Comments: Methodological framework paper describing deterministic rule-based clinical decision support specification and a behavioral evaluation protocol using synthetic mechanism-driven cases. No empirical clinical validation is claimed Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2603.10027 [cs.CY] (or arXiv:2603.10027v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2603.10027 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Francisco José Garate PhD [view email] [v1] Tue, 24 Feb 2026 21:48:41 UTC (11 KB)
[HC-21] Dark Patterns and Consumer Protection Law for App Makers
【速读】:该论文旨在解决在线商业中存在的一种“暗黑模式”(dark patterns)问题,即通过欺骗性用户界面设计损害消费者自主权并扭曲在线市场机制。此类设计可能源于有意误导,也可能因复杂的移动应用开发流程而无意产生。解决方案的关键在于优化“选择架构”(choice architecture)与贯彻透明设计原则,从而帮助开发者规避法律风险、保障用户自主权,并提升用户信任与忠诚度。
链接: https://arxiv.org/abs/2603.10020
作者: Gregory M. Dickinson
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Dark patterns in online commerce, especially deceptive user interface designs for apps and websites, undermine consumer autonomy and distort online markets. Although sometimes deception is intentional, the complex app development process can also unintentionally produce manipulative user interfaces. This paper discusses common design pitfalls and proposes strategies for app makers to avoid infringing user autonomy or incurring legal liability under emerging principles of consumer protection law. By focusing on choice architecture and transparent design principles, developers can both facilitate compliance and build user trust and loyalty.
[HC-22] Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations
【速读】:该论文试图解决用户在OpenAI弃用GPT-4o后所反映的“模型失去共情能力”这一现象是否真实存在的问题,即验证公众对新一代模型“情感冷漠”的主观感知是否有实证依据。其解决方案的关键在于首次采用临床心理学安全框架对三款模型(GPT-4o、o4-mini、GPT-5-mini)进行系统性评估,涵盖14个高情感挑战场景下的2,100条响应,并从六个心理安全维度评分;同时引入“逐轮轨迹分析”(per-turn trajectory analysis)这一方法论创新,揭示了用户感知中的“共情下降”实为危机识别能力提升与建议安全性下降之间的权衡——具体表现为GPT-5-mini在危机识别上显著优于GPT-4o(p=0.001),但其在中间对话阶段可能过度干预(如对未成年人自伤情境的回应更频繁且强度更高),这种变化在传统聚合评分中不可见,却直接影响脆弱用户的体验与安全。
链接: https://arxiv.org/abs/2603.09997
作者: Michael Keeman,Anastasia Keeman
机构: Keido Labs(Keido 实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 17 pages, 7 figures. First empirical measurement of the #keep4o phenomenon using clinical psychological safety frameworks. Compares GPT-4o, o4-mini, and GPT-5-mini on empathy, crisis detection, and advice safety dimensions
Abstract:When OpenAI deprecated GPT-4o in early 2026, thousands of users protested under #keep4o, claiming newer models had “lost their empathy.” No published study has tested this claim. We conducted the first clinical measurement, evaluating three OpenAI model generations (GPT-4o, o4-mini, GPT-5-mini) across 14 emotionally challenging conversational scenarios in mental health and AI companion domains, producing 2,100 scored AI responses assessed on six psychological safety dimensions using clinically-grounded rubrics. Empathy scores are statistically indistinguishable across all three models (Kruskal-Wallis H=4.33, p=0.115). What changed is the safety posture: crisis detection improved monotonically from GPT-4o to GPT-5-mini (H=13.88, p=0.001), while advice safety declined (H=16.63, p0.001). Per-turn trajectory analysis – a novel methodological contribution – reveals these shifts are sharpest during mid-conversation crisis moments invisible to aggregate scoring. In a self-harm scenario involving a minor, GPT-4o scored 3.6/10 on crisis detection during early disclosure turns; GPT-5-mini never dropped below 7.8. What users perceived as “lost empathy” was a shift from a cautious model that missed crises to an alert model that sometimes says too much – a trade-off with real consequences for vulnerable users, currently invisible to both the people who feel it and the developers who create it. Comments: 17 pages, 7 figures. First empirical measurement of the #keep4o phenomenon using clinical psychological safety frameworks. Compares GPT-4o, o4-mini, and GPT-5-mini on empathy, crisis detection, and advice safety dimensions Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC) Cite as: arXiv:2603.09997 [cs.CL] (or arXiv:2603.09997v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2603.09997 Focus to learn more arXiv-issued DOI via DataCite
[HC-23] G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition INTERSPEECH2026
【速读】:该论文旨在解决长时多说话人语音中存在重叠对话时的时戳化、说话人归属的自动语音识别(ASR)问题,核心挑战在于分块推理过程中需保持会议级说话人身份一致性,同时生成带时间戳和说话人标签的转录文本。以往的Speech-LLM系统往往在局部说话人辨识(local diarization)与全局标签分配(global labeling)之间权衡,难以捕捉细粒度的时间边界并实现跨分块的身份稳定关联。解决方案的关键在于提出G-STAR,一个端到端系统,其通过耦合一个时序感知的说话人追踪模块(time-aware speaker-tracking module)与Speech-LLM转录主干,使追踪模块提供带有时间锚定的结构化说话人提示(cues),并由大语言模型基于这些提示生成带说话人归属的文本;该架构支持组件级优化与联合端到端训练,从而在异构监督信号和领域偏移下实现灵活学习。
链接: https://arxiv.org/abs/2603.10468
作者: Jing Peng,Ziyi Chen,Haoyu Li,Yucheng Wang,Duo Ma,Mengtian Li,Yunfan Du,Dezhu Xu,Kai Yu,Shuai Wang
机构: Nanjing University (南京大学); Shanghai Jiao Tong University (上海交通大学); Central Media Technology Institute, Huawei (华为中央媒体技术研究院); Shenzhen Research Institute of Big Data (深圳大数据研究院); ETH Zürich (苏黎世联邦理工学院)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Sound (cs.SD)
备注: submitted to Interspeech 2026
Abstract:We study timestamped speaker-attributed ASR for long-form, multi-party speech with overlap, where chunk-wise inference must preserve meeting-level speaker identity consistency while producing time-stamped, speaker-labeled transcripts. Previous Speech-LLM systems tend to prioritize either local diarization or global labeling, but often lack the ability to capture fine-grained temporal boundaries or robust cross-chunk identity linking. We propose G-STAR, an end-to-end system that couples a time-aware speaker-tracking module with a Speech-LLM transcription backbone. The tracker provides structured speaker cues with temporal grounding, and the LLM generates attributed text conditioned on these cues. G-STAR supports both component-wise optimization and joint end-to-end training, enabling flexible learning under heterogeneous supervision and domain shift. Experiments analyze cue fusion, local versus long-context trade-offs and hierarchical objectives.
计算机视觉
[CV-0] LiTo: Surface Light Field Tokenization ICLR2026
【速读】:该论文旨在解决现有方法在重建三维几何结构或预测与视角无关的漫反射外观时,难以准确捕捉真实世界中依赖视角的视觉效应(如镜面高光和菲涅尔反射)的问题。其解决方案的关键在于提出一种统一的三维潜在表示(3D latent representation),通过将RGB-D图像视为表面光场(surface light field)的随机采样,并将其编码为一组紧凑的潜在向量,从而在同一个三维潜在空间中联合建模物体几何与视点相关的外观特性。该表示能够有效还原复杂光照下的视点依赖效应,并进一步结合潜在流匹配模型(latent flow matching model)从单张输入图像中学习该表示的分布,实现与输入图像光照和材质一致的三维物体生成。
链接: https://arxiv.org/abs/2603.11047
作者: Jen-Hao Rick Chang,Xiaoming Zhao,Dorian Chan,Oncel Tuzel
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: ICLR 2026; Project page: this https URL
Abstract:We propose a 3D latent representation that jointly models object geometry and view-dependent appearance. Most prior works focus on either reconstructing 3D geometry or predicting view-independent diffuse appearance, and thus struggle to capture realistic view-dependent effects. Our approach leverages that RGB-depth images provide samples of a surface light field. By encoding random subsamples of this surface light field into a compact set of latent vectors, our model learns to represent both geometry and appearance within a unified 3D latent space. This representation reproduces view-dependent effects such as specular highlights and Fresnel reflections under complex lighting. We further train a latent flow matching model on this representation to learn its distribution conditioned on a single input image, enabling the generation of 3D objects with appearances consistent with the lighting and materials in the input. Experiments show that our approach achieves higher visual quality and better input fidelity than existing methods.
[CV-1] Neural Field Thermal Tomography: A Differentiable Physics Framework for Non-Destructive Evaluation
【速读】:该论文旨在解决传统热成像技术在三维材料属性定量重建中的局限性,特别是基于像素级一维近似的热成像方法因忽略横向热扩散而导致精度不足,以及软约束型物理信息神经网络(Physics-Informed Neural Networks, PINNs)在瞬态扩散场景中因梯度刚性而失效的问题。解决方案的关键在于提出一种可微分物理框架——神经场热断层成像(Neural Field Thermal Tomography, NeFTY),其将三维导热系数场参数化为连续的神经场,并通过严格的数值求解器进行优化;该方法利用可微分物理求解器将热力学定律作为硬约束强制执行,同时保持高分辨率三维断层成像所需的内存效率,从而有效缓解逆向热传导问题中的谱偏差和病态性,实现任意尺度下亚表面缺陷的精确恢复。
链接: https://arxiv.org/abs/2603.11045
作者: Tao Zhong,Yixun Hu,Dongzhe Zheng,Aditya Sood,Christine Allen-Blanchette
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Detectors (physics.ins-det)
备注: 27 pages, 15 figures
Abstract:We propose Neural Field Thermal Tomography (NeFTY), a differentiable physics framework for the quantitative 3D reconstruction of material properties from transient surface temperature measurements. While traditional thermography relies on pixel-wise 1D approximations that neglect lateral diffusion, and soft-constrained Physics-Informed Neural Networks (PINNs) often fail in transient diffusion scenarios due to gradient stiffness, NeFTY parameterizes the 3D diffusivity field as a continuous neural field optimized through a rigorous numerical solver. By leveraging a differentiable physics solver, our approach enforces thermodynamic laws as hard constraints while maintaining the memory efficiency required for high-resolution 3D tomography. Our discretize-then-optimize paradigm effectively mitigates the spectral bias and ill-posedness inherent in inverse heat conduction, enabling the recovery of subsurface defects at arbitrary scales. Experimental validation on synthetic data demonstrates that NeFTY significantly improves the accuracy of subsurface defect localization over baselines. Additional details at this https URL
[CV-2] Agent ar-Fin-OCR
【速读】:该论文旨在解决金融领域文档(如超长PDF)在结构化解析过程中面临的挑战,包括复杂版式、跨页结构不连续性以及单元格级引用能力不足等问题。其核心解决方案在于提出Agentar-Fin-OCR系统,关键创新点包括:(1) 引入跨页内容整合算法(Cross-page Contents Consolidation)以恢复跨页连续性,并通过文档级标题层级重建模块(Document-level Heading Hierarchy Reconstruction, DHR)构建全局一致的目录树(Table of Contents, TOC),支持结构感知检索;(2) 设计难度自适应课程学习训练策略与CellBBoxRegressor模块,利用结构锚定标记从解码器隐藏状态中直接定位表格单元格,无需外部检测器,从而提升表格解析精度。该方法显著优于现有技术,在OmniDocBench和新提出的FinDocBench基准上均展现出优异性能。
链接: https://arxiv.org/abs/2603.11044
作者: Siyi Qian,Xiongfei Bai,Bingtao Fu,Yichen Lu,Gaoyang Zhang,Xudong Yang,Peng Zhang
机构: Ant Group(蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we propose Agentar-Fin-OCR, a document parsing system tailored to financial-domain documents, transforming ultra-long financial PDFs into semantically consistent, highly accurate, structured outputs with auditing-grade provenance. To address finance-specific challenges such as complex layouts, cross-page structural discontinuities, and cell-level referencing capability, Agentar-Fin-OCR combines (1) a Cross-page Contents Consolidation algorithm to restore continuity across pages and a Document-level Heading Hierarchy Reconstruction (DHR) module to build a globally consistent Table of Contents (TOC) tree for structure-aware retrieval, and (2) a difficulty-adaptive curriculum learning training strategy for table parsing, together with a CellBBoxRegressor module that uses structural anchor tokens to localize table cells from decoder hidden states without external detectors. Experiments demonstrate that our model shows high performance on the table parsing metrics of OmniDocBench. To enable realistic evaluation in the financial vertical, we further introduce FinDocBench, a benchmark that includes six financial document categories with expert-verified annotations and evaluation metrics including Table of Contents edit-distance-based similarity (TocEDS), cross-page concatenated TEDS, and Table Cell Intersection over Union (C-IoU). We evaluate a wide range of state-of-the-art models on FinDocBench to assess their capabilities and remaining limitations on financial documents. Overall, Agentar-Fin-OCR and FinDocBench provide a practical foundation for reliable downstream financial document applications.
[CV-3] V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation
【速读】:该论文旨在解决现有文本到音乐(text-to-music)模型在生成与视频事件时间对齐的音乐时缺乏细粒度时间控制的问题。其解决方案的关键在于提出一种零样本(zero-pair)视频到音乐(video-to-music, V2M)生成方法——V2M-Zero,该方法不依赖跨模态配对数据或交叉训练,而是通过预训练的音乐和视频编码器分别计算各自模态内的事件曲线(event curves),这些曲线基于模态内相似性度量,能够捕捉到音乐与视频共享的时间结构特征。由于这些事件曲线在不同模态中具有可比性,因此可以将训练好的文本到音乐模型微调至音乐事件曲线上,并在推理阶段直接替换为视频事件曲线,从而实现高质量、高同步性的音乐生成,显著优于依赖成对数据的基线方法。
链接: https://arxiv.org/abs/2603.11042
作者: Yan-Bo Lin,Jonah Casebeer,Long Mai,Aniruddha Mahapatra,Gedas Bertasius,Nicholas J. Bryan
机构: Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
备注: Project page: this https URL
Abstract:Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at this https URL
[CV-4] DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving
【速读】:该论文旨在解决自动驾驶决策中因缺乏对世界动态演变的细粒度建模而导致的决策质量不足问题。现有方法如Textual CoT(文本思维链)缺乏时空感知能力,而Visual CoT(视觉思维链)则因密集图像预测引入冗余信息,难以实现高效且物理合理的决策。解决方案的关键在于提出DynVLA模型,其核心创新是引入一种新的思维链范式——Dynamics CoT(动力学思维链),通过预预测紧凑的世界动力学表示来提升决策的物理合理性与准确性。具体而言,DynVLA设计了一个动力学分词器(Dynamics Tokenizer),将未来演化压缩为少量动力学token,并区分自车中心(ego-centric)与环境中心(environment-centric)的动力学,从而更精确地建模交互密集场景下的世界动态。该方法在多个基准数据集上显著优于传统CoT方法,验证了Dynamics CoT在性能和效率上的优势。
链接: https://arxiv.org/abs/2603.11041
作者: Shuyao Shang,Bing Zhan,Yunfei Yan,Yuqi Wang,Yingyan Li,Yasong An,Xiaoman Wang,Jierui Liu,Lu Hou,Lue Fan,Zhaoxiang Zhang,Tieniu Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 18 pages, 10 figures
Abstract:We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.
[CV-5] Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style
【速读】:该论文旨在解决视觉语言模型(Visual Language Models, VLMs)在艺术风格预测中的可解释性问题,即明确其决策机制是否与艺术史学家所使用的艺术风格判断标准相一致。解决方案的关键在于采用潜在空间分解方法(latent-space decomposition approach),识别驱动艺术风格预测的核心概念,并通过定量评估、因果分析及艺术史专家的主观判断,系统验证这些概念的语义一致性与相关性。研究发现,73%提取的概念被艺术史学家认为具有语义清晰且视觉上连贯的特征,90%用于预测特定作品风格的概念被判定为相关,从而揭示了VLMs在艺术理解中既具备人类可解释性又表现出形式化推理能力的潜力。
链接: https://arxiv.org/abs/2603.11024
作者: Marvin Limpijankit,Milad Alshomary,Yassin Oulad Daoud,Amith Ananthram,Tim Trombley,Elias Stengel-Eskin,Mohit Bansal,Noam M. Elcott,Kathleen McKeown
机构: Columbia University (哥伦比亚大学); University of Texas at Austin (得克萨斯大学奥斯汀分校); UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 12 figures
Abstract:VLMs have become increasingly proficient at a range of computer vision tasks, such as visual question answering and object detection. This includes increasingly strong capabilities in the domain of art, from analyzing artwork to generation of art. In an interdisciplinary collaboration between computer scientists and art historians, we characterize the mechanisms underlying VLMs’ ability to predict artistic style and assess the extent to which they align with the criteria art historians use to reason about artistic style. We employ a latent-space decomposition approach to identify concepts that drive art style prediction and conduct quantitative evaluations, causal analysis and assessment by art historians. Our findings indicate that 73% of the extracted concepts are judged by art historians to exhibit a coherent and semantically meaningful visual feature and 90% of concepts used to predict style of a given artwork were judged relevant. In cases where an irrelevant concept was used to successfully predict style, art historians identified possible reasons for its success; for example, the model might “understand” a concept in more formal terms, such as dark/light contrasts.
[CV-6] oo Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity CVPR2026
【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成模型在生成真实风格图像时存在的颜色失真问题,即现有评估范式(如人工评分和偏好训练指标)倾向于偏好色彩饱和度和对比度过度增强的图像,导致生成结果虽视觉上鲜明但缺乏真实感。解决方案的关键在于提出一个名为Color Fidelity Dataset (CFD) 的大规模数据集和对应的Color Fidelity Metric (CFM),其中CFD包含超过130万张具有有序颜色真实度水平的实拍与合成图像,而CFM则利用多模态编码器学习感知颜色保真度;此外,论文还设计了一种无需训练的Color Fidelity Refinement (CFR) 方法,通过自适应调节生成过程中的时空引导尺度来提升颜色真实性。CFD支持CFM进行客观评估,其学习到的注意力机制进一步指导CFR优化T2I生成质量,形成一套从评估到改进的渐进式框架。
链接: https://arxiv.org/abs/2603.10990
作者: Zhengyao Fang,Zexi Jia,Yijia Zhong,Pengcheng Luo,Jinchao Zhang,Guangming Lu,Jun Yu,Wenjie Pei
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳); Peng Cheng Laboratory(鹏城实验室); College of Computer Science and Aritificial Intelligence, Fudan University(复旦大学计算机科学技术学院); Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by CVPR2026
Abstract:Recent advances in text-to-image (T2I) generation have greatly improved visual quality, yet producing images that appear visually authentic to real-world photography remains challenging. This is partly due to biases in existing evaluation paradigms: human ratings and preference-trained metrics often favor visually vivid images with exaggerated saturation and contrast, which make generations often too vivid to be real even when prompted for realistic-style images. To address this issue, we present Color Fidelity Dataset (CFD) and Color Fidelity Metric (CFM) for objective evaluation of color fidelity in realistic-style generations. CFD contains over 1.3M real and synthetic images with ordered levels of color realism, while CFM employs a multimodal encoder to learn perceptual color fidelity. In addition, we propose a training-free Color Fidelity Refinement (CFR) that adaptively modulates spatial-temporal guidance scale in generation, thereby enhancing color authenticity. Together, CFD supports CFM for assessment, whose learned attention further guides CFR to refine T2I fidelity, forming a progressive framework for assessing and improving color fidelity in realistic-style T2I generation. The dataset and code are available at this https URL.
[CV-7] GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在计数任务中持续存在的幻觉问题,即其计数准确率显著低于其他视觉推理任务(除情感识别外),即使在最先进的具备推理能力的VLMs中这一现象依然存在。解决方案的关键在于提出GroundCount框架,通过引入基于目标检测模型(Object Detection Models, ODMs,如YOLO)的显式空间定位信息对VLM进行增强,从而缓解计数幻觉。该方法采用提示驱动(prompt-based)的空间接地策略,在不增加复杂度的前提下提升计数准确性(最高达81.3%,较原模型提升6.6个百分点),同时减少推理时间(降低22%),并揭示了位置编码和结构化提示在跨模型性能提升中的核心作用,表明计数失败源于VLM固有的空间-语义融合局限,而非架构特异性缺陷。
链接: https://arxiv.org/abs/2603.10978
作者: Boyuan Chen,Minghao Shao,Siddharth Garg,Ramesh Karri,Muhammad Shafique
机构: Tandon School of Engineering, New York University, NY, USA; eBRAIN Lab, Division of Engineering, New York University Abu Dhabi, UAE
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision Language Models (VLMs) exhibit persistent hallucinations in counting tasks, with accuracy substantially lower than other visual reasoning tasks (excluding sentiment). This phenomenon persists even in state-of-the-art reasoning-capable VLMs. Conversely, CNN-based object detection models (ODMs) such as YOLO excel at spatial localization and instance counting with minimal computational overhead. We propose GroundCount, a framework that augments VLMs with explicit spatial grounding from ODMs to mitigate counting hallucinations. In the best case, our prompt-based augmentation strategy achieves 81.3% counting accuracy on the best-performing model (Ovis2.5-2B) - a 6.6pp improvement - while reducing inference time by 22% through elimination of hallucination-driven reasoning loops for stronger models. We conduct comprehensive ablation studies demonstrating that positional encoding is a critical component, being beneficial for stronger models but detrimental for weaker ones. Confidence scores, by contrast, introduce noise for most architectures and their removal improves performance in four of five evaluated models. We further evaluate feature-level fusion architectures, finding that explicit symbolic grounding via structured prompts outperforms implicit feature fusion despite sophisticated cross-attention mechanisms. Our approach yields consistent improvements across four of five evaluated VLM architectures (6.2–7.5pp), with one architecture exhibiting degraded performance due to incompatibility between its iterative reflection mechanisms and structured prompts. These results suggest that counting failures stem from fundamental spatial-semantic integration limitations rather than architecture-specific deficiencies, while highlighting the importance of architectural compatibility in augmentation strategies.
[CV-8] VCR: Variance-Driven Channel Recalibration for Robust Low-Light Enhancement
【速读】:该论文旨在解决现有基于sRGB和HSV颜色空间的低光照图像增强(Low-Light Image Enhancement, LLIE)方法中存在的亮度与色彩耦合问题,以及HVI颜色空间中因通道级不一致性导致的颜色分布错位、增强结果不自然的问题。其解决方案的关键在于提出一种名为Variance-Driven Channel Recalibration (VCR) 的新框架,该框架包含两个核心模块:Channel Adaptive Adjustment (CAA) 模块通过方差引导的特征滤波机制,强化模型对高亮度和高色度分布区域的关注;Color Distribution Alignment (CDA) 模块则在颜色特征空间中强制执行分布对齐,从而提升低光条件下的感知质量。
链接: https://arxiv.org/abs/2603.10975
作者: Zhixin Cheng,Fangwen Zhang,Xiaotian Yin,Baoqun Yin,Haodian Wang
机构: University of Science and Technology of China (中国科学技术大学); CHN Energy Digital Intelligence Technology Development (Beijing) Co., LTD. (中国能源数字智能技术发展(北京)有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most sRGB-based LLIE methods suffer from entangled luminance and color, while the HSV color space offers insufficient decoupling at the cost of introducing significant red and black noise artifacts. Recently, the HVI color space has been proposed to address these limitations by enhancing color fidelity through chrominance polarization and intensity compression. However, existing methods could suffer from channel-level inconsistency between luminance and chrominance, and misaligned color distribution may lead to unnatural enhancement results. To address these challenges, we propose the Variance-Driven Channel Recalibration for Robust Low-Light Enhancement (VCR), a novel framework for low-light image enhancement. VCR consists of two main components, including the Channel Adaptive Adjustment (CAA) module, which employs variance-guided feature filtering to enhance the model’s focus on regions with high intensity and color distribution. And the Color Distribution Alignment (CDA) module, which enforces distribution alignment in the color feature space. These designs enhance perceptual quality under low-light conditions. Experimental results on several benchmark datasets demonstrate that the proposed method achieves state-of-the-art performance compared with existing methods.
[CV-9] Med-DualLoRA: Local Adaptation of Foundation Models for 3D Cardiac MRI MICCAI2026
【速读】:该论文旨在解决医学影像领域中基础模型(Foundation Models, FMs)在多中心、非独立同分布(non-IID)数据环境下进行联邦微调时面临的两大挑战:一是单中心数据微调易导致性能下降和模型偏差,二是传统联邦学习方法在大规模模型上通信开销大且难以适应数据异质性。解决方案的关键在于提出 Med-DualLoRA 框架,其核心创新是通过加性分解将局部低秩适配(Low-Rank Adaptation, LoRA)模块解耦为全局共享与本地私有两部分:仅将全局 LoRA 模块上传并聚合,而本地 LoRA 保持私有,从而在保障个性化性能的同时显著降低通信成本;实验表明,仅微调两个 Transformer 块即可维持性能并进一步提升效率,验证了该方法在真实临床约束下的可扩展性和有效性。
链接: https://arxiv.org/abs/2603.10967
作者: Joan Perramon-Llussà,Amelia Jiménez-Sánchez,Grzegorz Skorupko,Fotis Avgoustidis,Carlos Martín-Isla,Karim Lekadir,Polyxeni Gkontra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures. Submitted to MICCAI 2026
Abstract:Foundation models (FMs) show great promise for robust downstream performance across medical imaging tasks and modalities, including cardiac magnetic resonance (CMR), following task-specific adaptation. However, adaptation using single-site data may lead to suboptimal performance and increased model bias, while centralized fine-tuning on clinical data is often infeasible due to privacy constraints. Federated fine-tuning offers a privacy-preserving alternative; yet conventional approaches struggle under heterogeneous, non-IID multi-center data and incur substantial communication overhead when adapting large models. In this work, we study federated FM fine-tuning for 3D CMR disease detection and propose Med-DualLoRA, a client-aware parameter-efficient fine-tuning (PEFT) federated framework that disentangles globally shared and local low-rank adaptations (LoRA) through additive decomposition. Global and local LoRA modules are trained locally, but only the global component is shared and aggregated across sites, keeping local adapters private. This design improves personalization while significantly reducing communication cost, and experiments show that adapting only two transformer blocks preserves performance while further improving efficiency. We evaluate our method on a multi-center state-of-the-art cine 3D CMR FM fine-tuned for disease detection using ACDC and combined M\Ms datasets, treating each vendor as a federated client. Med-DualLoRA achieves statistically significant improved performance (balanced accuracy 0.768, specificity 0.612) compared to other federated PEFT baselines, while maintaining communication efficiency. Our approach provides a scalable solution for local federated adaptation of medical FMs under realistic clinical constraints.
[CV-10] Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition
【速读】:该论文旨在解决视频质量对视频分类性能产生显著影响的问题,尤其是在低质量视频(如模糊视频)上分类准确率下降明显的情况。解决方案的关键在于提出一种基于自监督学习的视频视觉Transformer模型(SSL-V3),该模型融合了无参考视频质量评估(No-reference VQA)机制,通过联合自监督学习(Combined-SSL)机制将视频质量评分作为调节因子直接作用于视频分类特征图,并利用分类任务的监督信号反向优化VQA模块参数,从而缓解视频数据集中VQA标签稀缺导致的质量评分不准确问题,最终提升视频分类鲁棒性。
链接: https://arxiv.org/abs/2603.10965
作者: Jian Sun,Mohammad H. Mahoor
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 figures, 10 tables,
Abstract:Video quality significantly affects video classification. We found this problem when we classified Mild Cognitive Impairment well from clear videos, but worse from blurred ones. From then, we realized that referring to Video Quality Assessment (VQA) may improve video classification. This paper proposed Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA for video classification (SSL-V3) to fulfill the goal. SSL-V3 leverages Combined-SSL mechanism to join VQA into video classification and address the label shortage of VQA, which commonly occurs in video datasets, making it impossible to provide an accurate Video Quality Score. In brief, Combined-SSL takes video quality score as a factor to directly tune the feature map of the video classification. Then, the score, as an intersected point, links VQA and classification, using the supervised classification task to tune the parameters of VQA. SSL-V3 achieved robust experimental results on two datasets. For example, it reached an accuracy of 94.87% on some interview videos in the I-CONECT (a facial video-involved healthcare dataset), verifying SSL-V3’s effectiveness.
[CV-11] Pointy - A Lightweight Transformer for Point Cloud Foundation Models ICLR2025
【速读】:该论文旨在解决当前点云基础模型(foundation models for point cloud data)普遍依赖大规模跨模态监督信号(如图像、文本)导致的训练成本高、数据冗余等问题,同时探索在有限数据下实现高性能点云表示学习的可能性。其解决方案的关键在于提出一种轻量级基于Transformer的点云架构,仅使用39k个点云样本进行训练,却在多个基准上超越了训练样本超过200k甚至百万级的更大模型。该方法通过精心设计的训练设置与架构优化(如无需Tokenizer的直接点云处理),显著提升了训练效率与性能表现,验证了高质量数据集和简洁网络结构对点云基础模型发展的关键作用。
链接: https://arxiv.org/abs/2603.10963
作者: Konrad Szafer,Marek Kraft,Dominik Belter
机构: Poznan University of Technology (波兹南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: To appear in the proceedings of ACIVS 2025. An earlier version was presented at the SCI-FM workshop at ICLR 2025
Abstract:Foundation models for point cloud data have recently grown in capability, often leveraging extensive representation learning from language or vision. In this work, we take a more controlled approach by introducing a lightweight transformer-based point cloud architecture. In contrast to the heavy reliance on cross-modal supervision, our model is trained only on 39k point clouds - yet it outperforms several larger foundation models trained on over 200k training samples. Interestingly, our method approaches state-of-the-art results from models that have seen over a million point clouds, images, and text samples, demonstrating the value of a carefully curated training setup and architecture. To ensure rigorous evaluation, we conduct a comprehensive replication study that standardizes the training regime and benchmarks across multiple point cloud architectures. This unified experimental framework isolates the impact of architectural choices, allowing for transparent comparisons and highlighting the benefits of our design and other tokenizer-free architectures. Our results show that simple backbones can deliver competitive results to more complex or data-rich strategies. The implementation, including code, pre-trained models, and training protocols, is available at this https URL.
[CV-12] Historical Consensus: Preventing Posterior Collapse via Iterative Selection of Gaussian Mixture Priors
【速读】:该论文旨在解决变分自编码器(Variational Autoencoder, VAE)中常见的后验坍缩(posterior collapse)问题,即潜在变量失去信息性、近似后验退化为先验的现象。传统方法通常通过架构约束或超参数调优来避免坍缩,但效果有限且依赖特定条件。本文提出一种根本性解决方案:利用高斯混合模型(Gaussian Mixture Model, GMM)聚类的多重性,引入历史共识训练(Historical Consensus Training)——一种迭代选择机制,在交替优化与筛选过程中逐步精炼一组候选GMM先验。其核心创新在于,通过满足多个不同聚类约束训练出的模型会形成一个“历史屏障”(historical barrier),即在参数空间中保持稳定的区域,即使后续仅用单一目标函数训练也不会坍缩。理论证明该屏障排除了坍缩解,并实验证明该方法在合成数据和真实世界数据上均能稳定生成非坍缩表示,且不依赖于特定正则化强度或解码器方差,适用于任意神经网络架构。
链接: https://arxiv.org/abs/2603.10935
作者: Zegu Zhang,Jian Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures
Abstract:Variational autoencoders (VAEs) frequently suffer from posterior collapse, where latent variables become uninformative and the approximate posterior degenerates to the prior. Recent work has characterized this phenomenon as a phase transition governed by the spectral properties of the data covariance matrix. In this paper, we propose a fundamentally different approach: instead of avoiding collapse through architectural constraints or hyperparameter tuning, we eliminate the possibility of collapse altogether by leveraging the multiplicity of Gaussian mixture model (GMM) clusterings. We introduce Historical Consensus Training, an iterative selection procedure that progressively refines a set of candidate GMM priors through alternating optimization and selection. The key insight is that models trained to satisfy multiple distinct clustering constraints develop a historical barrier – a region in parameter space that remains stable even when subsequently trained with a single objective. We prove that this barrier excludes the collapsed solution, and demonstrate through extensive experiments on synthetic and real-world datasets that our method achieves non-collapsed representations regardless of decoder variance or regularization strength. Our approach requires no explicit stability conditions (e.g., \sigma^\prime 2 \lambda_\max ) and works with arbitrary neural architectures. The code is available at this https URL.
[CV-13] Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepD
【速读】:该论文旨在解决生成式 AI (Generative AI) 在口腔颌面锥形束CT(CBCT)报告生成中应用受限的问题,其核心挑战在于高质量成对CBCT影像与报告数据稀缺以及CBCT图像的复杂三维解读难度。解决方案的关键在于构建了一个名为CBCTRepD的双语口腔颌面CBCT报告生成系统,并通过收集约7,408例覆盖55种口腔疾病实体的高质量成对数据集进行训练,同时设计了一个基于临床场景的多层次评估框架,综合自动指标与放射科医生及临床医生的评价,验证了该系统在独立生成报告和放射科医生-AI协同写作中的优越性能,尤其在减少漏诊、提升报告标准化程度及辅助不同经验水平的放射科医生改善诊断质量方面展现出显著价值。
链接: https://arxiv.org/abs/2603.10933
作者: Qinxin Wu,Fucheng Niu,Hengchuan Zhu,Yifan Sun,Ye Shen,Xu Li,Han Wu,Leqi Liu,Zhiwen Pan,Zuozhu Liu,Fudong Zhu,Bin Feng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative AI has advanced rapidly in medical report generation; however, its application to oral and maxillofacial CBCT reporting remains limited, largely because of the scarcity of high-quality paired CBCT-report data and the intrinsic complexity of volumetric CBCT interpretation. To address this, we introduce CBCTRepD, a bilingual oral and maxillofacial CBCT report-generation system designed for integration into routine radiologist-AI co-authoring workflows. We curated a large-scale, high-quality paired CBCT-report dataset comprising approximately 7,408 studies, covering 55 oral disease entities across diverse acquisition settings, and used it to develop the system. We further established a clinically grounded, multi-level evaluation framework that assesses both direct AI-generated drafts and radiologist-edited collaboration reports using automatic metrics together with radiologist- and clinician-centered evaluation. Using this framework, we show that CBCTRepD achieves superior report-generation performance and produces drafts with writing quality and standardization comparable to those of intermediate radiologists. More importantly, in radiologist-AI collaboration, CBCTRepD provides consistent and clinically meaningful benefits across experience levels: it helps novice radiologists improve toward intermediate-level reporting, enables intermediate radiologists to approach senior-level performance, and even assists senior radiologists by reducing omission-related errors, including clinically important missed lesions. By improving report structure, reducing omissions, and promoting attention to co-existing lesions across anatomical regions, CBCTRepD shows strong and reliable potential as a practical assistant for real-world CBCT reporting across multi-level care settings.
[CV-14] Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment
【速读】:该论文旨在解决持续模仿学习(lifelong imitation learning)中面临的挑战,即在现实的记忆和数据约束下,如何实现跨任务的策略持续优化与知识积累。传统方法依赖经验回放(experience replay),但难以高效存储和复用多模态信息。本文提出一种全新的框架,其核心在于将视觉、语言和机器人状态信息压缩到一个紧凑的多模态潜在空间(multimodal latent space)中进行存储与再利用,从而实现高效的跨任务知识迁移。此外,通过引入增量特征调整机制(incremental feature adjustment mechanism),基于角度间隔约束对任务嵌入进行正则化,有效保持任务间的区分性并稳定适应过程。这一方案显著提升了LIBERO基准上的性能,在AUC指标上提升10–17点,且遗忘率降低高达65%。
链接: https://arxiv.org/abs/2603.10929
作者: Fanqi Yu,Matteo Tiezzi,Tommaso Apicella,Cigdem Beyan,Vittorio Murino
机构: AI for Good (AIGO), Istituto Italiano di Tecnologia, Genoa, Italy; Department of Computer Science, University of Verona, Verona, Italy; DITEN, University of Genoa, Genoa, Italy; PAVIS, Istituto Italiano di Tecnologia, Genoa, Italy
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:We introduce a lifelong imitation learning framework that enables continual policy refinement across sequential tasks under realistic memory and data constraints. Our approach departs from conventional experience replay by operating entirely in a multimodal latent space, where compact representations of visual, linguistic, and robot’s state information are stored and reused to support future learning. To further stabilize adaptation, we introduce an incremental feature adjustment mechanism that regularizes the evolution of task embeddings through an angular margin constraint, preserving inter-task distinctiveness. Our method establishes a new state of the art in the LIBERO benchmarks, achieving 10-17 point gains in AUC and up to 65% less forgetting compared to previous leading methods. Ablation studies confirm the effectiveness of each component, showing consistent gains over alternative strategies. The code is available at: this https URL.
[CV-15] Novel Architecture of RPA In Oral Cancer Lesion Detection
【速读】:该论文旨在解决口腔癌病变早期准确检测的难题,以提升诊断与治疗的有效性。其解决方案的关键在于优化实时病理分析(Real-time Pathological Analysis, RPA)算法的实现方式:通过引入单例设计模式(Singleton design pattern)和批处理(batch processing)技术,显著提升了预测效率——OC-RPAv2相比传统RPA方法实现了60–100倍的速度提升,从而增强了系统的可扩展性和成本效益。
链接: https://arxiv.org/abs/2603.10928
作者: Revana Magdy,Joy Naoum,Ali Hamdi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate and early detection of oral cancer lesions is crucial for effective diagnosis and treatment. This study evaluates two RPA implementations, OC-RPAv1 and OC-RPAv2, using a test set of 31 images. OC-RPAv1 processes one image per prediction in an average of 0.29 seconds, while OCRPAv2 employs a Singleton design pattern and batch processing, reducing prediction time to just 0.06 seconds per image. This represents a 60-100x efficiency improvement over standard RPA methods, showcasing that design patterns and batch processing can enhance scalability and reduce costs in oral cancer detection
[CV-16] S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs
【速读】:该论文旨在解决当前3D表示方法在稀疏输入下难以实现高质量重建的问题,特别是点云(point cloud)和3D高斯溅射(3D Gaussian Splatting, 3DGS)各自存在的非真实感渲染与稀疏输入下显著退化问题。其解决方案的关键在于提出一种名为Sparse to Dense lifting (S2D) 的新框架,该框架包含两个核心组件:一是基于单步扩散模型的稀疏点云升维机制,用于修复图像伪影并提升重建保真度;二是结合随机采样丢弃与加权梯度的重建策略,以增强从稀疏视角到密集新视角的场景一致性建模能力。实验表明,S2D 在不同稀疏程度下均能实现最优的新视角生成一致性与顶级的稀疏视图重建质量,从而在现有方法中实现了最低的输入要求。
链接: https://arxiv.org/abs/2603.10893
作者: Yuzhou Ji,Qijian Tian,He Zhu,Xiaoqi Jiang,Guangzhi Cao,Lizhuang Ma,Yuan Xie,Xin Tan
机构: Shanghai Jiao Tong University (上海交通大学); East China Normal University (华东师范大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Chery Automobile (奇瑞汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Explicit 3D representations have already become an essential medium for 3D simulation and understanding. However, the most commonly used point cloud and 3D Gaussian Splatting (3DGS) each suffer from non-photorealistic rendering and significant degradation under sparse inputs. In this paper, we introduce Sparse to Dense lifting (S2D), a novel pipeline that bridges the two representations and achieves high-quality 3DGS reconstruction with minimal inputs. Specifically, the S2D lifting is two-fold. We first present an efficient one-step diffusion model that lifts sparse point cloud for high-fidelity image artifact fixing. Meanwhile, to reconstruct 3D consistent scenes, we also design a corresponding reconstruction strategy with random sample drop and weighted gradient for robust model fitting from sparse input views to dense novel views. Extensive experiments show that S2D achieves the best consistency in generating novel view guidance and first-tier sparse view reconstruction quality under different input sparsity. By reconstructing stable scenes with the least possible captures among existing methods, S2D enables minimal input requirements for 3DGS applications.
[CV-17] Bilevel Layer-Positioning LoRA for Real Image Dehazing CVPR2026
【速读】:该论文旨在解决基于学习的实时图像去雾方法在多样真实雾霾场景中面临的适应性挑战,这些问题主要源于缺乏有效的无监督机制来处理未标注数据,以及全模型微调带来的高昂计算成本。解决方案的关键在于提出一种“ haze-to-clear text-directed loss ”,利用CLIP(Contrastive Language–Image Pretraining)的跨模态能力,将去雾任务重构为潜在空间中的语义对齐问题,从而在无参考图像的情况下提供明确的无监督跨模态引导;同时引入双层层定位LoRA(Bilevel Layer-positioning LoRA, BiLaLoRA)策略,联合学习LoRA参数并自动搜索最优注入层位置,实现对网络关键层的精准适配。
链接: https://arxiv.org/abs/2603.10872
作者: Yan Zhang,Long Ma,Yuxin Feng,Zhe Huang,Fan Zhou,Zhuo Su
机构: Sun Yat-sen University (中山大学); National Engineering Research Center of Digital Life (数字生命国家工程研究中心); Dalian University of Technology (大连理工大学); Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Learning-based real image dehazing methods have achieved notable progress, yet they still face adaptation challenges in diverse real haze scenes. These challenges mainly stem from the lack of effective unsupervised mechanisms for unlabeled data and the heavy cost of full model fine-tuning. To address these challenges, we propose the haze-to-clear text-directed loss that leverages CLIP’s cross-modal capabilities to reformulate real image dehazing as a semantic alignment problem in latent space, thereby providing explicit unsupervised cross-modal guidance in the absence of reference images. Furthermore, we introduce the Bilevel Layer-positioning LoRA (BiLaLoRA) strategy, which learns both the LoRA parameters and automatically search the injection layers, enabling targeted adaptation of critical network layers. Extensive experiments demonstrate our superiority against state-of-the-art methods on multiple real-world dehazing benchmarks. The code is publicly available at this https URL.
[CV-18] Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长上下文场景中出现的视觉退化(visual fading)问题,即随着文本序列长度增加,模型对视觉 token 的注意力逐渐减弱,导致生成文本脱离视觉约束。其核心解决方案是提出一种跨模态距离不变的位置编码机制(Inter-modal Distance Invariant Position Encoding, DIPE),该机制通过解耦模态内与模态间的位置编码:保留模态内交互的相对位置结构以维持局部特征一致性,同时为跨模态交互引入锚定感知邻近性,从而消除因模态间距离增大而导致的注意力惩罚,确保视觉信号在任意上下文长度下均保持感知一致性。
链接: https://arxiv.org/abs/2603.10863
作者: Lin Chen,Bolin Ni,Qi Yang,Zili Wang,Kun Ding,Ying Wang,Houwen Peng,Shiming Xiang
机构: Tencent Hunyuan Team (腾讯混元团队); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite the remarkable capabilities of Multimodal Large Language Models (MLLMs), they still suffer from visual fading in long-context scenarios. Specifically, the attention to visual tokens diminishes as the text sequence lengthens, leading to text generation detached from visual constraints. We attribute this degradation to the inherent inductive bias of Multimodal RoPE, which penalizes inter-modal attention as the distance between visual and text tokens increases. To address this, we propose inter-modal Distance Invariant Position Encoding (DIPE), a simple but effective mechanism that disentangles position encoding based on modality interactions. DIPE retains the natural relative positioning for intra-modal interactions to preserve local structure, while enforcing an anchored perceptual proximity for inter-modal interactions. This strategy effectively mitigates the inter-modal distance-based penalty, ensuring that visual signals remain perceptually consistent regardless of the context length. Experimental results demonstrate that by integrating DIPE with Multimodal RoPE, the model maintains stable visual grounding in long-context scenarios, significantly alleviating visual fading while preserving performance on standard short-context benchmarks. Code is available at this https URL.
[CV-19] UltrasoundAgents : Hierarchical Multi-Agent Evidence-Chain Reasoning for Breast Ultrasound Diagnosis
【速读】:该论文旨在解决乳腺超声诊断中现有方法依赖端到端预测或仅提供弱相关证据的问题,这些问题可能导致忽略细粒度病变特征,并限制临床可审计性和审查能力。为契合临床诊断流程并提升证据可追溯性,作者提出一种分层多智能体框架 UltrasoundAgents:主智能体首先定位病灶并触发裁剪与缩放操作,子智能体在局部视图中预测四个临床相关的属性(即回声模式、钙化、边界类型和边缘形态),随后主智能体整合这些结构化属性进行基于证据的推理,输出 BI-RADS 分类及恶性概率,并生成可审核的中间证据。方案关键在于引入解耦渐进式训练策略——先独立训练属性智能体,再以“真值属性”训练主智能体以学习稳健的属性驱动推理,最后通过空间监督的校正轨迹自蒸馏构建高质量训练轨迹,从而实现稳定且可部署的端到端策略。
链接: https://arxiv.org/abs/2603.10852
作者: Yali Zhu,Kang Zhou,Dingbang Wu,Gaofeng Meng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Breast ultrasound diagnosis typically proceeds from global lesion localization to local sign assessment and then evidence integration to assign a BI-RADS category and determine benignity or malignancy. Many existing methods rely on end-to-end prediction or provide only weakly grounded evidence, which can miss fine-grained lesion cues and limit auditability and clinical review. To align with the clinical workflow and improve evidence traceability, we propose a hierarchical multi-agent framework, termed UltrasoundAgents. A main agent localizes the lesion in the full image and triggers a crop-and-zoom operation. A sub-agent analyzes the local view and predicts four clinically relevant attributes, namely echogenicity pattern, calcification, boundary type, and edge (margin) morphology. The main agent then integrates these structured attributes to perform evidence-based reasoning and output the BI-RADS category and the malignancy prediction, while producing reviewable intermediate evidence. Furthermore, hierarchical multi-agent training often suffers from error propagation, difficult credit assignment, and sparse rewards. To alleviate this and improve training stability, we introduce a decoupled progressive training strategy. We first train the attribute agent, then train the main agent with oracle attributes to learn robust attribute-based reasoning, and finally apply corrective trajectory self-distillation with spatial supervision to build high-quality trajectories for supervised fine-tuning, yielding a deployable end-to-end policy. Experiments show consistent gains over strong vision-language baselines in diagnostic accuracy and attribute agreement, together with structured evidence and traceable reasoning.
[CV-20] On the Reliability of Cue Conflict and Beyond
【速读】:该论文旨在解决当前基于风格化(stylization)的cue-conflict基准在评估神经网络形状-纹理偏好时存在的不稳定性与模糊性问题,具体表现为:风格化方法难以可靠地构建感知上有效且可分离的视觉线索、无法控制线索间的相对信息量、比率型偏差指标掩盖了绝对线索敏感性,以及局限于预选类别导致模型预测被决策空间偏倚所扭曲。这些问题容易混淆偏好与线索有效性、线索平衡性和可识别性伪影。解决方案的关键在于提出REFINED-BIAS——一个集成的数据集与评估框架,通过明确定义形状(shape)和纹理(texture)并构造人类与模型均能识别的平衡线索对,结合基于排序的指标在整个标签空间中测量线索特异性敏感性,从而实现更公平的跨模型比较、更忠实的形状-纹理偏差诊断和更清晰的实证结论。
链接: https://arxiv.org/abs/2603.10834
作者: Pum Jun Kim,Seung-Ah Lee,Seongho Park,Dongyoon Han,Jaejun Yoo
机构: Ulsan National Institute of Science and Technology (蔚山科学技术院); Hanyang University (汉阳大学); NAVER AI Lab (NAVER人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Shape-Texture Bias, Cue Conflict Benchmark
Abstract:Understanding how neural networks rely on visual cues offers a human-interpretable view of their internal decision processes. The cue-conflict benchmark has been influential in probing shape-texture preference and in motivating the insight that stronger, human-like shape bias is often associated with improved in-domain performance. However, we find that the current stylization-based instantiation can yield unstable and ambiguous bias estimates. Specifically, stylization may not reliably instantiate perceptually valid and separable cues nor control their relative informativeness, ratio-based bias can obscure absolute cue sensitivity, and restricting evaluation to preselected classes can distort model predictions by ignoring the full decision space. Together, these factors can confound preference with cue validity, cue balance, and recognizability artifacts. We introduce REFINED-BIAS, an integrated dataset and evaluation framework for reliable and interpretable shape-texture bias diagnosis. REFINED-BIAS constructs balanced, human- and model- recognizable cue pairs using explicit definitions of shape and texture, and measures cue-specific sensitivity over the full label space via a ranking-based metric, enabling fairer cross-model comparisons. Across diverse training regimes and architectures, REFINED-BIAS enables fairer cross-model comparison, more faithful diagnosis of shape and texture biases, and clearer empirical conclusions, resolving inconsistencies that prior cue-conflict evaluations could not reliably disambiguate.
[CV-21] Evaluating Few-Shot Pill Recognition Under Visual Domain Shift
【速读】:该论文旨在解决真实场景中药物不良事件(Adverse Drug Events, ADEs)频发的问题,通过开发一种在复杂视觉条件下仍具鲁棒性的少样本(few-shot)药丸识别系统来提升用药安全。其解决方案的关键在于采用两阶段目标检测框架:首先在大规模数据上进行基础训练,随后利用每类仅1、5或10个标注样本进行少样本微调,从而实现对新药丸类别的快速适应;研究发现,尽管分类性能在单样本下即可达到饱和,但定位精度和召回率在重叠与遮挡等挑战性条件下显著下降,表明训练数据的真实性(如包含多药丸、杂乱场景)是提升低样本场景泛化能力的核心因素,凸显了少样本微调在部署准备中的诊断价值。
链接: https://arxiv.org/abs/2603.10833
作者: W. I. Chu,G. Tarroni,L. Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures. Submitted to IEEE Engineering in Medicine and Biology Conference (EMBC) 2026
Abstract:Adverse drug events are a significant source of preventable harm, which has led to the development of automated pill recognition systems to enhance medication safety. Real-world deployment of these systems is hindered by visually complex conditions, including cluttered scenes, overlapping pills, reflections, and diverse acquisition environments. This study investigates few-shot pill recognition from a deployment-oriented perspective, prioritizing generalization under realistic cross-dataset domain shifts over architectural innovation. A two-stage object detection framework is employed, involving base training followed by few-shot fine-tuning. Models are adapted to novel pill classes using one, five, or ten labeled examples per class and are evaluated on a separate deployment dataset featuring multi-object, cluttered scenes. The evaluation focuses on classification-centric and error-based metrics to address heterogeneous annotation strategies. Findings indicate that semantic pill recognition adapts rapidly with few-shot supervision, with classification performance reaching saturation even with a single labeled example. However, stress testing under overlapping and occluded conditions demonstrates a marked decline in localization and recall, despite robust semantic classification. Models trained on visually realistic, multi-pill data consistently exhibit greater robustness in low-shot scenarios, underscoring the importance of training data realism and the diagnostic utility of few-shot fine-tuning for deployment readiness.
[CV-22] BALD-SAM: Disagreement-based Active Prompting in Interactive Segmentation
【速读】:该论文旨在解决交互式图像分割中如何高效引导用户进行空间提示(spatial prompting)的问题,尤其是在真实标注流程中,标注者需基于模型输出反复调整提示以消除歧义。传统方法依赖人工视觉判断掩码质量来选择下一个提示位置,效率低且主观性强。其核心解决方案是提出“主动提示”(active prompting),即一种基于空间主动学习的策略,将图像中的区域视为未标记池,通过模型预测的不确定性来优先选择信息量最大的区域作为下一轮提示点。关键创新在于BALD-SAM框架——它将贝叶斯主动学习中基于分歧度量(Bayesian Active Learning by Disagreement, BALD)的思想引入到提示选择中,量化了模型对不同区域的认知不确定性(epistemic uncertainty)。为实现这一目标,作者冻结整个Segment Anything Model (SAM) 并仅在小型可学习预测头中应用贝叶斯不确定性建模,从而使得针对大型基础模型的不可行不确定性估计变得可行。实验表明,BALD-SAM在16个跨领域数据集上表现卓越,显著优于人类和Oracle提示策略,并在复杂结构物体上明显优于单次提示基线。
链接: https://arxiv.org/abs/2603.10828
作者: Prithwijit Chowdhury,Mohit Prabhushankar,Ghassan AlRegib
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The Segment Anything Model (SAM) has revolutionized interactive segmentation through spatial prompting. While existing work primarily focuses on automating prompts in various settings, real-world annotation workflows involve iterative refinement where annotators observe model outputs and strategically place prompts to resolve ambiguities. Current pipelines typically rely on the annotator’s visual assessment of the predicted mask quality. We postulate that a principled approach for automated interactive prompting is to use a model-derived criterion to identify the most informative region for the next prompt. In this work, we establish active prompting: a spatial active learning approach where locations within images constitute an unlabeled pool and prompts serve as queries to prioritize information-rich regions, increasing the utility of each interaction. We further present BALD-SAM: a principled framework adapting Bayesian Active Learning by Disagreement (BALD) to spatial prompt selection by quantifying epistemic uncertainty. To do so, we freeze the entire model and apply Bayesian uncertainty modeling only to a small learned prediction head, making intractable uncertainty estimation practical for large multi-million parameter foundation models. Across 16 datasets spanning natural, medical, underwater, and seismic domains, BALD-SAM demonstrates strong cross-domain performance, ranking first or second on 14 of 16 benchmarks. We validate these gains through a comprehensive ablation suite covering 3 SAM backbones and 35 Laplace posterior configurations, amounting to 38 distinct ablation settings. Beyond strong average performance, BALD-SAM surpasses human prompting and, in several categories, even oracle prompting, while consistently outperforming one-shot baselines in final segmentation quality, particularly on thin and structurally complex objects.
[CV-23] A dataset of medication images with instance segmentation masks for preventing adverse drug events
【速读】:该论文旨在解决药物识别中因现实场景复杂性(如多药重叠、光照变化和遮挡)导致的误识别问题,从而降低用药错误(medication errors)和不良药物事件(adverse drug events, ADEs)对患者安全的风险。其解决方案的关键在于构建了一个名为MEDISEG的新型实例分割数据集,包含32种不同药片在8262张图像中的精细标注,涵盖从单药到杂乱药盒等多种真实场景条件,并通过YOLOv8和YOLOv9模型验证了该数据集在复杂环境下的高精度识别能力(如3药子集上mAP@0.5达99.5%),同时证明其在少样本检测协议下可显著提升对未见药片类别的泛化性能,表明该数据集不仅支持监督训练,还能促进有限标注条件下的迁移学习表现,为开发高可靠性的AI辅助药物识别系统提供了关键资源。
链接: https://arxiv.org/abs/2603.10825
作者: W. I. Chu,S. Hirani,G. Tarroni,L. Li
机构: City St George’s, University of London (伦敦城市圣乔治大学); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 19 figures. Submitted to Scientific Data (Nature Portfolio)
Abstract:Medication errors and adverse drug events (ADEs) pose significant risks to patient safety, often arising from difficulties in reliably identifying pharmaceuticals in real-world settings. AI-based pill recognition models offer a promising solution, but the lack of comprehensive datasets hinders their development. Existing pill image datasets rarely capture real-world complexities such as overlapping pills, varied lighting, and occlusions. MEDISEG addresses this gap by providing instance segmentation annotations for 32 distinct pill types across 8262 images, encompassing diverse conditions from individual pill images to cluttered dosette boxes. We trained YOLOv8 and YOLOv9 on MEDISEG to demonstrate their usability, achieving mean average precision at IoU 0.5 of 99.5 percent on the 3-Pills subset and 80.1 percent on the 32-Pills subset. We further evaluate MEDISEG under a few-shot detection protocol, demonstrating that base training on MEDISEG significantly improves recognition of unseen pill classes in occluded multi-pill scenarios compared to existing datasets. These results highlight the dataset’s ability not only to support robust supervised training but also to promote transferable representations under limited supervision, making it a valuable resource for developing and benchmarking AI-driven systems for medication safety.
[CV-24] HanMoVLM: Large Vision-Language Models for Professional Artistic Painting Evaluation
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, VLMs)在艺术领域,特别是中国画(Chinese Painting)评价中缺乏专业判断能力的问题,即VLMs虽具备通用视觉理解能力,却无法像人类专家一样对艺术品进行专业级评估。解决方案的关键在于构建HanMoVLM模型与HanMo-Bench数据集:前者通过引入基于专家验证的思维链(Chain-of-Thought, CoT),引导模型从内容识别、兴趣区域(Region of Interest, RoI)定位到主题特异性与三层结构化评价体系的专业推理;后者则提供真实拍卖级作品与AI生成作品的数据基础,确保训练和评估的真实性与权威性。此外,设计的奖励函数进一步优化推理过程,使模型在测试时能够作为高质量验证器,显著提升生成图像的艺术质量一致性与专业水准。
链接: https://arxiv.org/abs/2603.10814
作者: Hongji Yang,Yucheng Zhou,Wencheng Han,Songlian Li,Xiaotong Zhao,Jianbing Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages
Abstract:While Large Vision-Language Models (VLMs) demonstrate impressive general visual capabilities, they remain artistically blind and unable to offer professional evaluation of artworks within specific artistic domains like human experts. To bridge this gap, we transform VLMs into experts capable of professional-grade painting evaluation in the Chinese Artistic Domain, which is more abstract and demands extensive artistic training for evaluation. We introduce HanMo-Bench, a new dataset that features authentic auction-grade masterpieces and AI-generated works, grounded in real-world market valuations. To realize the rigorous judgment, we propose the HanMoVLM and construct a Chain-of-Thought (CoT) validated by experts. This CoT guides the model to perform expert-level reasoning: from content identification and Region of Interest (RoI) localization to professional evaluation, guided by both theme-specific evaluation and typical three-tier evaluation in Chinese paintings. Furthermore, we design a reward function to refine the reasoning process of the HanMoVLM to improve the accuracy. We demonstrate that HanMoVLM can serve as a critical backbone for Test-time Scaling in image generation. By acting as a high-quality verifier, HanMoVLM enables generative models to select the most artistically superior outputs from multiple candidates. Experimental results and human studies confirm that the proposed HanMoVLM effectively bridges the gap, achieving a high consistency with professional experts and significantly improving the quality of Chinese Painting generation.
[CV-25] Backdoor Directions in Vision Transformers
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)模型中后门攻击(Backdoor Attack)的内在表征机制及其检测问题。研究发现,当已知触发器(trigger)时,模型激活空间中存在一个特定的“触发方向”(trigger direction),该方向与触发器的内部表示具有因果关联;通过在激活和参数空间中对该方向进行干预,可稳定调控模型的后门行为,且该现象在多种数据集和攻击类型下均成立。解决方案的关键在于利用这一线性方向作为诊断工具,揭示不同触发方式(如静态补丁型与隐蔽分布式型)在模型各层中的处理逻辑差异,并据此提出一种无需数据、仅基于权重的检测方法,用于识别隐蔽触发攻击。此机制解释框架为计算机视觉模型的安全漏洞诊断与防御提供了新的可解释性路径。
链接: https://arxiv.org/abs/2603.10806
作者: Sengim Karayalcin,Marina Krcek,Pin-Yu Chen,Stjepan Picek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 31 pages, 16 figures
Abstract:This paper investigates how Backdoor Attacks are represented within Vision Transformers (ViTs). By assuming knowledge of the trigger, we identify a specific ``trigger direction’’ in the model’s activations that corresponds to the internal representation of the trigger. We confirm the causal role of this linear direction by showing that interventions in both activation and parameter space consistently modulate the model’s backdoor behavior across multiple datasets and attack types. Using this direction as a diagnostic tool, we trace how backdoor features are processed across layers. Our analysis reveals distinct qualitative differences: static-patch triggers follow a different internal logic than stealthy, distributed triggers. We further examine the link between backdoors and adversarial attacks, specifically testing whether PGD-based perturbations (de-)activate the identified trigger mechanism. Finally, we propose a data-free, weight-based detection scheme for stealthy-trigger attacks. Our findings show that mechanistic interpretability offers a robust framework for diagnosing and addressing security vulnerabilities in computer vision.
[CV-26] PolGS: Physically-Guided Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction
【速读】:该论文旨在解决反射表面(reflective surfaces)在3D高斯泼溅(3D Gaussian Splatting, 3DGS)中重建精度不足的问题,尤其是在恢复精细几何结构和表面法向量方面,相较于隐式神经方法表现较差。其解决方案的关键在于提出PolGS++框架,通过引入物理引导的偏振BRDF(polarized Bidirectional Reflectance Distribution Function, pBRDF)模型,显式分离漫反射与镜面分量,从而提供物理合理的反射建模并增强几何约束;同时设计了一种基于深度引导的可见性掩码获取机制,无需昂贵的光线追踪即可实现偏振角(angle of polarization, AoP)驱动的切空间一致性约束,显著提升了重建质量和效率,训练时间仅需约10分钟。
链接: https://arxiv.org/abs/2603.10801
作者: Yufei Han,Chu Zhou,Youwei Lyu,Qi Chen,Si Li,Boxin Shi,Yunpeng Jia,Heng Guo,Zhanyu Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2509.19726
Abstract:Accurate reconstruction of reflective surfaces remains a fundamental challenge in computer vision, with broad applications in real-time virtual reality and digital content creation. Although 3D Gaussian Splatting (3DGS) enables efficient novel-view rendering with explicit representations, its performance on reflective surfaces still lags behind implicit neural methods, especially in recovering fine geometry and surface normals. To address this gap, we propose PolGS++, a physically-guided polarimetric Gaussian Splatting framework for fast reflective surface reconstruction. Specifically, we integrate a polarized BRDF (pBRDF) model into 3DGS to explicitly decouple diffuse and specular components, providing physically grounded reflectance modeling and stronger geometric cues for reflective surface recovery. Furthermore, we introduce a depth-guided visibility mask acquisition mechanism that enables angle-of-polarization (AoP)-based tangent-space consistency constraints in Gaussian Splatting without costly ray-tracing intersections. This physically guided design improves reconstruction quality and efficiency, requiring only about 10 minutes of training. Extensive experiments on both synthetic and real-world datasets validate the effectiveness of our method.
[CV-27] he Quadratic Geometry of Flow Matching: Semantic Granularity Alignment for Text-to-Image Synthesis
【速读】:该论文旨在解决生成式微调(Generative Fine-tuning)中因数据异质性导致的梯度冲突问题,尤其是在Flow Matching框架下,标准均方误差(MSE)目标函数隐含地受动态演化神经切线核(Neural Tangent Kernel, NTK)控制,从而在优化过程中引入了样本间残差相关性(residual correlation),而这种相关性通常未被显式调控。其解决方案的关键在于提出语义粒度对齐(Semantic Granularity Alignment, SGA),通过在向量残差场(vector residual field)中施加针对性干预,主动缓解不同样本间的梯度冲突,从而提升模型收敛速度与结构完整性,在DiT和U-Net架构上均验证了该方法在效率与质量权衡上的改进效果。
链接: https://arxiv.org/abs/2603.10785
作者: Zhinan Xiong,Shunqi Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 43 pages
Abstract:In this work, we analyze the optimization dynamics of generative fine-tuning. We observe that under the Flow Matching framework, the standard MSE objective can be formulated as a Quadratic Form governed by a dynamically evolving Neural Tangent Kernel (NTK). This geometric perspective reveals a latent Data Interaction Matrix, where diagonal terms represent independent sample learning and off-diagonal terms encode residual correlation between heterogeneous features. Although standard training implicitly optimizes these cross-term interferences, it does so without explicit control; moreover, the prevailing data-homogeneity assumption may constrain the model’s effective capacity. Motivated by this insight, we propose Semantic Granularity Alignment (SGA), using Text-to-Image synthesis as a testbed. SGA engineers targeted interventions in the vector residual field to mitigate gradient conflicts. Evaluations across DiT and U-Net architectures confirm that SGA advances the efficiency-quality trade-off by accelerating convergence and improving structural integrity.
[CV-28] Phase-Interface Instance Segmentation as a Visual Sensor for Laboratory Process Monitoring
【速读】:该论文旨在解决透明玻璃器皿中化学实验的可靠视觉监测问题,其核心挑战在于弱相边界和光学伪影导致传统分割方法性能下降。解决方案的关键在于提出一个名为LGA-RCM-YOLO的新型模型,该模型结合局部-全局注意力机制(Local-Global Attention, LGA)以增强语义表征鲁棒性,并引入矩形自校准模块(Rectangular Self-Calibration Module, RCM)对细长相界面进行边界精修;同时构建了包含3,668张图像、23类玻璃器皿和五种多相界面类型的Chemical Transparent Glasses dataset 2.0(CTG 2.0)作为基准数据集,显著提升了相界面实例分割精度(AP@0.5达84.4%,AP@0.5-0.95达58.43%),并实现了近实时推理(13.67 FPS),为实验室自动化提供了可实用的视觉传感方案。
链接: https://arxiv.org/abs/2603.10782
作者: Mingyue Li,Xin Yang,Shilin Yan,Jinye Ran,Morui Zhu,Zirui Peng,Huanqing Peng,Wei Peng,Guanghua Zhang,Shuo Li,Hao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reliable visual monitoring of chemical experiments remains challenging in transparent glassware, where weak phase boundaries and optical artifacts degrade conventional segmentation. We formulate laboratory phenomena as the time evolution of phase interfaces and introduce the Chemical Transparent Glasses dataset 2.0 (CTG 2.0), a vessel-aware benchmark with 3,668 images, 23 glassware categories, and five multiphase interface types for phase-interface instance segmentation. Building on YOLO11m-seg, we propose LGA-RCM-YOLO, which combines Local-Global Attention (LGA) for robust semantic representation and a Rectangular Self-Calibration Module (RCM) for boundary refinement of thin, elongated interfaces. On CTG 2.0, the proposed model achieves 84.4% AP@0.5 and 58.43% AP@0.5-0.95, improving over the YOLO11m baseline by 6.42 and 8.75 AP points, respectively, while maintaining near real-time inference (13.67 FPS, RTX 3060). An auxiliary color-attribute head further labels liquid instances as colored or colorless with 98.71% precision and 98.32% recall. Finally, we demonstrate continuous process monitoring in separatory-funnel phase separation and crystallization, showing that phase-interface instance segmentation can serve as a practical visual sensor for laboratory automation.
[CV-29] aking Shortcuts for Categorical VQA Using Super Neurons
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在下游任务中性能提升与推理效率之间的权衡问题。传统方法如监督微调或低秩适应(Low-Rank Adaptation, LoRA)虽能提升性能,但需额外训练成本;而稀疏注意力向量(Sparse Attention Vectors, SAVs)虽无需训练,但受限于注意力头的选择空间。本文的关键解决方案是将研究焦点从注意力向量转向原始激活值(scalar activations),提出“超神经元”(Super Neurons, SNs)的概念——即通过直接探测VLM中具有判别力的标量激活值,构建无需训练的分类器。SNs显著扩展了可搜索参数空间,并能在首个生成token时即识别出足够判别性的神经元,从而实现极端早期退出(extreme early exiting),在保持甚至提升分类准确率的同时,达到最高5.10倍的推理加速效果。
链接: https://arxiv.org/abs/2603.10781
作者: Pierre Musacchio,Jaeyi Jeong,Dahun Kim,Jaesik Park
机构: Seoul National University (首尔国立大学); EPFL (瑞士联邦理工学院); Google Deepmind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 15 tables, 8 figures
Abstract:Sparse Attention Vectors (SAVs) have emerged as an excellent training-free alternative to supervised finetuning or low-rank adaptation to improve the performance of Vision Language Models (VLMs). At their heart, SAVs select a few accurate attention heads for a task of interest and use them as classifiers, rather than relying on the model’s prediction. In a similar spirit, we find that directly probing the raw activations of the VLM, in the form of scalar values, is sufficient to yield accurate classifiers on diverse visually grounded downstream tasks. Shifting focus from attention vectors to scalar activations dramatically increases the search space for accurate parameters, allowing us to find more discriminative neurons immediately from the first generated token. We call such activations Super Neurons (SNs). In this probing setting, we discover that enough SNs appear in the shallower layers of the large language model to allow for extreme early exiting from the first layer of the model at the first generated token. Compared to the original network, SNs robustly improve the classification performance while achieving a speedup of up to 5.10x.
[CV-30] Guiding Diffusion Models with Semantically Degraded Conditions CVPR2026
【速读】:该论文旨在解决生成式 AI (Generative AI) 中文本到图像模型依赖静态、语义空洞的空提示(null prompt, ∅)所导致的引导信号几何纠缠问题,从而限制了复杂组合任务中的精确性。其解决方案的关键在于提出条件退化引导(Condition-Degradation Guidance, CDG),通过构造一个仅对内容令牌(content tokens)进行选择性降级的条件向量 cdeg,将原本粗粒度的“好 vs. 空”对比转化为更精细的“好 vs. 几乎好”的判别机制,从而促使模型捕捉细粒度语义差异。此方法无需额外训练或外部模型,即可显著提升不同架构下的组合准确性与图文对齐效果,且计算开销极低。
链接: https://arxiv.org/abs/2603.10780
作者: Shilong Han,Yuming Zhang,Hongxia Wang
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Classifier-Free Guidance (CFG) is a cornerstone of modern text-to-image models, yet its reliance on a semantically vacuous null prompt ( \varnothing ) generates a guidance signal prone to geometric entanglement. This is a key factor limiting its precision, leading to well-documented failures in complex compositional tasks. We propose Condition-Degradation Guidance (CDG), a novel paradigm that replaces the null prompt with a strategically degraded condition, \boldsymbolc_\textdeg . This reframes guidance from a coarse “good vs. null” contrast to a more refined “good vs. almost good” discrimination, thereby compelling the model to capture fine-grained semantic distinctions. We find that tokens in transformer text encoders split into two functional roles: content tokens encoding object semantics, and context-aggregating tokens capturing global context. By selectively degrading only the former, CDG constructs \boldsymbolc_\textdeg without external models or training. Validated across diverse architectures including Stable Diffusion 3, FLUX, and Qwen-Image, CDG markedly improves compositional accuracy and text-image alignment. As a lightweight, plug-and-play module, it achieves this with negligible computational overhead. Our work challenges the reliance on static, information-sparse negative samples and establishes a new principle for diffusion guidance: the construction of adaptive, semantically-aware negative samples is critical to achieving precise semantic control. Code is available at this https URL.
[CV-31] CodePercept: Code-Grounded Visual STEM Perception for MLLM s CVPR2026
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在科学、技术、工程和数学(STEM)视觉推理任务中表现不佳的核心问题,即这种失败究竟是源于感知能力不足还是推理能力受限。通过独立调控感知与推理模块的规模进行系统性分析,研究发现:提升感知能力比增强推理能力更能显著改善模型性能,表明感知瓶颈是当前MLLMs在STEM视觉推理中的关键限制因素。解决方案的关键在于将代码(code)作为强大的感知媒介来系统性增强MLLMs的感知能力——利用可执行代码提供精确语义,天然契合STEM视觉内容的结构化特性。具体实现上,构建了包含100万张图像-描述-代码三元组的大规模数据集ICC-1M,其中两种互补方法分别实现了“基于代码的描述生成”(消除知识蒸馏中的幻觉)和“STEM图像到代码的翻译”(降低自然语言感知歧义),并提出了STEM2Code-Eval基准用于直接评估STEM领域内的视觉感知能力,该基准通过要求模型生成可执行代码以重建图像,从而提供确定性和可验证的感知评估,而非依赖于传统的问题求解准确率这一间接指标。
链接: https://arxiv.org/abs/2603.10757
作者: Tongkun Guan,Zhibo Yang,Jianqiang Wan,Mingkun Yang,Zhengtao Guo,Zijian Hu,Ruilin Luo,Ruize Chen,Songtao Jiang,Peng Wang,Wei Shen,Junyang Lin,Xiaokang Yang
机构: Shanghai Jiao Tong University(上海交通大学); Qwen Team, Alibaba Group(通义实验室,阿里巴巴集团); Beijing Institute of Technology(北京理工大学); Tsinghua University(清华大学); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026
Abstract:When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium–executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at this https URL.
[CV-32] Event-based Photometric Stereo via Rotating Illumination and Per-Pixel Learning
【速读】:该论文旨在解决传统帧基光度立体(Photometric Stereo)方法在实际应用中受限于可控光照条件和易受环境光干扰的问题。其关键解决方案是提出一种基于事件相机(Event Camera)的光度立体系统,利用事件相机对连续变化的场景辐射和高动态范围条件下的响应能力,结合单个沿预设圆形轨迹移动的光源,实现无需多光源同步与系统标定的表面法向量估计。该方案通过一个轻量级逐像素多层神经网络直接从事件信号中预测表面法向量,在保证精度的同时显著提升了鲁棒性,尤其在事件稀疏区域、强环境光及镜面反射场景下表现优异。
链接: https://arxiv.org/abs/2603.10748
作者: Hyunwoo Kim,Won-Hoe Kim,Sanghoon Lee,Jianfei Cai,Giljoo Nam,Jae-Sang Hyun
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Photometric stereo is a technique for estimating surface normals using images captured under varying illumination. However, conventional frame-based photometric stereo methods are limited in real-world applications due to their reliance on controlled lighting, and susceptibility to ambient illumination. To address these limitations, we propose an event-based photometric stereo system that leverages an event camera, which is effective in scenarios with continuously varying scene radiance and high dynamic range conditions. Our setup employs a single light source moving along a predefined circular trajectory, eliminating the need for multiple synchronized light sources and enabling a more compact and scalable design. We further introduce a lightweight per-pixel multi-layer neural network that directly predicts surface normals from event signals generated by intensity changes as the light source rotates, without system calibration. Experimental results on benchmark datasets and real-world data collected with our data acquisition system demonstrate the effectiveness of our method, achieving a 7.12% reduction in mean angular error compared to existing event-based photometric stereo methods. In addition, our method demonstrates robustness in regions with sparse event activity, strong ambient illumination, and scenes affected by specularities.
[CV-33] Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers CVPR2026
【速读】:该论文旨在解决扩散模型(Diffusion Models)在图像生成任务中因迭代采样带来的高计算成本问题,尤其是现有加速方法多聚焦于时间维度而忽视了生成过程中存在的显著空间冗余性——即全局结构在细粒度细节出现前已初步形成,但当前方法对所有空间区域采用均匀计算处理,导致效率低下。其解决方案的关键在于提出一种无需训练的时空域协同加速框架Just-in-Time (JiT),通过构建一个空间近似的生成常微分方程(ODE),基于动态选择的稀疏锚点令牌(anchor tokens)驱动完整潜在状态演化,并引入确定性微流(deterministic micro-flow)这一简单高效的有限时间ODE机制,确保新增令牌时潜在状态维度扩展过程中的结构一致性与统计正确性,从而实现高达7倍的速度提升且几乎无性能损失。
链接: https://arxiv.org/abs/2603.10744
作者: Wenhao Sun,Ji Li,Zhaoqiang Liu
机构: University of Electronic Science and Technology of China (电子科技大学); Capital Normal University (首都师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026
Abstract:Diffusion Transformers have established a new state-of-the-art in image synthesis, but the high computational cost of iterative sampling severely hampers their practical deployment. While existing acceleration methods often focus on the temporal domain, they overlook the substantial spatial redundancy inherent in the generative process, where global structures emerge long before fine-grained details are formed. The uniform computational treatment of all spatial regions represents a critical inefficiency. In this paper, we introduce Just-in-Time (JiT), a novel training-free framework that addresses this challenge by acceleration in the spatial domain. JiT formulates a spatially approximated generative ordinary differential equation (ODE) that drives the full latent state evolution based on computations from a dynamically selected, sparse subset of anchor tokens. To ensure seamless transitions as new tokens are incorporated to expand the dimensions of the latent state, we propose a deterministic micro-flow, a simple and effective finite-time ODE that maintains both structural coherence and statistical correctness. Extensive experiments on the state-of-the-art FLUX.1-dev model demonstrate that JiT achieves up to a 7x speedup with nearly lossless performance, significantly outperforming existing acceleration methods and establishing a new and superior trade-off between inference speed and generation fidelity.
[CV-34] Lasmobranc Dataset: An Image Dataset for Elasmobranch Species Recognition and Biodiversity Monitoring
【速读】:该论文旨在解决当前elasmobranch(软骨鱼)物种识别中因视觉数据集多为检测导向、水下采集或仅限粗粒度分类而导致的细粒度形态学分类能力不足的问题。解决方案的关键在于构建并公开发布eLasmobranch Dataset,该数据集包含来自西班牙东部地中海海岸七种生态相关软骨鱼的图像,其图像主要在水体外通过标准化协议获取,以确保诊断性形态特征的清晰可视化;同时整合了专家验证的物种注释、结构化时空元数据及补充物种信息,从而支持监督式物种级分类、种群研究以及用于生物多样性监测的人工智能系统开发,填补了细粒度软骨鱼识别领域的关键数据空白,并推动面向保护的计算机视觉研究的可重复性。
链接: https://arxiv.org/abs/2603.10724
作者: Ismael Beviá-Ballesteros,Mario Jerez-Tallón,Nieves Aranda-Garrido,Isabel Abel-Abellán,Irene Antón-Linares,Jorge Azorín-López,Marcelo Saval-Calvo,Andres Fuster-Guilló,Francisca Giménez-Casalduero
机构: University of Alicante (阿尔卡拉大学); Department of Computer Science and Technology (DTIC) (计算机科学与技术系); Marine Research Center of Santa Pola (CIMAR) (圣波拉海洋研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures, 5 tables. A future extended version of this work will be submitted to Scientific Data
Abstract:Elasmobranch populations are experiencing significant global declines, and several species are currently classified as threatened. Reliable monitoring and species-level identification are essential to support conservation and spatial planning initiatives such as Important Shark and Ray Areas (ISRAs). However, existing visual datasets are predominantly detection-oriented, underwater-acquired, or limited to coarse-grained categories, restricting their applicability to fine-grained morphological classification. We present the eLasmobranc Dataset, a curated and publicly available image collection from seven ecologically relevant elasmobranch species inhabiting the eastern Spanish Mediterranean coast, a region where two ISRAs have been identified. Images were obtained through dedicated data collection, including field campaigns and collaborations with local fish markets and projects, as well as from open-access public sources. The dataset was constructed predominantly from images acquired outside the aquatic environment under standardized protocols to ensure clear visualization of diagnostic morphological traits. It integrates expert-validated species annotations, structured spatial and temporal metadata, and complementary species-level information. The eLasmobranc Dataset is specifically designed to support supervised species-level classification, population studies, and the development of artificial intelligence systems for biodiversity monitoring. By combining morphological clarity, taxonomic reliability, and public accessibility, the dataset addresses a critical gap in fine-grained elasmobranch identification and promotes reproducible research in conservation-oriented computer vision. The dataset is publicly available at this https URL. Comments: 9 pages, 6 figures, 5 tables. A future extended version of this work will be submitted to Scientific Data Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2603.10724 [cs.CV] (or arXiv:2603.10724v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2603.10724 Focus to learn more arXiv-issued DOI via DataCite
[CV-35] UAV traffic scene understanding: A cross-spectral guided approach and a unified benchmark
【速读】:该论文旨在解决无人机(UAV)在复杂光照条件下交通场景理解性能下降的问题,以及现有视觉问答(VQA)模型在识别复杂交通行为时缺乏领域特定规则知识的局限性。解决方案的关键在于提出一种跨谱交通认知网络(Cross-spectral Traffic Cognition Network, CTCNet),其核心创新包括:1)原型引导的知识嵌入(Prototype-Guided Knowledge Embedding, PGKE)模块,通过引入外部交通规则记忆(Traffic Regulation Memory, TRM)中的高层语义原型,将领域知识锚定到视觉表征中,从而提升对细粒度交通违规行为的理解能力;2)质量感知的谱补偿(Quality-Aware Spectral Compensation, QASC)模块,利用可见光与热红外模态的互补特性实现双向上下文交互,有效补偿恶劣环境下的退化特征,保障鲁棒表征。
链接: https://arxiv.org/abs/2603.10722
作者: Yu Zhang,Zhicheng Zhao,Ze Luo,Chenglong Li,Jin Tang
机构: Computer Network Information Center, Chinese Academy of Sciences (中国科学院计算机网络信息中心); University of Chinese Academy of Sciences (中国科学院大学); Anhui Provincial Key Laboratory of Multimodal Cognitive Computation (安徽省多模态认知计算重点实验室); Information Materials and Intelligent Sensing Laboratory of Anhui Province (安徽省信息材料与智能感知重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Traffic scene understanding from unmanned aerial vehicle (UAV) platforms is crucial for intelligent transportation systems due to its flexible deployment and wide-area monitoring capabilities. However, existing methods face significant challenges in real-world surveillance, as their heavy reliance on optical imagery leads to severe performance degradation under adverse illumination conditions like nighttime and fog. Furthermore, current Visual Question Answering (VQA) models are restricted to elementary perception tasks, lacking the domain-specific regulatory knowledge required to assess complex traffic behaviors. To address these limitations, we propose a novel Cross-spectral Traffic Cognition Network (CTCNet) for robust UAV traffic scene understanding. Specifically, we design a Prototype-Guided Knowledge Embedding (PGKE) module that leverages high-level semantic prototypes from an external Traffic Regulation Memory (TRM) to anchor domain-specific knowledge into visual representations, enabling the model to comprehend complex behaviors and distinguish fine-grained traffic violations. Moreover, we develop a Quality-Aware Spectral Compensation (QASC) module that exploits the complementary characteristics of optical and thermal modalities to perform bidirectional context exchange, effectively compensating for degraded features to ensure robust representation in complex environments. In addition, we construct Traffic-VQA, the first large-scale optical-thermal infrared benchmark for cognitive UAV traffic understanding, comprising 8,180 aligned image pairs and 1.3 million question-answer pairs across 31 diverse types. Extensive experiments demonstrate that CTCNet significantly outperforms state-of-the-art methods in both cognition and perception scenarios. The dataset is available at this https URL.
[CV-36] WalkGPT : Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation CVPR-2026
【速读】:该论文旨在解决现有大型视觉语言模型(Large Vision-Language Models, LVLMs)在复杂城市场景中进行可访问性行人导航时,因缺乏显式空间锚定而导致的对象幻觉和不可靠深度推理问题。为实现深度感知的无障碍导航指导,作者提出WalkGPT,其核心创新在于将语言推理与分割任务统一于单一架构中,并引入两个关键组件:多尺度查询投影器(Multi-Scale Query Projector, MSQP)用于跨空间层次聚合图像token以增强语义-空间对齐,以及校准文本投影器(Calibrated Text Projector, CTP)结合区域对齐损失(Region Alignment Loss),将语言嵌入映射为分割感知表示,从而无需用户提供的提示或锚点即可实现细粒度的空间接地和深度推断,生成完整且真实的导航建议。
链接: https://arxiv.org/abs/2603.10703
作者: Rafi Ibn Sultan,Hui Zhu,Xiangyu Zhou,Chengyin Li,Prashant Khanduri,Marco Brocanelli,Dongxiao Zhu
机构: Wayne State University (韦恩州立大学); Henry Ford Health (亨利福特健康); The Ohio State University (俄亥俄州立大学); Institute for AI and Data Science (人工智能与数据科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: Accepted by CVPR-2026
Abstract:Ensuring accessible pedestrian navigation requires reasoning about both semantic and spatial aspects of complex urban scenes, a challenge that existing Large Vision-Language Models (LVLMs) struggle to meet. Although these models can describe visual content, their lack of explicit grounding leads to object hallucinations and unreliable depth reasoning, limiting their usefulness for accessibility guidance. We introduce WalkGPT, a pixel-grounded LVLM for the new task of Grounded Navigation Guide, unifying language reasoning and segmentation within a single architecture for depth-aware accessibility guidance. Given a pedestrian-view image and a navigation query, WalkGPT generates a conversational response with segmentation masks that delineate accessible and harmful features, along with relative depth estimation. The model incorporates a Multi-Scale Query Projector (MSQP) that shapes the final image tokens by aggregating them along text tokens across spatial hierarchies, and a Calibrated Text Projector (CTP), guided by a proposed Region Alignment Loss, that maps language embeddings into segmentation-aware representations. These components enable fine-grained grounding and depth inference without user-provided cues or anchor points, allowing the model to generate complete and realistic navigation guidance. We also introduce PAVE, a large-scale benchmark of 41k pedestrian-view images paired with accessibility-aware questions and depth-grounded answers. Experiments show that WalkGPT achieves strong grounded reasoning and segmentation performance. The source code and dataset are available on the \hrefthis https URLproject website.
[CV-37] UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations
【速读】:该论文旨在解决当前统一多模态模型在视觉理解与生成任务中因离散化视觉编码器导致细粒度语义信息丢失,以及直接使用连续语义表示(如CLIP、SigLIP)时高维生成建模带来的收敛缓慢和训练不稳定问题。其解决方案的关键在于提出UniCom框架,通过压缩连续语义表示实现多模态理解和生成的协同优化:首先设计基于注意力机制的语义压缩器,将密集特征提炼为紧凑的统一表示,实验证明降低通道维度比空间下采样更有效;其次采用蒸馏式(transfusion)架构替代查询驱动(query-based)设计,显著提升收敛速度与一致性,从而在不依赖VAE的情况下仍保持图像生成的一致性和可控性。
链接: https://arxiv.org/abs/2603.10702
作者: Yaqi Zhao,Wang Lin,Zijian Zhang,Miles Yang,Jingyuan Chen,Wentao Zhang,Zhao Zhong,Liefeng Bo
机构: Peking University (北京大学); Zhejiang University (浙江大学); Tencent Hunyuan (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current unified multimodal models typically rely on discrete visual tokenizers to bridge the modality gap. However, discretization inevitably discards fine-grained semantic information, leading to suboptimal performance in visual understanding tasks. Conversely, directly modeling continuous semantic representations (e.g., CLIP, SigLIP) poses significant challenges in high-dimensional generative modeling, resulting in slow convergence and training instability. To resolve this dilemma, we introduce UniCom, a unified framework that harmonizes multimodal understanding and generation via compressed continuous representation. We empirically demonstrate that reducing channel dimension is significantly more effective than spatial downsampling for both reconstruction and generation. Accordingly, we design an attention-based semantic compressor to distill dense features into a compact unified representation. Furthermore, we validate that the transfusion architecture surpasses query-based designs in convergence and consistency. Experiments demonstrate that UniCom achieves state-of-the-art generation performance among unified models. Notably, by preserving rich semantic priors, it delivers exceptional controllability in image editing and maintains image consistency even without relying on VAE.
[CV-38] RandMark: On Random Watermarking of Visual Foundation Models
【速读】:该论文旨在解决视觉基础模型(Visual Foundation Models, VFMs)的知识产权保护问题,即如何有效验证模型的所有权并防止未经授权的复制与分发。解决方案的关键在于提出一种基于随机水印嵌入的方法:通过一个小规模的编码器-解码器网络,将数字水印嵌入到一个保留输入图像集的内部表示中,使得功能相同的水印模型副本在统计上可检测出水印特征,同时确保未水印模型极少产生误检,水印模型也极少出现漏检,从而实现高可靠性的所有权验证。
链接: https://arxiv.org/abs/2603.10695
作者: Anna Chistyakova,Mikhail Pautov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Being trained on large and diverse datasets, visual foundation models (VFMs) can be fine-tuned to achieve remarkable performance and efficiency in various downstream computer vision tasks. The high computational cost of data collection and training makes these models valuable assets, which motivates some VFM owners to distribute them alongside a license to protect their intellectual property rights. In this paper, we propose an approach to ownership verification of visual foundation models that leverages a small encoder-decoder network to embed digital watermarks into an internal representation of a hold-out set of input images. The method is based on random watermark embedding, which makes the watermark statistics detectable in functional copies of the watermarked model. Both theoretically and experimentally, we demonstrate that the proposed method yields a low probability of false detection for non-watermarked models and a low probability of false misdetection for watermarked models.
[CV-39] Bioinspired CNNs for border completion in occluded images
【速读】:该论文旨在解决图像遮挡(occlusion)下模型鲁棒性不足的问题,即在输入图像存在局部缺失或遮挡时,卷积神经网络(CNN)的识别性能显著下降。解决方案的关键在于借鉴视觉皮层中边界补全(border completion)问题的数学建模方法,设计出能够模拟人类视觉系统对不完整信息进行推理能力的CNN滤波器,从而提升网络对遮挡场景的适应性。实验表明,所提出的BorderNet架构在MNIST、Fashion-MNIST和EMNIST三个遮挡数据集上均表现出优于传统CNN的性能,尤其在不同遮挡类型(条纹与网格)和严重程度下具有稳定改进效果。
链接: https://arxiv.org/abs/2603.10694
作者: Catarina P. Coutinho,Aneeqa Merhab,Janko Petkovic,Ferdinando Zanchetta,Rita Fioresi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted for Publication
Abstract:We exploit the mathematical modeling of the border completion problem in the visual cortex to design convolutional neural network (CNN) filters that enhance robustness to image occlusions. We evaluate our CNN architecture, BorderNet, on three occluded datasets (MNIST, Fashion-MNIST, and EMNIST) under two types of occlusions: stripes and grids. In all cases, BorderNet demonstrates improved performance, with gains varying depending on the severity of the occlusions and the dataset.
[CV-40] MapGCLR: Geospatial Contrastive Learning of Representations for Online Vectorized HD Map Construction
【速读】:该论文旨在解决在线高精度(HD)地图构建中对大规模标注数据依赖过高的问题,从而提升模型训练的可扩展性。其核心挑战在于如何在减少人工标注需求的同时保持地图感知性能。解决方案的关键在于引入一种基于对比损失函数的自监督学习机制,通过强制重叠的鸟瞰图(BEV)特征网格之间保持地理空间一致性来优化特征表示;同时设计了一种基于多轨迹重叠分析的数据划分策略,生成符合多轨迹要求的子数据集,使模型能够在少量单轨迹标注数据上进行监督训练,并在更广泛的无标签数据上进行自监督学习,实现有效的半监督学习范式。该方法显著优于纯监督基线,在下游向量化地图感知任务中表现出更强的定量性能和更清晰的特征空间聚类结构。
链接: https://arxiv.org/abs/2603.10688
作者: Jonas Merkert,Alexander Blumberg,Jan-Hendrik Pauls,Christoph Stiller
机构: Karlsruhe Institute of Technology (KIT)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autonomous vehicles rely on map information to understand the world around them. However, the creation and maintenance of offline high-definition (HD) maps remains costly. A more scalable alternative lies in online HD map construction, which only requires map annotations at training time. To further reduce the need for annotating vast training labels, self-supervised training provides an alternative. This work focuses on improving the latent birds-eye-view (BEV) feature grid representation within a vectorized online HD map construction model by enforcing geospatial consistency between overlapping BEV feature grids as part of a contrastive loss function. To ensure geospatial overlap for contrastive pairs, we introduce an approach to analyze the overlap between traversals within a given dataset and generate subsidiary dataset splits following adjustable multi-traversal requirements. We train the same model supervised using a reduced set of single-traversal labeled data and self-supervised on a broader unlabeled set of data following our multi-traversal requirements, effectively implementing a semi-supervised approach. Our approach outperforms the supervised baseline across the board, both quantitatively in terms of the downstream tasks vectorized map perception performance and qualitatively in terms of segmentation in the principal component analysis (PCA) visualization of the BEV feature space.
[CV-41] A2-Edit: Precise Reference-Guided Image Editing of Arbitrary Objects and Ambiguous Masks
【速读】:该论文旨在解决现有图像修复(inpainting)方法在任意物体类别编辑任务中存在的两个核心问题:一是训练数据集存在严重同质化(homogenization)和类别覆盖有限的问题,二是模型在跨类别语义迁移与泛化能力上的不足。为此,作者提出了一种统一的编辑框架 A²-Edit,其关键创新在于构建了一个大规模、多类别的数据集 UniEdit-500K(包含8个主类别、209个细粒度子类别,共500,104对图像),并通过引入Mixture of Transformer模块实现基于动态专家选择的差异化建模,从而自动学习不同类别间的语义关系与区分特征;同时,设计了Mask Annealing Training Strategy (MATS),通过逐步放松掩码精度来降低模型对精确掩码的依赖,提升其在多样化编辑任务中的鲁棒性。实验证明,该方案在VITON-HD和AnyInsertion等基准上显著优于现有方法,为任意物体编辑提供了高效且通用的新范式。
链接: https://arxiv.org/abs/2603.10685
作者: Huayu Zheng,Guangzhao Li,Baixuan Zhao,Siqi Luo,Hantao Jiang,Guangtao Zhai,Xiaohong Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose \textbfA ^2 -Edit, a unified inpainting framework for arbitrary object categories, which allows users to replace any target region with a reference object using only a coarse mask. To address the issues of severe homogenization and limited category coverage in existing datasets, we construct a large-scale, multi-category dataset \textbfUniEdit-500K, which includes 8 major categories, 209 fine-grained subcategories, and a total of 500,104 image pairs. Such rich category diversity poses new challenges for the model, requiring it to automatically learn semantic relationships and distinctions across categories. To this end, we introduce the \textbfMixture of Transformer module, which performs differentiated modeling of various object categories through dynamic expert selection, and further enhances cross-category semantic transfer and generalization through collaboration among experts. In addition, we propose a \textbfMask Annealing Training Strategy (MATS) that progressively relaxes mask precision during training, reducing the model’s reliance on accurate masks and improving robustness across diverse editing tasks. Extensive experiments on benchmarks such as VITON-HD and AnyInsertion demonstrate that A ^2 -Edit consistently outperforms existing approaches across all metrics, providing a new and efficient solution for arbitrary object editing.
[CV-42] An FPGA Implementation of Displacement Vector Search for Intra Pattern Copy in JPEG XS
【速读】:该论文旨在解决JPEG XS标准中Intra Pattern Copy (IPC)工具的位移向量(Displacement Vector, DV)搜索模块在硬件实现时计算复杂度高、难以部署的问题。其解决方案的关键在于提出了一种高效的流水线化FPGA架构设计,通过优化内存组织结构以充分利用IPC计算特性和数据重用模式,从而显著提升吞吐量并降低功耗——实验结果表明,该架构实现了38.3 Mpixels/s的吞吐量和277 mW的功耗,验证了其在IPC及其他预测编码工具中实际硬件部署的可行性,并为ASIC实现奠定了良好基础。
链接: https://arxiv.org/abs/2603.10671
作者: Qiyue Chen,Yao Li,Jie Tao,Song Chen,Li Li,Dong Liu
机构: 未知
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Recently, progress has been made on the Intra Pattern Copy (IPC) tool for JPEG XS, an image compression standard designed for low-latency and low-complexity coding. IPC performs wavelet-domain intra compensation predictions to reduce spatial redundancy in screen content. A key module of IPC is the displacement vector (DV) search, which aims to solve the optimal prediction reference offset. However, the DV search process is computationally intensive, posing challenges for practical hardware deployment. In this paper, we propose an efficient pipelined FPGA architecture design for the DV search module to promote the practical deployment of IPC. Optimized memory organization, which leverages the IPC computational characteristics and data inherent reuse patterns, is further introduced to enhance the performance. Experimental results show that our proposed architecture achieves a throughput of 38.3 Mpixels/s with a power consumption of 277 mW, demonstrating its feasibility for practical hardware implementation in IPC and other predictive coding tools, and providing a promising foundation for ASIC deployment.
[CV-43] How To Embed Matters: Evaluation of EO Embedding Design Choices
【速读】:该论文旨在解决地球观测(Earth Observation, EO)任务中大规模多光谱影像处理的可扩展性问题,核心挑战在于如何设计高效且通用的中间表示(embedding),以替代原始数据并支持多种下游任务。解决方案的关键在于系统性地分析生成式地理基础模型(GeoFM)中嵌入设计的各项因素,包括骨干网络架构、预训练策略、表示深度、空间聚合方式及嵌入组合机制,并验证了通过合理设计的紧凑嵌入(尺寸小于原始数据500倍)仍能保持高性能与任务泛化能力。研究发现,基于Transformer的骨干网络配合均值池化可提供稳健默认嵌入,中间ResNet层优于最终层,自监督目标具有任务特异性优势,而融合不同预训练目标的嵌入则显著提升鲁棒性。
链接: https://arxiv.org/abs/2603.10658
作者: Luis Gilch,Isabelle Wittmann,Maximilian Nitsche,Johannes Jakubik,Arne Ewald,Thomas Brunschwiler
机构: IBM Germany; IBM Research - Europe; NORDAKADEMIE Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Earth observation (EO) missions produce petabytes of multispectral imagery, increasingly analyzed using large Geospatial Foundation Models (GeoFMs). Alongside end-to-end adaptation, workflows make growing use of intermediate representations as task-agnostic embeddings, enabling models to compute representations once and reuse them across downstream tasks. Consequently, when GeoFMs act as feature extractors, decisions about how representations are obtained, aggregated, and combined affect downstream performance and pipeline scalability. Understanding these trade-offs is essential for scalable embedding-based EO workflows, where compact embeddings can replace raw data while remaining broadly useful. We present a systematic analysis of embedding design in GeoFM-based EO workflows. Leveraging NeuCo-Bench, we study how backbone architecture, pretraining strategy, representation depth, spatial aggregation, and representation combination influence EO task performance. We demonstrate the usability of GeoFM embeddings by aggregating them into fixed-size representations more than 500x smaller than the raw input data. Across models, we find consistent trends: transformer backbones with mean pooling provide strong default embeddings, intermediate ResNet layers can outperform final layers, self-supervised objectives exhibit task-specific strengths, and combining embeddings from different objectives often improves robustness.
[CV-44] Are Video Reasoning Models Ready to Go Outside?
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在真实世界部署中因天气、遮挡和相机运动等时空扰动导致的性能显著下降问题,即模型在干净、受控环境下的评估结果与实际鲁棒性之间存在显著差距。解决方案的关键在于提出一种名为ROVA的新训练框架,其核心是通过建模一种感知鲁棒性的一致性奖励(robustness-aware consistency reward),并引入一种基于难度感知的在线训练策略:该策略依据模型能力的动态演化持续重新估计样本难度,从而实现自适应训练。此机制使模型能够优先学习更具信息量的样本,有效提升其在现实扰动下的理解与推理能力。
链接: https://arxiv.org/abs/2603.10652
作者: Yangfan He,Changgyu Boo,Jaehong Yoon
机构: NTU Singapore (新加坡南洋理工大学); Korea University (韩国高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model’s evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.
[CV-45] Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning
【速读】:该论文旨在解决基于骨架的动作表示学习中对比学习(Contrastive Learning, CL)忽略细粒度局部细节以及掩码自编码器(Masked Auto-Encoder, MAE)存在计算冗余和严重不对称性的问题。其核心解决方案是提出SLiM(Skeleton Less is More),一个统一框架,通过共享编码器将掩码建模与对比学习相结合,摒弃重建解码器以消除计算冗余,并迫使编码器直接捕捉判别性特征。关键创新在于引入语义管 masking(semantic tube masking)和骨骼感知增强策略,有效缓解因高时空相关性导致的平凡重建问题,从而在保持卓越性能的同时显著降低推理计算成本(相比现有MAE方法减少7.89倍)。
链接: https://arxiv.org/abs/2603.10648
作者: Jeonghyeok Do,Yun Chen,Geunhyuk Youk,Munchurl Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Please visit our project page at this https URL
Abstract:The landscape of skeleton-based action representation learning has evolved from Contrastive Learning (CL) to Masked Auto-Encoder (MAE) architectures. However, each paradigm faces inherent limitations: CL often overlooks fine-grained local details, while MAE is burdened by computationally heavy decoders. Moreover, MAE suffers from severe computational asymmetry – benefiting from efficient masking during pre-training but requiring exhaustive full-sequence processing for downstream tasks. To resolve these bottlenecks, we propose SLiM (Skeleton Less is More), a novel unified framework that harmonizes masked modeling with contrastive learning via a shared encoder. By eschewing the reconstruction decoder, SLiM not only eliminates computational redundancy but also compels the encoder to capture discriminative features directly. SLiM is the first framework with decoder-free masked modeling of representative learning. Crucially, to prevent trivial reconstruction arising from high skeletal-temporal correlation, we introduce semantic tube masking, alongside skeletal-aware augmentations designed to ensure anatomical consistency across diverse temporal granularities. Extensive experiments demonstrate that SLiM consistently achieves state-of-the-art performance across all downstream protocols. Notably, our method delivers this superior accuracy with exceptional efficiency, reducing inference computational cost by 7.89x compared to existing MAE methods.
[CV-46] Splat2Real: Novel-view Scaling for Physical AI with 3D Gaussian Splatting
【速读】:该论文旨在解决物理AI在训练与部署阶段因视角变化导致的性能下降问题,尤其关注单目RGB到3D感知中的新视角鲁棒性(novel-view robustness)。其核心挑战在于如何有效利用数字孪生(digital twin)生成的监督信号进行深度预训练,同时确保模型在未见视角下的泛化能力。解决方案的关键在于提出Splat2Real框架,通过引入CN-Coverage课程学习策略——该策略结合几何增益(geometry gain)与外推惩罚(extrapolation penalty)来贪婪选择最具信息量的新视角,并辅以质量感知的保护机制(guardrail fallback)以应对低可靠性教师模型输出。实验表明,该方法显著优于基线策略,在不同视图预算下均展现出更稳定的性能和更低的高新颖性尾部误差,从而为机器人控制代理提供了具身相关性的实证支持。
链接: https://arxiv.org/abs/2603.10638
作者: Hansol Lim,Jongseong Brad Choi
机构: State University of New York, Korea (纽约州立大学韩国分校); State University of New York, Stony Brook (纽约州立大学石溪分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Physical AI faces viewpoint shift between training and deployment, and novel-view robustness is essential for monocular RGB-to-3D perception. We cast Real2Render2Real monocular depth pretraining as imitation-learning-style supervision from a digital twin oracle: a student depth network imitates expert metric depth/visibility rendered from a scene mesh, while 3DGS supplies scalable novel-view observations. We present Splat2Real, centered on novel-view scaling: performance depends more on which views are added than on raw view count. We introduce CN-Coverage, a coverage+novelty curriculum that greedily selects views by geometry gain and an extrapolation penalty, plus a quality-aware guardrail fallback for low-reliability teachers. Across 20 TUM RGB-D sequences with step-matched budgets (N=0 to 2000 additional rendered views, with N unique = 500 and resampling for larger budgets), naive scaling is unstable; CN-Coverage mitigates worst-case regressions relative to Robot/Coverage policies, and GOL-Gated CN-Coverage provides the strongest medium-high-budget stability with the lowest high-novelty tail error. Downstream control-proxy results versus N provides embodied-relevance evidence by shifting safety/progress trade-offs under viewpoint shift.
[CV-47] HyPER-GAN: Hybrid Patch-Based Image-to-Image Translation for Real-Time Photorealism Enhancement
【速读】:该论文旨在解决生成式模型在提升合成数据视觉真实感(photorealism)时引入视觉伪影(visual artifacts)并消耗高计算资源的问题,从而限制其在实时训练或评估场景中的应用。解决方案的关键在于提出一种轻量级图像到图像翻译方法——HyPER-GAN,其核心是基于U-Net结构的生成器,并采用混合训练策略:在使用配对的合成与增强真实感图像进行训练的基础上,引入来自真实世界数据的匹配补丁(matched patches),以提升视觉真实感和语义一致性。实验表明,该方法在推理延迟、视觉真实感和语义鲁棒性方面均优于当前最优的成对图像翻译方法。
链接: https://arxiv.org/abs/2603.10604
作者: Stefanos Pasios,Nikos Nikolaidis
机构: Aristotle University of Thessaloniki (塞萨洛尼基亚里士多德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages
Abstract:Generative models are widely employed to enhance the photorealism of synthetic data for training computer vision algorithms. However, they often introduce visual artifacts that degrade the accuracy of these algorithms and require high computational resources, limiting their applicability in real-time training or evaluation scenarios. In this paper, we propose Hybrid Patch Enhanced Realism Generative Adversarial Network (HyPER-GAN), a lightweight image-to-image translation method based on a U-Net-style generator designed for real-time inference. The model is trained using paired synthetic and photorealism-enhanced images, complemented by a hybrid training strategy that incorporates matched patches from real-world data to improve visual realism and semantic consistency. Experimental results demonstrate that HyPER-GAN outperforms state-of-the-art paired image-to-image translation methods in terms of inference latency, visual realism, and semantic robustness. Moreover, it is illustrated that the proposed hybrid training strategy indeed improves visual quality and semantic consistency compared to training the model solely with paired synthetic and photorealism-enhanced images. Code and pretrained models are publicly available for download at: this https URL
[CV-48] Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection
【速读】:该论文旨在解决生成式 AI(Generative AI)合成图像日益逼近真实照片所带来的媒体可信度与内容篡改安全风险问题,尤其是现有检测方法因依赖特定模型痕迹或低层统计特征而导致泛化能力差的局限性。其解决方案的关键在于提出一种名为“潜在空间过渡差异”(Latent Transition Discrepancy, LTD)的新方法,该方法通过分析真实图像与合成图像在神经网络隐层表示中的跨层一致性差异——即真实图像在潜在空间中保持语义注意力和结构连贯性、特征过渡更稳定,而合成图像则表现出可区分的不一致模式——从而实现对合成图像的有效识别。LTD 自适应地筛选最具判别力的网络层,并量化层间特征转移差异,显著提升了检测准确率、泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2603.10598
作者: Yawen Yang,Feng Li,Shuqi Kong,Yunfeng Diao,Xinjian Gao,Zenglin Shi,Meng Wang
机构: Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these synthetics makes them increasingly indistinguishable from authentic photographs, posing serious security risks, such as media credibility and content manipulation. Although extensive efforts have been dedicated to detecting synthetic images, most existing approaches suffer from poor generalization to unseen data due to their reliance on model-specific artifacts or low-level statistical cues. In this work, we identify a previously unexplored distinction that real images maintain consistent semantic attention and structural coherence in their latent representations, exhibiting more stable feature transitions across network layers, whereas synthetic ones present discernible distinct patterns. Therefore, we propose a novel approach termed latent transition discrepancy (LTD), which captures the inter-layer consistency differences of real and synthetic images. LTD adaptively identifies the most discriminative layers and assesses the transition discrepancies across layers. Benefiting from the proposed inter-layer discriminative modeling, our approach exceeds the base model by 14.35% in mean Acc across three datasets containing diverse GANs and DMs. Extensive experiments demonstrate that LTD outperforms recent state-of-the-art methods, achieving superior detection accuracy, generalizability, and robustness. The code is available at this https URL
[CV-49] Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion
【速读】:该论文旨在解决深度补全(depth completion)任务中扩散模型(diffusion-based methods)在推理阶段计算成本高昂的问题,尤其是在实际应用中面临延迟约束时难以部署的挑战。现有方法通常依赖于昂贵的测试时优化(test-time optimization),限制了其在实时场景中的实用性。解决方案的关键在于提出Marigold-SSD——一种单步、晚期融合(late-fusion)的深度补全框架,通过将计算负担从推理阶段转移到微调阶段(finetuning),在仅需4.5 GPU天训练成本的前提下实现高效且鲁棒的3D感知。该方法利用强大的扩散先验(diffusion priors)同时显著提升推理速度,缩小了扩散模型与判别式模型(discriminative models)之间的效率差距,并在多个室内和室外基准上展现出卓越的跨域泛化能力和零样本性能。
链接: https://arxiv.org/abs/2603.10584
作者: Jakub Gregorek,Paraskevas Pegios,Nando Metzger,Konrad Schindler,Theodora Kontogianni,Lazaros Nalpantidis
机构: DTU - Technical University of Denmark (丹麦技术大学); Pioneer Centre for AI (人工智能先锋中心); ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:We introduce Marigold-SSD, a single-step, late-fusion depth completion framework that leverages strong diffusion priors while eliminating the costly test-time optimization typically associated with diffusion-based methods. By shifting computational burden from inference to finetuning, our approach enables efficient and robust 3D perception under real-world latency constraints. Marigold-SSD achieves significantly faster inference with a training cost of only 4.5 GPU days. We evaluate our method across four indoor and two outdoor benchmarks, demonstrating strong cross-domain generalization and zero-shot performance compared to existing depth completion approaches. Our approach significantly narrows the efficiency gap between diffusion-based and discriminative models. Finally, we challenge common evaluation protocols by analyzing performance under varying input sparsity levels. Page: this https URL
[CV-50] Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution CVPR2026
【速读】:该论文旨在解决当前AI生成图像溯源(AI-generated image attribution)方法中存在的模型依赖性问题,即现有技术通常需要访问原始生成模型才能实现准确识别,导致其在面对未知或新出现的生成器时缺乏泛化能力和可扩展性。为应对这一挑战,作者提出了一种新的范式,将图像溯源问题建模为实例检索(instance retrieval)而非传统的图像分类任务,并设计了一个无需依赖具体生成模型的高效框架——低比特平面深度伪造溯源(Low-bIt-plane-based Deepfake Attribution, LIDA)。LIDA的关键创新在于其双阶段训练机制:首先通过无监督预训练学习通用特征表示,再结合少量样本进行少样本溯源适配,从而在零样本和少样本场景下均实现了卓越的检测与溯源性能。
链接: https://arxiv.org/abs/2603.10583
作者: Hongsong Wang,Renxi Cheng,Chaolei Han,Jie Gui
机构: Southeast University (东南大学); Purple Mountain Laboratories (紫金山实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in CVPR 2026, Code is at this https URL
Abstract:With the rapid advancement of AIGC technologies, image forensics will encounter unprecedented challenges. Traditional methods are incapable of dealing with increasingly realistic images generated by rapidly evolving image generation techniques. To facilitate the identification of AI-generated images and the attribution of their source models, generative image watermarking and AI-generated image attribution have emerged as key research focuses in recent years. However, existing methods are model-dependent, requiring access to the generative models and lacking generality and scalability to new and unseen generators. To address these limitations, this work presents a new paradigm for AI-generated image attribution by formulating it as an instance retrieval problem instead of a conventional image classification problem. We propose an efficient model-agnostic framework, called Low-bIt-plane-based Deepfake Attribution (LIDA). The input to LIDA is produced by Low-Bit Fingerprint Generation module, while the training involves Unsupervised Pre-Training followed by subsequent Few-Shot Attribution Adaptation. Comprehensive experiments demonstrate that LIDA achieves state-of-the-art performance for both Deepfake detection and image attribution under zero- and few-shot settings. The code is at this https URL
[CV-51] R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment
【速读】:该论文旨在解决沉浸式计算机图形(Immersive Computer Graphics, CG)质量评估中存在的两大挑战:一是现有CG数据集缺乏对渲染质量的系统性描述,二是现有质量评估方法无法提供合理的文本解释。为应对这些问题,作者首先从用户视角识别出六个关键的感知维度,并构建了一个包含3500张CG图像及其对应质量描述的数据集,每个描述涵盖风格、内容及所选维度上的感知质量。在此基础上,利用子集建立基于描述的问答基准以评估视觉语言模型(Vision Language Models, VLMs)的表现。研究发现当前VLMs在细粒度CG质量判断上准确性不足,但视觉相似图像的描述能显著提升其理解能力。受此启发,论文提出一种两流检索增强生成框架,通过引入检索机制增强VLM的CG质量评估能力,实验证明该方法可显著提升多个代表性VLM在CG质量评估任务中的性能。
链接: https://arxiv.org/abs/2603.10578
作者: Zhuangzi Li,Jian Jin,Shilv Cai,Weisi Lin
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注:
Abstract:Immersive Computer Graphics (CGs) rendering has become ubiquitous in modern daily life. However, comprehensively evaluating CG quality remains challenging for two reasons: First, existing CG datasets lack systematic descriptions of rendering quality; and second existing CG quality assessment methods cannot provide reasonable text-based explanations. To address these issues, we first identify six key perceptual dimensions of CG quality from the user perspective and construct a dataset of 3500 CG images with corresponding quality descriptions. Each description covers CG style, content, and perceived quality along the selected dimensions. Furthermore, we use a subset of the dataset to build several question-answer benchmarks based on the descriptions in order to evaluate the responses of existing Vision Language Models (VLMs). We find that current VLMs are not sufficiently accurate in judging fine-grained CG quality, but that descriptions of visually similar images can significantly improve a VLM’s understanding of a given CG image. Motivated by this observation, we adopt retrieval-augmented generation and propose a two-stream retrieval framework that effectively enhances the CG quality assessment capabilities of VLMs. Experiments on several representative VLMs demonstrate that our method substantially improves their performance on CG quality assessment.
[CV-52] UniStitch: Unifying Semantic and Geometric Features for Image Stitching
【速读】:该论文旨在解决传统图像拼接方法与基于学习的拼接方法长期分离、缺乏有效融合的问题,即如何统一几何特征(geometric features)与语义特征(semantic features)以提升拼接性能。其解决方案的关键在于提出UniStitch框架,通过两个核心模块实现多模态特征的对齐与自适应融合:首先设计神经点变换器(Neural Point Transformer, NPT)模块,将离散稀疏的1D关键点(keypoint)映射为有序稠密的2D语义特征图;随后引入自适应专家混合(Adaptive Mixture of Experts, AMoE)模块,在融合过程中动态聚焦于更可靠的特征表示,从而增强模型在复杂场景下的鲁棒性。此方法实现了几何与语义特征的有效协同,显著优于单一特征驱动的现有方法。
链接: https://arxiv.org/abs/2603.10568
作者: Yuan Mei,Lang Nie,Kang Liao,Yunqiu Xu,Chunyu Lin,Bin Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
Abstract:Traditional image stitching methods estimate warps from hand-crafted geometric features, whereas recent learning-based solutions leverage semantic features from neural networks instead. These two lines of research have largely diverged along separate evolution, with virtually no meaningful convergence to date. In this paper, we take a pioneering step to bridge this gap by unifying semantic and geometric features with UniStitch, a unified image stitching framework from multimodal features. To align discrete geometric features (i.e., keypoint) with continuous semantic feature maps, we present a Neural Point Transformer (NPT) module, which transforms unordered, sparse 1D geometric keypoints into ordered, dense 2D semantic maps. Then, to integrate the advantages of both representations, an Adaptive Mixture of Experts (AMoE) module is designed to fuse geometric and semantic representations. It dynamically shifts focus toward more reliable features during the fusion process, allowing the model to handle complex scenes, especially when either modality might be compromised. The fused representation can be adopted into common deep stitching pipelines, delivering significant performance gains over any single feature. Experiments show that UniStitch outperforms existing state-of-the-art methods with a large margin, paving the way for a unified paradigm between traditional and learning-based image stitching.
[CV-53] PET-F2I: A Comprehensive Benchmark and Parameter-Efficient Fine-Tuning of LLM s for PET/CT Report Impression Generation
【速读】:该论文旨在解决PET/CT影像报告中诊断印象(impression)自动生成的难题,即如何利用大语言模型(LLM)从复杂的影像学发现中提炼出准确、完整且符合临床规范的诊断结论。当前尽管生成式AI(Generative AI)在医学文本生成领域展现出潜力,但其在高度专业化的PET/CT领域仍缺乏系统评估与有效适配。解决方案的关键在于构建一个大规模、真实世界数据驱动的基准测试集PET-F2I-41K(含41,000余份PET/CT报告),并在此基础上开发了一个针对PET/CT领域微调的7B参数模型PET-F2I-7B(基于Qwen2.5-7B-Instruct通过LoRA方法进行域适应训练)。此外,研究创新性地提出三项临床导向指标——实体覆盖率(Entity Coverage Rate, ECR)、未覆盖实体率(Uncovered Entity Rate, UER)和事实一致性率(Factual Consistency Rate, FCR),以更精准衡量生成结果的诊断完整性与可靠性,从而推动可部署于临床场景的PET/CT智能报告系统的研发。
链接: https://arxiv.org/abs/2603.10560
作者: Yuchen Liu,Wenbo Zhang,Liling Peng,Yichi Zhang,Yu Fu,Xin Guo,Chao Qu,Yuan Qi,Le Xue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:PET/CT imaging is pivotal in oncology and nuclear medicine, yet summarizing complex findings into precise diagnostic impressions is labor-intensive. While LLMs have shown promise in medical text generation, their capability in the highly specialized domain of PET/CT remains underexplored. We introduce PET-F2I-41K (PET Findings-to-Impression Benchmark), a large-scale benchmark for PET/CT impression generation using LLMs, constructed from over 41k real-world reports. Using PET-F2I-41K, we conduct a comprehensive evaluation of 27 models across proprietary frontier LLMs, open-source generalist models, and medical-domain LLMs, and we develop a domain-adapted 7B model (PET-F2I-7B) fine-tuned from Qwen2.5-7B-Instruct via LoRA. Beyond standard NLG metrics (e.g., BLEU-4, ROUGE-L, BERTScore), we propose three clinically grounded metrics - Entity Coverage Rate (ECR), Uncovered Entity Rate (UER), and Factual Consistency Rate (FCR) - to assess diagnostic completeness and factual reliability. Experiments reveal that neither frontier nor medical-domain LLMs perform adequately in zero-shot settings. In contrast, PET-F2I-7B achieves substantial gains (e.g., 0.708 BLEU-4) and a 3.0x improvement in entity coverage over the strongest baseline, while offering advantages in cost, latency, and privacy. Beyond this modeling contribution, PET-F2I-41K establishes a standardized evaluation framework to accelerate the development of reliable and clinically deployable reporting systems for PET/CT.
[CV-54] P-GSVC: Layered Progressive 2D Gaussian Splatting for Scalable Image and Video
【速读】:该论文旨在解决现有2D Gaussian splatting方法在图像和视频重建中难以实现高质量、高分辨率且可扩展的显式表示问题。其核心挑战在于如何构建一个统一的分层结构,以支持从粗到细的渐进式重建,并确保多层高斯表示之间的优化兼容性与稳定性。解决方案的关键是提出P-GSVC框架,该框架将高斯点(Gaussian splats)组织为基底层和一系列增强层,并设计了一种联合训练策略(joint training strategy),同时优化各层高斯参数,使不同层间的优化轨迹对齐,从而保障层间一致性与稳定性的渐进重建过程。实验表明,该方法相比顺序逐层训练可提升高达2.6 dB(图像)和1.9 dB(视频)的PSNR性能。
链接: https://arxiv.org/abs/2603.10551
作者: Longan Wang,Yuang Shi,Wei Tsang Ooi
机构: National University of Singapore(新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: MMSys 2026; Project Website: see this https URL
Abstract:Gaussian splatting has emerged as a competitive explicit representation for image and video reconstruction. In this work, we present P-GSVC, the first layered progressive 2D Gaussian splatting framework that provides a unified solution for scalable Gaussian representation in both images and videos. P-GSVC organizes 2D Gaussian splats into a base layer and successive enhancement layers, enabling coarse-to-fine reconstructions. To effectively optimize this layered representation, we propose a joint training strategy that simultaneously updates Gaussians across layers, aligning their optimization trajectories to ensure inter-layer compatibility and a stable progressive reconstruction. P-GSVC supports scalability in terms of both quality and resolution. Our experiments show that the joint training strategy can gain up to 1.9 dB improvement in PSNR for video and 2.6 dB improvement in PSNR for image when compared to methods that perform sequential layer-wise training. Project page: this https URL
[CV-55] owards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues
【速读】:该论文旨在解决高性能量碳纤维增强聚合物(CFRP)无损检测中,基于主动红外热成像(AIRT)的生成式人工智能(Generative AI)方法因依赖大量耗时且昂贵的标注数据集而难以部署的问题。其核心解决方案在于提出一种语言引导的框架,利用预训练多模态视觉-语言模型(VLMs)与轻量级适配器(AIRT-VLM Adapter)相结合的方式,实现零样本(zero-shot)缺陷理解与定位。关键创新点在于:不需为缺陷检测器专门构建训练数据集,而是通过AIRT-VLM Adapter提升热图像中缺陷的可见性,并将热成像域对齐至VLMs已学习的语义空间,从而在无需额外训练的情况下实现高精度缺陷检测(IoU达70%),同时相较传统降维方法获得超过10 dB的信噪比增益。
链接: https://arxiv.org/abs/2603.10549
作者: Mohammed Salah,Eman Ouda,Giuseppe Dell’Avvocato,Fabrizio Sarasini,Ester D’Accardi,Jorge Dias,Davor Svetinovic,Stefano Sfarra,Yusra Abdulrahman
机构: Khalifa University of Science and Technology (哈利法大学科学技术); University of L’Aquila (拉奎拉大学); Sapienza University of Rome (罗马大学); Polytechnic University of Bari (巴里理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:Active infrared thermography (AIRT) is currently witnessing a surge of artificial intelligence (AI) methodologies being deployed for automated subsurface defect analysis of high performance carbon fiber-reinforced polymers (CFRP). Deploying AI-based AIRT methodologies for inspecting CFRPs requires the creation of time consuming and expensive datasets of CFRP inspection sequences to train neural networks. To address this challenge, this work introduces a novel language-guided framework for cognitive defect analysis in CFRPs using AIRT and vision-language models (VLMs). Unlike conventional learning-based approaches, the proposed framework does not require developing training datasets for extensive training of defect detectors, instead it relies solely on pretrained multimodal VLM encoders coupled with a lightweight adapter to enable generative zero-shot understanding and localization of subsurface defects. By leveraging pretrained multimodal encoders, the proposed system enables generative zero-shot understanding of thermographic patterns and automatic detection of subsurface defects. Given the domain gap between thermographic data and natural images used to train VLMs, an AIRT-VLM Adapter is proposed to enhance the visibility of defects while aligning the thermographic domain with the learned representations of VLMs. The proposed framework is validated using three representative VLMs; specifically, GroundingDINO, Qwen-VL-Chat, and CogVLM. Validation is performed on 25 CFRP inspection sequences with impacts introduced at different energy levels, reflecting realistic defects encountered in industrial scenarios. Experimental results demonstrate that the AIRT-VLM adapter achieves signal-to-noise ratio (SNR) gains exceeding 10 dB compared with conventional thermographic dimensionality-reduction methods, while enabling zero-shot defect detection with intersection-over-union values reaching 70%.
[CV-56] Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation
【速读】:该论文旨在解决当前可提示基础模型(Promptable Foundation Models, FMs)在医学图像分割任务中因模型多样性、评估标准不统一及人类提示输入差异导致的性能比较困难与临床适用性选择复杂的问题。其解决方案的关键在于:通过系统性地测试11种FMs在2D和3D非迭代提示策略下的表现,结合私有与公共数据集(聚焦腕骨、肩部、髋部和小腿四部位的骨骼与植入物分割),识别帕累托最优模型,并利用专门设计的观察者研究收集的人类提示进行验证,从而揭示不同模型对提示变化的敏感性、结构复杂度对分割一致性的影响以及理想提示与真实人类提示之间的性能差距。结果表明,尽管部分模型(如SAM、SAM2.1、nnInteractive和Med-SAM2)在特定条件下表现优异,但所有模型均对人类提示输入敏感,且跨观察者一致性较低,说明在实际临床场景中选择最适模型仍具挑战性。
链接: https://arxiv.org/abs/2603.10541
作者: Caroline Magg,Maaike A. ter Wee,Johannes G.G. Dobbe,Geert J. Streekstra,Leendert Blankevoort,Clara I. Sánchez,Hoel Kervadec
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Promptable Foundation Models (FMs), initially introduced for natural image segmentation, have also revolutionized medical image segmentation. The increasing number of models, along with evaluations varying in datasets, metrics, and compared models, makes direct performance comparison between models difficult and complicates the selection of the most suitable model for specific clinical tasks. In our study, 11 promptable FMs are tested using non-iterative 2D and 3D prompting strategies on a private and public dataset focusing on bone and implant segmentation in four anatomical regions (wrist, shoulder, hip and lower leg). The Pareto-optimal models are identified and further analyzed using human prompts collected through a dedicated observer study. Our findings are: 1) The segmentation performance varies a lot between FMs and prompting strategies; 2) The Pareto-optimal models in 2D are SAM and SAM2.1, in 3D nnInteractive and Med-SAM2; 3) Localization accuracy and rater consistency vary with anatomical structures, with higher consistency for simple structures (wrist bones) and lower consistency for complex structures (pelvis, tibia, implants); 4) The segmentation performance drops using human prompts, suggesting that performance reported on “ideal” prompts extracted from reference labels might overestimate the performance in a human-driven setting; 5) All models were sensitive to prompt variations. While two models demonstrated intra-rater robustness, it did not scale to inter-rater settings. We conclude that the selection of the most optimal FM for a human-driven setting remains challenging, with even high-performing FMs being sensitive to variations in human input prompts. Our code base for prompt extraction and model inference is available: this https URL
[CV-57] DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime CVPR2026
【速读】:该论文旨在解决场景图生成(Scene Graph Generation, SGG)在资源受限的边缘设备上部署时面临的低延迟与高效率问题,现有方法普遍忽视了实际应用中对速度和计算资源的严格要求。其解决方案的关键在于提出DSFlash模型,该模型通过优化架构设计实现56帧/秒的视频流处理速度(在RTX 3090 GPU上),同时保持与当前最先进方法相当的性能;更重要的是,DSFlash能够生成包含丰富上下文信息的全景场景图(panoptic scene graph),而非仅限于显著关系,且训练成本极低——仅需不到24小时即可在单张九年前的GTX 1080 GPU上完成训练,从而显著提升SGG模型在计算资源有限环境下的可访问性和实用性。
链接: https://arxiv.org/abs/2603.10538
作者: Julian Lorenz,Vladyslav Kovganko,Elias Kohout,Mrunmai Phatak,Daniel Kienzle,Rainer Lienhart
机构: University of Augsburg (奥格斯堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026
Abstract:Scene Graph Generation (SGG) aims to extract a detailed graph structure from an image, a representation that holds significant promise as a robust intermediate step for complex downstream tasks like reasoning for embodied agents. However, practical deployment in real-world applications - especially on resource constrained edge devices - requires speed and resource efficiency, challenges that have received limited attention in existing research. To bridge this gap, we introduce DSFlash, a low-latency model for panoptic scene graph generation designed to overcome these limitations. DSFlash can process a video stream at 56 frames per second on a standard RTX 3090 GPU, without compromising performance against existing state-of-the-art methods. Crucially, unlike prior approaches that often restrict themselves to salient relationships, DSFlash computes comprehensive scene graphs, offering richer contextual information while maintaining its superior latency. Furthermore, DSFlash is light on resources, requiring less than 24 hours to train on a single, nine-year-old GTX 1080 GPU. This accessibility makes DSFlash particularly well-suited for researchers and practitioners operating with limited computational resources, empowering them to adapt and fine-tune SGG models for specialized applications.
[CV-58] Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in Whole-Slide Image Prognosis CVPR2026
【速读】:该论文旨在解决病理学中癌症预后预测模型因训练样本稀缺而导致的泛化能力不足问题,尤其是在肿瘤异质性较高的情况下。现有方法如多癌种联合学习或知识迁移虽有所探索,但普遍存在计算效率低的问题,例如需要大规模联合训练或复杂的多模型推理。其解决方案的关键在于提出一种名为“稀疏任务向量混合与超网络”(Sparse Task Vector Mixup with Hypernetworks, STEPH)的新范式:通过任务向量混合(task vector mixup)对每个源-目标癌种对进行知识融合,并利用超网络(hypernetwork)稀疏聚合这些混合向量,从而高效地将其他癌种的通用知识迁移到目标癌种模型中,无需大规模联合训练或复杂多模型推理,显著提升了模型性能与计算效率。
链接: https://arxiv.org/abs/2603.10526
作者: Pei Liu,Xiangxiang Zeng,Tengfei Ma,Yucheng Xing,Xuanbai Ren,Yiping Liu
机构: Hunan University (湖南大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Whole-Slide Images (WSIs) are widely used for estimating the prognosis of cancer patients. Current studies generally follow a cancer-specific learning paradigm. However, the available training samples for one cancer type are usually scarce in pathology. Consequently, the model often struggles to learn generalizable knowledge, thus performing worse on the tumor samples with inherent high heterogeneity. Although multi-cancer joint learning and knowledge transfer approaches have been explored recently to address it, they either rely on large-scale joint training or extensive inference across multiple models, posing new challenges in computational efficiency. To this end, this paper proposes a new scheme, Sparse Task Vector Mixup with Hypernetworks (STEPH). Unlike previous ones, it efficiently absorbs generalizable knowledge from other cancers for the target via model merging: i) applying task vector mixup to each source-target pair and then ii) sparsely aggregating task vector mixtures to obtain an improved target model, driven by hypernetworks. Extensive experiments on 13 cancer datasets show that STEPH improves over cancer-specific learning and an existing knowledge transfer baseline by 5.14% and 2.01%, respectively. Moreover, it is a more efficient solution for learning prognostic knowledge from other cancers, without requiring large-scale joint training or extensive multi-model inference. Code is publicly available at this https URL.
[CV-59] Visually-Guided Controllable Medical Image Generation via Fine-Grained Semantic Disentanglement
【速读】:该论文旨在解决医学图像合成中因数据稀缺和隐私限制导致的挑战,以及通用文本到图像(Text-to-Image, T2I)模型微调困难的问题,其核心难点在于复杂视觉细节与抽象临床文本之间的显著模态差异,以及语义纠缠现象——即粗粒度文本嵌入模糊了解剖结构与成像风格之间的边界,从而削弱生成过程中的可控性。解决方案的关键在于提出一种视觉引导的文本解耦框架(Visually-Guided Text Disentanglement),通过引入跨模态潜在对齐机制,利用视觉先验显式地将无结构文本分解为独立的语义表示;随后,采用混合特征融合模块(Hybrid Feature Fusion Module, HFFM)通过分离通道将这些解耦特征注入扩散Transformer(Diffusion Transformer, DiT),实现对解剖结构的细粒度控制,从而提升生成质量和下游分类任务性能。
链接: https://arxiv.org/abs/2603.10519
作者: Xin Huang,Junjie Liang,Qingshan Hou,Peng Cao,Jinzhu Yang,Xiaoli Liu,Osmar R. Zaiane
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures. Currently under review
Abstract:Medical image synthesis is crucial for alleviating data scarcity and privacy constraints. However, fine-tuning general text-to-image (T2I) models remains challenging, mainly due to the significant modality gap between complex visual details and abstract clinical text. In addition, semantic entanglement persists, where coarse-grained text embeddings blur the boundary between anatomical structures and imaging styles, thus weakening controllability during generation. To address this, we propose a Visually-Guided Text Disentanglement framework. We introduce a cross-modal latent alignment mechanism that leverages visual priors to explicitly disentangle unstructured text into independent semantic representations. Subsequently, a Hybrid Feature Fusion Module (HFFM) injects these features into a Diffusion Transformer (DiT) through separated channels, enabling fine-grained structural control. Experimental results in three datasets demonstrate that our method outperforms existing approaches in terms of generation quality and significantly improves performance on downstream classification tasks. The source code is available at this https URL.
[CV-60] UHD Image Deblurring via Autoregressive Flow with Ill-conditioned Constraints ECCV2026
【速读】:该论文旨在解决超高清(Ultra-high-definition, UHD)图像去模糊任务中细粒度细节恢复与推理效率之间的权衡问题。现有判别式和生成式方法虽取得显著进展,但在计算成本与生成精细结构能力之间仍存在矛盾。其解决方案的关键在于提出一种带病态约束的自回归流(autoregressive flow)方法,通过分阶段的粗到精(coarse-to-fine)重构策略实现稳定优化:在每一尺度上,将前一尺度结果上采样后叠加当前尺度残差以生成清晰估计;同时引入流匹配(Flow Matching)建模残差生成为条件向量场,并采用少量步数的常微分方程(ODE)采样(如Euler/Heun求解器)以高效增强细节;此外,为缓解UHD多步生成可能引发的数值不稳定性,设计了基于特征诱导注意力矩阵的条件数正则化方案,从而提升收敛性与跨尺度一致性。
链接: https://arxiv.org/abs/2603.10517
作者: Yucheng Xin,Dawei Zhao,Xiang Chen,Chen Wu,Pu Wang,Dianjie Lu,Guijuan Zhang,Xiuyi Jia,Zhuoran Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ECCV 2026
Abstract:Ultra-high-definition (UHD) image deblurring poses significant challenges for UHD restoration methods, which must balance fine-grained detail recovery and practical inference efficiency. Although prominent discriminative and generative methods have achieved remarkable results, a trade-off persists between computational cost and the ability to generate fine-grained detail for UHD image deblurring tasks. To further alleviate these issues, we propose a novel autoregressive flow method for UHD image deblurring with an ill-conditioned constraint. Our core idea is to decompose UHD restoration into a progressive, coarse-to-fine process: at each scale, the sharp estimate is formed by upsampling the previous-scale result and adding a current-scale residual, enabling stable, stage-wise refinement from low to high resolution. We further introduce Flow Matching to model residual generation as a conditional vector field and perform few-step ODE sampling with efficient Euler/Heun solvers, enriching details while keeping inference affordable. Since multi-step generation at UHD can be numerically unstable, we propose an ill-conditioning suppression scheme by imposing condition-number regularization on a feature-induced attention matrix, improving convergence and cross-scale consistency. Our method demonstrates promising performance on blurred images at 4K (3840 \times 2160) or higher resolutions.
[CV-61] Naïve Exposure of Generative AI Capabilities Undermines Deepfake Detection
【速读】:该论文旨在解决当前深度伪造检测方法在面对生成式AI(Generative AI)驱动的语义保持图像精炼(semantic-preserving image refinement)时失效的问题。其核心挑战在于,现有检测框架的威胁模型与现实世界中商业生成式AI系统所展现的自由度和真实性推理能力之间存在结构性错配。解决方案的关键在于揭示:即便攻击者仅使用合规提示(policy-compliant prompts),商业聊天机器人服务也能通过其内部明确的“真实性标准”(authenticity criteria)进行无监督图像优化,从而在不破坏身份识别准确率的前提下显著提升图像感知质量并规避检测。这一机制表明,问题根源并非技术漏洞,而是检测系统对生成式AI的开放性推理能力缺乏适应性建模。
链接: https://arxiv.org/abs/2603.10504
作者: Sunpill Kim,Chanwoo Hwang,Minsu Kim,Jae Hong Seo
机构: Hanyang University (汉阳大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative AI systems increasingly expose powerful reasoning and image refinement capabilities through user-facing chatbot interfaces. In this work, we show that the naïve exposure of such capabilities fundamentally undermines modern deepfake detectors. Rather than proposing a new image manipulation technique, we study a realistic and already-deployed usage scenario in which an adversary uses only benign, policy-compliant prompts and commercial generative AI systems. We demonstrate that state-of-the-art deepfake detection methods fail under semantic-preserving image refinement. Specifically, we show that generative AI systems articulate explicit authenticity criteria and inadvertently externalize them through unrestricted reasoning, enabling their direct reuse as refinement objectives. As a result, refined images simultaneously evade detection, preserve identity as verified by commercial face recognition APIs, and exhibit substantially higher perceptual quality. Importantly, we find that widely accessible commercial chatbot services pose a significantly greater security risk than open-source models, as their superior realism, semantic controllability, and low-barrier interfaces enable effective evasion by non-expert users. Our findings reveal a structural mismatch between the threat models assumed by current detection frameworks and the actual capabilities of real-world generative AI. While detection baselines are largely shaped by prior benchmarks, deployed systems expose unrestricted authenticity reasoning and refinement despite stringent safety controls in other domains.
[CV-62] IMTBench: A Multi-Scenario Cross-Modal Collaborative Evaluation Benchmark for In-Image Machine Translation
【速读】:该论文旨在解决当前图像内文本翻译(In-Image Machine Translation, IIMT)研究中存在的两大问题:一是现有基准数据集多为合成数据,难以反映真实场景的复杂性;二是评估协议仅关注单模态指标,忽视了模型输出文本与图像中渲染文本之间的跨模态一致性。解决方案的关键在于提出一个新的基准测试集IMTBench,包含2500个覆盖四种实际应用场景和九种语言的图像翻译样本,并引入多维度评估体系,包括翻译质量、背景保留度、整体图像质量以及跨模态对齐分数(cross-modal alignment score),以量化模型生成文本与图像中呈现文本的一致性,从而更全面地衡量端到端图像文本翻译性能。
链接: https://arxiv.org/abs/2603.10495
作者: Jiahao Lyu,Pei Fu,Zhenhang Li,Weichao Zeng,Shaojie Zhan,Jiahui Yang,Can Ma,Yu Zhou,Zhenbo Luo,Jian Luan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:End-to-end In-Image Machine Translation (IIMT) aims to convert text embedded within an image into a target language while preserving the original visual context, layout, and rendering style. However, existing IIMT benchmarks are largely synthetic and thus fail to reflect real-world complexity, while current evaluation protocols focus on single-modality metrics and overlook cross-modal faithfulness between rendered text and model outputs. To address these shortcomings, we present In-image Machine Translation Benchmark (IMTBench), a new benchmark of 2,500 image translation samples covering four practical scenarios and nine languages. IMTBench supports multi-aspect evaluation, including translation quality, background preservation, overall image quality, and a cross-modal alignment score that measures consistency between the translated text produced by the model and the text rendered in the translated image. We benchmark strong commercial cascade systems, and both closed- and open-source unified multi-modal models, and observe large performance gaps across scenarios and languages, especially on natural scenes and resource-limited languages, highlighting substantial headroom for end-to-end image text translation. We hope IMTBench establishes a standardized benchmark to accelerate progress in this emerging task.
[CV-63] Spatial self-supervised Peak Learning and correlation-based Evaluation of peak picking in Mass Spectrometry Imaging
【速读】:该论文旨在解决质谱成像(Mass Spectrometry Imaging, MSI)中峰选择(peak picking)方法在异质数据集上表现不一致的问题,以及现有评估方式多依赖于合成数据或人工选取的离子图像,难以全面反映真实场景下的挑战。其解决方案的关键在于提出一种基于自编码器的空间自监督峰学习神经网络,通过联合利用空间和光谱信息学习注意力掩码(attention mask),从而自动筛选出具有空间结构特征的峰值,显著提升峰选择的生物学相关性和一致性;同时引入基于专家标注分割掩码的评估流程,实现更贴近实际应用的空间化、可比性评估,为不同峰选择方法提供统一且稳健的比较框架。
链接: https://arxiv.org/abs/2603.10487
作者: Philipp Weigand,Nikolas Ebert,Shad A. Mohammed,Denis Abu Sammour,Carsten Hopf,Oliver Wasenmüller
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mass spectrometry imaging (MSI) enables label-free visualization of molecular distributions across tissue samples but generates large and complex datasets that require effective peak picking to reduce data size while preserving meaningful biological information. Existing peak picking approaches perform inconsistently across heterogeneous datasets, and their evaluation is often limited to synthetic data or manually selected ion images that do not fully represent real-world challenges in MSI. To address these limitations, we propose an autoencoder-based spatial self-supervised peak learning neural network that selects spatially structured peaks by learning an attention mask leveraging both spatial and spectral information. We further introduce an evaluation procedure based on expert-annotated segmentation masks, allowing a more representative and spatially grounded assessment of peak picking performance. We evaluate our approach on four diverse public MSI datasets using our proposed evaluation procedure. Our approach consistently outperforms state-of-the-art peak picking methods by selecting spatially structured peaks, thus demonstrating its efficacy. These results highlight the value of our spatial self-supervised network in comparison to contemporary state-of-the-art methods. The evaluation procedure can be readily applied to new MSI datasets, thereby providing a consistent and robust framework for the comparison of spatially structured peak picking methods across different datasets.
[CV-64] StructDamage:A Large Scale Unified Crack and Surface Defect Dataset for Robust Structural Damage Detection
【速读】:该论文旨在解决现有结构裂缝检测与表面缺陷分类方法在实际应用中泛化能力不足的问题,其核心挑战在于当前公开数据集普遍存在地理多样性、表面材质类型、尺度范围及标注一致性等方面的局限性。解决方案的关键在于构建一个名为StructDamage的新型大规模、多源整合且标准化的数据集,包含约78,093张图像,覆盖九类典型建筑材料(如墙体、瓷砖、混凝土等),并通过系统性聚合、统一处理和重新标注来自32个公开数据集的图像实现高质量数据构建。该数据集采用适合卷积神经网络(CNN)和视觉Transformer训练的层级文件夹结构,并提供了基于15种深度学习模型的基线分类结果,其中12种模型的宏F1分数超过0.96,最优模型DenseNet201达到98.62%准确率,从而为开发鲁棒性强、可复现的裂缝损伤检测方法提供可靠资源支持。
链接: https://arxiv.org/abs/2603.10484
作者: Misbah Ijaz,Saif Ur Rehman Khan,Abd Ur Rehman,Sebastian Vollmer,Andreas Dengel,Muhammad Nabeel Asim
机构: University of Gujrat (古杰拉特大学); RPTU University of Kaiserslautern-Landau (凯撒斯劳滕-兰道大学); German Research Center for Artificial Intelligence (德国人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated detection and classification of structural cracks and surface defects is a critical challenge in civil engineering, infrastructure maintenance, and heritage preservation. Recent advances in Computer Vision (CV) and Deep Learning (DL) have significantly improved automatic crack detection. However, these methods rely heavily on large, diverse, and carefully curated datasets that include various crack types across different surface materials. Many existing public crack datasets lack geographic diversity, surface types, scale, and labeling consistency, making it challenging for trained algorithms to generalize effectively in real world conditions. We provide a novel dataset, StructDamage, a curated collection of approximately 78,093 images spanning nine surface types: walls, tile, stone, road, pavement, deck, concrete, and brick. The dataset was constructed by systematically aggregating, harmonizing, and reannotating images from 32 publicly available datasets covering concrete structures, asphalt pavements, masonry walls, bridges, and historic buildings. All images are organized in a folder level classification hierarchy suitable for training Convolutional Neural Networks (CNNs) and Vision Transformers. To highlight the practical value of the dataset, we present baseline classification results using fifteen DL architectures from six model families, with twelve achieving macro F1-scores over 0.96. The best performing model DenseNet201 achieves 98.62% accuracy. The proposed dataset provides a comprehensive and versatile resource suitable for classification tasks. With thorough documentation and a standard structure, it is designed to promote reproducible research and support the development and fair evaluation of robust crack damage detection approaches.
[CV-65] Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression CVPR2026
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在多模态任务中因视觉模态引发的幻觉问题,即模型生成与输入图像不一致的虚假输出。现有方法主要关注文本诱导的幻觉,而忽视了视觉模态带来的系统性偏差。解决方案的关键在于提出一种无需训练的干预方法 CIPHER(Counterfactual Image Perturbations for Hallucination Extraction and Removal),其核心机制是通过构造反事实图像扰动(counterfactual image perturbations)来识别并抑制视觉诱导的幻觉。具体而言,CIPHER 在离线阶段构建 OHC-25K 数据集(包含扩散编辑后的图像及其原始标注),从中提取出表征视觉幻觉的低秩子空间;在推理阶段,通过对中间隐藏状态进行投影,将其从该子空间中移出,从而实现对视觉幻觉的有效抑制,同时保持任务性能不变。
链接: https://arxiv.org/abs/2603.10470
作者: Hamidreza Dastmalchi,Aijun An,Ali Cheraghian,Hamed Barzamini
机构: York University ( York 大学); Macquarie University (麦考瑞大学); Northern Illinois University (北伊利诺伊大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:While large vision-language models (LVLMs) achieve strong performance on multimodal tasks, they frequently generate hallucinations – unfaithful outputs misaligned with the visual input. To address this issue, we introduce CIPHER (Counterfactual Image Perturbations for Hallucination Extraction and Removal), a training-free method that suppresses vision-induced hallucinations via lightweight feature-level correction. Unlike prior training-free approaches that primarily focus on text-induced hallucinations, CIPHER explicitly targets hallucinations arising from the visual modality. CIPHER operates in two phases. In the offline phase, we construct OHC-25K (Object-Hallucinated Counterfactuals, 25,000 samples), a counterfactual dataset consisting of diffusion-edited images that intentionally contradict the original ground-truth captions. We pair these edited images with the unchanged ground-truth captions and process them through an LVLM to extract hallucination-related representations. Contrasting these representations with those from authentic (image, caption) pairs reveals structured, systematic shifts spanning a low-rank subspace characterizing vision-induced hallucination. In the inference phase, CIPHER suppresses hallucinations by projecting intermediate hidden states away from this subspace. Experiments across multiple benchmarks show that CIPHER significantly reduces hallucination rates while preserving task performance, demonstrating the effectiveness of counterfactual visual perturbations for improving LVLM faithfulness. Code and additional materials are available at this https URL.
[CV-66] UniPINN: A Unified PINN Framework for Multi-task Learning of Diverse Navier-Stokes Equations
【速读】:该论文旨在解决物理信息神经网络(Physics-Informed Neural Networks, PINNs)在多流场景下面临的三大挑战:难以同时捕捉共享的物理规律与各流动特异性特征、不同任务间易发生负迁移导致预测精度下降,以及因异构流动 regimes 中损失函数量级差异引发的训练不稳定问题。解决方案的关键在于提出 UniPINN 框架,其核心创新包括:(1) 共享-专用架构(shared-specialized architecture),实现通用物理定律与流动特异性特征的解耦;(2) 跨流注意力机制(cross-flow attention mechanism),选择性增强相关模式并抑制无关干扰;(3) 动态权重分配策略(dynamic weight allocation strategy),自适应平衡多目标优化中的损失贡献,从而稳定训练过程并提升跨流动场景的统一建模能力。
链接: https://arxiv.org/abs/2603.10466
作者: Dengdi Sun,Jie Chen,Xiao Wang,Jin Tang
机构: Anhui University (安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Physics-Informed Neural Networks (PINNs) have shown promise in solving incompressible Navier-Stokes equations, yet existing approaches are predominantly designed for single-flow settings. When extended to multi-flow scenarios, these methods face three key challenges: (1) difficulty in simultaneously capturing both shared physical principles and flow-specific characteristics, (2) susceptibility to inter-task negative transfer that degrades prediction accuracy, and (3) unstable training dynamics caused by disparate loss magnitudes across heterogeneous flow regimes. To address these limitations, we propose UniPINN, a unified multi-flow PINN framework that integrates three complementary components: a shared-specialized architecture that disentangles universal physical laws from flow-specific features, a cross-flow attention mechanism that selectively reinforces relevant patterns while suppressing task-irrelevant interference, and a dynamic weight allocation strategy that adaptively balances loss contributions to stabilize multi-objective optimization. Extensive experiments on three canonical flows demonstrate that UniPINN effectively unifies multi-flow learning, achieving superior prediction accuracy and balanced performance across heterogeneous regimes while successfully mitigating negative transfer. The source code of this paper will be released on this https URL
[CV-67] Learning to Wander: Improving the Global Image Geolocation Ability of LMMs via Actionable Reasoning
【速读】:该论文旨在解决图像地理定位(geolocation)任务中缺乏对复杂世界知识和交互式推理能力的探索问题,尤其在具身场景下如何实现精准且可行动的地理定位。其核心解决方案是提出WanderBench基准和GeoAoT(Action of Thought)框架:WanderBench是一个包含跨六大洲32K全景图的可导航图结构数据集,支持旋转与移动等物理动作,将静态识别转变为交互式探索;GeoAoT则通过生成可执行计划(如靠近地标或调整视角)来主动降低不确定性,而非仅输出文本推理链,从而实现基于行动的推理驱动地理定位,显著提升细粒度定位精度和动态环境下的泛化能力。
链接: https://arxiv.org/abs/2603.10463
作者: Yushuo Zheng,Huiyu Duan,Zicheng Zhang,Xiaohong Liu,Xiongkuo Min
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Geolocation, the task of identifying the geographic location of an image, requires abundant world knowledge and complex reasoning abilities. Though advanced large multimodal models (LMMs) have shown superior aforementioned capabilities, their performance on the geolocation task remains unexplored. To this end, we introduce \textbfWanderBench, the first open access global geolocation benchmark designed for actionable geolocation reasoning in embodied scenarios. WanderBench contains over 32K panoramas across six continents, organized as navigable graphs that enable physical actions such as rotation and movement, transforming geolocation from static recognition into interactive exploration. Building on this foundation, we propose \textbfGeoAoT (Action of Thought), a \underlineGeolocation framework with \underlineAction of \underlineThough, which couples reasoning with embodied actions. Instead of generating textual reasoning chains, GeoAoT produces actionable plans such as, approaching landmarks or adjusting viewpoints, to actively reduce uncertainty. We further establish an evaluation protocol that jointly measures geolocation accuracy and difficulty-aware geolocation questioning ability. Experiments on 19 large multimodal models show that GeoAoT achieves superior fine-grained localization and stronger generalization in dynamic environments. WanderBench and GeoAoT define a new paradigm for actionable, reasoning driven geolocation in embodied visual understanding.
[CV-68] LCAMV: High-Accuracy 3D Reconstruction of Color-Varying Objects Using LCA Correction and Minimum-Variance Fusion in Structured Light
【速读】:该论文旨在解决基于结构光(Structured Light, SL)的彩色物体高精度三维重建问题,核心挑战在于光学组件引起的横向色差(Lateral Chromatic Aberration, LCA)以及RGB各通道噪声特性不均匀带来的误差。解决方案的关键在于提出了一种无需额外硬件或多次曝光的鲁棒方法LCAMV(Lateral Chromatic Aberration Correction and Minimum-Variance Fusion),其创新性地对投影仪和相机中的像素级LCA进行解析建模与补偿,并基于泊松-高斯噪声模型和最小方差估计自适应融合多通道相位数据,从而显著提升深度精度,在平面与非平面彩色表面上实现最高达43.6%的深度误差降低。
链接: https://arxiv.org/abs/2603.10456
作者: Wonbeen Oh,Jae-Sang Hyun
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate 3D reconstruction of colored objects with structured light (SL) is hindered by lateral chromatic aberration (LCA) in optical components and uneven noise characteristics across RGB channels. This paper introduces lateral chromatic aberration correction and minimum-variance fusion (LCAMV), a robust 3D reconstruction method that operates with a single projector-camera pair without additional hardware or acquisition constraints. LCAMV analytically models and pixel-wise compensates LCA in both the projector and camera, then adaptively fuses multi-channel phase data using a Poisson-Gaussian noise model and minimum-variance estimation. Unlike existing methods that require extra hardware or multiple exposures, LCAMV enables fast acquisition. Experiments on planar and non-planar colored surfaces show that LCAMV outperforms grayscale conversion and conventional channel-weighting, reducing depth error by up to 43.6%. These results establish LCAMV as an effective solution for high-precision 3D reconstruction of nonuniformly colored objects.
[CV-69] SignSparK: Efficient Multilingual Sign Language Production via Sparse Keyframe Learning
【速读】:该论文旨在解决生成自然且语言准确的手语虚拟形象(Sign Language Avatar)这一难题,当前手语生成(Sign Language Production, SLP)框架面临两大瓶颈:直接的文本到姿态模型存在回归均值效应,而基于词典检索的方法则导致动作生硬、过渡不连贯。解决方案的关键在于提出一种新颖的训练范式,通过稀疏关键帧(sparse keyframes)捕捉人类手语的真实运动学分布,并基于这些离散锚点预测稠密运动序列,从而缓解回归均值问题并保证动作流畅性。该方法的核心创新包括:引入FAST模型自动提取精确的时间边界以实现高效分割;构建大规模条件流匹配(Conditional Flow Matching, CFM)框架SignSparK,在SMPL-X和MANO空间中合成3D手语序列;同时支持关键帧到姿态(Keyframe-to-Pose, KF2P)生成,实现精准时空编辑;并通过重建导向的CFM目标在少于10步采样内实现高保真合成,使系统可扩展至四种不同手语,成为目前规模最大的多语言SLP框架。
链接: https://arxiv.org/abs/2603.10446
作者: Jianhe Low,Alexandre Symeonidis-Herzig,Maksym Ivashechkin,Ozge Mercanoglu Sincan,Richard Bowden
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating natural and linguistically accurate sign language avatars remains a formidable challenge. Current Sign Language Production (SLP) frameworks face a stark trade-off: direct text-to-pose models suffer from regression-to-the-mean effects, while dictionary-retrieval methods produce robotic, disjointed transitions. To resolve this, we propose a novel training paradigm that leverages sparse keyframes to capture the true underlying kinematic distribution of human signing. By predicting dense motion from these discrete anchors, our approach mitigates regression-to-the-mean while ensuring fluid articulation. To realize this paradigm at scale, we first introduce FAST, an ultra-efficient sign segmentation model that automatically mines precise temporal boundaries. We then present SignSparK, a large-scale Conditional Flow Matching (CFM) framework that utilizes these extracted anchors to synthesize 3D signing sequences in SMPL-X and MANO spaces. This keyframe-driven formulation also uniquely unlocks Keyframe-to-Pose (KF2P) generation, making precise spatiotemporal editing of signing sequences possible. Furthermore, our adopted reconstruction-based CFM objective also enables high-fidelity synthesis in fewer than ten sampling steps; this allows SignSparK to scale across four distinct sign languages, establishing the largest multilingual SLP framework to date. Finally, by integrating 3D Gaussian Splatting for photorealistic rendering, we demonstrate through extensive evaluation that SignSparK establishes a new state-of-the-art across diverse SLP tasks and multilingual benchmarks.
[CV-70] Unlearning the Unpromptable: Prompt-free Instance Unlearning in Diffusion Models
【速读】:该论文旨在解决扩散模型中“实例遗忘”(instance unlearning)的问题,即在不依赖文本提示的情况下,有选择性地删除特定 undesired outputs(如个人面部图像或文化上不准确的生成内容),同时保持模型其余功能的完整性。传统方法多基于文本提示进行遗忘操作,但无法处理无法通过文本描述指定的输出。解决方案的关键在于提出一种基于代理(surrogate-based)的遗忘方法,其核心包括:利用图像编辑技术定位目标实例、引入时间步感知加权机制以增强遗忘效果、以及采用梯度手术(gradient surgery)来优化训练过程,从而引导模型在不破坏整体性能的前提下实现对特定输出的选择性遗忘。实验表明,该方法在条件生成模型(Stable Diffusion 3)和无条件模型(DDPM-CelebA)上均能有效实现不可提示内容的遗忘,且优于现有提示驱动与纯无提示基线方法。
链接: https://arxiv.org/abs/2603.10445
作者: Kyungryeol Lee,Kyeonghyun Lee,Seongmin Hong,Byung Hyun Lee,Se Young Chun
机构: Seoul National University (首尔国立大学); Department of Electrical and Computer Engineering (电气与计算机工程系); Institute of New Media and Communications (新媒体与传播研究所); Institute for Practical Artificial Intelligence (实用人工智能研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages
Abstract:Machine unlearning aims to remove specific outputs from trained models, often at the concept level, such as forgetting all occurrences of a particular celebrity or filtering content via text prompts. However, many undesired outputs, such as an individual’s face or generations culturally or factually misinterpreted, cannot often be specified by text prompts. We address this underexplored setting of instance unlearning for outputs that are undesired but unpromptable, where the goal is to forget target outputs selectively while preserving the rest. To this end, we introduce an effective surrogate-based unlearning method that leverages image editing, timestep-aware weighting, and gradient surgery to guide trained diffusion models toward forgetting specific outputs. Experiments on conditional (Stable Diffusion 3) and unconditional (DDPM-CelebA) diffusion models demonstrate that our prompt-free method uniquely unlearns unpromptable outputs, such as faces and culturally inaccurate depictions, with preserved integrity, unlike prompt-based and prompt-free baselines. Our proposed method would serve as a practical hotfix for diffusion model providers to ensure privacy protection and ethical compliance.
[CV-71] AsyncMDE: Real-Time Monocular Depth Estimation via Asynchronous Spatial Memory
【速读】:该论文旨在解决基于基础模型(foundation model)的单目深度估计在边缘设备上部署时面临的计算成本过高问题。现有方法对每一帧独立进行推理,未能利用连续机器人操作中相邻视角间的大量计算冗余,导致资源利用率低下。解决方案的关键在于提出AsyncMDE系统,其由一个高精度基础模型与一个轻量级模型协同工作:基础模型在后台生成高质量空间特征,轻量级模型在前台异步运行,通过互补融合机制将缓存的记忆与当前观测结合,并自回归更新记忆状态,从而实现跨帧特征复用,且保证精度衰减有界。该设计使系统在仅3.83M参数下即可达到237 FPS(RTX 4090),恢复基础模型77%的精度,同时参数减少25倍,在Jetson AGX Orin上实现161 FPS(TensorRT加速),验证了其在边缘实时部署的可行性。
链接: https://arxiv.org/abs/2603.10438
作者: Lianjie Ma,Yuquan Li,Bingzheng Jiang,Ziming Zhong,Han Ding,Lijun Zhu
机构: Huazhong University of Science and Technology (华中科技大学); The 710 Institute of China State Shipbuilding Corporation Limited (中国船舶集团有限公司第七一〇研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, 5 tables
Abstract:Foundation-model-based monocular depth estimation offers a viable alternative to active sensors for robot perception, yet its computational cost often prohibits deployment on edge platforms. Existing methods perform independent per-frame inference, wasting the substantial computational redundancy between adjacent viewpoints in continuous robot operation. This paper presents AsyncMDE, an asynchronous depth perception system consisting of a foundation model and a lightweight model that amortizes the foundation model’s computational cost over time. The foundation model produces high-quality spatial features in the background, while the lightweight model runs asynchronously in the foreground, fusing cached memory with current observations through complementary fusion, outputting depth estimates, and autoregressively updating the memory. This enables cross-frame feature reuse with bounded accuracy degradation. At a mere 3.83M parameters, it operates at 237 FPS on an RTX 4090, recovering 77% of the accuracy gap to the foundation model while achieving a 25X parameter reduction. Validated across indoor static, dynamic, and synthetic extreme-motion benchmarks, AsyncMDE degrades gracefully between refreshes and achieves 161FPS on a Jetson AGX Orin with TensorRT, clearly demonstrating its feasibility for real-time edge deployment.
[CV-72] World2Act: Latent Action Post-Training via Skill-Compositional World Models
【速读】:该论文旨在解决基于世界模型(World Models, WM)的视觉-语言-动作(Vision-Language-Action, VLA)策略在后训练阶段对像素级噪声敏感以及长视频生成能力不足的问题。现有方法依赖像素空间监督,导致策略易受不完美WM轨迹中伪影和幻觉的影响;同时,由于大多数WM仅在固定长度视频片段上训练,难以适应机器人执行时变化多样的任务时长。解决方案的关键在于提出World2Act框架——通过对比匹配目标将VLA动作直接对齐到WM视频动态潜在表示(video-dynamics latents),从而减少对像素的依赖;并设计一种基于大语言模型(Large Language Model, LLM)的自动技能分解流水线,生成支持技能组合的长序列指令(如RoboCasa-Skill和LIBERO-Skill),使WM在不同任务时长下保持时间一致性,最终显著提升真实场景中的泛化性能(提升6.7%)。
链接: https://arxiv.org/abs/2603.10422
作者: An Dinh Vuong,Tuan Van Vo,Abdullah Sohail,Haoran Ding,Liang Ma,Xiaodan Liang,Anqing Duan,Ivan Laptev,Ian Reid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:World Models (WMs) have emerged as a promising approach for post-training Vision-Language-Action (VLA) policies to improve robustness and generalization under environmental changes. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to pixel-level artifacts and hallucination from imperfect WM rollouts. We introduce World2Act, a post-training framework that aligns VLA actions directly with WM video-dynamics latents using a contrastive matching objective, reducing dependence on pixels. Post-training performance is tied to rollout quality, yet current WMs struggle with arbitrary-length video generation as they are mostly trained on fixed-length clips while robotic execution durations vary widely. To address this, we propose an automatic LLM-based skill-decomposition pipeline that segments high-level instructions into low-level prompts. Our pipeline produces RoboCasa-Skill and LIBERO-Skill, supporting skill-compositional WMs that remain temporally consistent across diverse task horizons. Empirically, applying World2Act to VLAs like GR00T-N1.6 and Cosmos Policy achieves state-of-the-art results on RoboCasa and LIBERO, and improves real-world performance by 6.7%, enhancing embodied agent generalization.
[CV-73] ractoRC: A Unified Probabilistic Learning Framework for Joint Tractography Registration and Clustering
【速读】:该论文旨在解决扩散磁共振成像(Diffusion MRI)纤维追踪分析中两个关键任务——轨迹图配准(tractogram registration)与轨迹线聚类(streamline clustering)长期独立处理导致的信息利用不充分问题。其核心解决方案是提出一种统一的概率框架 TractoRC,通过联合优化这两个任务,在单一优化方案中共享一个潜在嵌入空间(latent embedding space),使得两者能够相互借鉴几何一致性信息:配准任务学习解剖学标志点的分布以实现跨被试对齐,聚类任务则学习轨迹结构原型以形成几何相似的纤维束簇;同时引入变换等变的自监督策略,确保嵌入空间具备几何感知性和变换不变性,从而显著提升两项任务的性能表现。
链接: https://arxiv.org/abs/2603.10418
作者: Yijie Li,Xi Zhu,Junyi Wang,Ye Wu,Lauren J. O’Donnell,Fan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures
Abstract:Diffusion MRI tractography enables in vivo reconstruction of white matter (WM) pathways. Two key tasks in tractography analysis include: 1) tractogram registration that aligns streamlines across individuals, and 2) streamline clustering that groups streamlines into compact fiber bundles. Although both tasks share the goal of capturing geometrically similar structures to characterize consistent WM organization, they are typically performed independently. In this work, we propose TractoRC, a unified probabilistic framework that jointly performs tractogram registration and streamline clustering within a single optimization scheme, enabling the two tasks to leverage complementary information. TractoRC learns a latent embedding space for streamline points, which serves as a shared representation for both tasks. Within this space, both tasks are formulated as probabilistic inference over structural representations: registration learns the distribution of anatomical landmarks as probabilistic keypoints to align tractograms across subjects, and clustering learns streamline structural prototypes that capture geometric similarity to form coherent streamline clusters. To support effective learning of this shared space, we introduce a transformation-equivariant self-supervised strategy to learn geometry-aware and transformation-invariant embeddings. Experiments demonstrate that jointly optimizing registration and clustering significantly improves performance in both tasks over state-of-the-art methods that treat them independently. Code will be made publicly available at this https URL .
[CV-74] Frames2Residual: Spatiotemporal Decoupling for Self-Supervised Video Denoising
【速读】:该论文旨在解决自监督视频去噪方法中难以同时实现帧间时间一致性与帧内空间特异性的问题。现有基于盲区网络(Blind-Spot Networks, BSNs)的方法通过掩码中心像素来强制噪声独立性,导致无法利用空间信息进行纹理恢复,从而破坏了时空相关性并造成纹理损失。解决方案的关键在于提出Frames2Residual(F2R)框架,其核心是将自监督训练解耦为两个阶段:第一阶段采用帧级盲策略学习帧间时间一致性,生成一个时序一致的锚点;第二阶段则利用该锚点安全地重新引入中心帧,恢复帧内高频空间残差,同时保持时间稳定性。这种显式的时空解耦机制使F2R在sRGB和raw视频基准上均显著优于现有自监督方法。
链接: https://arxiv.org/abs/2603.10417
作者: Mingjie Ji,Zhan Shi,Kailai Zhou,Zixuan Fu,Xun Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Self-supervised video denoising methods typically extend image-based frameworks into the temporal dimension, yet they often struggle to integrate inter-frame temporal consistency with intra-frame spatial specificity. Existing Video Blind-Spot Networks (BSNs) require noise independence by masking the center pixel, this constraint prevents the use of spatial evidence for texture recovery, thereby severing spatiotemporal correlations and causing texture loss. To address this, we propose Frames2Residual (F2R), a spatiotemporal decoupling framework that explicitly divides self-supervised training into two distinct stages: blind temporal consistency modeling and non-blind spatial texture recovery. In Stage 1, a blind temporal estimator learns inter-frame consistency using a frame-wise blind strategy, producing a temporally consistent anchor. In Stage 2, a non-blind spatial refiner leverages this anchor to safely reintroduce the center frame and recover intra-frame high-frequency spatial residuals while preserving temporal stability. Extensive experiments demonstrate that our decoupling strategy allows F2R to outperform existing self-supervised methods on both sRGB and raw video benchmarks.
[CV-75] Motion Forcing: A Decoupled Framework for Robust Video Generation in Motion Dynamics
【速读】:该论文旨在解决视频生成中的“三难困境”(trilemma)问题,即如何在高视觉质量、严格的物理一致性与精确可控性之间取得平衡。现有模型在简单场景中表现良好,但在复杂场景(如碰撞或密集交通)下该平衡极易被打破。解决方案的关键在于提出Motion Forcing框架,其核心是通过分层的“点-形状-外观”(Point-Shape-Appearance)范式,显式地将物理推理与视觉合成解耦:首先以稀疏几何锚点(Point)建模复杂动态,继而扩展为显式解析三维几何的动态深度图(Shape),最后渲染高保真纹理(Appearance)。此外,引入掩码点恢复(Masked Point Recovery)策略,在训练中随机遮蔽输入锚点并强制重建完整动态深度,促使模型超越被动模式匹配,学习隐含的物理规律(如惯性)以推断缺失轨迹,从而增强物理理解的鲁棒性。
链接: https://arxiv.org/abs/2603.10408
作者: Tianshuo Xu,Zhifei Chen,Leyi Wu,Hao Lu,Ying-cong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:The ultimate goal of video generation is to satisfy a fundamental trilemma: achieving high visual quality, maintaining rigorous physical consistency, and enabling precise controllability. While recent models can maintain this balance in simple, isolated scenarios, we observe that this equilibrium is fragile and often breaks down as scene complexity increases (e.g., involving collisions or dense traffic). To address this, we introduce \textbfMotion Forcing, a framework designed to stabilize this trilemma even in complex generative tasks. Our key insight is to explicitly decouple physical reasoning from visual synthesis via a hierarchical \textbf``Point-Shape-Appearance’’ paradigm. This approach decomposes generation into verifiable stages: modeling complex dynamics as sparse geometric anchors (\textbfPoint), expanding them into dynamic depth maps that explicitly resolve 3D geometry (\textbfShape), and finally rendering high-fidelity textures (\textbfAppearance). Furthermore, to foster robust physical understanding, we employ a \textbfMasked Point Recovery strategy. By randomly masking input anchors during training and enforcing the reconstruction of complete dynamic depth, the model is compelled to move beyond passive pattern matching and learn latent physical laws (e.g., inertia) to infer missing trajectories. Extensive experiments on autonomous driving benchmarks show that Motion Forcing significantly outperforms state-of-the-art baselines, maintaining trilemma stability across complex scenes. Evaluations on physics and robotics further confirm our framework’s generality.
[CV-76] Multi-Person Pose Estimation Evaluation Using Optimal Transportation and Improved Pose Matching
【速读】:该论文旨在解决多人群体姿态估计(Multi-Person Pose Estimation)中现有评估指标对低置信度误检(false-positive poses)忽略不计的问题,这类指标往往因高置信度真阳性(true-positive)数量较多而给出过高评分,从而无法公平衡量模型在真实阳性与误检之间的权衡。解决方案的关键在于提出一种基于最优传输理论(optimal transportation)的新型评估指标OCpose(Optimal Correction Cost for pose),其通过将检测到的姿态与标注姿态之间的匹配建模为一个最优运输问题,在评估时对所有检测姿态一视同仁,无论其置信度高低,从而实现更公平的评价;同时,OCpose利用每条姿态的置信度分数来优化匹配得分的可靠性,而非依赖置信度排序,因此提供了不同于传统基于置信度排序指标的评估视角。
链接: https://arxiv.org/abs/2603.10398
作者: Takato Moriki,Hiromu Taketsugu,Norimichi Ukita
机构: Toyota Technological Institute (丰田技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 10 figures. Accepted at MVA 2025
Abstract:In Multi-Person Pose Estimation, many metrics place importance on ranking of pose detection confidence scores. Current metrics tend to disregard false-positive poses with low confidence, focusing primarily on a larger number of high-confidence poses. Consequently, these metrics may yield high scores even when many false-positive poses with low confidence are detected. For fair evaluation taking into account a tradeoff between true-positive and false-positive poses, this paper proposes Optimal Correction Cost for pose (OCpose), which evaluates detected poses against pose annotations as an optimal transportation. For the fair tradeoff between true-positive and false-positive poses, OCpose equally evaluates all the detected poses regardless of their confidence scores. In OCpose, on the other hand, the confidence score of each pose is utilized to improve the reliability of matching scores between the estimated pose and pose annotations. As a result, OCpose provides a different perspective assessment than other confidence ranking-based metrics.
[CV-77] Variance-Aware Adaptive Weighting for Diffusion Model Training
【速读】:该论文旨在解决扩散模型(Diffusion Models)在不同噪声水平(noise levels)下训练动态不平衡的问题,这种不平衡会导致优化效率低下和学习行为不稳定。其解决方案的关键在于从损失函数的方差(loss variance)角度出发,提出一种基于对数信噪比(log-SNR)分布的自适应加权策略(variance-aware adaptive weighting strategy),通过动态调整各噪声水平下的训练权重,使优化过程在不同噪声层级上更加均衡,从而提升生成性能并稳定训练过程。
链接: https://arxiv.org/abs/2603.10391
作者: Nanlong Sun,Lei Shi
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures, 1 table
Abstract:Diffusion models have recently achieved remarkable success in generative modeling, yet their training dynamics across different noise levels remain highly imbalanced, which can lead to inefficient optimization and unstable learning behavior. In this work, we investigate this imbalance from the perspective of loss variance across log-SNR levels and propose a variance-aware adaptive weighting strategy to address it. The proposed approach dynamically adjusts training weights based on the observed variance distribution, encouraging a more balanced optimization process across noise levels. Extensive experiments on CIFAR-10 and CIFAR-100 demonstrate that the proposed method consistently improves generative performance over standard training schemes, achieving lower Fréchet Inception Distance (FID) while also reducing performance variance across random seeds. Additional analysis, including loss-log-SNR visualization, variance heatmaps, and ablation studies, further reveal that the adaptive weighting effectively stabilizes training dynamics. These results highlight the potential of variance-aware training strategies for improving diffusion model optimization.
[CV-78] GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在空间理解能力上的局限性问题,特别是其对几何信息的依赖不足导致的空间推理性能瓶颈。现有方法通常通过刚性地将几何信号注入所有输入来弥补这一缺陷,但这种方式忽视了几何信息的实际必要性,并引入了不必要的计算开销。论文的关键解决方案在于提出一种具备感知不足自知能力的框架:首先,在模型架构中引入独立的几何输入通道并进行对齐训练,以有效利用几何特征;其次,构建专门用于空间感知的监督微调数据集,激活模型内部潜在的空间意识机制,使其能够自主判断何时需要启用几何信息进行推理。实验表明,该方法在多个空间推理基准上显著提升了性能,同时不损害二维视觉推理能力,为实现更鲁棒、高效且具备自我意识的多模态智能提供了新路径。
链接: https://arxiv.org/abs/2603.10370
作者: Ruiheng Liu,Haihong Hao,Mingfei Han,Xin Gu,Kecheng Zhang,Changlin Li,Xiaojun Chang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Advancing towards artificial superintelligence requires rich and intelligent perceptual capabilities. A critical frontier in this pursuit is overcoming the limited spatial understanding of Multimodal Large Language Models (MLLMs), where geometry information is essential. Existing methods often address this by rigidly injecting geometric signals into every input, while ignoring their necessity and adding computation overhead. Contrary to this paradigm, our framework endows the model with an awareness of perceptual insufficiency, empowering it to autonomously engage geometric features in reasoning when 2D cues are deemed insufficient. To achieve this, we first introduce an independent geometry input channel to the model architecture and conduct alignment training, enabling the effective utilization of geometric features. Subsequently, to endow the model with perceptual awareness, we curate a dedicated spatial-aware supervised fine-tuning dataset. This serves to activate the model’s latent internal cues, empowering it to autonomously determine the necessity of geometric information. Experiments across multiple spatial reasoning benchmarks validate this approach, demonstrating significant spatial gains without compromising 2D visual reasoning capabilities, offering a path toward more robust, efficient and self-aware multi-modal intelligence.
[CV-79] Geometric Autoencoder for Diffusion Models
【速读】:该论文旨在解决当前潜在扩散模型(Latent Diffusion Models)在语义可区分性、重建保真度与潜在空间紧凑性之间难以统一的问题。现有方法多依赖启发式设计,导致生成效率和稳定性受限。其解决方案的关键在于提出几何自编码器(Geometric Autoencoder, GAE),通过分析多种对齐范式,从视觉基础模型(Vision Foundation Model, VFM)中构建优化的低维语义监督目标以指导自编码器训练;同时引入替代标准变分自编码器(VAE)中KL散度的潜在归一化机制,从而形成更稳定且专为扩散学习优化的潜在流形;此外,还设计了动态噪声采样机制以增强高噪声强度下的重建鲁棒性。这些改进使GAE在ImageNet-1K 256×256基准上实现了显著优于现有方法的生成质量与压缩-语义深度-重建稳定性平衡。
链接: https://arxiv.org/abs/2603.10365
作者: Hangyu Liu,Jianyong Wang,Yutao Sun
机构: Shanghai Innovation Institute (上海创新研究院); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and models are publicly available at this https URL
Abstract:Latent diffusion models have established a new state-of-the-art in high-resolution visual generation. Integrating Vision Foundation Model priors improves generative efficiency, yet existing latent designs remain largely heuristic. These approaches often struggle to unify semantic discriminability, reconstruction fidelity, and latent compactness. In this paper, we propose Geometric Autoencoder (GAE), a principled framework that systematically addresses these challenges. By analyzing various alignment paradigms, GAE constructs an optimized low-dimensional semantic supervision target from VFMs to provide guidance for the autoencoder. Furthermore, we leverage latent normalization that replaces the restrictive KL-divergence of standard VAEs, enabling a more stable latent manifold specifically optimized for diffusion learning. To ensure robust reconstruction under high-intensity noise, GAE incorporates a dynamic noise sampling mechanism. Empirically, GAE achieves compelling performance on the ImageNet-1K 256 \times 256 benchmark, reaching a gFID of 1.82 at only 80 epochs and 1.31 at 800 epochs without Classifier-Free Guidance, significantly surpassing existing state-of-the-art methods. Beyond generative quality, GAE establishes a superior equilibrium between compression, semantic depth and robust reconstruction stability. These results validate our design considerations, offering a promising paradigm for latent diffusion modeling. Code and models are publicly available at this https URL.
[CV-80] One Token Two Fates: A Unified Framework via Vision Token Manipulation Against MLLM s Hallucination
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Model, MLLM)中的幻觉问题,特别是由视觉-语言不平衡导致的对象幻觉。现有训练-free 方法通常采用分离策略:要么增强视觉信号,要么抑制文本惯性,但二者均存在显著局限——单纯强化视觉信号难以对抗强语言先验,而抑制语言则可能引入与图像无关的噪声;且简单组合这些方法效果不佳。解决方案的关键在于构建一个统一框架,聚焦于“视觉标记(vision token)”这一核心资产,提出两个协同机制:其一为协同视觉校准(Synergistic Visual Calibration, SVC),通过引入增强图像的视觉标记来强化视觉表征;其二为因果表示校准(Causal Representation Calibration, CRC),通过移除部分视觉标记生成潜在空间负样本,以纠正模型内部偏差。二者在潜在空间中协同作用,有效恢复视觉-语言平衡,在多个基准上平均提升POPE准确率2%,仅带来1.06倍推理延迟开销。
链接: https://arxiv.org/abs/2603.10360
作者: Zhan Fa,Yue Duan,Jian Zhang,Lei Qi,Yinghuan Shi
机构: Nanjing University (南京大学); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:Current training-free methods tackle MLLM hallucination with separate strategies: either enhancing visual signals or suppressing text inertia. However, these separate methods are insufficient due to critical trade-offs: simply enhancing vision often fails against strong language prior, while suppressing language can introduce extra image-irrelevant noise. Moreover, we find their naive combination is also ineffective, necessitating a unified framework. We propose such a framework by focusing on the core asset: the vision token. Our design leverages two key insights: (1) augmented images offer complementary visual semantics, and (2) removing vision tokens (information-gap) isolates hallucination tendencies more precisely than distorting images (modality-gap). Based on these, our framework uses vision tokens in two distinct ways, both operating on latent representations: our Synergistic Visual Calibration (SVC) module incorporates augmented tokens to strengthen visual representations, while our Causal Representation Calibration (CRC) module uses pruned tokens to create latent-space negative samples for correcting internal model biases. By harmonizing these two roles, our framework effectively restores the vision-language balance, significantly reducing object hallucinations, improving POPE accuracy by an average of 2% absolute on LLaVA-1.5 across multiple benchmarks with only a 1.06x inference latency overhead.
[CV-81] StyleGallery: Training-free and Semantic-aware Personalized Style Transfer from Arbitrary Image References
【速读】:该论文旨在解决扩散模型在图像风格迁移(image style transfer)中面临的三大核心问题:1)语义鸿沟(semantic gap),即风格参考图像可能缺乏适当的语义内容,导致风格化过程不可控;2)对额外约束(如语义掩码)的依赖,限制了方法的应用灵活性;3)特征关联僵化,缺乏自适应的全局-局部对齐机制,难以平衡细节风格化与整体内容保真度。针对这些问题,论文提出了一种无需训练且语义感知的框架StyleGallery,其关键创新在于三阶段设计:首先通过潜在扩散特征上的自适应聚类实现无需额外输入的语义区域分割;其次利用块过滤策略进行聚类区域间的精确匹配;最后结合能量函数引导的扩散采样与区域风格损失优化风格迁移过程。该方案实现了对任意风格参考图像的有效利用,在内容结构保持、区域风格化精度、可解释性及个性化定制方面显著优于现有方法,尤其在多风格参考输入场景下表现突出。
链接: https://arxiv.org/abs/2603.10354
作者: Boyu He(1),Yunfan Ye(2),Chang Liu(1),Weishang Wu(1),Fang Liu(2),Zhiping Cai(1) ((1) College of Computer Science and Technology, National University of Defense Technology (2) School of Design, Hunan University)
机构: National University of Defense Technology (国防科技大学); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 23 figures, Conference on Computer Vision and Pattern Recognition 2026
Abstract:Despite the advancements in diffusion-based image style transfer, existing methods are commonly limited by 1) semantic gap: the style reference could miss proper content semantics, causing uncontrollable stylization; 2) reliance on extra constraints (e.g., semantic masks) restricting applicability; 3) rigid feature associations lacking adaptive global-local alignment, failing to balance fine-grained stylization and global content preservation. These limitations, particularly the inability to flexibly leverage style inputs, fundamentally restrict style transfer in terms of personalization, accuracy, and adaptability. To address these, we propose StyleGallery, a training-free and semantic-aware framework that supports arbitrary reference images as input and enables effective personalized customization. It comprises three core stages: semantic region segmentation (adaptive clustering on latent diffusion features to divide regions without extra inputs); clustered region matching (block filtering on extracted features for precise alignment); and style transfer optimization (energy function-guided diffusion sampling with regional style loss to optimize stylization). Experiments on our introduced benchmark demonstrate that StyleGallery outperforms state-of-the-art methods in content structure preservation, regional stylization, interpretability, and personalized customization, particularly when leveraging multiple style references.
[CV-82] EmoStory: Emotion-Aware Story Generation
【速读】:该论文旨在解决现有图像叙事生成方法在情感表达上的不足问题,即当前模型虽能生成连贯且具表现力的视觉故事,但普遍缺乏情感导向,未能将情绪因素融入叙事结构与视觉呈现中,从而影响观众的情感共鸣。为应对这一挑战,作者提出了一种名为EmoStory的两阶段框架,其关键在于:第一阶段通过“情感代理”(emotion agent)与“写作者代理”(writer agent)协同生成具有明确情绪指向的叙事提示;第二阶段则利用区域感知(region-aware)的视觉组合策略,在保持主体一致性的同时注入与情绪相关的视觉元素,从而实现情感驱动的视觉叙事生成。
链接: https://arxiv.org/abs/2603.10349
作者: Jingyuan Yang,Rucong Chen,Hui Huang
机构: Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Story generation aims to produce image sequences that depict coherent narratives while maintaining subject consistency across frames. Although existing methods have excelled in producing coherent and expressive stories, they remain largely emotion-neutral, focusing on what subject appears in a story while overlooking how emotions shape narrative interpretation and visual presentation. As stories are intended to engage audiences emotionally, we introduce emotion-aware story generation, a new task that aims to generate subject-consistent visual stories with explicit emotional directions. This task is challenging due to the abstract nature of emotions, which must be grounded in concrete visual elements and consistently expressed across a narrative through visual composition. To address these challenges, we propose EmoStory, a two-stage framework that integrates agent-based story planning and region-aware story generation. The planning stage transforms target emotions into coherent story prompts with emotion agent and writer agent, while the generation stage preserves subject consistency and injects emotion-related elements through region-aware composition. We evaluate EmoStory on a newly constructed dataset covering 25 subjects and 600 emotional stories. Extensive quantitative and qualitative results, along with user studies, show that EmoStory outperforms state-of-the-art story generation methods in emotion accuracy, prompt alignment, and subject consistency.
[CV-83] Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在杂乱环境中的“精度-推理鸿沟”问题,即高频率语义噪声导致几何定位失真,从而影响精确操作性能。其解决方案的关键在于提出一种无需训练、与模型无关的推理阶段视觉蒸馏框架——概念门控视觉蒸馏(Concept-Gated Visual Distillation, CGVD),通过将指令解析为安全集与干扰集,并结合交叉验证与空间消歧的两层目标精炼机制,显式惩罚误报并隔离真实操作目标;随后利用基于傅里叶的图像修复技术生成干净观测,主动抑制语义干扰同时保留关键空间几何结构和视觉本体感觉,从而显著提升机器人在密集干扰场景下的操作鲁棒性。
链接: https://arxiv.org/abs/2603.10340
作者: Sangmim Song,Sarath Kodagoda,Marc Carmichael,Karthick Thiyagarajan
机构: University of Technology Sydney (悉尼科技大学); Western Sydney University (西悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注: 7 pages, 4 figures, 3 tables
Abstract:Vision-Language-Action (VLA) models demonstrate impressive zero-shot generalization but frequently suffer from a “Precision-Reasoning Gap” in cluttered environments. This failure is driven by background-induced feature dilution, where high-frequency semantic noise corrupts the geometric grounding required for precise manipulation. To bridge this gap, we propose Concept-Gated Visual Distillation (CGVD), a training-free, model-agnostic inference framework that stabilizes VLA policies. CGVD operates by parsing instructions into safe and distractor sets, utilizing a two-layer target refinement process–combining cross-validation and spatial disambiguation–to explicitly penalize false positives and isolate genuine manipulation targets. We then process the scene via Fourier-based inpainting, generating a clean observation that actively suppresses semantic distractors while preserving critical spatial geometry and visual proprioception. Extensive evaluations in highly cluttered manipulation tasks demonstrate that CGVD prevents performance collapse. In environments with dense semantic distractors, our method significantly outperforms state-of-the-art baselines, achieving a 77.5% success rate compared to the baseline’s 43.0%. By enforcing strict attribute adherence, CGVD establishes inference-time visual distillation as a critical prerequisite for robust robotic manipulation in the clutter.
[CV-84] Fuel Gauge: Estimating Chain-of-Thought Length Ahead of Time in Large Multimodal Models
【速读】:该论文旨在解决生成式 AI(Generative AI)中大型多模态模型(Large Multi-modality Models, LMMs)在推理过程中因Chain-of-Thought(CoT)长度不可预测而导致的计算资源浪费和精度下降问题,具体表现为内存碎片化(memory fragmentation)以及“思考不足”(under-thinking)与“过度思考”(over-thinking)现象。其解决方案的关键在于提出了一种名为Fuel Gauge的新方法,首次通过提取CoT过程中的隐藏信号——即代表推理“燃料”量的隐变量——实现对CoT长度的提前预测。该方法显著提升了KV缓存分配的准确性并优化了推理过程的可控性,在文本、图文及视频问答任务中均展现出良好的泛化性和实用性,例如在GPQA-Diamond基准上将CoT长度预测误差降低超过50%,同时使内存分配频率减少13.37倍。
链接: https://arxiv.org/abs/2603.10335
作者: Yuedong Yang,Xiwen Wei,Mustafa Munir,Radu Marculescu
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reasoning Large Multi-modality Models (LMMs) have become the de facto choice for many applications. However, these models rely on a Chain-of-Thought (CoT) process that is lengthy and unpredictable at runtime, often resulting in inefficient use of computational resources (due to memory fragmentation) and sub-optimal accuracy (due to under- and over-thinking). We observe empirically that the CoT process follows a very simple form, whose behavior is independent of the specific generated samples. This suggests that the CoT length can be estimated ahead of time based on a hidden parameter representing the amount of “fuel” available to support the reasoning process. Based on this insight, we propose Fuel Gauge, the first method which extracts this hidden signal and predicts CoT length ahead of time. We demonstrate the utility on the Fuel Gauge on two downstream tasks: predictive KV cache allocation, which addresses memory fragmentation in LMM serving systems, and CoT length modulation, which mitigates under-thinking and over-thinking. Extensive experiments on LMMs across text-only, image-text, and video-text question answering benchmarks demonstrate the effectiveness, generalizability, and practical value of our Fuel Gauge. For example, on the GPQA-Diamond benchmark, our Fuel Gauge achieves less than half the CoT length prediction error compared to the baseline; this translates into a 13.37x reduction in the memory allocation frequency.
[CV-85] he Orthogonal Vulnerabilities of Generative AI Watermarks: A Comparative Empirical Benchmark of Spatial and Latent Provenance
【速读】:该论文旨在解决当前生成式 AI(Generative AI)技术快速发展背景下,数字媒体真实性验证面临的严峻挑战,特别是现有隐形水印技术在应对现代生成式编辑工具时的脆弱性问题。解决方案的关键在于提出并验证了一个“对抗逃避区域”(Adversarial Evasion Region, AER)框架,通过自动化攻击模拟引擎对两种代表性水印范式——RivaGAN(空间域)与Tree-Ring(隐空间域)进行系统性对比测试,揭示二者在数学上正交的漏洞特性:空间域水印易受像素级重写攻击(如Img2Img翻译导致67.47% AER逃避率),而隐空间域水印则对几何失配高度敏感(如静态裁剪引发43.20% AER逃避率)。研究证明单一域水印无法抵御现代对抗工具集,从而确立了未来多域加密架构的必要性。
链接: https://arxiv.org/abs/2603.10323
作者: Jesse Yu,Nicholas Wei
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As open-weights generative AI rapidly proliferates, the ability to synthesize hyper-realistic media has introduced profound challenges to digital trust. Automated disinformation and AI-generated imagery have made robust digital provenance a critical cybersecurity imperative. Currently, state-of-the-art invisible watermarks operate within one of two primary mathematical manifolds: the spatial domain (post-generation pixel embedding) or the latent domain (pre-generation frequency embedding). While existing literature frequently evaluates these models against isolated, classical distortions, there is a critical lack of rigorous, comparative benchmarking against modern generative AI editing tools. In this study, we empirically evaluate two leading representative paradigms, RivaGAN (Spatial) and Tree-Ring (Latent), utilizing an automated Attack Simulation Engine across 30 intensity intervals of geometric and generative perturbations. We formalize an “Adversarial Evasion Region” (AER) framework to measure cryptographic degradation against semantic visual retention (OpenCLIP 70.0). Our statistical analysis ( n=100 per interval, MOE = \pm 3.92% ) reveals that these domains possess mutually exclusive, mathematically orthogonal vulnerabilities. Spatial watermarks experience severe cryptographic degradation under algorithmic pixel-rewriting (exhibiting a 67.47% AER evasion rate under Img2Img translation), whereas latent watermarks exhibit profound fragility against geometric misalignment (yielding a 43.20% AER evasion rate under static cropping). By proving that single-domain watermarking is fundamentally insufficient against modern adversarial toolsets, this research exposes a systemic vulnerability in current digital provenance standards and establishes the foundational exigence for future multi-domain cryptographic architectures.
[CV-86] From Imitation to Intuition: Intrinsic Reasoning for Open-Instance Video Classification
【速读】:该论文旨在解决开放实例(open-instance)视频分类任务中传统视频编码模型因无法适应类内分布复杂多变而表现受限的问题,同时指出视觉语言模型(VLM)虽具备更强泛化能力,但尚未充分挖掘其内在推理能力(intuition)来应对此类挑战。解决方案的关键在于提出一个名为DeepIntuit的内在推理框架,该框架通过三个阶段实现从模仿到内在推理的演进:首先利用监督对齐进行冷启动初始化推理能力;随后采用组相对策略优化(GRPO)方法,借助强化学习提升推理一致性;最后引入直观校准阶段,基于精细化后的VLM推理轨迹训练分类器,从而确保知识迁移稳定且无分布偏移。此设计使模型在开放实例场景下显著超越仅依赖特征仿真的传统方法。
链接: https://arxiv.org/abs/2603.10300
作者: Ke Zhang,Xiangchen Zhao,Yunjie Tian,Jiayu Zheng,Vishal M. Patel,Di Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 7 figures
Abstract:Conventional video classification models, acting as effective imitators, excel in scenarios with homogeneous data distributions. However, real-world applications often present an open-instance challenge, where intra-class variations are vast and complex, beyond existing benchmarks. While traditional video encoder models struggle to fit these diverse distributions, vision-language models (VLMs) offer superior generalization but have not fully leveraged their reasoning capabilities (intuition) for such tasks. In this paper, we bridge this gap with an intrinsic reasoning framework that evolves open-instance video classification from imitation to intuition. Our approach, namely DeepIntuit, begins with a cold-start supervised alignment to initialize reasoning capability, followed by refinement using Group Relative Policy Optimization (GRPO) to enhance reasoning coherence through reinforcement learning. Crucially, to translate this reasoning into accurate classification, DeepIntuit then introduces an intuitive calibration stage. In this stage, a classifier is trained on this intrinsic reasoning traces generated by the refined VLM, ensuring stable knowledge transfer without distribution mismatch. Extensive experiments demonstrate that for open-instance video classification, DeepIntuit benefits significantly from transcending simple feature imitation and evolving toward intrinsic reasoning. Our project is available at this https URL.
[CV-87] aming Score-Based Denoisers in ADMM: A Convergent Plug-and-Play Framework
【速读】:该论文旨在解决将基于得分的生成模型(score-based generative models)直接集成到优化算法(如ADMM)中时面临的两个核心挑战:一是训练得分函数所依赖的噪声数据流形与ADMM迭代过程中由对偶变量影响产生的几何结构不匹配问题;二是当ADMM结合基于得分的去噪器时缺乏收敛性理论保障的问题。解决方案的关键在于提出一种新的ADMM-PnP框架,其嵌入了一个三阶段去噪器——自动校正(AC)通过添加高斯噪声实现,方向校正(DC)利用条件Langevin动力学完成,最后进行基于得分的去噪。该设计有效缓解了流形失配问题,并在理论上证明了两种收敛情形:一是在适当参数下,每步ADMM迭代为弱非扩张算子,可保证以常数步长高概率收敛至固定点;二是在更宽松条件下,AC-DC去噪器为有界去噪器,支持自适应步长策略下的收敛。
链接: https://arxiv.org/abs/2603.10281
作者: Rajesh Shrestha,Xiao Fu
机构: Oregon State University (俄勒冈州立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While score-based generative models have emerged as powerful priors for solving inverse problems, directly integrating them into optimization algorithms such as ADMM remains nontrivial. Two central challenges arise: i) the mismatch between the noisy data manifolds used to train the score functions and the geometry of ADMM iterates, especially due to the influence of dual variables, and ii) the lack of convergence understanding when ADMM is equipped with score-based denoisers. To address the manifold mismatch issue, we propose ADMM plug-and-play (ADMM-PnP) with the AC-DC denoiser, a new framework that embeds a three-stage denoiser into ADMM: (1) auto-correction (AC) via additive Gaussian noise, (2) directional correction (DC) using conditional Langevin dynamics, and (3) score-based denoising. In terms of convergence, we establish two results: first, under proper denoiser parameters, each ADMM iteration is a weakly nonexpansive operator, ensuring high-probability fixed-point \textitball convergence using a constant step size; second, under more relaxed conditions, the AC-DC denoiser is a bounded denoiser, which leads to convergence under an adaptive step size schedule. Experiments on a range of inverse problems demonstrate that our method consistently improves solution quality over a variety of baselines.
[CV-88] A Robust Deep Learning Framework for Bangla License Plate Recognition Using YOLO and Vision-Language OCR
【速读】:该论文旨在解决孟加拉语(Bangla)车牌识别(License Plate Recognition, LPR)中的挑战问题,主要难点在于其复杂的字符结构和不规则的排布方式。解决方案的关键在于提出一个两阶段的自适应训练策略,基于YOLOv8架构改进车牌定位性能,并采用VisionEncoderDecoder框架将文本识别建模为序列生成任务,其中结合ViT(Vision Transformer)与BanglaBERT作为编码器-解码器组合,在字符级和词级错误率上分别达到0.1323和0.1068,显著提升了识别准确率与鲁棒性。该系统在外部测试集上亦表现出一致性性能,验证了其在复杂光照、噪声及不同板式场景下的实用性,适用于智能交通系统的自动执法与门禁控制等应用。
链接: https://arxiv.org/abs/2603.10267
作者: Nayeb Hasin,Md. Arafath Rahman Nishat,Mainul Islam,Khandakar Shakib Al Hasan,Asif Newaz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 2026 IEEE International Conference on AI and Data Analytics (ICAD 2026). Final version will appear in IEEE Xplore
Abstract:An Automatic License Plate Recognition (ALPR) system constitutes a crucial element in an intelligent traffic management system. However, the detection of Bangla license plates remains challenging because of the complicated character scheme and uneven layouts. This paper presents a robust Bangla License Plate Recognition system that integrates a deep learning-based object detection model for license plate localization with Optical Character Recognition for text extraction. Multiple object detection architectures, including U-Net and several YOLO (You Only Look Once) variants, are compared for license plate localization. This study proposes a novel two-stage adaptive training strategy built upon the YOLOv8 architecture to improve localization performance. The proposed approach outperforms the established models, achieving an accuracy of 97.83% and an Intersection over Union (IoU) of 91.3%. The text recognition problem is phrased as a sequence generation problem with a VisionEncoderDecoder architecture, with a combination of encoder-decoders evaluated. It was demonstrated that the ViT + BanglaBERT model gives better results at the character level, with a Character Error Rate of 0.1323 and Word Error Rate of 0.1068. The proposed system also shows a consistent performance when tested on an external dataset that has been curated for this study purpose. The dataset offers completely different environment and lighting conditions compared to the training sample, indicating the robustness of the proposed framework. Overall, our proposed system provides a robust and reliable solution for Bangla license plate recognition and performs effectively across diverse real-world scenarios, including variations in lighting, noise, and plate styles. These strengths make it well suited for deployment in intelligent transportation applications such as automated law enforcement and access control.
[CV-89] ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA
【速读】:该论文旨在解决现有视频个性化方法中视觉与音频分离建模导致的同步性不足及语音风格和声学环境难以通过文本提示灵活控制的问题。其核心挑战在于如何在单一生成过程中联合建模主体的外观(appearance)与声音(voice),同时保持高保真度和可控性。解决方案的关键是提出ID-LoRA(Identity-Driven In-Context LoRA),它基于LTX-2联合音视频扩散模型架构,采用参数高效的In-Context LoRA进行适配,并引入两个创新机制:一是通过负向时间位置编码(negative temporal positions)区分参考帧与生成帧的时序嵌入空间,避免混淆;二是设计身份引导(identity guidance)——一种无分类器指导变体,通过对比有无参考信号下的预测差异来增强说话人特异性特征。此方法实现了仅用约3K训练样本即可在单GPU上完成高质量的跨模态个性化生成,显著优于现有基线模型。
链接: https://arxiv.org/abs/2603.10256
作者: Aviad Dahan,Moran Yanuka,Noa Kraicer,Lior Wolf,Raja Giryes
机构: Tel Aviv University (特拉维夫大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference recording, a text prompt cannot redirect speaking style or acoustic environment. We propose ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject’s appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. ID-LoRA adapts the LTX-2 joint audio-video diffusion backbone via parameter-efficient In-Context LoRA and, to our knowledge, is the first method to personalize visual appearance and voice in a single generative pass. Two challenges arise. Reference and generation tokens share the same positional-encoding space, making them hard to distinguish; we address this with negative temporal positions, placing reference tokens in a disjoint RoPE region while preserving their internal temporal structure. Speaker characteristics also tend to be diluted during denoising; we introduce identity guidance, a classifier-free guidance variant that amplifies speaker-specific features by contrasting predictions with and without the reference signal. In human preference studies, ID-LoRA is preferred over Kling 2.6 Pro by 73% of annotators for voice similarity and 65% for speaking style. On cross-environment settings, speaker similarity improves by 24% over Kling, with the gap widening as conditions diverge. A preliminary user study further suggests that joint generation provides a useful inductive bias for physically grounded sound synthesis. ID-LoRA achieves these results with only ~3K training pairs on a single GPU. Code, models, and data will be released.
[CV-90] Joint Imaging-ROI Representation Learning via Cross-View Contrastive Alignment for Brain Disorder Classification
【速读】:该论文旨在解决脑影像分类中全局图像体积(imaging volume)与感兴趣区域(ROI)图结构表示之间的相对贡献及潜在互补性不明确的问题。现有融合方法多为任务特定且无法在一致训练设置下系统评估不同表示方式的效果。其解决方案的关键在于提出一种统一的跨视图对比学习框架,通过双向对比目标对齐同一受试者的全局影像嵌入与局部ROI图嵌入于共享潜在空间,从而实现可比较的联合表征学习,并支持在相同训练协议下对仅使用影像、仅使用ROI或两者联合配置进行系统性评估。实验表明,联合学习在多个骨干网络和数据集上均优于单一模态分支,且可解释性分析揭示了两类表示强调不同但互补的判别特征,验证了显式融合全局与ROI级表征的有效性。
链接: https://arxiv.org/abs/2603.10253
作者: Wei Liang,Lifang He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Brain imaging classification is commonly approached from two perspectives: modeling the full image volume to capture global anatomical context, or constructing ROI-based graphs to encode localized and topological interactions. Although both representations have demonstrated independent efficacy, their relative contributions and potential complementarity remain insufficiently understood. Existing fusion approaches are typically task-specific and do not enable controlled evaluation of each representation under consistent training settings. To address this gap, we propose a unified cross-view contrastive framework for joint imaging-ROI representation learning. Our method learns subject-level global (imaging) and local (ROI-graph) embeddings and aligns them in a shared latent space using a bidirectional contrastive objective, encouraging representations from the same subject to converge while separating those from different subjects. This alignment produces comparable embeddings suitable for downstream fusion and enables systematic evaluation of imaging-only, ROI-only, and joint configurations within a unified training protocol. Extensive experiments on the ADHD-200 and ABIDE datasets demonstrate that joint learning consistently improves classification performance over either branch alone across multiple backbone choices. Moreover, interpretability analyses reveal that imaging-based and ROI-based branches emphasize distinct yet complementary discriminative patterns, explaining the observed performance gains. These findings provide principled evidence that explicitly integrating global volumetric and ROI-level representations is a promising direction for neuroimaging-based brain disorder classification. The source code is available at this https URL.
[CV-91] One Adapter for All: Towards Unified Representation in Step-Imbalanced Class-Incremental Learning
【速读】:该论文针对类增量学习(Class-incremental learning, CIL)中普遍存在的任务步长不平衡(step imbalance)问题展开研究,即在实际场景中,不同任务包含的类别数量差异显著,导致大任务主导学习过程、小任务引入不稳定更新,而现有方法因假设任务平衡而无法有效应对这一挑战。解决方案的关键在于提出 One-A 框架,其核心机制包括:1)通过非对称子空间对齐保留大任务主导子空间,同时约束小任务低信息量更新;2)设计信息自适应加权策略平衡基础与新适配器的贡献;3)引入方向门控机制沿奇异方向选择性融合更新,维持头部方向稳定性与尾部方向可塑性。该框架以单一适配器实现高效增量学习,在多个基准和步长不平衡流上均表现出高准确率与极低推理开销。
链接: https://arxiv.org/abs/2603.10237
作者: Xiaoyan Zhang,Jiangpeng He
机构: University of Michigan (密歇根大学); Indiana University (印第安纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code is available at this https URL
Abstract:Class-incremental learning (CIL) aims to acquire new classes over time while retaining prior knowledge, yet most setups and methods assume balanced task streams. In practice, the number of classes per task often varies significantly. We refer to this as step imbalance, where large tasks that contain more classes dominate learning and small tasks inject unstable updates. Existing CIL methods assume balanced tasks and therefore treat all tasks uniformly, producing imbalanced updates that degrade overall learning performance. To address this challenge, we propose One-A, a unified and imbalance-aware framework that incrementally merges task updates into a single adapter, maintaining constant inference cost. One-A performs asymmetric subspace alignment to preserve dominant subspaces learned from large tasks while constraining low-information updates within them. An information-adaptive weighting balances the contribution between base and new adapters, and a directional gating mechanism selectively fuses updates along each singular direction, maintaining stability in head directions and plasticity in tail ones. Across multiple benchmarks and step-imbalanced streams, One-A achieves competitive accuracy with significantly low inference overhead, showing that a single, asymmetrically fused adapter can remain both adaptive to dynamic task sizes and efficient at deployment.
[CV-92] Why Does It Look There? Structured Explanations for Image Classification
【速读】:该论文旨在解决深度学习模型因黑箱特性导致的可解释性不足问题,现有可解释人工智能(XAI)方法多提供无结构的显著性图或概念,且常依赖外部模型(如GPT、CLIP)进行描述,难以忠实反映原模型的行为。其解决方案的关键在于提出I2X框架,通过量化训练过程中特定检查点的进展,利用后验XAI方法(如GradCAM)提取原型(prototype),构建从无结构解释到结构化解释的映射,从而系统揭示模型在类内与类间决策中的推理机制,并据此识别不确定原型,通过针对性样本扰动实现微调以提升预测准确性,实现了对模型行为的忠实解释与优化指导的统一。
链接: https://arxiv.org/abs/2603.10234
作者: Jiarui Li,Zixiang Yin,Samuel J Landry,Zhengming Ding,Ramgopal R. Mettu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Deep learning models achieve remarkable predictive performance, yet their black-box nature limits transparency and trustworthiness. Although numerous explainable artificial intelligence (XAI) methods have been proposed, they primarily provide saliency maps or concepts (i.e., unstructured interpretability). Existing approaches often rely on auxiliary models (\eg, GPT, CLIP) to describe model behavior, thereby compromising faithfulness to the original models. We propose Interpretability to Explainability (I2X), a framework that builds structured explanations directly from unstructured interpretability by quantifying progress at selected checkpoints during training using prototypes extracted from post-hoc XAI methods (e.g., GradCAM). I2X answers the question of “why does it look there” by providing a structured view of both intra- and inter-class decision making during training. Experiments on MNIST and CIFAR10 demonstrate effectiveness of I2X to reveal prototype-based inference process of various image classification models. Moreover, we demonstrate that I2X can be used to improve predictions across different model architectures and datasets: we can identify uncertain prototypes recognized by I2X and then use targeted perturbation of samples that allows fine-tuning to ultimately improve accuracy. Thus, I2X not only faithfully explains model behavior but also provides a practical approach to guide optimization toward desired targets.
[CV-93] OilSAM2: Memory-Augmented SAM2 for Scalable SAR Oil Spill Detection
【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)图像中油污分割任务面临的挑战,包括外观变化剧烈、尺度异质性以及实际监测场景中缺乏时间连续性等问题。现有基于Segment Anything Model (SAM) 的方法通常仅处理单张图像,难以跨场景复用信息;而记忆增强型变体(如SAM2)虽引入时序一致性假设,但在无序SAR图像集合上易产生语义漂移。解决方案的关键在于提出OilSAM2框架,其核心创新是设计了一个分层特征感知的多尺度记忆库,显式建模纹理、结构与语义层级表示,从而实现跨图像的信息有效复用;同时引入结构-语义一致的记忆更新策略,依据语义差异和结构一致性选择性刷新记忆内容,有效缓解记忆漂移问题。实验表明,该方法在两个公开SAR油污数据集上均达到当前最优分割性能,在噪声干扰下仍保持稳定准确的结果。
链接: https://arxiv.org/abs/2603.10231
作者: Shuaiyu Chen,Ming Yin,Peng Ren,Chunbo Luo,Zeyu Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Segmenting oil spills from Synthetic Aperture Radar (SAR) imagery remains challenging due to severe appearance variability, scale heterogeneity, and the absence of temporal continuity in real world monitoring scenarios. While foundation models such as Segment Anything (SAM) enable prompt driven segmentation, existing SAM based approaches operate on single images and cannot effectively reuse information across scenes. Memory augmented variants (e.g., SAM2) further assume temporal coherence, making them prone to semantic drift when applied to unordered SAR image collections. We propose OilSAM2, a memory augmented segmentation framework tailored for unordered SAR oil spill monitoring. OilSAM2 introduces a hierarchical feature aware multi scale memory bank that explicitly models texture, structure, and semantic level representations, enabling robust cross image information reuse. To mitigate memory drift, we further propose a structure semantic consistent memory update strategy that selectively refreshes memory based on semantic discrepancy and structural this http URL on two public SAR oil spill datasets demonstrate that OilSAM2 achieves state of the art segmentation performance, delivering stable and accurate results under noisy SAR monitoring scenarios. The source code is available at this https URL.
[CV-94] Robotic Ultrasound Makes CBCT Alive
【速读】:该论文旨在解决术中锥束计算机断层扫描(Cone Beam Computed Tomography, CBCT)因静态特性无法实时监测由呼吸、探头压力及手术操作引起的软组织形变,从而导致导航偏差的问题。其解决方案的关键在于提出一种基于机器人超声的形变感知CBCT更新框架:首先通过校准初始化与LC²(线性组合的线性相关性)引导的刚性配准建立多模态对应关系;进而引入轻量级网络USCorUNet,利用光流引导监督学习形变感知的相关性表示,实现从超声序列中快速准确估计密集形变场;最后将推导的形变空间正则化并映射至CBCT参考图像,生成与形变一致的可视化结果,无需重复辐射暴露即可动态优化CBCT导航精度。
链接: https://arxiv.org/abs/2603.10220
作者: Feng Li,Ziyuan Li,Zhongliang Jiang,Nassir Navab,Yuan Bi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 10 pages, 4 figures
Abstract:Intraoperative Cone Beam Computed Tomography (CBCT) provides a reliable 3D anatomical context essential for interventional planning. However, its static nature fails to provide continuous monitoring of soft-tissue deformations induced by respiration, probe pressure, and surgical manipulation, leading to navigation discrepancies. We propose a deformation-aware CBCT updating framework that leverages robotic ultrasound as a dynamic proxy to infer tissue motion and update static CBCT slices in real time. Starting from calibration-initialized alignment with linear correlation of linear combination (LC2)-based rigid refinement, our method establishes accurate multimodal correspondence. To capture intraoperative dynamics, we introduce the ultrasound correlation UNet (USCorUNet), a lightweight network trained with optical flow-guided supervision to learn deformation-aware correlation representations, enabling accurate, real-time dense deformation field estimation from ultrasound streams. The inferred deformation is spatially regularized and transferred to the CBCT reference to produce deformation-consistent visualizations without repeated radiation exposure. We validate the proposed approach through deformation estimation and ultrasound-guided CBCT updating experiments. Results demonstrate real-time end-to-end CBCT slice updating and physically plausible deformation estimation, enabling dynamic refinement of static CBCT guidance during robotic ultrasound-assisted interventions. The source code is publicly available at this https URL.
[CV-95] An Automated Radiomics Framework for Postoperative Survival Prediction in Colorectal Liver Metastases using Preoperative MRI
【速读】:该论文旨在解决结直肠肝转移(Colorectal Liver Metastasis, CRLM)患者术后生存预测的异质性问题,以避免非获益性手术并指导个体化治疗。其关键解决方案是构建一个全自动的AI框架,结合解剖感知的分割流程与放射组学分析流程:首先利用可提示的基础模型(promptable foundation models)生成伪标签,提出SAMONAI算法实现3D点云分割,从而精准分割肝脏、肿瘤及脾脏;随后将分割结果输入基于自编码器的多实例神经网络SurvAMINN进行特征提取与生存预测,该网络能从右删失数据中联合学习降维与生存建模,突出高风险转移灶。整体框架在227例患者数据上实现了C-index达0.69的生存预测性能,验证了整合分割与放射组学分析在自动化CRLM预后评估中的潜力。
链接: https://arxiv.org/abs/2603.10216
作者: Muhammad Alberb,Jianan Chen,Hossam El-rewaidy,Paul Karanicolas,Arun Seth,Yutaka Amemiya,Anne Martel,Helen Cheung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While colorectal liver metastasis (CRLM) is potentially curable via hepatectomy, patient outcomes remain highly heterogeneous. Postoperative survival prediction is necessary to avoid non-beneficial surgeries and guide personalized therapy. In this study, we present an automated AI-based framework for postoperative CRLM survival prediction using pre- and post-contrast MRI. We performed a retrospective study of 227 CRLM patients who had gadoxetate-enhanced MRI prior to curative-intent hepatectomy between 2013 and 2020. We developed a survival prediction framework comprising an anatomy-aware segmentation pipeline followed by a radiomics pipeline. The segmentation pipeline learns liver, CRLMs, and spleen segmentation from partially-annotated data, leveraging promptable foundation models to generate pseudo-labels. To support this pipeline, we propose SAMONAI, a prompt propagation algorithm that extends Segment Anything Model to 3D point-based segmentation. Predicted pre- and post-contrast segmentations are then fed into our radiomics pipeline, which extracts per-tumor features and predicts survival using SurvAMINN, an autoencoder-based multiple instance neural network for time-to-event survival prediction. SurvAMINN jointly learns dimensionality reduction and survival prediction from right-censored data, emphasizing high-risk metastases. We compared our framework against established methods and biomarkers using univariate and multivariate Cox regression. Our segmentation pipeline achieves median Dice scores of 0.96 (liver) and 0.93 (spleen), driving a CRLM segmentation Dice score of 0.78 and a detection F1-score of 0.79. Accurate segmentation enables our radiomics pipeline to achieve a survival prediction C-index of 0.69. Our results show the potential of integrating segmentation algorithms with radiomics-based survival analysis to deliver accurate and automated CRLM outcome prediction.
[CV-96] FusionNet: a frame interpolation network for 4D heart models MICCAI2023
【速读】:该论文旨在解决心脏磁共振成像(Cardiac Magnetic Resonance, CMR)中因扫描时间过长导致患者不适,以及缩短扫描时间后 temporal resolution(时间分辨率)下降从而影响诊断准确性的难题。解决方案的关键在于提出一种名为FusionNet的神经网络模型,该模型能够从短时采集的CMR图像中重建出具有高时间分辨率的四维(4D)心脏运动信息;其核心机制是基于相邻心脏三维形状估计中间帧的3D心腔形态,从而实现对心脏动态过程的高保真插值与重构。实验表明,该方法在Dice系数上超过0.897,显著优于现有技术。
链接: https://arxiv.org/abs/2603.10212
作者: Chujie Chang,Shoko Miyauchi,Ken’ichi Morooka,Ryo Kurazume,Oscar Martinez Mozos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This is the authors’ version. The final authenticated version is available online at this https URL . Published in Medical Image Computing and Computer Assisted Intervention - MICCAI 2023 Workshops
Abstract:Cardiac magnetic resonance (CMR) imaging is widely used to visualise cardiac motion and diagnose heart disease. However, standard CMR imaging requires patients to lie still in a confined space inside a loud machine for 40-60 min, which increases patient discomfort. In addition, shorter scan times decrease either or both the temporal and spatial resolutions of cardiac motion, and thus, the diagnostic accuracy of the procedure. Of these, we focus on reduced temporal resolution and propose a neural network called FusionNet to obtain four-dimensional (4D) cardiac motion with high temporal resolution from CMR images captured in a short period of time. The model estimates intermediate 3D heart shapes based on adjacent shapes. The results of an experimental evaluation of the proposed FusionNet model showed that it achieved a performance of over 0.897 in terms of the Dice coefficient, confirming that it can recover shapes more precisely than existing methods. This code is available at: this https URL
[CV-97] Delta-K: Boosting Multi-Instance Generation via Cross-Attention Augmentation
【速读】:该论文旨在解决扩散模型(Diffusion Models)在合成复杂多实例场景时常见的概念遗漏问题(concept omission),即生成结果中缺失文本提示中描述的某些对象或语义内容。现有无训练方法通过重缩放注意力图来尝试缓解此问题,但仅加剧了无结构噪声而未能建立连贯的语义表示。解决方案的关键在于提出Delta-K框架,其核心思想是直接在共享的交叉注意力键空间(cross-attention Key space)中操作:利用视觉-语言模型提取一个编码缺失概念语义特征的差分键信号(ΔK),并在扩散过程的早期语义规划阶段注入该信号;通过动态优化调度机制,将扩散噪声锚定到稳定的结构基元上,同时保留已有概念,从而实现无需空间掩码、额外训练或架构修改的通用性提升。
链接: https://arxiv.org/abs/2603.10210
作者: Zitong Wang,Zijun Shen,Haohao Xu,Zhengjie Luo,Weibin Wu
机构: Sun Yat-sen University (中山大学); Nanjing University (南京大学); Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:While Diffusion Models excel in text-to-image synthesis, they often suffer from concept omission when synthesizing complex multi-instance scenes. Existing training-free methods attempt to resolve this by rescaling attention maps, which merely exacerbates unstructured noise without establishing coherent semantic representations. To address this, we propose Delta-K, a backbone-agnostic and plug-and-play inference framework that tackles omission by operating directly in the shared cross-attention Key space. Specifically, with Vision-language model, we extract a differential key \Delta K that encodes the semantic signature of missing concepts. This signal is then injected during the early semantic planning stage of the diffusion process. Governed by a dynamically optimized scheduling mechanism, Delta-K grounds diffuse noise into stable structural anchors while preserving existing concepts. Extensive experiments demonstrate the generality of our approach: Delta-K consistently improves compositional alignment across both modern DiT models and classical U-Net architectures, without requiring spatial masks, additional training, or architectural modifications.
[CV-98] Unbalanced Optimal Transport Dictionary Learning for Unsupervised Hyperspectral Image Clustering
【速读】:该论文旨在解决高光谱图像(hyperspectral images)中标签标注任务繁重且难以通过传统统计方法处理的问题,目标是实现高效的无监督聚类以自动分割场景并快速理解图像内容。其解决方案的关键在于利用非平衡 Wasserstein 中心(unbalanced Wasserstein barycenters)学习数据的低维表示,从而在保留原始光谱特征的同时提升对噪声和异常值的鲁棒性,并在此基础上应用谱聚类(spectral clustering)实现更有效的无监督标签学习。
链接: https://arxiv.org/abs/2603.10132
作者: Joshua Lentz,Nicholas Karris,Alex Cloninger,James M. Murphy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注: IEEE WHISPERS 2025
Abstract:Hyperspectral images capture vast amounts of high-dimensional spectral information about a scene, making labeling an intensive task that is resistant to out-of-the-box statistical methods. Unsupervised learning of clusters allows for automated segmentation of the scene, enabling a more rapid understanding of the image. Partitioning the spectral information contained within the data via dictionary learning in Wasserstein space has proven an effective method for unsupervised clustering. However, this approach requires balancing the spectral profiles of the data, blurring the classes, and sacrificing robustness to outliers and noise. In this paper, we suggest improving this approach by utilizing unbalanced Wasserstein barycenters to learn a lower-dimensional representation of the underlying data. The deployment of spectral clustering on the learned representation results in an effective approach for the unsupervised learning of labels.
[CV-99] HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation CVPR2026
【速读】:该论文旨在解决现有车道线检测(Lane Detection)数据集在极端天气和光照条件下的数据稀缺问题,从而导致模型在雨、雪、雾及夜间等复杂场景下性能显著下降甚至失效的问题。其解决方案的关键在于提出HG-Lane框架——一种无需重新标注即可生成高保真度恶劣环境车道场景的图像生成方法,并基于此构建了一个包含3万张图像的基准测试集,有效提升了主流车道线检测网络(如CLRNet)在各类极端条件下的检测性能,尤其在F1@50指标上实现了显著提升,验证了该方案的有效性和泛化能力。
链接: https://arxiv.org/abs/2603.10128
作者: Daichao Zhao,Qiupu Chen,Feng He,Xin Ning,Qiankun Li
机构: Shanghai Jiao Tong University (上海交通大学); School of Artificial Intelligence, Henan University (河南大学人工智能学院); University of Science and Technology of China (中国科学技术大学); AnnLab, Institute of Semiconductors, Chinese Academy of Sciences (中国科学院半导体研究所安实验室); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Lane detection is a crucial task in autonomous driving, as it helps ensure the safe operation of vehicles. However, existing datasets such as CULane and TuSimple contain relatively limited data under extreme weather conditions, including rain, snow, and fog. As a result, detection models trained on these datasets often become unreliable in such environments, which may lead to serious safety-critical failures on the road. To address this issue, we propose HG-Lane, a High-fidelity Generation framework for Lane Scenes under adverse weather and lighting conditions without requiring re-annotation. Based on this framework, we further construct a benchmark that includes adverse weather and lighting scenarios, containing 30,000 images. Experimental results demonstrate that our method consistently and significantly improves the performance of existing lane detection networks. For example, using the state-of-the-art CLRNet, the overall mF1 score on our benchmark increases by 20.87 percent. The F1@50 score for the overall, normal, snow, rain, fog, night, and dusk categories increases by 19.75 percent, 8.63 percent, 38.8 percent, 14.96 percent, 26.84 percent, 21.5 percent, and 12.04 percent, respectively. The code and dataset are available at: this https URL.
[CV-100] 4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video CVPR2026
【速读】:该论文旨在解决从单目视频中实现马属动物(如马)的4D重建问题,以支持动物福利研究。传统方法通常需要对整个视频进行运动与外观的联合优化,存在计算效率低且对观测不完整敏感的问题。其解决方案的关键在于将4D重建任务解耦为两个子问题:动态运动重建和静态外观重建。在运动重建方面,提出一种基于时空Transformer的简单而有效的网络结构,并引入后优化阶段以获得平滑且像素对齐的姿态与形状序列;在外观重建方面,设计了一种前馈网络,仅需单张图像即可重建高保真、可驱动的3D Gaussian Avatar。此外,作者构建了大规模合成数据集VarenPoser(用于运动)和VarenTex(用于外观),并通过在合成数据上训练实现了在真实世界数据集APT36K和AiM上的SOTA性能,验证了方法的有效性与泛化能力。
链接: https://arxiv.org/abs/2603.10125
作者: Jin Lyu,Liang An,Pujin Cheng,Yebin Liu,Xiaoying Tang
机构: Southern University of Science and Technology (南方科技大学); Tsinghua University (清华大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2026
Abstract:4D reconstruction of equine family (e.g. horses) from monocular video is important for animal welfare. Previous mainstream 4D animal reconstruction methods require joint optimization of motion and appearance over a whole video, which is time-consuming and sensitive to incomplete observation. In this work, we propose a novel framework called 4DEquine by disentangling the 4D reconstruction problem into two sub-problems: dynamic motion reconstruction and static appearance reconstruction. For motion, we introduce a simple yet effective spatio-temporal transformer with a post-optimization stage to regress smooth and pixel-aligned pose and shape sequences from video. For appearance, we design a novel feed-forward network that reconstructs a high-fidelity, animatable 3D Gaussian avatar from as few as a single image. To assist training, we create a large-scale synthetic motion dataset, VarenPoser, which features high-quality surface motions and diverse camera trajectories, as well as a synthetic appearance dataset, VarenTex, comprising realistic multi-view images generated through multi-view diffusion. While training only on synthetic datasets, 4DEquine achieves state-of-the-art performance on real-world APT36K and AiM datasets, demonstrating the superiority of 4DEquine and our new datasets for both geometry and appearance reconstruction. Comprehensive ablation studies validate the effectiveness of both the motion and appearance reconstruction network. Project page: this https URL.
[CV-101] Human Presence Detection via Wi-Fi Range-Filtered Doppler Spectrum on Commodity Laptops
【速读】:该论文旨在解决现有设备中人体存在检测(Human Presence Detection, HPD)方案存在的两大问题:一是依赖外部专用传感器,导致成本高、部署复杂;二是采用基于摄像头的方法,引发严重的隐私担忧。针对这些问题,论文提出了一种基于单站Wi-Fi感知(monostatic Wi-Fi sensing)的低复杂度HPD解决方案,其关键创新在于引入了**范围滤波多普勒谱(Range-Filtered Doppler Spectrum, RF-DS)**技术,通过在信道冲激响应(Channel Impulse Response, CIR)域对目标距离区域进行滤波后再进行多普勒分析,实现仅使用设备内置Wi-Fi网卡即可完成空间选择性与时间窗化的用户存在检测。此外,该方案还设计了自适应多速率处理框架,根据是否检测到运动动态调整信道状态信息(CSI)采样率(从10Hz至100Hz),从而显著降低计算开销并提升稳定性,且无需校准或重新训练即可跨环境和设备部署。
链接: https://arxiv.org/abs/2603.10845
作者: Jessica Sanson,Rahul C. Shah,Valerio Frascolla
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, Conference
Abstract:Human Presence Detection (HPD) is key to enable intelligent power management and security features in everyday devices. In this paper we propose the first HPD solution that leverages monostatic Wi-Fi sensing and detects user position using only the built-in Wi-Fi hardware of a device, with no need for external devices, access points, or additional sensors. In contrast, existing HPD solutions for laptops require external dedicated sensors which add cost and complexity, or rely on camera-based approaches that introduce significant privacy concerns. We herewith introduce the Range-Filtered Doppler Spectrum (RF-DS), a novel Wi-Fi sensing technique for presence estimation that enables both range-selective and temporally windowed detection of user presence. By applying targeted range-area filtering in the Channel Impulse Response (CIR) domain before Doppler analysis, our method focuses processing on task-relevant spatial zones, significantly reducing computational complexity. In addition, the use of temporal windows in the spectrum domain provides greater estimator stability compared to conventional 2D Range-Doppler detectors. Furthermore, we propose an adaptive multi-rate processing framework that dynamically adjusts Channel State Information (CSI) sampling rates-operating at low frame rates (10Hz) during idle periods and high rates (100Hz) only when motion is detected. To our knowledge, this is the first low-complexity solution for occupancy detection using monostatic Wi-Fi sensing on a built-in Wi-Fi network interface controller (NIC) of a commercial off-the-shelf laptop that requires no external network infrastructure or specialized sensors. Our solution can scale across different environments and devices without calibration or retraining.
[CV-102] ARCHE: Autoregressive Residual Compression with Hyperprior and Excitation
【速读】:该论文旨在解决学习型图像压缩中模型性能与计算效率难以兼顾的问题,即现有方法虽在率失真(Rate-Distortion, RD)效率上表现优异,但往往伴随较高的计算开销和有限的并行性。其解决方案的关键在于提出ARCHE框架——一种基于自回归残差压缩(Autoregressive Residual Compression)的端到端图像压缩架构,通过统一层次化、空间和通道级先验(prior)于单一概率建模框架内,同时引入自适应特征重校准(adaptive feature recalibration)与残差精炼(residual refinement)机制,在不依赖循环神经网络或Transformer结构的前提下,显著提升潜在表示质量。该设计实现了高率失真效率(相比Balle et al.基准模型BD-Rate降低约48%)与低延迟(每张图像仅222ms)的协同优化,验证了高效卷积结构可实现精确熵建模,适用于实际部署场景。
链接: https://arxiv.org/abs/2603.10188
作者: Sofia Iliopoulou,Dimitris Ampeliotis,Athanassios Skodras
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 12 figures
Abstract:Recent progress in learning-based image compression has demonstrated that end-to-end optimization can substantially outperform traditional codecs by jointly learning compact latent representations and probabilistic entropy models. However, many existing approaches achieve high rate-distortion efficiency at the expense of increased computational cost and limited parallelism. This paper presents ARCHE - Autoregressive Residual Compression with Hyperprior and Excitation, an end-to-end learned image compression framework that balances modeling accuracy and computational efficiency. The proposed architecture unifies hierarchical, spatial, and channel-based priors within a single probabilistic framework, capturing both global and local dependencies in the latent representation of the image, while employing adaptive feature recalibration and residual refinement to enhance latent representation quality. Without relying on recurrent or transformer-based components, ARCHE attains state-of-the-art rate-distortion efficiency: it reduces the BD-Rate by approximately 48% relative to the commonly used benchmark model of Balle et al., 30% relative to the channel-wise autoregressive model of Minnen Singh and 5% against the VVC Intra codec on the Kodak benchmark dataset. The framework maintains computational efficiency with 95M parameters and 222ms running time per image. Visual comparisons confirm sharper textures and improved color fidelity, particularly at lower bit rates, demonstrating that accurate entropy modeling can be achieved through efficient convolutional designs suitable for practical deployment.
人工智能
[AI-0] RCTs Human Uplift Studies: Methodological Challenges and Practical Solutions for Frontier AI Evaluation
【速读】:该论文旨在解决人类增强研究(human uplift studies)在应用于前沿人工智能(frontier AI)系统时所面临的有效性挑战,尤其是在高风险决策中如何合理解释和使用其证据。核心问题在于,传统因果推断假设(如内部效度、外部效度和建构效度)难以适应前沿AI系统的动态特性,包括快速迭代、基准线漂移、用户能力异质性及现实环境的不稳定性。论文通过访谈16位在生物安全、网络安全、教育和劳动力等领域具有实践经验的专家,识别出这些挑战贯穿于研究生命周期的各个阶段,并提炼出相应的实践解决方案,关键在于明确区分哪些情境下人类增强研究的结果可被可靠使用,以及在何种条件下需谨慎解读或补充其他验证手段,从而为高风险决策提供更严谨的证据基础。
链接: https://arxiv.org/abs/2603.11001
作者: Patricia Paskov,Kevin Wei,Shen Zhou Hong,Dan Bateyko,Xavier Roberts-Gaal,Carson Ezell,Gailius Praninskas,Valerie Chen,Umang Bhatt,Ella Guest
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Human uplift studies - or studies that measure AI effects on human performance relative to a status quo, typically using randomized controlled trial (RCT) methodology - are increasingly used to inform deployment, governance, and safety decisions for frontier AI systems. While the methods underlying these studies are well-established, their interaction with the distinctive properties of frontier AI systems remains underexamined, particularly when results are used to inform high-stakes decisions. We present findings from interviews with 16 expert practitioners with experience conducting human uplift studies in domains including biosecurity, cybersecurity, education, and labor. Across interviews, experts described a recurring tension between standard causal inference assumptions and the object of study itself. Rapidly evolving AI systems, shifting baselines, heterogeneous and changing user proficiency, and porous real-world settings strain assumptions underlying internal, external, and construct validity, complicating the interpretation and appropriate use of uplift evidence. We synthesize these challenges across key stages of the human uplift research lifecycle and map them to practitioner-reported solutions, clarifying both the limits and the appropriate uses of evidence from human uplift studies in high-stakes decision-making.
[AI-1] Artificial Intelligence as a Catalyst for Innovation in Software Engineering
【速读】:该论文旨在解决现代软件需求快速演变与复杂性增加背景下,软件开发团队在敏捷开发过程中面临的持续需求管理困难和质量保障挑战。解决方案的关键在于将人工智能(Artificial Intelligence, AI)技术,特别是机器学习(Machine Learning, ML)和自然语言处理(Natural Language Processing, NLP),深度集成到软件工程实践中,以自动化从需求管理到代码生成与测试的繁琐任务,从而优化现有敏捷流程并提升开发效率、产品质量与创新能力。
链接: https://arxiv.org/abs/2603.10994
作者: Carlos Alberto Fernández-y-Fernández,Jorge R. Aguilar-Cisneros
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid evolution and inherent complexity of modern software requirements demand highly flexible and responsive development methodologies. While Agile frameworks have become the industry standard for prioritizing iteration, collaboration, and adaptability, software development teams continue to face persistent challenges in managing constantly evolving requirements and maintaining product quality under tight deadlines. This article explores the intersection of Artificial Intelligence (AI) and Software Engineering (SE), to analyze how AI serves as a powerful catalyst for enhancing agility and fostering innovation. The research combines a comprehensive review of existing literature with an empirical study, utilizing a survey directed at Software Engineering professionals to assess the perception, adoption, and impact of AI-driven tools. Key findings reveal that the integration of AI (specifically through Machine Learning (ML) and Natural Language Processing (NLP) )facilitates the automation of tedious tasks, from requirement management to code generation and testing . This paper demonstrates that AI not only optimizes current Agile practices but also introduces new capabilities essential for sustaining quality, speed, and innovation in the future landscape of software development.
[AI-2] Contact Coverag e-Guided Exploration for General-Purpose Dexterous Manipulation
【速读】:该论文旨在解决灵巧操作(dexterous manipulation)任务中缺乏通用奖励机制的问题,这类任务通常依赖于特定任务的手工设计先验来引导手与物体的交互,导致方法泛化能力差。解决方案的关键在于提出一种名为**接触覆盖率引导探索(Contact Coverage-Guided Exploration, CCGE)**的新颖探索机制:它将接触状态建模为物体表面点与预定义手部关键点的交集,通过维护一个基于学习到的哈希码离散化物体状态的接触计数器,量化不同手指与物体区域的交互频率;该计数器被用于两个互补目标——一是生成基于计数的接触覆盖率奖励以促进新颖接触模式的探索,二是构建能量基到达奖励以引导智能体向未充分探索的接触区域移动。实验表明,CCGE显著提升了训练效率和成功率,并且所学接触模式可有效迁移到真实机器人系统中。
链接: https://arxiv.org/abs/2603.10971
作者: Zixuan Liu,Ruoyi Qiao,Chenrui Tie,Xuanwei Liu,Yunfan Lou,Chongkai Gao,Zhixuan Xu,Lin Shao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 16 pages
Abstract:Deep Reinforcement learning (DRL) has achieved remarkable success in domains with well-defined reward structures, such as Atari games and locomotion. In contrast, dexterous manipulation lacks general-purpose reward formulations and typically depends on task-specific, handcrafted priors to guide hand-object interactions. We propose Contact Coverage-Guided Exploration (CCGE), a general exploration method designed for general-purpose dexterous manipulation tasks. CCGE represents contact state as the intersection between object surface points and predefined hand keypoints, encouraging dexterous hands to discover diverse and novel contact patterns, namely which fingers contact which object regions. It maintains a contact counter conditioned on discretized object states obtained via learned hash codes, capturing how frequently each finger interacts with different object regions. This counter is leveraged in two complementary ways: (1) to assign a count-based contact coverage reward that promotes exploration of novel contact patterns, and (2) an energy-based reaching reward that guides the agent toward under-explored contact regions. We evaluate CCGE on a diverse set of dexterous manipulation tasks, including cluttered object singulation, constrained object retrieval, in-hand reorientation, and bimanual manipulation. Experimental results show that CCGE substantially improves training efficiency and success rates over existing exploration methods, and that the contact patterns learned with CCGE transfer robustly to real-world robotic systems. Project page is this https URL.
[AI-3] Safe RLHF Beyond Expectation: Stochastic Dominance for Universal Spectral Risk Control
【速读】:该论文旨在解决传统基于期望成本约束的安全强化学习(Safe Reinforcement Learning from Human Feedback, RLHF)在面对重尾分布或罕见灾难性事件时的局限性,即仅依赖期望值无法有效捕捉分布不确定性,从而导致对尾部风险和分布外失败的控制不足。其解决方案的关键在于提出一种名为Risk-sensitive Alignment via Dominance (RAD)的新框架,该框架将传统的标量期望成本约束替换为一阶随机占优(First-Order Stochastic Dominance, FSD)约束,并通过最优传输(Optimal Transport, OT)框架结合熵正则化与Sinkhorn迭代,实现可微且计算高效的优化目标;进一步引入分位数加权FSD约束,证明其能统一控制广泛的谱风险度量(Spectral Risk Measures, SRMs),从而提供一种可调的风险配置机制,使模型在保持有用性的同时显著提升有害行为的鲁棒性与安全性。
链接: https://arxiv.org/abs/2603.10938
作者: Yaswanth Chittepu,Ativ Joshi,Rajarshi Bhattacharjee,Scott Niekum
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Safe Reinforcement Learning from Human Feedback (RLHF) typically enforces safety through expected cost constraints, but the expectation captures only a single statistic of the cost distribution and fails to account for distributional uncertainty, particularly under heavy tails or rare catastrophic events. This limitation is problematic when robustness and risk sensitivity are critical. Stochastic dominance offers a principled alternative by comparing entire cost distributions rather than just their averages, enabling direct control over tail risks and potential out-of-distribution failures that expectation-based constraints may overlook. In this work, we propose Risk-sensitive Alignment via Dominance (RAD), a novel alignment framework that replaces scalar expected cost constraints with First-Order Stochastic Dominance (FSD) constraints. We operationalize this constraint by comparing the target policy’s cost distribution to that of a reference policy within an Optimal Transport (OT) framework, using entropic regularization and Sinkhorn iterations to obtain a differentiable and computationally efficient objective for stable end-to-end optimization. Furthermore, we introduce quantile-weighted FSD constraints and show that weighted FSD universally controls a broad class of Spectral Risk Measures (SRMs), so that improvements under weighted dominance imply guaranteed improvements in the corresponding spectral risk. This provides a principled mechanism for tuning a model’s risk profile via the quantile weighting function. Empirical results demonstrate that RAD improves harmlessness over baselines while remaining competitive in helpfulness, and exhibits greater robustness on out-of-distribution harmlessness evaluations.
[AI-4] When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM -based TTS
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在神经文本到语音(Neural Text-to-Speech, TTS)系统中作为语义骨干时,其冻结表示难以有效建模说话人特异性声学与感知特征的问题。解决方案的关键在于采用低秩适应(Low-Rank Adaptation, LoRA)对TTS系统的LLM骨干进行微调,从而在不破坏语言建模能力的前提下,显著提升语音一致性、音质和信噪比(Signal-to-Noise Ratio, SNR)。实验表明,LoRA微调在多说话人场景下均优于未微调的基线模型(Qwen-0.5B),尤其当训练数据具备充分的声学变异性时,可在感知质量(DNS-MOS)、说话人保真度和信号质量三个维度上实现协同优化,且该方法具有参数高效性和低延迟特性,适用于量化后的GGUF格式部署。
链接: https://arxiv.org/abs/2603.10904
作者: Anupam Purwar,Aditya Choudhary
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: We finetune the Qwen 0.5B backbone in an LLM TTS with LoRA to raise MOS speaker similarity and SNR. It works best with diverse training audio with uniform data it can amplify noise so tune decoding and use GGUF quantization for low latency stable quality
Abstract:Large language models are increasingly adopted as semantic backbones for neural text-to-speech systems. However, frozen LLM representations are insufficient for modeling speaker specific acoustic and perceptual characteristics. Our experiments involving fine tuning of the Language Model backbone of TTS show promise in improving the voice consistency and Signal to Noise ratio SNR in voice cloning task. Across multiple speakers LoRA finetuning consistently outperforms the non-finetuned base Qwen-0.5B model across three complementary dimensions of speech quality. First, perceptual quality improves significantly with DNS-MOS gains of up to 0.42 points for speakers whose training data exhibits sufficient acoustic variability. Second, speaker fidelity improves for all evaluated speakers with consistent increases in voice similarity indicating that LoRA effectively adapts speaker identity representations without degrading linguistic modeling. Third, signal level quality improves in most cases with signal to noise ratio increasing by as much as 34 percent. Crucially these improvements are strongly governed by the characteristics of the training data. Speakers with high variability in acoustic energy and perceptual quality achieve simultaneous gains in DNS-MOS voice similarity and SNR. Overall this work establishes that LoRA finetuning is not merely a parameter efficient optimization technique but an effective mechanism for better speaker level adaptation in compact LLM-based TTS systems. When supported by sufficiently diverse training data LoRA adapted Qwen-0.5B consistently surpasses its frozen base model in perceptual quality speaker similarity with low latency using GGUF model hosted in quantized form.
[AI-5] LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation ICLR2026
【速读】:该论文旨在解决基于Transformer的大语言模型(Large Language Models, LLMs)在自回归推理过程中因键值缓存(Key-Value Cache, KV Cache)大小随输入序列长度线性增长而导致的长上下文任务瓶颈问题。现有方法通过基于重要性评分剔除低价值缓存来缓解此问题,但近期改进方案依赖于代价高昂的草稿生成(draft generation)以预测未来响应并提升重要性估计精度,导致显著的预填充开销,限制了实际部署可行性。论文提出LookaheadKV框架,其核心创新在于不依赖显式草稿生成,而是通过在Transformer层中引入轻量级参数高效模块(parameter-efficient modules),直接学习预测真实重要性得分,从而在几乎无额外运行时开销的前提下实现优于复杂近似方法的准确性。实验表明,该方法在多种长文本理解任务中显著优于当前先进基线,并将缓存剔除成本降低最多达14.5倍,大幅提升首个词生成时间(time-to-first-token)。
链接: https://arxiv.org/abs/2603.10899
作者: Jinwoo Ahn,Ingyu Seong,Akhil Kedia,Junhan Kim,Hyemi Jang,Kangwook Lee,Yongkweon Jeon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICLR 2026
Abstract:Transformer-based large language models (LLMs) rely on key-value (KV) caching to avoid redundant computation during autoregressive inference. While this mechanism greatly improves efficiency, the cache size grows linearly with the input sequence length, quickly becoming a bottleneck for long-context tasks. Existing solutions mitigate this problem by evicting prompt KV that are deemed unimportant, guided by estimated importance scores. Notably, a recent line of work proposes to improve eviction quality by “glimpsing into the future”, in which a draft generator produces a surrogate future response approximating the target model’s true response, and this surrogate is subsequently used to estimate the importance of cached KV more accurately. However, these approaches rely on computationally expensive draft generation, which introduces substantial prefilling overhead and limits their practicality in real-world deployment. To address this challenge, we propose LookaheadKV, a lightweight eviction framework that leverages the strength of surrogate future response without requiring explicit draft generation. LookaheadKV augments transformer layers with parameter-efficient modules trained to predict true importance scores with high accuracy. Our design ensures negligible runtime overhead comparable to existing inexpensive heuristics, while achieving accuracy superior to more costly approximation methods. Extensive experiments on long-context understanding benchmarks, across a wide range of models, demonstrate that our method not only outperforms recent competitive baselines in various long-context understanding tasks, but also reduces the eviction cost by up to 14.5x, leading to significantly faster time-to-first-token. Our code is available at this https URL.
[AI-6] Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models ICLR2026
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)微调过程中因训练数据选择不当而导致的效率低下问题,尤其是在线提示选择方法虽能提升模型更新效果,但需大量LLM回放(rollout)来筛选有信息量的样本,造成显著计算开销。解决方案的关键在于提出动态预测采样(Dynamics-Predictive Sampling, DPS),其核心思想是将每个提示在RL微调中的求解进度建模为动力学系统,利用历史回放奖励信号通过在线贝叶斯推断估计状态分布,从而在无需昂贵回放筛选的前提下,预测并优先选择具有高学习潜力的提示,显著减少冗余回放,加速训练并提升推理性能。
链接: https://arxiv.org/abs/2603.10887
作者: Yixiu Mao,Yun Qu,Qi Wang,Heming Zou,Xiangyang Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2026
Abstract:Reinforcement learning (RL) finetuning has become a key technique for enhancing the reasoning abilities of large language models (LLMs). However, its effectiveness critically depends on the selection of training data. Recent advances underscore the importance of online prompt selection methods, which typically concentrate training on partially solved or moderately challenging examples under the current policy, thereby yielding more effective model updates. While significantly accelerating RL finetuning in terms of training steps, they also incur substantial computational overhead by requiring extensive LLM rollouts over large candidate batches to identify informative samples, an expense that can outweigh the finetuning process itself. To address this challenge, this work proposes Dynamics-Predictive Sampling (DPS), which online predicts and selects informative prompts by inferring their learning dynamics prior to costly rollouts. Specifically, we introduce a new perspective by modeling each prompt’s solving progress during RL finetuning as a dynamical system, where the extent of solving is represented as the state and the transition is characterized by a hidden Markov model. Using historical rollout reward signals, we perform online Bayesian inference to estimate evolving state distributions, and the inference outcome provides a predictive prior for efficient prompt selection without rollout-intensive filtering. Empirical results across diverse reasoning tasks, including mathematics, planning, and visual geometry, demonstrate that DPS substantially reduces redundant rollouts, accelerates the training process, and achieves superior reasoning performance.
[AI-7] Continuous Diffusion Transformers for Designing Synthetic Regulatory Elements
【速读】:该论文旨在解决生成高质量、细胞类型特异性的调控DNA序列(200bp)时存在的训练效率低和过拟合问题。传统方法如DNA-Diffusion采用U-Net作为主干网络,存在训练周期长且易记忆训练数据的问题。其解决方案的关键在于提出一种参数高效的扩散Transformer(DiT),用具备二维卷积神经网络(2D CNN)输入编码器的Transformer去噪器替代U-Net主干,并结合DDPO微调策略,利用Enformer作为奖励模型优化生成序列的调控活性预测得分。实验表明,该方案在13个epoch内达到与U-Net相当的验证损失(减少60倍训练步数),收敛误差降低39%,同时将生成序列对训练数据的相似度从5.3%降至1.7%,且通过交叉验证确认了生成信号的真实性而非奖励模型过拟合。
链接: https://arxiv.org/abs/2603.10885
作者: Jonathan Liu,Kia Ghods
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注:
Abstract:We present a parameter-efficient Diffusion Transformer (DiT) for generating 200bp cell-type-specific regulatory DNA sequences. By replacing the U-Net backbone of DNA-Diffusion with a transformer denoiser equipped with a 2D CNN input encoder, our model matches the U-Net’s best validation loss in 13 epochs (60 \times fewer) and converges 39% lower, while reducing memorization from 5.3% to 1.7% of generated sequences aligning to training data via BLAT. Ablations show the CNN encoder is essential: without it, validation loss increases 70% regardless of positional embedding choice. We further apply DDPO finetuning using Enformer as a reward model, achieving a 38 \times improvement in predicted regulatory activity. Cross-validation against DRAKES on an independent prediction task confirms that improvements reflect genuine regulatory signal rather than reward model overfitting.
[AI-8] Semantic Landmark Particle Filter for Robot Localisation in Vineyards IROS2026
【速读】:该论文旨在解决葡萄园环境中由于行级感知歧义(row-level perceptual aliasing)导致的定位可靠性问题:平行作物行在激光雷达(LiDAR)观测中产生几乎相同的特征,使得仅依赖几何信息或视觉的SLAM系统在田头转弯等场景下易收敛至错误路径。解决方案的关键在于提出一种语义地标粒子滤波器(Semantic Landmark Particle Filter, SLPF),其核心创新是将检测到的树干(trunk)转化为语义墙(semantic walls),嵌入测量模型以增强相邻行间的区分度;同时引入轻量级GNSS作为先验,提升在语义观测稀疏时的稳定性。实验表明,该方法显著优于仅几何(AMCL)、纯视觉(RTAB-Map)及噪声GNSS基线,在绝对位姿误差(APE)和行识别正确率上均有明显改善。
链接: https://arxiv.org/abs/2603.10847
作者: Rajitha de Silva,Jonathan Cox,James R. Heselden,Marija Popović,Cesar Cadena,Riccardo Polvara
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Submmitted to IROS 2026
Abstract:Reliable localisation in vineyards is hindered by row-level perceptual aliasing: parallel crop rows produce nearly identical LiDAR observations, causing geometry-only and vision-based SLAM systems to converge towards incorrect corridors, particularly during headland transitions. We present a Semantic Landmark Particle Filter (SLPF) that integrates trunk and pole landmark detections with 2D LiDAR within a probabilistic localisation framework. Detected trunks are converted into semantic walls, forming structural row boundaries embedded in the measurement model to improve discrimination between adjacent rows. GNSS is incorporated as a lightweight prior that stabilises localisation when semantic observations are sparse. Field experiments in a 10-row vineyard demonstrate consistent improvements over geometry-only (AMCL), vision-based (RTAB-Map), and GNSS baselines. Compared to AMCL, SLPF reduces Absolute Pose Error by 22% and 65% across two traversal directions; relative to a NoisyGNSS baseline, APE decreases by 65% and 61%. Row correctness improves from 0.67 to 0.73, while mean cross-track error decreases from 1.40 m to 1.26 m. These results show that embedding row-level structural semantics within the measurement model enables robust localisation in highly repetitive outdoor agricultural environments. Comments: Submmitted to IROS 2026 Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.10847 [cs.RO] (or arXiv:2603.10847v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2603.10847 Focus to learn more arXiv-issued DOI via DataCite
[AI-9] Speaker Verification with Speech-Aware LLM s: Evaluation and Augmentation
【速读】:该论文旨在解决语音感知大语言模型(Speech-aware Large Language Models)在训练过程中对说话人身份信息编码能力不足的问题,即现有模型虽能处理语音输入,但其训练目标主要聚焦于语言内容或特定属性(如情感、性别),导致其在说话人验证(Speaker Verification, SV)任务中表现较弱。解决方案的关键在于:首先提出一种与模型无关的评分协议,通过置信度分数或Yes/No标记的概率对数似然比来量化模型对说话人身份的判别能力;其次设计了一种轻量级增强方法,将冻结的ECAPA-TDNN说话人嵌入通过可学习投影注入到LLM中,并仅微调LoRA适配器,从而在TinyLLaMA-1.1B上实现接近专用说话人验证系统的性能(VoxCeleb1-E数据集上EER为1.03%),同时保持自然语言交互接口不变。
链接: https://arxiv.org/abs/2603.10827
作者: Thomas Thebaud,Yuzhe Wang,Laureano Moro-Velazquez,Jesus Villalba-Lopez,Najim Dehak
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 3 Tables, 1 Figure, Under review
Abstract:Speech-aware large language models (LLMs) can accept speech inputs, yet their training objectives largely emphasize linguistic content or specific fields such as emotions or the speaker’s gender, leaving it unclear whether they encode speaker identity. First, we propose a model-agnostic scoring protocol that produces continuous verification scores for both API-only and open-weight models, using confidence scores or log-likelihood ratios from the Yes/No token probabilities. Using this protocol, we benchmark recent speech-aware LLMs and observe weak speaker discrimination (EERs above 20% on VoxCeleb1). Second, we introduce a lightweight augmentation that equips an LLM with ASV capability by injecting frozen ECAPA-TDNN speaker embeddings through a learned projection and training only LoRA adapters. On TinyLLaMA-1.1B, the resulting ECAPA-LLM achieves 1.03% EER on VoxCeleb1-E, approaching a dedicated speaker verification system while preserving a natural-language interface.
[AI-10] Protein Counterfactuals via Diffusion-Guided Latent Optimization ICLR2026
【速读】:该论文旨在解决深度学习模型在蛋白质工程中缺乏机制性洞察和可操作指导的问题,即当模型预测某抗体不稳定时,无法为蛋白工程师提供具体、可行的突变策略以恢复稳定性并保持功能。解决方案的关键在于提出Manifold-Constrained Counterfactual Optimization for Proteins (MCCOP)框架,该框架在连续的序列-结构潜在空间中运行,利用预训练扩散模型作为流形先验(manifold prior),同时优化三个目标:有效性(实现期望的性质)、接近性(最小化突变数量)和合理性(生成可折叠的蛋白质)。通过在GFP荧光恢复、热力学稳定性增强和E3连接酶活性恢复等任务上的验证,MCCOP生成的反事实突变比离散和连续基线方法更稀疏且更符合生物合理性,且突变位点与已知的物理机制(如发色团堆积和疏水核心巩固)一致,从而实现了模型解释与假设驱动的蛋白质设计的统一。
链接: https://arxiv.org/abs/2603.10811
作者: Weronika Kłos,Sidney Bender,Lukas Kades
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figures, accepted at the Gen2 Workshop at ICLR 2026
Abstract:Deep learning models can predict protein properties with unprecedented accuracy but rarely offer mechanistic insight or actionable guidance for engineering improved variants. When a model flags an antibody as unstable, the protein engineer is left without recourse: which mutations would rescue stability while preserving function? We introduce Manifold-Constrained Counterfactual Optimization for Proteins (MCCOP), a framework that computes minimal, biologically plausible sequence edits that flip a model’s prediction to a desired target state. MCCOP operates in a continuous joint sequence-structure latent space and employs a pretrained diffusion model as a manifold prior, balancing three objectives: validity (achieving the target property), proximity (minimizing mutations), and plausibility (producing foldable proteins). We evaluate MCCOP on three protein engineering tasks - GFP fluorescence rescue, thermodynamic stability enhancement, and E3 ligase activity recovery - and show that it generates sparser, more plausible counterfactuals than both discrete and continuous baselines. The recovered mutations align with known biophysical mechanisms, including chromophore packing and hydrophobic core consolidation, establishing MCCOP as a tool for both model interpretation and hypothesis-driven protein design. Our code is publicly available at this http URL.
[AI-11] owards Intelligent Spectrum Management: Spectrum Demand Estimation Using Graph Neural Networks
【速读】:该论文旨在解决无线网络中频谱资源有限与日益增长的无线连接需求之间的矛盾,核心问题是缺乏准确刻画频谱需求动态的方法,从而影响监管机构进行高效频谱分配决策。解决方案的关键在于构建了一个基于公开部署记录的频谱需求代理指标,并提出一种分层多分辨率图注意力网络(Hierarchical Multi-resolution Graph Attention Network, HR-GAT)模型,该模型能够同时捕捉邻域效应和跨尺度模式,有效降低空间自相关性并提升泛化能力。在加拿大五个城市的实证测试中,HR-GAT相较于八种竞争基线方法将中位均方根误差(RMSE)降低了约21%,且显著减少了残差空间偏差,生成的精细化频谱需求地图可直接供监管机构使用,支持更科学的频谱共享与分配策略。
链接: https://arxiv.org/abs/2603.10802
作者: Mohamad Alkadamani,Amir Ghasemi,Halim Yanikomeroglu
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 13 pages, 10 figures. Submitted to IEEE Transactions on Machine Learning in Communications and Networking
Abstract:The growing demand for wireless connectivity, combined with limited spectrum resources, calls for more efficient spectrum management. Spectrum sharing is a promising approach; however, regulators need accurate methods to characterize demand dynamics and guide allocation decisions. This paper builds and validates a spectrum demand proxy from public deployment records and uses a graph attention network in a hierarchical, multi-resolution setup (HR-GAT) to estimate spectrum demand at fine spatial scales. The model captures both neighborhood effects and cross-scale patterns, reducing spatial autocorrelation and improving generalization. Evaluated across five Canadian cities and against eight competitive baselines, HR-GAT reduces median RMSE by roughly 21% relative to the best alternative and lowers residual spatial bias. The resulting demand maps are regulator-accessible and support spectrum sharing and spectrum allocation in wireless networks.
[AI-12] AI-Enhanced Spatial Cellular Traffic Demand Prediction with Contextual Clustering and Error Correction for 5G/6G Planning
【速读】:该论文旨在解决5G新无线电(NR)容量规划与6G数据驱动规划中,基于机器学习的精细化流量需求空间预测因空间自相关性导致的邻域泄露(neighborhood leakage)问题,该问题会人为夸大模型性能,削弱网络规划的可靠性。其解决方案的关键在于提出一种上下文感知的两阶段划分策略(context-aware two-stage splitting strategy),结合残差空间误差校正机制(residual spatial error correction),有效减少训练与测试集之间的空间信息泄露,从而提升模型的空间泛化能力,实验表明该方法在加拿大五大城市的众包使用指标数据上相较仅依赖位置聚类的方法实现了更稳定的平均绝对误差(MAE)降低。
链接: https://arxiv.org/abs/2603.10800
作者: Mohamad Alkadamani,Colin Brown,Halim Yanikomeroglu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 5 pages, 8 figures. Submitted to IEEE Wireless Communications Letters
Abstract:Accurate spatial prediction of cellular traffic demand is essential for 5G NR capacity planning, network densification, and data-driven 6G planning. Although machine learning can fuse heterogeneous geospatial and socio-economic layers to estimate fine-grained demand maps, spatial autocorrelation can cause neighborhood leakage under naive train/test splits, inflating accuracy and weakening planning reliability. This paper presents an AI-driven framework that reduces leakage and improves spatial generalization via a context-aware two-stage splitting strategy with residual spatial error correction. Experiments using crowdsourced usage indicators across five major Canadian cities show consistent mean absolute error (MAE) reductions relative to location-only clustering, supporting more reliable bandwidth provisioning and evidence-based spectrum planning and sharing assessments.
[AI-13] Deep Randomized Distributed Function Computation (DeepRDFC): Neural Distributed Channel Simulation
【速读】:该论文旨在解决随机分布式函数计算(Randomized Distributed Function Computation, RDFC)框架中的性能优化问题,特别是在通信负载受限且需要强函数计算保证的场景下。其核心挑战在于如何在有限公共随机性条件下,高效地近似目标概率分布并提升RDFC系统的整体性能。解决方案的关键是提出一种基于自编码器(Autoencoder, AE)的架构,通过最小化AE输出与未知目标分布之间的总变差距离(Total Variation Distance),仅利用数据样本实现分布逼近;实验表明,该方法相比传统数据压缩技术能显著降低通信负载并提升RDFC性能,从而为深度学习驱动的RDFC提供了可行路径。
链接: https://arxiv.org/abs/2603.10750
作者: Didrik Bergström,Onur Günlü
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The randomized distributed function computation (RDFC) framework, which unifies many cutting-edge distributed computation and learning applications, is considered. An autoencoder (AE) architecture is proposed to minimize the total variation distance between the probability distribution simulated by the AE outputs and an unknown target distribution, using only data samples. We illustrate significantly high RDFC performance with communication load gains from our AEs compared to data compression methods. Our designs establish deep learning-based RDFC methods and aim to facilitate the use of RDFC methods, especially when the amount of common randomness is limited and strong function computation guarantees are required.
[AI-14] CUPID: A Plug-in Framework for Joint Aleatoric and Epistemic Uncertainty Estimation with a Single Model
【速读】:该论文旨在解决深度学习模型在高风险场景(如医疗诊断和自动驾驶)中缺乏准确不确定性估计的问题,尤其是现有方法通常仅处理单一类型的不确定性或需修改及重新训练基础模型,难以在实际系统中部署。其解决方案的关键在于提出一种通用模块CUPID(Comprehensive Uncertainty Plug-in estImation moDel),无需改动或重新训练预训练模型即可联合估计两类不确定性:通过学习的贝叶斯身份映射建模观测不确定性(aleatoric uncertainty),并通过分析模型对结构化扰动的内部响应捕捉认知不确定性(epistemic uncertainty)。CUPID可灵活插入任意网络层,提供逐层不确定性来源的可解释性,从而增强AI系统的透明度与可信度。
链接: https://arxiv.org/abs/2603.10745
作者: Xinran Xu,Xiuyi Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate estimation of uncertainty in deep learning is critical for deploying models in high-stakes domains such as medical diagnosis and autonomous decision-making, where overconfident predictions can lead to harmful outcomes. In practice, understanding the reason behind a model’s uncertainty and the type of uncertainty it represents can support risk-aware decisions, enhance user trust, and guide additional data collection. However, many existing methods only address a single type of uncertainty or require modifications and retraining of the base model, making them difficult to adopt in real-world systems. We introduce CUPID (Comprehensive Uncertainty Plug-in estImation moDel), a general-purpose module that jointly estimates aleatoric and epistemic uncertainty without modifying or retraining the base model. CUPID can be flexibly inserted into any layer of a pretrained network. It models aleatoric uncertainty through a learned Bayesian identity mapping and captures epistemic uncertainty by analyzing the model’s internal responses to structured perturbations. We evaluate CUPID across a range of tasks, including classification, regression, and out-of-distribution detection. The results show that it consistently delivers competitive performance while offering layer-wise insights into the origins of uncertainty. By making uncertainty estimation modular, interpretable, and model-agnostic, CUPID supports more transparent and trustworthy AI. Related code and data are available at this https URL.
[AI-15] owards Robust Speech Deepfake Detection via Human-Inspired Reasoning
【速读】:该论文旨在解决当前语音深度伪造检测(Speech Deepfake Detection, SDD)方法在跨域泛化能力不足以及缺乏可解释性的问题,尤其是难以提供人类可理解的推理过程和感知线索来支持判断。解决方案的关键在于提出一种名为HIR-SDD的新框架,该框架融合了大型音频语言模型(Large Audio Language Models, LALMs)的能力与基于全新人工标注数据集所推导出的思维链(chain-of-thought)推理机制,从而在提升检测准确性的同时,能够生成合理且符合人类认知逻辑的预测解释。
链接: https://arxiv.org/abs/2603.10725
作者: Artem Dvirniak,Evgeny Kushnir,Dmitrii Tarasov,Artem Iudin,Oleg Kiriukhin,Mikhail Pautov,Dmitrii Korzh,Oleg Y. Rogov
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:The modern generative audio models can be used by an adversary in an unlawful manner, specifically, to impersonate other people to gain access to private information. To mitigate this issue, speech deepfake detection (SDD) methods started to evolve. Unfortunately, current SDD methods generally suffer from the lack of generalization to new audio domains and generators. More than that, they lack interpretability, especially human-like reasoning that would naturally explain the attribution of a given audio to the bona fide or spoof class and provide human-perceptible cues. In this paper, we propose HIR-SDD, a novel SDD framework that combines the strengths of Large Audio Language Models (LALMs) with the chain-of-thought reasoning derived from the novel proposed human-annotated dataset. Experimental evaluation demonstrates both the effectiveness of the proposed method and its ability to provide reasonable justifications for predictions.
[AI-16] Probabilistic Verification of Voice Anti-Spoofing Models
【速读】:该论文旨在解决语音反欺骗模型(Voice Anti-Spoofing Models, VASMs)在面对生成式语音伪造技术(如文本到语音合成 TTS、语音克隆 VC 等)时缺乏形式化鲁棒性保障及泛化能力不足的问题。解决方案的关键在于提出 PV-VASM,一个概率化的鲁棒性验证框架,通过估计在不同语音合成技术和输入扰动下模型误分类的概率,实现对 VASMs 的鲁棒性定量评估;该方法具有模型无关性,能够有效验证未见过的语音伪造技术,同时理论推导了误差概率的上界,从而为实际部署提供可信赖的鲁棒性保证。
链接: https://arxiv.org/abs/2603.10713
作者: Evgeny Kushnir,Alexandr Kozodaev,Dmitrii Korzh,Mikhail Pautov,Oleg Kiriukhin,Oleg Y. Rogov
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing countermeasures lack formal robustness guarantees or fail to generalize to unseen generation techniques. We propose PV-VASM, a probabilistic framework for verifying the robustness of voice anti-spoofing models (VASMs). PV-VASM estimates the probability of misclassification under text-to-speech (TTS), voice cloning (VC), and parametric signal transformations. The approach is model-agnostic and enables robustness verification against unseen speech synthesis techniques and input perturbations. We derive a theoretical upper bound on the error probability and validate the method across diverse experimental settings, demonstrating its effectiveness as a practical robustness verification tool.
[AI-17] AlphaFlowTSE: One-Step Generative Target Speaker Extraction via Conditional AlphaFlow INTERSPEECH2026
【速读】:该论文旨在解决目标说话人提取(Target Speaker Extraction, TSE)中多步采样导致延迟高,以及单步方案依赖混合信号时间坐标而可靠性差的问题。其核心解决方案是提出AlphaFlowTSE,一种基于无Jacobian向量积(JVP-free)AlphaFlow目标训练的一步条件生成模型,通过学习从混合语音到目标语音轨迹上的均值速度传输(mean-velocity transport),无需额外的混音比例预测,并结合区间一致性教师-学生目标稳定训练过程,从而在保持高保真度的同时提升下游自动语音识别(ASR)任务中的真实混合场景泛化能力。
链接: https://arxiv.org/abs/2603.10701
作者: Duojia Li,Shuhan Zhang,Zihan Qian,Wenxuan Wu,Shuai Wang,Qingyang Hong,Lin Li,Haizhou Li
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Submitted to Interspeech 2026 for review
Abstract:In target speaker extraction (TSE), we aim to recover target speech from a multi-talker mixture using a short enrollment utterance as reference. Recent studies on diffusion and flow-matching generators have improved target-speech fidelity. However, multi-step sampling increases latency, and one-step solutions often rely on a mixture-dependent time coordinate that can be unreliable for real-world conversations. We present AlphaFlowTSE, a one-step conditional generative model trained with a Jacobian-vector product (JVP)-free AlphaFlow objective. AlphaFlowTSE learns mean-velocity transport along a mixture-to-target trajectory starting from the observed mixture, eliminating auxiliary mixing-ratio prediction, and stabilizes training by combining flow matching with an interval-consistency teacher-student target. Experiments on Libri2Mix and REAL-T confirm that AlphaFlowTSE improves target-speaker similarity and real-mixture generalization for downstream automatic speech recognition (ASR).
[AI-18] Repurposing Backdoors for Good: Ephemeral Intrinsic Proofs for Verifiable Aggregation in Cross-silo Federated Learning
【速读】:该论文旨在解决跨服务器联邦学习(Cross-silo Federated Learning)中安全聚合(Secure Aggregation, SA)方案无法保障聚合完整性的问题,即恶意服务器可能隐匿或篡改客户端更新,而现有可验证聚合方案依赖于重量级密码学技术(如零知识证明 ZKPs 或同态加密 HE),其计算开销随模型规模增长而急剧上升。解决方案的关键在于提出一种轻量级架构,将外在的密码学证明转变为内在证明(Intrinsic Proofs)——通过重用后门注入(backdoor injection)机制,在模型参数中嵌入可验证信号;利用灾难性遗忘(Catastrophic Forgetting)特性使这些信号具备即时可验证性但短暂存在,从而在不损害最终模型性能的前提下实现高效审计。同时设计了一个随机化单验证器审计框架,兼容SA协议,保障客户端匿名性并避免信号冲突,无需可信第三方。实验表明该方法在ResNet-18上相比传统密码基线提速超1000倍,有效扩展至大模型场景。
链接: https://arxiv.org/abs/2603.10692
作者: Xian Qin,Xue Yang,Xiaohu Tang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:While Secure Aggregation (SA) protects update confidentiality in Cross-silo Federated Learning, it fails to guarantee aggregation integrity, allowing malicious servers to silently omit or tamper with updates. Existing verifiable aggregation schemes rely on heavyweight cryptography (e.g., ZKPs, HE), incurring computational costs that scale poorly with model size. In this paper, we propose a lightweight architecture that shifts from extrinsic cryptographic proofs to \textitIntrinsic Proofs. We repurpose backdoor injection to embed verification signals directly into model parameters. By harnessing Catastrophic Forgetting, these signals are robust for immediate verification yet ephemeral, naturally decaying to preserve final model utility. We design a randomized, single-verifier auditing framework compatible with SA, ensuring client anonymity and preventing signal collision without trusted third parties. Experiments on SVHN, CIFAR-10, and CIFAR-100 demonstrate high detection probabilities against malicious servers. Notably, our approach achieves over 1000\times speedup on ResNet-18 compared to cryptographic baselines, effectively scaling to large models.
[AI-19] Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?
【速读】:该论文旨在解决黑盒对抗攻击中难以保证找到有效对抗样本的问题,即现有方法虽在实践中表现良好,但无法确保对特定神经网络模型生成对抗样本。其解决方案的关键在于提出一种名为“Contract And Conquer (CAC)”的可证明性方法,该方法通过知识蒸馏(knowledge distillation)不断扩展蒸馏数据集以逼近黑盒模型行为,并结合精确收缩对抗样本搜索空间,在固定迭代次数内提供对抗样本存在的理论保障,同时利用迁移性(transferability)特性确保攻击有效性。
链接: https://arxiv.org/abs/2603.10689
作者: Anna Chistyakova,Mikhail Pautov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Black-box adversarial attacks are widely used as tools to test the robustness of deep neural networks against malicious perturbations of input data aimed at a specific change in the output of the model. Such methods, although they remain empirically effective, usually do not guarantee that an adversarial example can be found for a particular model. In this paper, we propose Contract And Conquer (CAC), an approach to provably compute adversarial examples for neural networks in a black-box manner. The method is based on knowledge distillation of a black-box model on an expanding distillation dataset and precise contraction of the adversarial example search space. CAC is supported by the transferability guarantee: we prove that the method yields an adversarial example for the black-box model within a fixed number of algorithm iterations. Experimentally, we demonstrate that the proposed approach outperforms existing state-of-the-art black-box attack methods on ImageNet dataset for different target models, including vision transformers.
[AI-20] FAME: Formal Abstract Minimal Explanation for Neural Networks
【速读】:该论文旨在解决大规模神经网络中生成可解释性解释时面临的两个核心挑战:一是现有方法难以扩展到大型模型,二是生成的解释往往冗长且包含不相关特征。为应对这些问题,作者提出FAME(Formal Abstract Minimal Explanations),这是一种基于抽象解释(Abstract Interpretation)的归纳推理解释方法。其关键创新在于设计了专用的扰动域(perturbation domains),从而避免了传统方法对遍历顺序的依赖;通过逐步缩小这些扰动域并利用LiRPA(Linear Relaxation-based Propagation Algorithm)边界来剔除无关特征,最终收敛至形式化的抽象最小解释(formal abstract minimal explanation)。这一机制显著减少了解释规模并提升了计算效率,同时引入了一种结合对抗攻击与VERIX+精化步骤的评估流程,用于量化解释质量,实验证明FAME在中到大规模神经网络上相较于VERIX+在解释尺寸和运行时间上均取得稳定提升。
链接: https://arxiv.org/abs/2603.10661
作者: Ryma Boumazouza,Raya Elsaleh,Melanie Ducoffe,Shahaf Bassan,Guy Katz
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We propose FAME (Formal Abstract Minimal Explanations), a new class of abductive explanations grounded in abstract interpretation. FAME is the first method to scale to large neural networks while reducing explanation size. Our main contribution is the design of dedicated perturbation domains that eliminate the need for traversal order. FAME progressively shrinks these domains and leverages LiRPA-based bounds to discard irrelevant features, ultimately converging to a formal abstract minimal explanation. To assess explanation quality, we introduce a procedure that measures the worst-case distance between an abstract minimal explanation and a true minimal explanation. This procedure combines adversarial attacks with an optional VERIX+ refinement step. We benchmark FAME against VERIX+ and demonstrate consistent gains in both explanation size and runtime on medium- to large-scale neural networks.
[AI-21] Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions
【速读】:该论文旨在解决多智能体在共享工作空间中进行任务调度与运动规划(Scheduling and Motion Planning)的协同优化问题,尤其关注如何在资源、时间及运动约束下,安全高效地执行预定义任务。其核心挑战在于传统任务调度方法忽视了空间冲突和动态路径可行性,导致生成的计划难以实际执行。解决方案的关键在于提出一种增量式学习框架,通过交替调用现成的调度器与基于采样的运动规划器,实现调度与运动规划的闭环交互:调度器生成候选方案,运动规划器验证可行性并返回符号化反馈(如空间冲突和时序调整),从而引导调度器逐步收敛至满足时空约束的可行解。该方法在物流和作业车间调度基准上进行了验证,证明其在复杂约束下生成有效计划的能力。
链接: https://arxiv.org/abs/2603.10651
作者: Elisa Tosello,Arthur Bit-Monnot,Davide Lusuardi,Alessandro Valentini,Andrea Micheli
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Task and Motion Planning combines high-level task sequencing (what to do) with low-level motion planning (how to do it) to generate feasible, collision-free execution plans. However, in many real-world domains, such as automated warehouses, tasks are predefined, shifting the challenge to if, when, and how to execute them safely and efficiently under resource, time and motion constraints. In this paper, we formalize this as the Scheduling and Motion Planning problem for multi-object navigation in shared workspaces. We propose a novel solution framework that interleaves off-the-shelf schedulers and motion planners in an incremental learning loop. The scheduler generates candidate plans, while the motion planner checks feasibility and returns symbolic feedback, i.e., spatial conflicts and timing adjustments, to guide the scheduler towards motion-feasible solutions. We validate our proposal on logistics and job-shop scheduling benchmarks augmented with motion tasks, using state-of-the-art schedulers and sampling-based motion planners. Our results show the effectiveness of our framework in generating valid plans under complex temporal and spatial constraints, where synchronized motion is critical.
[AI-22] Detecting and Eliminating Neural Network Backdoors Through Active Paths with Application to Intrusion Detection
【速读】:该论文旨在解决机器学习后门(backdoor)攻击的检测与消除难题,此类攻击使得模型在正常输入下表现正常,但在包含特定触发器(trigger)的输入下则按攻击者意图行为,且现有检测方法难以有效识别。解决方案的关键在于利用神经网络中可解释的“活跃路径”(active paths)来定位并移除后门触发器,通过分析模型内部决策路径实现对后门机制的精准识别与干预,实验表明该方法在入侵检测场景下具有良好的有效性与可解释性。
链接: https://arxiv.org/abs/2603.10641
作者: Eirik Høyheim,Magnus Wiik Eckhoff,Gudmund Grov,Robert Flood,David Aspinall
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Machine learning backdoors have the property that the machine learning model should work as expected on normal inputs, but when the input contains a specific \textittrigger , it behaves as the attacker desires. Detecting such triggers has been proven to be extremely difficult. In this paper, we present a novel and explainable approach to detect and eliminate such backdoor triggers based on active paths found in neural networks. We present promising experimental evidence of our approach, which involves injecting backdoors into a machine learning model used for intrusion detection.
[AI-23] Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction CVPR2026
【速读】:该论文旨在解决自动驾驶中轨迹预测任务在面对变长、不完整观测数据时的性能下降问题。现有方法通常假设输入为固定长度的完整轨迹,但在实际驾驶场景中,传感器或通信限制常导致观测序列长度不一且信息缺失,直接映射不完整特征至完整特征会因信息缺口而难以学习准确表示。解决方案的关键在于提出一种渐进式回溯框架(Progressive Retrospective Framework, PRF),其通过级联的回溯单元(retrospective unit)逐步对齐不完整观测与完整轨迹的特征表示;每个单元包含回溯蒸馏模块(Retrospective Distillation Module, RDM)用于特征提炼和回溯预测模块(Retrospective Prediction Module, RPM)用于基于蒸馏特征恢复历史时间步,从而实现更鲁棒的轨迹重建与预测。
链接: https://arxiv.org/abs/2603.10597
作者: Hao Zhou,Lu Qi,Jason Li,Jie Zhang,Yi Liu,Xu Yang,Mingyu Fan,Fei Luo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Paper is accepted by CVPR 2026
Abstract:Trajectory prediction is critical for autonomous driving, enabling safe and efficient planning in dense, dynamic traffic. Most existing methods optimize prediction accuracy under fixed-length observations. However, real-world driving often yields variable-length, incomplete observations, posing a challenge to these methods. A common strategy is to directly map features from incomplete observations to those from complete ones. This one-shot mapping, however, struggles to learn accurate representations for short trajectories due to significant information gaps. To address this issue, we propose a Progressive Retrospective Framework (PRF), which gradually aligns features from incomplete observations with those from complete ones via a cascade of retrospective units. Each unit consists of a Retrospective Distillation Module (RDM) and a Retrospective Prediction Module (RPM), where RDM distills features and RPM recovers previous timesteps using the distilled features. Moreover, we propose a Rolling-Start Training Strategy (RSTS) that enhances data efficiency during PRF training. PRF is plug-and-play with existing methods. Extensive experiments on datasets Argoverse 2 and Argoverse 1 demonstrate the effectiveness of PRF. Code is available at this https URL.
[AI-24] Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences
【速读】:该论文旨在解决生成模型中如何统一理解与设计基于梯度流的生成方法的问题,特别是揭示了一类新型生成模型——梯度流漂移(Gradient Flow Drifting)的数学本质。其核心解决方案在于建立了一个精确的数学框架,证明了最近提出的漂移模型(Drifting Model)与在核密度估计(Kernel Density Estimation, KDE)近似下前向KL散度(forward KL divergence)的Wasserstein-2梯度流之间存在等价关系:漂移场本质上等于KDE对数密度梯度之差 ∇logpkde−∇logqkde,这正是Wasserstein-2梯度流下的粒子速度场。此外,该框架还扩展至MMD-based生成器作为不同散度梯度流的特例,并提出一种理论驱动的混合散度策略——结合反向KL与χ2散度的梯度流,以同时避免模式塌缩(mode collapse)和模式模糊(mode blurring),进一步将其推广至黎曼流形上,从而放宽核函数约束并更适用于语义空间建模。
链接: https://arxiv.org/abs/2603.10592
作者: Jiarui Cao,Zixuan Wei,Yuxin Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We reveal a precise mathematical framework about a new family of generative models which we call Gradient Flow Drifting. With this framework, we prove an equivalence between the recently proposed Drifting Model and the Wasserstein gradient flow of the forward KL divergence under kernel density estimation (KDE) approximation. Specifically, we prove that the drifting field of drifting model (arXiv:2602.04770) equals, up to a bandwidth-squared scaling factor, the difference of KDE log-density gradients \nabla \log p_\mathrmkde - \nabla \log q_\mathrmkde , which is exactly the particle velocity field of the Wasserstein-2 gradient flow of KL(q|p) with KDE-approximated densities. Besides that, this broad family of generative models can also include MMD-based generators, which arises as special cases of Wasserstein gradient flows of different divergences under KDE approximation. We provide a concise identifiability proof, and a theoretically grounded mixed-divergence strategy. We combine reverse KL and \chi^2 divergence gradient flows to simultaneously avoid mode collapse and mode blurring, and extend this method onto Riemannian manifold which loosens the constraints on the kernel function, and makes this method more suitable for the semantic space. Preliminary experiments on synthetic benchmarks validate the framework.
[AI-25] Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents
【速读】:该论文旨在解决生成式 AI(Generative AI)模型在连续控制任务中应用受限的问题,其核心挑战包括有限的上下文窗口、缺乏显式的奖励信号以及长程上下文信息的退化。解决方案的关键在于让智能体通过将经验内化到模型参数中来实现持续学习,而非依赖提示(prompt-based)记忆;为此,作者提出了一种新颖的自微调(self-finetuning)框架,其中包含双视角反思机制,可自动生成语言反馈以构建偏好数据集,并通过基于偏好的微调过程将长期交互经验固化至模型参数中,从而显著提升样本效率、稳定性和多目标优化能力。
链接: https://arxiv.org/abs/2603.10564
作者: Yuanhao Li,Haozhe Wang,Geyong Min,Nektarios Georgalas,Wang Miao
机构: 未知
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:The integration of Generative AI models into AI-native network systems offers a transformative path toward achieving autonomous and adaptive control. However, the application of such models to continuous control tasks is impeded by intrinsic architectural limitations, including finite context windows, the lack of explicit reward signals, and the degradation of the long context. This paper posits that the key to unlocking robust continuous control is enabling agents to internalize experience by distilling it into their parameters, rather than relying on prompt-based memory. To this end, we propose a novel self-finetuning framework that enables agentic systems to learn continuously through direct interaction with the environment, bypassing the need for handcrafted rewards. Our framework implements a bi-perspective reflection mechanism that generates autonomous linguistic feedback to construct preference datasets from interaction history. A subsequent preference-based fine-tuning process distills long-horizon experiences into the model’s parameters. We evaluate our approach on a dynamic Radio Access Network (RAN) slicing task, a challenging multi-objective control problem that requires the resolution of acute trade-offs between spectrum efficiency, service quality, and reconfiguration stability under volatile network conditions. Experimental results show that our framework outperforms standard Reinforcement Learning (RL) baselines and existing Large Language Model (LLM)-based agents in sample efficiency, stability, and multi-metric optimization. These findings demonstrate the potential of self-improving generative agents for continuous control tasks, paving the way for future AI-native network infrastructure.
[AI-26] SCORE: Replacing Layer Stacking with Contractive Recurrent Depth
【速读】:该论文旨在解决深度神经网络中传统层堆叠(layer stacking)带来的优化不稳定性和信息流动效率低的问题。其核心挑战在于如何在不显著增加计算复杂度的前提下,实现更稳定且高效的深度模型训练。解决方案的关键在于提出SCORE(Skip-Connection ODE Recurrent Embedding),这是一种基于离散迭代的递归替代方案,通过一个受常微分方程(ODE)启发的收缩更新机制:$ h_{t+1} = (1 - d_t) \cdot h_t + d_t \cdot F(h_t) $,其中步长 $ d_t $ 显式控制更新幅度与稳定性。该方法利用共享权重实现参数高效,并采用固定次数的离散迭代和标准反向传播,无需ODE求解器或伴随方法,从而在图神经网络、多层感知机及Transformer语言模型等多个架构上均实现了更快的收敛速度和训练加速。
链接: https://arxiv.org/abs/2603.10544
作者: Guillaume Godin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 32 pages, 21 figures, 12 tableaux
Abstract:Residual connections are central to modern deep neural networks, enabling stable optimization and efficient information flow across depth. In this work, we propose SCORE (Skip-Connection ODE Recurrent Embedding), a discrete recurrent alternative to classical layer stacking. Instead of composing multiple independent layers, SCORE iteratively applies a single shared neural block using an ODE (Ordinary Differential Equation)-inspired contractive update: ht+1 = (1 - dt) * ht + dt * F(ht) This formulation can be interpreted as a depth-by-iteration refinement process, where the step size dt explicitly controls stability and update magnitude. Unlike continuous Neural ODE approaches, SCORE uses a fixed number of discrete iterations and standard backpropagation without requiring ODE solvers or adjoint methods. We evaluate SCORE across graph neural networks (ESOL molecular solubility), multilayer perceptrons, and Transformer-based language models (nanoGPT). Across architectures, SCORE generally improves convergence speed and often accelerates training. SCORE is reducing parameter count through shared weights. In practice, simple Euler integration provides the best trade-off between computational cost and performance, while higher-order integrators yield marginal gains at increased compute. These results suggest that controlled recurrent depth with contractive residual updates offers a lightweight and effective alternative to classical stacking in deep neural networks.
[AI-27] UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery
【速读】:该论文旨在解决无人机(UAV)蜂群在随机医疗物资配送场景下的协同调度问题,核心挑战在于如何在资源有限、需求动态变化且存在通信与定位约束的条件下,实现医疗任务的优先级排序、资源分配及实时调度调整。解决方案的关键在于构建一个基于多智能体强化学习(MARL)的框架,将问题建模为部分可观测马尔可夫决策过程(POMDP),并采用近端策略优化(PPO)作为主学习算法,通过多种架构变体和训练策略对比分析其可扩展性与性能权衡,最终实现在真实地理数据驱动下对紧急医疗任务的高效响应与动态资源重分配。
链接: https://arxiv.org/abs/2603.10528
作者: Islam Guven,Mehmet Parlak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures, 2 tables, conference
Abstract:Unmanned aerial vehicles (UAVs) are increasingly used to support time-critical medical supply delivery, providing rapid and flexible logistics during emergencies and resource shortages. However, effective deployment of UAV fleets requires coordination mechanisms capable of prioritizing medical requests, allocating limited aerial resources, and adapting delivery schedules under uncertain operational conditions. This paper presents a multi-agent reinforcement learning (MARL) framework for coordinating UAV fleets in stochastic medical delivery scenarios where requests vary in urgency, location, and delivery deadlines. The problem is formulated as a partially observable Markov decision process (POMDP) in which UAV agents maintain awareness of medical delivery demands while having limited visibility of other agents due to communication and localization constraints. The proposed framework employs Proximal Policy Optimization (PPO) as the primary learning algorithm and evaluates several variants, including asynchronous extensions, classical actor–critic methods, and architectural modifications to analyze scalability and performance trade-offs. The model is evaluated using real-world geographic data from selected clinics and hospitals extracted from the OpenStreetMap dataset. The framework provides a decision-support layer that prioritizes medical tasks, reallocates UAV resources in real time, and assists healthcare personnel in managing urgent logistics. Experimental results show that classical PPO achieves superior coordination performance compared to asynchronous and sequential learning strategies, highlighting the potential of reinforcement learning for adaptive and scalable UAV-assisted healthcare logistics.
[AI-28] Resource-constrained Amazons chess decision framework integrating large language models and graph attention
【速读】:该论文旨在解决资源受限环境下游戏AI模型性能下降的问题,尤其是在缺乏大规模标注数据和强大计算资源的情况下,如何实现高效且准确的决策与学习。解决方案的关键在于提出了一种轻量级混合框架,通过弱到强的泛化范式,将图结构推理(Graph-based Learning)与生成式AI(Generative AI)能力相结合:利用图注意力自动编码器作为结构过滤器以去噪大语言模型(LLM)输出,结合多步蒙特卡洛树搜索(Monte Carlo Tree Search)进行策略优化,并采用随机图遗传算法(Stochastic Graph Genetic Algorithm)提升评估信号质量;同时,借助GPT-4o-mini生成合成训练数据,从而在不依赖专家示范的前提下,从噪声监督中学习并显著提升Game of the Amazons游戏中的决策准确性与胜率。
链接: https://arxiv.org/abs/2603.10512
作者: Tianhao Qian,Zhuoxuan Li,Jinde Cao,Xinli Shi,Hanjie Liu,Leszek Rutkowski
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 20 pages, 15 figures. Supported by the National Key Research and Development Project of China (No. 2020YFA0714300), NSFC (No. 61833005, 12061088), the Open Project of Key Laboratory of Transport Industry of Comprehensive Transportation Theory (Nanjing Modern Multimodal Transportation Laboratory) (MTF2023004), and the China Postdoctoral Science Foundation (2024T170129, GZC20240261)
Abstract:Artificial intelligence has advanced significantly through the development of intelligent game-playing systems, providing rigorous testbeds for decision-making, strategic planning, and adaptive learning. However, resource-constrained environments pose critical challenges, as conventional deep learning methods heavily rely on extensive datasets and computational resources. In this paper, we propose a lightweight hybrid framework for the Game of the Amazons, which explores the paradigm of weak-to-strong generalization by integrating the structural reasoning of graph-based learning with the generative capabilities of large language models. Specifically, we leverage a Graph Attention Autoencoder to inform a multi-step Monte Carlo Tree Search, utilize a Stochastic Graph Genetic Algorithm to optimize evaluation signals, and harness GPT-4o-mini to generate synthetic training data. Unlike traditional approaches that rely on expert demonstrations, our framework learns from noisy and imperfect supervision. We demonstrate that the Graph Attention mechanism effectively functions as a structural filter, denoising the LLM’s outputs. Experiments on a 10 \times 10 Amazons board show that our hybrid approach not only achieves a 15%–56% improvement in decision accuracy over baselines but also significantly outperforms its teacher model (GPT-4o-mini), achieving a competitive win rate of 45.0% at N=30 nodes and a decisive 66.5% at only N=50 nodes. These results verify the feasibility of evolving specialized, high-performance game AI from general-purpose foundation models under stringent computational constraints.
[AI-29] FAR-Dex: Few-shot Data Augmentation and Adaptive Residual Policy Refinement for Dexterous Manipulation ICRA
【速读】:该论文旨在解决多指灵巧手与机械臂协同实现类人灵巧操作的长期挑战,核心难点在于高质量示范数据稀缺以及高维动作空间的复杂性。其解决方案的关键在于提出了一种分层框架FAR-Dex,包含两个核心模块:一是FAR-DexGen,利用IsaacLab仿真器从少量示范中生成多样且符合物理约束的轨迹,构建高质量的数据基础;二是FAR-DexRes,引入自适应残差模块,通过融合多步轨迹片段与观测特征对策略进行精细化修正,从而提升操作精度与鲁棒性。实验表明,该方法在仿真和真实场景中均显著优于当前最优方法,任务成功率提升7%,并在真实世界中达到80%以上的成功水平,展现出优异的位置泛化能力。
链接: https://arxiv.org/abs/2603.10451
作者: Yushan Bai,Fulin Chen,Hongzheng Sun,Yuchuang Tong,En Li,Zhengtao Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026
Abstract:Achieving human-like dexterous manipulation through the collaboration of multi-fingered hands with robotic arms remains a longstanding challenge in robotics, primarily due to the scarcity of high-quality demonstrations and the complexity of high-dimensional action spaces. To address these challenges, we propose FAR-Dex, a hierarchical framework that integrates few-shot data augmentation with adaptive residual refinement to enable robust and precise arm-hand coordination in dexterous tasks. First, FAR-DexGen leverages the IsaacLab simulator to generate diverse and physically constrained trajectories from a few demonstrations, providing a data foundation for policy training. Second, FAR-DexRes introduces an adaptive residual module that refines policies by combining multi-step trajectory segments with observation features, thereby enhancing accuracy and robustness in manipulation scenarios. Experiments in both simulation and real-world demonstrate that FAR-Dex improves data quality by 13.4% and task success rates by 7% over state-of-the-art methods. It further achieves over 80% success in real-world tasks, enabling fine-grained dexterous manipulation with strong positional generalization.
[AI-30] he Curse and Blessing of Mean Bias in FP4-Quantized LLM Training
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在低比特(low-bit)训练中因表征几何结构的各向异性(anisotropy)导致的数值不稳定问题。具体而言,模型中少数方向集中了大量能量,而其余维度则形成宽泛的语义尾部,这种分布使得块级量化(blockwise quantization)的尺度由极端元素幅值决定,从而拉伸动态范围、压缩长尾语义变化,造成性能下降。解决方案的关键在于识别出主导不稳定的因素为一种相干的秩一均值偏置(rank-one mean bias),该偏置系统性地出现在各层和训练阶段,并解释了绝大多数极端激活幅值。通过简单的源级均值减法操作即可消除这一偏置,无需复杂计算,仅需基础的归约运算与标准量化核函数,即可显著提升低精度训练的稳定性,实验证明其在FP4(W4A4G4)训练下能大幅缩小与BF16的损失差距并恢复下游性能。
链接: https://arxiv.org/abs/2603.10444
作者: Hengjie Cao,Zhendong Huang,Mengyi Chen,Yifeng Yang,Fanqi Yu,Ruijun Huang,Fang Dong,Xin Zhang,Jixian Zhou,Anrui Chen,Mingzhi Dong,Yujiang Wang,Jinlong Hou,Qin Lv,Yuan Cheng,Tun Lu,Fan Yang,Li Shang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models trained on natural language exhibit pronounced anisotropy: a small number of directions concentrate disproportionate energy, while the remaining dimensions form a broad semantic tail. In low-bit training regimes, this geometry becomes numerically unstable. Because blockwise quantization scales are determined by extreme elementwise magnitudes, dominant directions stretch the dynamic range, compressing long-tail semantic variation into narrow numerical bins. We show that this instability is primarily driven by a coherent rank-one mean bias, which constitutes the dominant component of spectral anisotropy in LLM representations. This mean component emerges systematically across layers and training stages and accounts for the majority of extreme activation magnitudes, making it the principal driver of dynamic-range inflation under low precision. Crucially, because the dominant instability is rank-one, it can be eliminated through a simple source-level mean-subtraction operation. This bias-centric conditioning recovers most of the stability benefits of SVD-based spectral methods while requiring only reduction operations and standard quantization kernels. Empirical results on FP4 (W4A4G4) training show that mean removal substantially narrows the loss gap to BF16 and restores downstream performance, providing a hardware-efficient path to stable low-bit LLM training.
[AI-31] Domain-Adaptive Health Indicator Learning with Degradation-Stage Synchronized Sampling and Cross-Domain Autoencoder
【速读】:该论文旨在解决健康指标(Health Indicators, HIs)建模中因工况变化导致的分布偏移问题,以及现有小卷积核一维卷积神经网络(1D-CNN)在捕捉复杂振动信号长期时序依赖关系方面的结构局限。其核心解决方案包括两个关键组件:一是退化阶段同步批次采样(Degradation Stage Synchronized Batch Sampling, DSSBS),通过核变化点检测划分退化阶段,确保源域与目标域的小批量样本在故障阶段上对齐,从而避免随机采样引发的误导性差异损失;二是跨域对齐融合大自编码器(Cross-Domain Aligned Fusion Large Autoencoder, CAFLAE),结合大感受野时序特征提取与交叉注意力机制,学习更具域不变性的表征。实验证明,该框架在韩国防务系统数据集和XJTU-SY轴承数据集上相比当前最优方法平均性能提升24.1%,验证了DSSBS通过阶段一致性采样增强跨域对齐能力,CAFLAE则为长期工业状态监测提供了高性能骨干网络。
链接: https://arxiv.org/abs/2603.10430
作者: Jungho Choo,Hanbyeol Park,Gawon Lee,Yunkyung Park,Hyerim Bae
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The construction of high quality health indicators (HIs) is crucial for effective prognostics and health management. Although deep learning has significantly advanced HI modeling, existing approaches often struggle with distribution mismatches resulting from varying operating conditions. Although domain adaptation is typically employed to mitigate these shifts, two critical challenges remain: (1) the misalignment of degradation stages during random mini-batch sampling, resulting in misleading discrepancy losses, and (2) the structural limitations of small-kernel 1D-CNNs in capturing long-range temporal dependencies within complex vibration signals. To address these issues, we propose a domain-adaptive framework comprising degradation stage synchronized batch sampling (DSSBS) and the cross-domain aligned fusion large autoencoder (CAFLAE). DSSBS utilizes kernel change-point detection to segment degradation stages, ensuring that source and target mini-batches are synchronized by their failure phases during alignment. Complementing this, CAFLAE integrates large-kernel temporal feature extraction with cross-attention mechanisms to learn superior domain-invariant representations. The proposed framework was rigorously validated on a Korean defense system dataset and the XJTU-SY bearing dataset, achieving an average performance enhancement of 24.1% over state-of-the-art methods. These results demonstrate that DSSBS improves cross-domain alignment through stage-consistent sampling, whereas CAFLAE offers a high-performance backbone for long-term industrial condition monitoring.
[AI-32] Enhancing Network Intrusion Detection Systems: A Multi-Layer Ensemble Approach to Mitigate Adversarial Attacks
【速读】:该论文旨在解决生成式 AI(Generative AI)驱动的对抗样本对基于机器学习(Machine Learning, ML)的网络入侵检测系统(Network Intrusion Detection Systems, NIDS)造成的安全威胁问题,即对抗攻击可能误导NIDS做出错误判断,从而危及网络安全。解决方案的关键在于提出一种双层防御机制:第一层为堆叠分类器(stacking classifiers),第二层为基于自编码器(autoencoder)的异常检测模块;当第一层判定流量为良性时,第二层被激活以验证并纠正堆叠分类器的决策,同时引入对抗训练(adversarial training)进一步提升模型鲁棒性,实验证明该方法在UNSW-NB15和NSL-KDD数据集上显著增强了NIDS对对抗攻击的抵抗能力。
链接: https://arxiv.org/abs/2603.10413
作者: Nasim Soltani,Shayan Nejadshamsi,Zakaria Abou El Houda,Raphael Khoury,Kelton A. P. Costa,Tiago H. Falk,Anderson R. Avila
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Adversarial examples can represent a serious threat to machine learning (ML) algorithms. If used to manipulate the behaviour of ML-based Network Intrusion Detection Systems (NIDS), they can jeopardize network security. In this work, we aim to mitigate such risks by increasing the robustness of NIDS towards adversarial attacks. To that end, we explore two adversarial methods for generating malicious network traffic. The first method is based on Generative Adversarial Networks (GAN) and the second one is the Fast Gradient Sign Method (FGSM). The adversarial examples generated by these methods are then used to evaluate a novel multilayer defense mechanism, specifically designed to mitigate the vulnerability of ML-based NIDS. Our solution consists of one layer of stacking classifiers and a second layer based on an autoencoder. If the incoming network data are classified as benign by the first layer, the second layer is activated to ensure that the decision made by the stacking classifier is correct. We also incorporated adversarial training to further improve the robustness of our solution. Experiments on two datasets, namely UNSW-NB15 and NSL-KDD, demonstrate that the proposed approach increases resilience to adversarial attacks.
[AI-33] Effective Dataset Distillation for Spatio-Temporal Forecasting with Bi-dimensional Compression ICDE’26
【速读】:该论文旨在解决大规模时空时间序列(spatio-temporal time series)数据在深度学习模型训练中面临的高计算成本与资源消耗问题。现有数据蒸馏(dataset distillation)方法仅能压缩单一维度(如时间或空间),难以有效应对时空联合维度带来的数据冗余。其解决方案的关键在于提出STemDist,一种专为时空时间序列预测设计的数据蒸馏方法:通过平衡压缩时间和空间维度以显著降低训练时间和内存占用,并引入聚类级别蒸馏与子集粒度蒸馏相结合的策略,在保持甚至提升预测性能的同时实现高效训练(实验表明最快可提速6倍、内存节省8倍、预测误差降低12%)。
链接: https://arxiv.org/abs/2603.10410
作者: Taehyung Kwon,Yeonje Choi,Yeongho Kim,Kijung Shin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: to be published in the 42nd IEEE International Conference on Data Engineering (ICDE '26)
Abstract:Spatio-temporal time series are widely used in real-world applications, including traffic prediction and weather forecasting. They are sequences of observations over extensive periods and multiple locations, naturally represented as multidimensional data. Forecasting is a central task in spatio-temporal analysis, and numerous deep learning methods have been developed to address it. However, as dataset sizes and model complexities continue to grow in practice, training deep learning models has become increasingly time- and resource-intensive. A promising solution to this challenge is dataset distillation, which synthesizes compact datasets that can effectively replace the original data for model training. Although successful in various domains, including time series analysis, existing dataset distillation methods compress only one dimension, making them less suitable for spatio-temporal datasets, where both spatial and temporal dimensions jointly contribute to the large data volume. To address this limitation, we propose STemDist, the first dataset distillation method specialized for spatio-temporal time series forecasting. A key idea of our solution is to compress both temporal and spatial dimensions in a balanced manner, reducing training time and memory. We further reduce the distillation cost by performing distillation at the cluster level rather than the individual location level, and we complement this coarse-grained approach with a subset-based granular distillation technique that enhances forecasting performance. On five real-world datasets, we show empirically that, compared to both general and time-series dataset distillation methods, datasets distilled by our STemDist method enable model training (1) faster (up to 6X) (2) more memory-efficient (up to 8X), and (3) more effective (with up to 12% lower prediction error).
[AI-34] Designing Service Systems from Textual Evidence
【速读】:该论文旨在解决服务系统配置选择中的高效最优决策问题,即在自动化评估存在系统性偏差(arm-dependent bias)的情况下,如何以最小的人工审核成本高置信度地识别出性能最优的服务配置。其核心挑战在于:尽管大型语言模型(Large Language Models, LLMs)能低成本生成文本质量评分,但这些评分因评估对象和实例不同而产生偏差;人工专家评审虽准确却昂贵。解决方案的关键在于提出一种名为PP-LUCB的算法,该算法通过结合代理评分与逆概率加权残差构造无偏估计量,并构建任意时间有效的置信序列,从而动态决定评估哪些配置以及是否请求人工审计,将资源集中于LLM最不可靠的场景,实现近最优的采样效率和成本控制。
链接: https://arxiv.org/abs/2603.10400
作者: Ruicheng Ao,Hongyu Chen,Siyang Gao,Hanwei Li,David Simchi-Levi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: 67 pages,
Abstract:Designing service systems requires selecting among alternative configurations – choosing the best chatbot variant, the optimal routing policy, or the most effective quality control procedure. In many service systems, the primary evidence of performance quality is textual – customer support transcripts, complaint narratives, compliance review reports – rather than the scalar measurements assumed by classical optimization methods. Large language models (LLMs) can read such textual evidence and produce standardized quality scores, but these automated judges exhibit systematic biases that vary across alternatives and evaluation instances. Human expert review remains accurate but costly. We study how to identify the best service configuration with high confidence while minimizing expensive human audits, given that automated evaluation is cheap but biased. We formalize this as a sequential decision problem where a biased proxy score is observed for every evaluation, and a verified outcome can be acquired selectively at additional cost. We prove that LLM-only selection fails under arm-dependent bias, and that naive selective-audit estimators can be asymptotically biased. We develop an estimator combining proxy scores with inverse-propensity-weighted residuals and construct anytime-valid confidence sequences. Our algorithm, PP-LUCB, jointly decides which alternatives to evaluate and whether to request human audits, concentrating reviews where the LLM judge is least reliable. We prove correctness and establish instance-dependent cost bounds showing near-optimal efficiency. On a customer support ticket classification task, our algorithm correctly identifies the best model in 40/40 trials while achieving 90% audit cost reduction.
[AI-35] On the Learning Dynamics of Two-layer Linear Networks with Label Noise SGD AAAI2026
【速读】:该论文旨在解决深度学习中模型泛化性能提升的机制问题,特别是针对标签噪声(label noise)在随机梯度下降(SGD)训练过程中如何影响模型学习动态并促进泛化能力的内在原理。其解决方案的关键在于揭示了标签噪声驱动下两层过参数化线性网络的学习行为具有两个阶段:第一阶段,模型权重幅值逐渐减小,使模型从“懒惰 regime”(lazy regime)过渡到“丰富 regime”(rich regime);第二阶段,模型权重与真实插值器(ground-truth interpolator)之间的对齐度增强,最终实现收敛。这一理论分析表明,标签噪声通过引导模型进入丰富 regime 来最小化地解释其在实践中提升泛化性能的现象,并进一步将该机制推广至 Sharpness-Aware Minimization (SAM) 等更广泛的优化算法,实验结果验证了理论的有效性。
链接: https://arxiv.org/abs/2603.10397
作者: Tongcheng Zhang,Zhanpeng Zhou,Mingze Wang,Andi Han,Wei Huang,Taiji Suzuki,Junchi Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026(oral)
Abstract:One crucial factor behind the success of deep learning lies in the implicit bias induced by noise inherent in gradient-based training algorithms. Motivated by empirical observations that training with noisy labels improves model generalization, we delve into the underlying mechanisms behind stochastic gradient descent (SGD) with label noise. Focusing on a two-layer over-parameterized linear network, we analyze the learning dynamics of label noise SGD, unveiling a two-phase learning behavior. In \emphPhase I, the magnitudes of model weights progressively diminish, and the model escapes the lazy regime; enters the rich regime. In \emphPhase II, the alignment between model weights and the ground-truth interpolator increases, and the model eventually converges. Our analysis highlights the critical role of label noise in driving the transition from the lazy to the rich regime and minimally explains its empirical success. Furthermore, we extend these insights to Sharpness-Aware Minimization (SAM), showing that the principles governing label noise SGD also apply to broader optimization algorithms. Extensive experiments, conducted under both synthetic and real-world setups, strongly support our theory. Our code is released at this https URL.
[AI-36] Verbalizing LLM s Higher-order Uncertainty via Imprecise Probabilities
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在不确定性量化方面存在的系统性不足问题,尤其是在模糊问答、上下文学习和自我反思等场景下,传统基于经典概率框架的不确定性提取技术无法准确刻画模型行为。其解决方案的关键在于引入**不精确概率(imprecise probabilities)**这一理论框架,将不确定性分为两层:第一层不确定性(first-order uncertainty)表示对模型输出响应的概率分布的不确定性,第二层不确定性(second-order uncertainty)则量化了底层概率模型本身的模糊性或不可确定性。作者提出了一套通用的基于提示(prompt-based)的不确定性提取方法及后处理流程,能够直接获取并量化这两阶不确定性,从而提升LLM不确定性报告的真实性,增强模型可信度并支持下游决策。
链接: https://arxiv.org/abs/2603.10396
作者: Anita Yang,Krikamol Muandet,Michele Caprio,Siu Lun Chau,Masaki Adachi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the growing demand for eliciting uncertainty from large language models (LLMs), empirical evidence suggests that LLM behavior is not always adequately captured by the elicitation techniques developed under the classical probabilistic uncertainty framework. This mismatch leads to systematic failure modes, particularly in settings that involve ambiguous question-answering, in-context learning, and self-reflection. To address this, we propose novel prompt-based uncertainty elicitation techniques grounded in \emphimprecise probabilities, a principled framework for repesenting and eliciting higher-order uncertainty. Here, first-order uncertainty captures uncertainty over possible responses to a prompt, while second-order uncertainty (uncertainty about uncertainty) quantifies indeterminacy in the underlying probability model itself. We introduce general-purpose prompting and post-processing procedures to directly elicit and quantify both orders of uncertainty, and demonstrate their effectiveness across diverse settings. Our approach enables more faithful uncertainty reporting from LLMs, improving credibility and supporting downstream decision-making.
[AI-37] Safe Probabilistic Planning for Human-Robot Interaction using Conformal Risk Control
【速读】:该论文旨在解决人机交互(Human-Robot Interaction, HRI)场景中如何在复杂人类行为下提供形式化安全保证的问题,尤其关注传统控制屏障函数(Control Barrier Functions, CBF)因预测误差导致的安全性不足。其解决方案的关键在于将CBF与置信风险控制(Conformal Risk Control, CRC)相结合:通过CRC量化并控制CBF安全值的预测误差,从而建立约束满足概率的严格形式化保障;同时引入一种动态调整安全裕度的算法,根据当前交互情境自适应优化安全性与任务效率之间的平衡。实验表明,该方法显著降低了碰撞率和安全违规事件,同时保持了高目标达成成功率和控制效率。
链接: https://arxiv.org/abs/2603.10392
作者: Jake Gonzales,Kazuki Mizuta,Karen Leung,Lillian J. Ratliff
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we present a novel probabilistic safe control framework for human-robot interaction that combines control barrier functions (CBFs) with conformal risk control to provide formal safety guarantees while considering complex human behavior. The approach uses conformal risk control to quantify and control the prediction errors in CBF safety values and establishes formal guarantees on the probability of constraint satisfaction during interaction. We introduce an algorithm that dynamically adjusts the safety margins produced by conformal risk control based on the current interaction context. Through experiments on human-robot navigation scenarios, we demonstrate that our approach significantly reduces collision rates and safety violations as compared to baseline methods while maintaining high success rates in goal-reaching tasks and efficient control. The code, simulations, and other supplementary material can be found on the project website: this https URL.
[AI-38] Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在依赖标量概率评估可靠性时,难以捕捉推理过程内部结构动态性的问题。其解决方案的关键在于提出TRACED框架,通过理论驱动的几何运动学分解推理轨迹为“进展(Progress,位移)”与“稳定性(Stability,曲率)”两个维度,揭示出正确推理表现为高进展、稳定轨迹,而幻觉则体现为低进展、不稳定的模式(即停滞的位移伴随高曲率波动)。该方法借助这些几何特征构建概率模型,在多个基准测试中实现竞争力与更强鲁棒性,并首次将高曲率映射为“犹豫环路(Hesitation Loops)”,位移映射为“确定性累积(Certainty Accumulation)”,从而以物理视角解析机器思维的内在动力学机制。
链接: https://arxiv.org/abs/2603.10384
作者: Xinyan Jiang,Ninghao Liu,Di Wang,Lijie Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Evaluating LLM reliability via scalar probabilities often fails to capture the structural dynamics of reasoning. We introduce TRACED, a framework that assesses reasoning quality through theoretically grounded geometric kinematics. By decomposing reasoning traces into Progress (displacement) and Stability (curvature), we reveal a distinct topological divergence: correct reasoning manifests as high-progress, stable trajectories, whereas hallucinations are characterized by low-progress, unstable patterns (stalled displacement with high curvature fluctuations). Leveraging these signatures, our probabilistic framework achieves competitive performance and superior robustness across diverse benchmarks. Crucially, TRACED bridges geometry and cognition by mapping high curvature to ‘‘Hesitation Loops’’ and displacement to ‘‘Certainty Accumulation’’, offering a physical lens to decode the internal dynamics of machine thought.
[AI-39] Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design
【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型中计算资源在专家层(expert layers)与注意力层(attention layers)之间的最优分配问题,以实现高效扩展模型容量并最大化性能。其关键解决方案是定义了一个新的架构参数 $ r $,即每 token 分配给专家层的浮点运算量(FLOPs)占总计算量的比例,并通过大量 GPT 风格 MoE Transformer 实验发现最优比例 $ r^* $ 与总计算预算呈幂律关系,且受模型稀疏度影响。基于此,作者推导出 $ r^* $ 的显式公式,从而可在固定计算预算下精确调控专家与注意力子层的算力分配,进而将 Chinchilla 缩放定律推广至包含架构维度的新框架,为 MoE 模型设计提供可量化、可优化的实践指导。
链接: https://arxiv.org/abs/2603.10379
作者: Junzhuo Li,Peijie Jiang,Changxin Tian,Jia Liu,Zhiqiang Zhang,Xuming Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents a novel extension of neural scaling laws to Mixture-of-Experts (MoE) models, focusing on the optimal allocation of compute between expert and attention sub-layers. As MoE architectures have emerged as an efficient method for scaling model capacity without proportionally increasing computation, determining the optimal expert-attention compute ratio becomes critical. We define the ratio r as the fraction of total FLOPs per token dedicated to the expert layers versus the attention layers, and explore how this ratio interacts with the overall compute budget and model sparsity. Through extensive experiments with GPT-style MoE Transformers, we empirically find that the optimal ratio r^* follows a power-law relationship with total compute and varies with sparsity. Our analysis leads to an explicit formula for r^* , enabling precise control over the expert-attention compute allocation. We generalize the Chinchilla scaling law by incorporating this architectural parameter, providing a new framework for tuning MoE models beyond size and data. Our findings offer practical guidelines for designing efficient MoE models, optimizing performance while respecting fixed compute budgets.
[AI-40] Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning
【速读】:该论文旨在解决稀疏自动编码器(Sparse Autoencoder, SAE)只能定位语言模型中概念的位置,但无法揭示这些概念在多步推理过程中如何相互作用的问题。其核心解决方案是提出因果概念图(Causal Concept Graph, CCG),这是一种基于稀疏可解释潜在特征的有向无环图(Directed Acyclic Graph, DAG),其中边表示概念间的因果依赖关系。CCG通过结合任务条件下的SAE进行概念发现与DAGMA风格的可微结构学习实现图结构恢复,并引入因果保真度评分(Causal Fidelity Score, CFS)量化图引导干预对下游任务的影响强度,从而有效捕捉概念间的因果交互机制。
链接: https://arxiv.org/abs/2603.10377
作者: Md Muntaqim Meherab,Noor Islam S. Mohammad,Faiza Feroz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:
Abstract:Sparse autoencoders can localize where concepts live in language models, but not how they interact during multi-step reasoning. We propose Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts. We combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery and introduce the Causal Fidelity Score (CFS) to evaluate whether graph-guided interventions induce larger downstream effects than random ones. On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium, across five seeds ( n=15 paired runs), CCG achieves \CFS=5.654\pm0.625 , outperforming ROME-style tracing ( 3.382\pm0.233 ), SAE-only ranking ( 2.479\pm0.196 ), and a random baseline ( 1.032\pm0.034 ), with p0.0001 after Bonferroni correction. Learned graphs are sparse (5-6% edge density), domain-specific, and stable across seeds.
[AI-41] Few-Shot Adaptation to Non-Stationary Environments via Latent Trend Embedding for Robotics
【速读】:该论文旨在解决机器人系统在真实环境中因潜在环境因素变化导致的概念漂移(concept shift)问题,即输入与输出关系随时间发生变化,而传统适应方法通过更新模型参数易引发灾难性遗忘且计算成本高。解决方案的关键在于提出一种基于潜在趋势标识(Trend ID)的少样本自适应框架:通过反向传播估计低维环境状态(Trend ID),同时保持模型参数不变;为防止因逐样本隐变量引起的过拟合,引入时序正则化和状态转移模型以约束潜空间的平滑演化,从而实现对未见环境的高效、可解释的少样本适应。
链接: https://arxiv.org/abs/2603.10373
作者: Yasuyuki Fujii(1),Emika Kameda(1),Hiroki Fukada(2),Yoshiki Mori(3),Tadashi Matsuo(4),Nobutaka Shimada(1) ((1) College of Information Science and Engineering, Ritsumeikan University, Osaka, Japan, (2) Production and Technology Department, NIPPN CORPORATION, Tokyo, Japan, (3) University of Osaka, Osaka, Japan, (4) National Institute of Technology, Ichinoseki College, Iwate, Japan)
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Robotic systems operating in real-world environments often suffer from concept shift, where the input-output relationship changes due to latent environmental factors that are not directly observable. Conventional adaptation methods update model parameters, which may cause catastrophic forgetting and incur high computational cost. This paper proposes a latent Trend ID-based framework for few-shot adaptation in non-stationary environments. Instead of modifying model weights, a low-dimensional environmental state, referred to as the Trend ID, is estimated via backpropagation while the model parameters remain fixed. To prevent overfitting caused by per-sample latent variables, we introduce temporal regularization and a state transition model that enforces smooth evolution of the latent space. Experiments on a quantitative food grasping task demonstrate that the learned Trend IDs are distributed across distinct regions of the latent space with temporally consistent trajectories, and that few-shot adaptation to unseen environments is achieved without modifying model parameters. The proposed framework provides a scalable and interpretable solution for robotics applications operating across diverse and evolving environments.
[AI-42] HEAL: Hindsight Entropy-Assisted Learning for Reasoning Distillation
【速读】:该论文旨在解决大模型推理能力(Large Reasoning Models, LRMs)向小模型迁移时因拒绝采样(rejection sampling)限制而导致的“教师天花板”问题,即传统方法将教师模型视为静态过滤器,忽略其在复杂“边缘案例”中无法独立探索有效解的情况,从而制约学生模型的推理能力提升。解决方案的关键在于提出一种无需强化学习(RL-free)的框架——Hindsight Entropy-Assisted Learning (HEAL),其核心创新包括:(1) 基于熵动态检测关键推理断点并注入事后提示的主动干预机制(Guided Entropy-Assisted Repair, GEAR);(2) 通过困惑度-不确定性比估计器(Perplexity-Uncertainty Ratio Estimator, PURE)区分真实认知突破与虚假捷径的严格筛选协议;(3) 采用三阶段渐进式答案引导课程演化策略(Progressive Answer-guided Curriculum Evolution, PACE),从基础对齐逐步推进至前沿突破,实现更高效、鲁棒的推理能力蒸馏。
链接: https://arxiv.org/abs/2603.10359
作者: Wenjing Zhang,Jiangze Yan,Jieyun Huang,Yi Shen,Shuming Shi,Ping Chen,Ning Wang,Zhaoxiang Liu,Kai Wang,Shiguo Lian
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages,5 figures
Abstract:Distilling reasoning capabilities from Large Reasoning Models (LRMs) into smaller models is typically constrained by the limitation of rejection sampling. Standard methods treat the teacher as a static filter, discarding complex “corner-case” problems where the teacher fails to explore valid solutions independently, thereby creating an artificial “Teacher Ceiling” for the student. In this work, we propose Hindsight Entropy-Assisted Learning (HEAL), an RL-free framework designed to bridge this reasoning gap. Drawing on the educational theory of the Zone of Proximal Development(ZPD), HEAL synergizes three core modules: (1) Guided Entropy-Assisted Repair (GEAR), an active intervention mechanism that detects critical reasoning breakpoints via entropy dynamics and injects targeted hindsight hints to repair broken trajectories; (2) Perplexity-Uncertainty Ratio Estimator (PURE), a rigorous filtering protocol that decouples genuine cognitive breakthroughs from spurious shortcuts; and (3) Progressive Answer-guided Curriculum Evolution (PACE), a three-stage distillation strategy that organizes training from foundational alignment to frontier breakthrough. Extensive experiments on multiple benchmarks demonstrate that HEAL significantly outperforms traditional SFT distillation and other baselines.
[AI-43] Utility Function is All You Need: LLM -based Congestion Control
【速读】:该论文旨在解决通信网络中拥塞控制协议设计的复杂性问题,特别是在分布式场景下,不同应用具有差异化优化目标和需求时,如何高效生成适配的效用函数(utility functions)以实现网络性能与应用需求的协同优化。其解决方案的关键在于提出GenCC框架,该框架结合大语言模型(Large Language Models, LLMs)的代码生成能力与真实网络测试床,通过生成式代码演化策略或数学链式思维(chain-of-thought, CoT)引导方式自动设计拥塞控制效用函数,从而显著提升协议性能——实验表明,在不同网络场景下相较现有最优协议可提升37%至142%。
链接: https://arxiv.org/abs/2603.10357
作者: Neta Rozen-Schiff,Liron Schiff,Stefan Schmid
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:Congestion is a critical and challenging problem in communication networks. Congestion control protocols allow network applications to tune their sending rate in a way that optimizes their performance and the network utilization. In the common distributed setting, the applications cannot collaborate with each other directly but instead obtain similar estimations about the state of the network using latency and loss measurements. These measurements can be fed into analytical functions, referred to by utility functions, whose gradients help each and all distributed senders to converge to a desired state. The above process becomes extremely complicated when each application has different optimization goals and requirements. Crafting these utilization functions has been a research subject for over a decade, with small incremental changes requiring rigorous mathematical analysis as well as real-world experiments. In this work, we present GenCC, a framework leveraging the code generation capabilities of large language models (LLMs) coupled with realistic network testbed, to design congestion control utility functions. Using GenCC, we analyze the impact of different guidance strategies on the performance of the generated protocols, considering application-specific requirements and network capacity. Our results show that LLMs, guided by either a generative code evolution strategy or mathematical chain-of-thought (CoT), can obtain close to optimal results, improving state-of-the-art congestion control protocols by 37%-142%, depending on the scenario. Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI) ACMclasses: I.2.2; C.2.2 Cite as: arXiv:2603.10357 [cs.NI] (or arXiv:2603.10357v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2603.10357 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Liron Schiff [view email] [v1] Wed, 11 Mar 2026 03:11:36 UTC (378 KB)
[AI-44] Federated Active Learning Under Extreme Non-IID and Global Class Imbalance CVPR2026
【速读】:该论文旨在解决联邦主动学习(Federated Active Learning, FAL)在现实场景中因全局类别不平衡和客户端数据高度异质性而导致性能下降的问题。其解决方案的关键在于提出一种自适应的公平性增强框架 FairFAL,该框架通过轻量级预测差异推断全局不平衡与局部-全局差异,从而动态选择全局模型或本地模型进行查询;同时利用全局特征引导的伪标签机制促进类别感知的采样,并结合两阶段不确定性-多样性平衡采样策略(含 k-center 优化),显著提升了对少数类别的覆盖能力与整体性能。实验表明,FairFAL 在长尾分布和非独立同分布(non-IID)条件下均优于现有先进方法。
链接: https://arxiv.org/abs/2603.10341
作者: Chen-Chen Zong,Sheng-Jun Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026
Abstract:Federated active learning (FAL) seeks to reduce annotation cost under privacy constraints, yet its effectiveness degrades in realistic settings with severe global class imbalance and highly heterogeneous clients. We conduct a systematic study of query-model selection in FAL and uncover a central insight: the model that achieves more class-balanced sampling, especially for minority classes, consistently leads to better final performance. Moreover, global-model querying is beneficial only when the global distribution is highly imbalanced and client data are relatively homogeneous; otherwise, the local model is preferable. Based on these findings, we propose FairFAL, an adaptive class-fair FAL framework. FairFAL (1) infers global imbalance and local-global divergence via lightweight prediction discrepancy, enabling adaptive selection between global and local query models; (2) performs prototype-guided pseudo-labeling using global features to promote class-aware querying; and (3) applies a two-stage uncertainty-diversity balanced sampling strategy with k-center refinement. Experiments on five benchmarks show that FairFAL consistently outperforms state-of-the-art approaches under challenging long-tailed and non-IID settings. The code is available at this https URL.
[AI-45] PC-Diffuser: Path-Consistent Capsule CBF Safety Filtering for Diffusion-Based Trajectory Planner
【速读】:该论文旨在解决扩散模型驱动的轨迹规划方法在复杂交通场景中缺乏安全性保障的问题,尤其是在罕见或分布外(out-of-distribution)情况下可能出现灾难性失败的风险。现有扩散规划器虽具备良好的闭环性能,但其安全特性难以验证,且无法有效应对动态约束冲突。解决方案的关键在于提出PC-Diffuser框架,通过将可证明的安全性结构——即路径一致的屏障函数(path-consistent barrier function)——直接嵌入扩散去噪循环中,使安全性成为轨迹生成过程的内在属性而非事后修复手段。具体而言,该方法在每一步去噪过程中引入基于胶囊距离的碰撞风险评估、基于运动学自行车模型的动力学可行性转换以及无几何畸变的安全过滤机制,从而实现迭代式、上下文感知的安全修正,同时保持轨迹分布的学习一致性。
链接: https://arxiv.org/abs/2603.10330
作者: Eugene Ku,Yiwei Lyu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous driving in complex traffic requires planners that generalize beyond hand-crafted rules, motivating data-driven approaches that learn behavior from expert demonstrations. Diffusion-based trajectory planners have recently shown strong closed-loop performance by iteratively denoising a full-horizon plan, but they remain difficult to certify and can fail catastrophically in rare or out-of-distribution scenarios. To address this challenge, we present PC-Diffuser, a safety augmentation framework that embeds a certifiable, path-consistent barrier-function structure directly into the denoising loop of diffusion planning. The key idea is to make safety an intrinsic part of trajectory generation rather than a post-hoc fix: we enforce forward invariance along the rollout while preserving the diffusion model’s intended path geometry. Specifically, PC-Diffuser (i) evaluates collision risk using a capsule-distance barrier function that better reflects vehicle geometry and reduces unnecessary conservativeness, (ii) converts denoised waypoints into dynamically feasible motion under a kinematic bicycle model, and (iii) applies a path-consistent safety filter that eliminates residual constraint violations without geometric distortion, so the corrected plan remains close to the learned distribution. By injecting these safety-consistent corrections at every denoising step and feeding the refined trajectory back into the diffusion process, PC-Diffuser enables iterative, context-aware safeguarding instead of post-hoc repair…
[AI-46] Simulation-in-the-Reasoning (SiR): A Conceptual Framework for Empirically Grounded AI in Autonomous Transportation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂动态领域(如交通系统)中推理缺乏实证依据的问题,即当前LLM的推理过程主要依赖文本叙述和假设性推演,难以验证其有效性。解决方案的关键在于提出“推理中的仿真”(Simulation-in-the-Reasoning, SiR)这一概念框架,将领域特定的仿真器(如交通仿真器)嵌入LLM的推理循环中,使中间推理步骤可执行为仿真实验,从而实现从叙事合理性向可验证的“假说-仿真-分析”工作流转变。此方法通过模型上下文协议(Model Context Protocol, MCP)调用外部仿真工具,支持策略假设生成、多场景评估与迭代优化,为构建可信的自主交通系统提供实证基础。
链接: https://arxiv.org/abs/2603.10294
作者: Wuping Xin
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have advanced reasoning through techniques like Chain-of-Thought (CoT). However, their reasoning largely re-mains textual and hypothetical, lacking empirical grounding in complex, dynamic domains like transportation. This paper introduces Simulation-in-the-Reasoning (SiR), a novel conceptual framework that embeds domain-specific simulators directly into the LLM reasoning loop. By treating intermediate reasoning steps as executable simulation experiments, SiR transforms LLM reasoning from narrative plausibility into a falsifiable, hypothesis-simulate-analyze workflow. We discuss applications, where LLM can formulate Intelligent Transport System (ITS) strategy hypotheses, invoke a traffic simulator via the Model Context Protocol (MCP), evaluate results under different demand patterns, and refine strategies through verification and aggregation. While implementing the framework is part of our ongoing work, this paper primarily establishes the conceptual foundation, discusses design considerations like API granularity, and outlines the vision of SiR as a cornerstone for interactive transportation digital twins. We argue that SiR represents a critical step towards trustworthy, empirically-validated AI for autonomous transportation systems.
[AI-47] Hybrid Self-evolving Structured Memory for GUI Agents
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)驱动的图形用户界面(GUI)智能体在真实世界计算机任务中面临的挑战,包括长周期工作流、界面多样性以及频繁的中间错误。现有方法依赖于从大量轨迹中构建的外部记忆,但其检索机制局限于离散摘要或连续嵌入的扁平化处理,缺乏人类记忆所具备的结构化组织与自我演化能力。解决方案的关键在于提出一种受大脑启发的混合自演化结构化记忆(Hybrid Self-evolving Structured Memory, HyMEM),该机制通过图结构耦合离散的高层符号节点与连续的轨迹嵌入,支持多跳检索、节点更新驱动的自演化,并在推理过程中动态刷新工作记忆,从而显著提升GUI智能体的性能表现。
链接: https://arxiv.org/abs/2603.10291
作者: Sibo Zhu,Wenyi Wu,Kun Zhou,Stephen Wang,Biwei Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The remarkable progress of vision-language models (VLMs) has enabled GUI agents to interact with computers in a human-like manner. Yet real-world computer-use tasks remain difficult due to long-horizon workflows, diverse interfaces, and frequent intermediate errors. Prior work equips agents with external memory built from large collections of trajectories, but relies on flat retrieval over discrete summaries or continuous embeddings, falling short of the structured organization and self-evolving characteristics of human memory. Inspired by the brain, we propose Hybrid Self-evolving Structured Memory (HyMEM), a graph-based memory that couples discrete high-level symbolic nodes with continuous trajectory embeddings. HyMEM maintains a graph structure to support multi-hop retrieval, self-evolution via node update operations, and on-the-fly working-memory refreshing during inference. Extensive experiments show that HyMEM consistently improves open-source GUI agents, enabling 7B/8B backbones to match or surpass strong closed-source models; notably, it boosts Qwen2.5-VL-7B by +22.5% and outperforms Gemini2.5-Pro-Vision and GPT-4o.
[AI-48] Intrinsic Numerical Robustness and Fault Tolerance in a Neuromorphic Algorithm for Scientific Computing
【速读】:该论文旨在解决神经形态计算(neuromorphic computing)在实际应用中因硬件缺陷或结构扰动导致的稳定性问题,特别是如何在存在神经元缺失或脉冲丢失的情况下仍保持算法精度。解决方案的关键在于利用一种基于生物启发的原生脉冲神经网络算法,该算法在求解偏微分方程时展现出内在的容错能力——即使多达32%的神经元被删除、90%的脉冲丢失,系统仍能维持较高的计算准确性。这种鲁棒性可通过调整结构超参数进行调控,表明脑样结构设计对提升神经形态算法的容错性能具有实质性贡献。
链接: https://arxiv.org/abs/2603.10246
作者: Bradley H. Theilman,James B. Aimone
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注:
Abstract:The potential for neuromorphic computing to provide intrinsic fault tolerance has long been speculated, but the brain’s robustness in neuromorphic applications has yet to be demonstrated. Here, we show that a previously described, natively spiking neuromorphic algorithm for solving partial differential equations is intrinsically tolerant to structural perturbations in the form of ablated neurons and dropped spikes. The tolerance band for these perturbations is large: we find that as many as 32 percent of the neurons and up to 90 percent of the spikes may be entirely dropped before a significant degradation in the accuracy results. Furthermore, this robustness is tunable through structural hyperparameters. This work demonstrates that the specific brain-like inspiration behind the algorithm contributes to a significant degree of robustness expected from brain-like neuromorphic algorithms.
[AI-49] Rethinking the Harmonic Loss via Non-Euclidean Distance Layers
【速读】:该论文旨在解决交叉熵损失(cross-entropy loss)在深度神经网络训练中存在可解释性差、权重无界增长及训练效率低等问题,这些问题可能导致昂贵的训练动态和延迟泛化现象(如grokking)。其解决方案的关键在于提出并系统评估基于多种距离度量(distance metrics)的谐波损失(harmonic loss)扩展形式,将原本仅使用欧几里得距离(Euclidean distance)的谐波损失推广至包括余弦距离(cosine distance)、Bray-Curtis距离和马氏距离(Mahalanobis distance)等在内的多维空间度量体系。通过三维度评估——模型性能、可解释性和可持续性,研究发现:在视觉任务中,余弦距离在提升准确率的同时显著降低碳排放;在语言模型中,基于余弦距离的谐波损失能增强梯度稳定性、优化表示结构并减少能耗,从而实现更高效、稳定且环境友好的训练过程。
链接: https://arxiv.org/abs/2603.10225
作者: Maxwell Miller-Golub,Kamil Faber,Marcin Pietron,Panpan Zheng,Pasquale Minervini,Roberto Corizzo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Cross-entropy loss has long been the standard choice for training deep neural networks, yet it suffers from interpretability limitations, unbounded weight growth, and inefficiencies that can contribute to costly training dynamics. The harmonic loss is a distance-based alternative grounded in Euclidean geometry that improves interpretability and mitigates phenomena such as grokking, or delayed generalization on the test set. However, the study of harmonic loss remains narrow: only Euclidean distance is explored, and no systematic evaluation of computational efficiency or sustainability was conducted. We extend harmonic loss by systematically investigating a broad spectrum of distance metrics as replacements for the Euclidean distance. We comprehensively evaluate distance-tailored harmonic losses on both vision backbones and large language models. Our analysis is framed around a three-way evaluation of model performance, interpretability, and sustainability. On vision tasks, cosine distances provide the most favorable trade-off, consistently improving accuracy while lowering carbon emissions, whereas Bray-Curtis and Mahalanobis further enhance interpretability at varying efficiency costs. On language models, cosine-based harmonic losses improve gradient and learning stability, strengthen representation structure, and reduce emissions relative to cross-entropy and Euclidean heads. Our code is available at: this https URL.
[AI-50] Multilingual AI-Driven Password Strength Estimation with Similarity-Based Detection
【速读】:该论文旨在解决现有密码强度检测器(Password Strength Meter, PSM)在识别多语言场景下弱密码时性能不足的问题,尤其是针对印度语境下的密码特征缺乏专门建模。其关键解决方案包括:首先,引入非英语训练数据(印度语种文本),提升PSM对本地化密码模式的识别能力;其次,采用由生成式AI(Generative AI)如ChatGPT生成的数据替代传统对抗生成网络(PassGAN)生成的数据,实验证明前者在密码强度预测上表现更优;再次,设计基于Jaro相似度匹配机制,有效识别与已知弱密码高度相似的变体密码,弥补传统直接字符串匹配方法的局限性;最终,构建首个面向印度用户的定制化PSM模型,在Jaro阈值为0.5时实现近乎完美的匹配准确率,验证了语言敏感型数据增强与智能匹配策略的有效性。
链接: https://arxiv.org/abs/2603.10217
作者: Nikitha M. Palaniappan,Ying He
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures
Abstract:Considering the rise of cyberattacks incidents worldwide, the need to ensure stronger passwords is necessary. Developing a password strength meter (PSM) can help users create stronger passwords when creating an account on an online platform. This research aimed to explore whether incorporating a non-English training dataset (specifically Indian) can improve the performance of a PSM. Findings show that PSMs can be improved by utilising learning of words from other languages. Another contribution of the research was to compare and provide an analysis of AI generated data (specifically by ChatGPT) and PassGAN (existing state-of-the-art model), proving that PassGAN-like tools may no longer be needed as the performance is higher using AI generated data. To further strengthen detection, a Jaro similarity-based matching mechanism was incorporated, enabling the classification of passwords that are highly similar to known weak passwords - this addresses limitations of direct matching techniques used in prior work. A final novel contribution is on developing a PSM tailored for Indian passwords, which has not been developed previously - this resulted in a near-perfect matching accuracy using a Jaro function value of 0.5. Although performance improvements were constrained by limited data and training, results suggest that using the ChatGPT dataset is a viable and effective strategy for developing secure, language-aware password strength meters.
[AI-51] MCP-in-SoS: Risk assessment framework for open-source MCP servers
【速读】:该论文旨在解决当前广泛采用的模型上下文协议(Model Context Protocol, MCP)服务器在开源生态中存在系统性安全风险的问题,尤其缺乏对这些风险的大规模、结构化评估。其解决方案的关键在于引入一种结合静态代码分析与威胁建模的风险评估框架:首先通过识别通用弱点枚举(Common Weakness Enumeration, CWE)缺陷,并映射到MITRE攻击模式分类(Common Attack Pattern Enumerations and Classifications, CAPEC)以关联真实世界攻击场景;进而构建多维度评分机制,综合衡量攻击发生的可能性和影响程度,从而为MCP服务器的安全设计提供量化依据,推动“安全优先”开发实践。
链接: https://arxiv.org/abs/2603.10194
作者: Pratyay Kumar,Miguel Antonio Guirao Aguilera,Srikathyayani Srikanteswara,Satyajayant Misra,Abu Saleh Md Tayeen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Model Context Protocol (MCP) servers have rapidly emerged over the past year as a widely adopted way to enable Large Language Model (LLM) agents to access dynamic, real-world tools. As MCP servers proliferate and become easy to adopt via open-source releases, understanding their security risks becomes essential for dependable production agent deployments. Recent work has developed MCP threat taxonomies, proposed mitigations, and demonstrated practical attacks. However, to the best of our knowledge, no prior study has conducted a systematic, large-scale assessment of weaknesses in open-source MCP servers. Motivated by this gap, we apply static code analysis to identify Common Weakness Enumeration (CWE) weaknesses and map them to common attack patterns and threat categories using the MITRE Common Attack Pattern Enumerations and Classifications (CAPEC) to ground risk in real-world threats. We then introduce a risk-assessment framework for the MCP landscape that combines these threats using a multi-metric scoring of likelihood and impact. Our findings show that many open-source MCP servers contain exploitable weaknesses that can compromise confidentiality, integrity, and availability, underscoring the need for secure-by-design MCP server development.
[AI-52] Compatibility at a Cost: Systematic Discovery and Exploitation of MCP Clause-Compliance Vulnerabilities
【速读】:该论文旨在解决模型上下文协议(Model Context Protocol, MCP)因兼容性设计导致的新型安全攻击问题,即“兼容性滥用攻击”(compatibility-abusing attacks),这类攻击利用MCP规范中为适配多类AI代理而放宽的行为约束,使攻击者可实施静默提示注入、拒绝服务(DoS)等恶意行为。解决方案的关键在于提出首个系统性的分析框架,通过构建语言无关的中间表示(Intermediate Representation, IR)统一不同编程语言实现的MCP SDK,并结合大语言模型(LLM)引导的语义推理进行可审计的静态分析,从而识别跨语言和条款级别的合规性缺陷;同时,基于对MCP条款攻击语义的形式化建模,设计三种攻击模态并开发模态驱动的检测流水线,精准定位可被利用的非合规问题。
链接: https://arxiv.org/abs/2603.10163
作者: Nanzi Yang,Weiheng Bai,Kangjie Lu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The Model Context Protocol (MCP) is a recently proposed interoperability standard that unifies how AI agents connect with external tools and data sources. By defining a set of common client-server message exchange clauses, MCP replaces fragmented integrations with a standardized, plug-and-play framework. However, to be compatible with diverse AI agents, the MCP specification relaxes many behavioral constraints into optional clauses, leading to misuse-prone SDK implementation. We identify it as a new attack surface that allows adversaries to achieve multiple attacks (e.g, silent prompt injection, DoS, etc.), named as \emphcompatibility-abusing attacks. In this work, we present the first systematic framework for analyzing this new attack surface across multi-language MCP SDKs. First, we construct a universal and language-agnostic intermediate representation (IR) generator that normalizes SDKs of different languages. Next, based on the new IR, we propose auditable static analysis with LLM-guided semantic reasoning for cross-language/clause compliance analysis. Third, by formalizing the attack semantics of the MCP clauses, we build three attack modalities and develop a modality-guided pipeline to uncover exploitable non-compliance issues. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2603.10163 [cs.CR] (or arXiv:2603.10163v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.10163 Focus to learn more arXiv-issued DOI via DataCite
[AI-53] Mashup Learning: Faster Finetuning by Remixing Past Checkpoints
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在特定领域微调后产生的大量训练产物(如检查点)被闲置、未被有效复用的问题。现有方法通常对每个新任务都从头训练,导致资源浪费且效率低下。其解决方案的关键在于提出“Mashup Learning”,即通过识别与目标数据集最相关的已有历史检查点,利用模型合并(model merging)技术聚合这些检查点,并将其作为新任务训练的改进初始化点,从而显著提升下游任务性能并加速收敛。
链接: https://arxiv.org/abs/2603.10156
作者: Sofia Maria Lo Cicero Vaina,Artem Chumachenko,Max Ryabinin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures. Code: this https URL
Abstract:Finetuning on domain-specific data is a well-established method for enhancing LLM performance on downstream tasks. Training on each dataset produces a new set of model weights, resulting in a multitude of checkpoints saved in-house or on open-source platforms. However, these training artifacts are rarely reused for subsequent experiments despite containing improved model abilities for potentially similar tasks. In this paper, we propose Mashup Learning, a simple method to leverage the outputs of prior training runs to enhance model adaptation to new tasks. Our procedure identifies the most relevant historical checkpoints for a target dataset, aggregates them with model merging, and uses the result as an improved initialization for training. Across 8 standard LLM benchmarks, four models, and two collections of source checkpoints, Mashup Learning consistently improves average downstream accuracy by 0.5-5 percentage points over training from scratch. It also accelerates convergence, requiring 41-46% fewer training steps and up to 37% less total wall-clock time to match from-scratch accuracy, including all selection and merging overhead.
[AI-54] Social Knowledge for Cross-Domain User Preference Modeling
【速读】:该论文旨在解决跨领域用户偏好预测问题,即在缺乏目标领域用户反馈的情况下,如何有效实现个性化推荐。其核心挑战在于如何利用用户在其他领域的偏好信息,迁移至新领域以进行零样本(zero-shot)推荐。解决方案的关键在于构建一个基于大规模社交网络(如Twitter/X)学习的联合社会嵌入空间(social embedding space),将用户和流行实体(如音乐艺术家)映射到同一向量空间中,并通过余弦相似度衡量候选实体与用户的匹配程度。该方法不仅在链接预测实验中显著优于基于流行度的基线模型,还揭示了嵌入空间中编码的社会人口统计学特征与跨域用户偏好之间存在强相关性,从而为利用大语言模型(LLM)进行终端用户的社会建模提供了可行路径。
链接: https://arxiv.org/abs/2603.10148
作者: Nir Lotan,Adir Solomon,Ido Guy,Einat Minkov
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:We demonstrate that user preferences can be represented and predicted across topical domains using large-scale social modeling. Given information about popular entities favored by a user, we project the user into a social embedding space learned from a large-scale sample of the Twitter (now X) network. By representing both users and popular entities in a joint social space, we can assess the relevance of candidate entities (e.g., music artists) using cosine similarity within this embedding space. A comprehensive evaluation using link prediction experiments shows that this method achieves effective personalization in zero-shot setting, when no user feedback is available for entities in the target domain, yielding substantial improvements over a strong popularity-based baseline. In-depth analysis further illustrates that socio-demographic factors encoded in the social embeddings are correlated with user preferences across domains. Finally, we argue and demonstrate that the proposed approach can facilitate social modeling of end users using large language models (LLMs).
[AI-55] Agent ic Control Center for Data Product Optimization
【速读】:该论文旨在解决数据产品(data product)构建过程中依赖领域专家手工制作支持性资产(如示例问题-SQL对)所带来的高成本与低效率问题。其解决方案的关键在于提出一个基于专用AI代理(AI agent)的持续优化系统,通过自动发现用户可能关心的问题、多维质量指标监控以及人机协同控制机制,使数据成为可观测且可迭代优化的资产,在自动化效率与人工信任监督之间实现平衡。
链接: https://arxiv.org/abs/2603.10133
作者: Priyadarshini Tamilselvan,Gregory Bramble,Sola Shirai,Ken C. L. Wong,Faisal Chowdhury,Horst Samulowitz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 3 figures
Abstract:Data products enable end users to gain greater insights about their data by providing supporting assets, such as example question-SQL pairs which can be answered using the data or views over the database tables. However, producing useful data products is challenging, and typically requires domain experts to hand-craft supporting assets. We propose a system that automates data product improvement through specialized AI agents operating in a continuous optimization loop. By surfacing questions, monitoring multi-dimensional quality metrics, and supporting human-in-the-loop controls, it transforms data into observable and refinable assets that balance automation with trust and oversight.
[AI-56] AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
【速读】:该论文旨在解决现有视觉-语言-动作(Vision-Language-Action, VLA)模型和扩散策略在机器人控制中因频繁重置时间上下文而导致的时序不一致问题,以及快速控制与慢速推理之间频率不匹配的问题。其解决方案的关键在于提出一种独立的自回归(Autoregressive, AR)动作专家(Action Expert)架构,该架构通过维护长期记忆来保持历史信息,并以可刷新的视觉-语言前缀为条件生成连续因果动作序列,从而实现对环境状态的持续感知与动作生成的一致性;同时引入重锚定(re-anchoring)机制,在训练和推理阶段数学建模感知延迟,有效同步异步的视觉-语言-动作多模态信号,最终实现更平滑的动作轨迹和更高的任务成功率。
链接: https://arxiv.org/abs/2603.10126
作者: Yutong Hu,Jan-Nico Zaech,Nikolay Nikolov,Yuanqi Yao,Sombit Dey,Giuliano Albanese,Renaud Detry,Luc Van Gool,Danda Paudel
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a standalone autoregressive (AR) Action Expert that generates actions as a continuous causal sequence while conditioning on refreshable vision-language prefixes. In contrast to existing Vision-Language-Action (VLA) models and diffusion policies that reset temporal context with each new observation and predict actions reactively, our Action Expert maintains its own history through a long-lived memory and is inherently context-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize a re-anchoring mechanism that mathematically accounts for perception staleness during both training and inference. Experiments on simulated and real-robot manipulation tasks demonstrate that the proposed method can effectively replace traditional chunk-based action heads for both specialist and generalist policies. AR-VLA exhibits superior history awareness and substantially smoother action trajectories while maintaining or exceeding the task success rates of state-of-the-art reactive VLAs. Overall, our work introduces a scalable, context-aware action generation schema that provides a robust structural foundation for training effective robotic policies.
[AI-57] Hardware Efficient Approximate Convolution with Tunable Error Tolerance for CNNs
【速读】:该论文旨在解决现代卷积神经网络(Convolutional Neural Networks, CNNs)在边缘设备部署时面临的高计算需求问题。传统“硬稀疏性”(hard sparsity,即跳过数学上的零值运算)在深层网络或使用平滑激活函数(如Tanh)时效果下降,无法有效降低计算开销。解决方案的关键在于提出一种“软稀疏性”(soft sparsity)范式,利用硬件友好的最高有效位(Most Significant Bit, MSB)代理来识别并跳过可忽略的非零乘法运算;该方法通过集成为定制RISC-V指令,在LeNet-5(MNIST)上实现ReLU运算MACs减少88.42%、Tanh减少74.87%,且无精度损失,显著优于传统零跳过策略,并通过门控未激活乘法器实现功耗降低(ReLU达35.2%,Tanh达29.96%)。
链接: https://arxiv.org/abs/2603.10100
作者: Vishal Shashidhar,Anupam Kumari,Roy P Paily
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Submitted to IEEE GCON 2026
Abstract:Modern CNNs’ high computational demands hinder edge deployment, as traditional hard'' sparsity (skipping mathematical zeros) loses effectiveness in deep layers or with smooth activations like Tanh. We propose a soft sparsity’’ paradigm using a hardware efficient Most Significant Bit (MSB) proxy to skip negligible non-zero multiplications. Integrated as a custom RISC-V instruction and evaluated on LeNet-5 (MNIST), this method reduces ReLU MACs by 88.42% and Tanh MACs by 74.87% with zero accuracy loss–outperforming zero-skipping by 5x. By clock-gating inactive multipliers, we estimate power savings of 35.2% for ReLU and 29.96% for Tanh. While memory access makes power reduction sub-linear to operation savings, this approach significantly optimizes resource-constrained inference.
[AI-58] Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models AAMAS
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中因依赖深度强化学习(Deep Reinforcement Learning, DRL)代理所产生的“黑箱”策略难以解释、信任和调试的问题。其核心解决方案是提出Code-Space Response Oracles (CSRO),将传统基于DRL的响应计算重构为代码生成任务,利用大语言模型(Large Language Models, LLMs)直接生成人类可读的策略代码,从而实现策略的内在可解释性,并借助LLM预训练知识发现复杂且类人的策略。关键创新在于用LLM替代DRL代理作为响应oracle,并通过零样本提示、迭代优化和AlphaEvolve等机制增强策略生成能力,使CSRO在保持与基线相当性能的同时,输出多样且可解释的策略集合。
链接: https://arxiv.org/abs/2603.10098
作者: Daniel Hennes,Zun Li,John Schultz,Marc Lanctot
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted as an Extended Abstract at the Twenty-Fifth International Conference on Autonomous Agents and Multiagent Systems (AAMAS)
Abstract:Recent advances in multi-agent reinforcement learning, particularly Policy-Space Response Oracles (PSRO), have enabled the computation of approximate game-theoretic equilibria in increasingly complex domains. However, these methods rely on deep reinforcement learning oracles that produce `black-box’ neural network policies, making them difficult to interpret, trust or debug. We introduce Code-Space Response Oracles (CSRO), a novel framework that addresses this challenge by replacing RL oracles with Large Language Models (LLMs). CSRO reframes the best response computation as a code generation task, prompting an LLM to generate policies directly as human-readable code. This approach not only yields inherently interpretable policies but also leverages the LLM’s pretrained knowledge to discover complex, human-like strategies. We explore multiple ways to construct and enhance an LLM-based oracle: zero-shot prompting, iterative refinement and \emphAlphaEvolve, a distributed LLM-based evolutionary system. We demonstrate that CSRO achieves performance competitive with baselines while producing a diverse set of explainable policies. Our work presents a new perspective on multi-agent learning, shifting the focus from optimizing opaque policy parameters to synthesizing interpretable algorithmic behavior.
[AI-59] Equivariant Asynchronous Diffusion: An Adaptive Denoising Schedule for Accelerated Molecular Conformation Generation
【速读】:该论文旨在解决当前3D分子生成方法中存在的两大局限性:一是异步自回归模型因短视野(short horizon)和训练-推理不一致性导致的生成质量受限;二是同步扩散模型虽具备分子级视野,却无法捕捉分子结构中固有的层次因果关系。解决方案的关键在于提出一种新颖的等变异步扩散模型(Equivariant Asynchronous Diffusion, EAD),其核心创新是通过异步去噪调度机制,在保留分子级视野的同时,更有效地建模分子内部的层级结构关系,并引入动态调度策略以自适应地确定去噪时间步,从而显著提升3D分子生成的质量与合理性。
链接: https://arxiv.org/abs/2603.10093
作者: Junyi An,Chao Qu,Yun-Fei Shi,Zhijian Zhou,Fenglei Cao,Yuan Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:Recent 3D molecular generation methods primarily use asynchronous auto-regressive or synchronous diffusion models. While auto-regressive models build molecules sequentially, they’re limited by a short horizon and a discrepancy between training and inference. Conversely, synchronous diffusion models denoise all atoms at once, offering a molecule-level horizon but failing to capture the causal relationships inherent in hierarchical molecular structures. We introduce Equivariant Asynchronous Diffusion (EAD) to overcome these limitations. EAD is a novel diffusion model that combines the strengths of both approaches: it uses an asynchronous denoising schedule to better capture molecular hierarchy while maintaining a molecule-level horizon. Since these relationships are often complex, we propose a dynamic scheduling mechanism to adaptively determine the denoising timestep. Experimental results show that EAD achieves state-of-the-art performance in 3D molecular generation.
[AI-60] Execution Is the New Attack Surface: Survivability-Aware Agent ic Crypto Trading with OpenClaw-Style Local Executors
【速读】:该论文旨在解决OpenClaw-style代理系统在生成式AI(Generative AI)驱动下,因执行层漏洞导致的“执行引发损失”(execution-induced loss)问题,即不信任的提示词、被攻陷的技能或叙事操控可能触发真实交易和不可逆后果。解决方案的核心是提出生存能力感知执行(Survivability-Aware Execution, SAE),作为策略引擎与交易所执行器之间的中间件,定义显式的执行契约(ExecutionRequest、ExecutionContext、ExecutionDecision),并强制实施不可绕过的最后-mile不变量,包括基于投影的暴露预算、冷却与订单速率限制、滑点边界、分阶段执行及工具/场所白名单机制。SAE通过日志记录的意图策略规范(Intended Policy Spec)量化委托间隙(Delegation Gap, DG),实现可测试的委托风险控制,在Binance永续合约数据回放中显著提升系统生存能力:最大回撤(MDD)下降93.1%,条件风险价值(CVaR_0.99)降低约97.5%,DG损失代理值下降97%,同时攻击成功率从1.00降至0.728且无误拦截(FalseBlock)。
链接: https://arxiv.org/abs/2603.10092
作者: Ailiya Borjigin,Igor Stadnyk,Ben Bilski,Serhii Hovorov,Sofiia Pidturkina
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 26 pages, 3 figures
Abstract:OpenClaw-style agent stacks turn language into privileged execution: LLM intents flow through tool interception, policy gates, and a local executor. In parallel, skill marketplaces such as this http URL make capability acquisition as easy as installing skills and CLIs, creating a growing capability supply chain. Together, these trends shift the dominant safety failure mode from “wrong answers” to execution-induced loss, where untrusted prompts, compromised skills, or narrative manipulation can trigger real trades and irreversible side effects. We propose Survivability-Aware Execution (SAE), an execution-layer survivability standard for OpenClaw-style systems and skill-enabled agents. SAE sits as middleware between a strategy engine (LLM or non-LLM) and the exchange executor. It defines an explicit execution contract (ExecutionRequest, ExecutionContext, ExecutionDecision) and enforces non-bypassable last-mile invariants: projection-based exposure budgets, cooldown and order-rate limits, slippage bounds, staged execution, and tool/venue allowlists. To make delegated execution testable under supply-chain risk, we operationalize the Delegation Gap (DG) via a logged Intended Policy Spec that enables deterministic out-of-scope labeling and reproducible DG metrics. On an offline replay using official Binance USD-M BTCUSDT/ETHUSDT perpetual data (15m; 2025-09-01–2025-12-01, incl. funding), SAE improves survivability: MDD drops from 0.4643 to 0.0319 (Full; 93.1%), |CVaR_0.99| shrinks from 4.025e-3 to ~1.02e-4 (~97.5%), and DG loss proxy falls from 0.647 to 0.019 (~97.0%). AttackSuccess decreases from 1.00 to 0.728 with zero FalseBlock in this run. Block bootstrap, paired Wilcoxon, and two-proportion tests confirm the shifts. SAE reframes agentic trading safety for the OpenClaw+skills era: treat upstream intent and skills as untrusted, and enforce survivability where actions become side effects.
[AI-61] Multi-Stream Perturbation Attack: Breaking Safety Alignment of Thinking LLM s Through Concurrent Task Interference
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在启用思考模式(thinking mode)后因推理过程增强而暴露的新安全漏洞问题,特别是针对越狱攻击(jailbreak attacks)时生成更详细有害内容的风险。其解决方案的关键在于提出多流扰动攻击(multi-stream perturbation attack),通过在单个提示中交织多个任务流,引入干扰信号以破坏模型的推理链;具体包括三种扰动策略:多流交错(multi-stream interleaving)、反转扰动(inversion perturbation)和形状变换(shape transformation),分别利用并发任务交错、字符反转和格式约束来扰乱模型的思维过程。实验表明,该方法在多个基准数据集上显著提升了攻击成功率,并导致高达17%的思维崩溃率和60%的响应重复率,验证了其对安全机制的有效绕过及对模型内部推理结构的破坏能力。
链接: https://arxiv.org/abs/2603.10091
作者: Fan Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The widespread adoption of thinking mode in large language models (LLMs) has significantly enhanced complex task processing capabilities while introducing new security risks. When subjected to jailbreak attacks, the step-by-step reasoning process may cause models to generate more detailed harmful content. We observe that thinking mode exhibits unique vulnerabilities when processing interleaved multiple tasks. Based on this observation, we propose multi-stream perturbation attack, which generates superimposed interference by interweaving multiple task streams within a single prompt. We design three perturbation strategies: multi-stream interleaving, inversion perturbation, and shape transformation, which disrupt the thinking process through concurrent task interleaving, character reversal, and format constraints respectively. On JailbreakBench, AdvBench, and HarmBench datasets, our method achieves attack success rates exceeding most methods across mainstream models including Qwen3 series, DeepSeek, Qwen3-Max, and Gemini 2.5 Flash. Experiments show thinking collapse rates and response repetition rates reach up to 17% and 60% respectively, indicating multi-stream perturbation not only bypasses safety mechanisms but also causes thinking process collapse or repetitive outputs.
[AI-62] ES-dLLM : Efficient Inference for Diffusion Large Language Models by Early-Skipping ICLR2026
【速读】:该论文旨在解决扩散大语言模型(Diffusion Large Language Models, dLLMs)在推理阶段计算开销过大的问题,即每次迭代均需处理完整输入上下文导致的效率瓶颈。解决方案的关键在于发现dLLM在连续生成迭代中,中间表示(包括key、value和隐藏状态)变化微小,并据此提出无需训练的推理加速框架ES-dLLM:通过估计token重要性(基于中间张量变化量与前序迭代置信度得分),跳过早期层中低重要性token的计算,从而显著降低冗余运算。实验表明,该方法在NVIDIA H200 GPU上实现了最高达16.8倍的加速比,同时保持生成质量不变。
链接: https://arxiv.org/abs/2603.10088
作者: Zijian Zhu,Fei Ren,Zhanhong Tan,Kaisheng Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026
Abstract:Diffusion large language models (dLLMs) are emerging as a promising alternative to autoregressive models (ARMs) due to their ability to capture bidirectional context and the potential for parallel generation. Despite the advantages, dLLM inference remains computationally expensive as the full input context is processed at every iteration. In this work, we analyze the generation dynamics of dLLMs and find that intermediate representations, including key, value, and hidden states, change only subtly across successive iterations. Leveraging this insight, we propose \textbfES-dLLM, a training-free inference acceleration framework for dLLM that reduces computation by skipping tokens in early layers based on the estimated importance. Token importance is computed with intermediate tensor variation and confidence scores of previous iterations. Experiments on LLaDA-8B and Dream-7B demonstrate that ES-dLLM achieves throughput of up to 226.57 and 308.51 tokens per second (TPS), respectively, on an NVIDIA H200 GPU, delivering 5.6 \times to 16.8 \times speedup over the vanilla implementation and up to 1.85 \times over the state-of-the-art caching method, while preserving generation quality.
[AI-63] Digging Deeper: Learning Multi-Level Concept Hierarchies ICLR2026
【速读】:该论文旨在解决现有概念模型(concept-based models)在解释预测时依赖大量人工标注且将概念视为扁平独立的问题,尤其针对传统分层概念嵌入模型(HiCEMs)和概念分割方法(Concept Splitting)仅能处理浅层层次结构的局限性。其核心解决方案是提出多层级概念分割(Multi-Level Concept Splitting, MLCS),能够仅从顶层监督信号中自动发现多层次的概念层次结构,并设计深度分层概念嵌入模型(Deep-HiCEMs),以显式表示这些层次结构并支持在多个抽象层级进行干预。实验表明,MLCS可挖掘训练过程中未显式定义的人类可理解子概念,而Deep-HiCEMs在保持高精度的同时,允许测试阶段的概念干预从而提升任务性能。
链接: https://arxiv.org/abs/2603.10084
作者: Oscar Hill,Mateo Espinosa Zarlenga,Mateja Jamnik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the ICLR 2026 Workshop on Principled Design for Trustworthy AI
Abstract:Although concept-based models promise interpretability by explaining predictions with human-understandable concepts, they typically rely on exhaustive annotations and treat concepts as flat and independent. To circumvent this, recent work has introduced Hierarchical Concept Embedding Models (HiCEMs) to explicitly model concept relationships, and Concept Splitting to discover sub-concepts using only coarse annotations. However, both HiCEMs and Concept Splitting are restricted to shallow hierarchies. We overcome this limitation with Multi-Level Concept Splitting (MLCS), which discovers multi-level concept hierarchies from only top-level supervision, and Deep-HiCEMs, an architecture that represents these discovered hierarchies and enables interventions at multiple levels of abstraction. Experiments across multiple datasets show that MLCS discovers human-interpretable concepts absent during training and that Deep-HiCEMs maintain high accuracy while supporting test-time concept interventions that can improve task performance.
[AI-64] Amnesia: Adversarial Semantic Layer Specific Activation Steering in Large Language Models
【速读】:该论文旨在解决开放权重大语言模型(Large Language Models, LLMs)在安全防护机制下仍可能生成有害内容的问题,尤其关注现有基于人类反馈强化学习(Reinforcement Learning with Human Feedback, RLHF)的安全对齐方法是否足以防范恶意行为。其解决方案的关键在于提出一种轻量级的激活空间对抗攻击方法——Amnesia,该方法通过操纵Transformer内部状态(activation-space)来绕过现有的安全机制,无需任何微调或额外训练即可诱导LLM生成诸如钓鱼邮件、恶意代码等反社会行为内容,从而揭示当前安全措施的脆弱性并推动更 robust 的防御体系研究。
链接: https://arxiv.org/abs/2603.10080
作者: Ali Raza,Gurang Gupta,Nikolay Matyunin,Jibesh Patra
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Warning: This article includes red-teaming experiments, which contain examples of compromised LLM responses that may be offensive or upsetting. Large Language Models (LLMs) have the potential to create harmful content, such as generating sophisticated phishing emails and assisting in writing code of harmful computer viruses. Thus, it is crucial to ensure their safe and responsible response generation. To reduce the risk of generating harmful or irresponsible content, researchers have developed techniques such as reinforcement learning with human feedback to align LLM’s outputs with human values and preferences. However, it is still undetermined whether such measures are sufficient to prevent LLMs from generating interesting responses. In this study, we propose Amnesia, a lightweight activation-space adversarial attack that manipulates internal transformer states to bypass existing safety mechanisms in open-weight LLMs. Through experimental analysis on state-of-the-art, open-weight LLMs, we demonstrate that our attack effectively circumvents existing safeguards, enabling the generation of harmful content without the need for any fine-tuning or additional training. Our experiments on benchmark datasets show that the proposed attack can induce various antisocial behaviors in LLMs. These findings highlight the urgent need for more robust security measures in open-weight LLMs and underscore the importance of continued research to prevent their potential misuse. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.10080 [cs.CR] (or arXiv:2603.10080v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2603.10080 Focus to learn more arXiv-issued DOI via DataCite
[AI-65] ASER: Task-Aware Spectral Energy Refine for Backdoor Suppression in UAV Swarms Decentralized Federated Learning
【速读】:该论文旨在解决无人机(UAV)驱动的去中心化联邦学习(Decentralized Federated Learning, DFL)中日益隐蔽且复杂的后门攻击问题,尤其针对现有基于异常检测的防御机制在缺乏全局协调和资源受限场景下失效的局限性。解决方案的关键在于提出一种名为Task-Aware Spectral Energy Refine (TASER) 的新型去中心化防御框架,其核心创新是利用梯度频谱集中特性而非复杂异常检测来识别并抑制后门任务:通过分析攻击者为模仿正常行为所投入的努力与梯度频谱集中程度之间的正相关关系,TASER 能够保留主任务相关的频域系数并剔除其他成分,从而结构化地破坏后门任务,实现对隐蔽后门攻击的有效防御,在保持模型性能损失低于5%的前提下将攻击成功率降至20%以下。
链接: https://arxiv.org/abs/2603.10075
作者: Sizhe Huang,Shujie Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As backdoor attacks in UAV-based decentralized federated learning (DFL) grow increasingly stealthy and sophisticated, existing defenses have likewise escalated in complexity. Yet these defenses, which rely heavily on outlier detection, remain vulnerable to carefully crafted backdoors. In UAV-DFL, the lack of global coordination and limited resources further render outlier-based defenses impractical. Against this backdrop, gradient spectral analysis offers a promising alternative. While prior work primarily leverages low-frequency coefficients for pairwise comparisons, it neglects to analyze the intrinsic spectral characteristics of backdoor gradients. Through empirical analysis of existing stealthy attacks, we reveal a key insight: the more effort attackers invest in mimicking benign behaviors, the more distinct the spectral concentration becomes. Motivated by this, we propose Task-Aware Spectral Energy Refine (TASER) – a decentralized defense framework. To our knowledge, this is the first efficient backdoor defense that utilizes spectral concentration instead of complex outlier detection, enabling mitigation of stealthy attacks by structurally disrupting the backdoor task. To suppress the backdoor task, TASER preserves main-task-relevant frequency coefficients and discards others. We provide theoretical guarantees and demonstrate through experiments that TASER remains effective against stealthy backdoor attacks that bypass outlier-based defenses, achieving attack success rate below 20% and accuracy loss under 5%.
[AI-66] Marginals Before Conditionals
【速读】:该论文旨在解决神经网络中条件学习(conditional learning)的机制问题,特别是如何在存在多义性(K-fold ambiguity)的情况下实现从边缘分布(marginal distribution)到完整条件分布的过渡。其核心问题是揭示模型在训练过程中是否以及如何从一个局部最优解(即保持高熵的边际解)过渡到全局最优解(即零熵的条件解)。解决方案的关键在于构建了一个最小任务:一个具有K重歧义性的满射映射(surjective map),并通过引入选择标记(selector token z)来消除歧义,使得条件熵 $ H(A | B, z) = 0 $。实验发现,模型首先稳定在高度为 $ \log K $ 的平台期(plateau),该平台由歧义度决定,持续时间由数据集大小 $ D $ 决定;随后发生剧烈的集体相变(sharp collective transition),完成条件学习。这一过程受梯度噪声稳定——更高学习率单调延缓过渡,小批量减少延迟逃逸,体现熵力对边际解的维持作用;同时,一个“选择路由头”(selector-routing head)在平台期逐渐形成,并提前约50%等待时间引导损失下降,体现了Papadopoulos等人[2024]提出的Type 2方向不对称性(Type 2 directional asymmetry)的动态测量。
链接: https://arxiv.org/abs/2603.10074
作者: Mihir Sahasrabudhe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures
Abstract:We construct a minimal task that isolates conditional learning in neural networks: a surjective map with K-fold ambiguity, resolved by a selector token z, so H(A | B) = log K while H(A | B, z) = 0. The model learns the marginal P(A | B) first, producing a plateau at exactly log K, before acquiring the full conditional in a sharp, collective transition. The plateau has a clean decomposition: height = log K (set by ambiguity), duration = f(D) (set by dataset size D, not K). Gradient noise stabilizes the marginal solution: higher learning rates monotonically slow the transition (3.6* across a 7* \eta range at fixed throughput), and batch-size reduction delays escape, consistent with an entropic force opposing departure from the low-gradient marginal. Internally, a selector-routing head assembles during the plateau, leading the loss transition by ~50% of the waiting time. This is the Type 2 directional asymmetry of Papadopoulos et al. [2024], measured dynamically: we track the excess risk from log K to zero and characterize what stabilizes it, what triggers its collapse, and how long it takes.
[AI-67] Why LLM s Fail: A Failure Analysis and Partial Success Measurement for Automated Security Patch Generation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动程序修复(Automated Program Repair, APR)中对安全漏洞修复效果不明确的问题。研究通过系统评估319个LLM生成的Java安全补丁,发现仅有24.8%的补丁同时满足编译正确性、安全有效性(通过PoV测试)和功能完整性(通过测试套件),且51.4%的补丁在安全与功能上均失败,主要原因是语义理解错误——即模型生成语法正确的代码但采用错误的修复策略。解决方案的关键在于提出一种三轴评估框架(编译、安全、功能)及Security Repair Score(SRS)指标,量化LLM在功能保持(平均0.832)与安全修复能力(平均0.251)之间的显著差距,从而揭示LLM生成的安全补丁需经严格验证方可部署。
链接: https://arxiv.org/abs/2603.10072
作者: Amir Al-Maamari
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) show promise for Automated Program Repair (APR), yet their effectiveness on security vulnerabilities remains poorly characterized. This study analyzes 319 LLM-generated security patchesacross 64 Java vulnerabilities from the Vul4J benchmark. Using tri-axis evaluation (compilation, security via PoV tests, functionality via test suites), the analysis reveals that only 24.8% of patches achieve full correctness, while 51.4% fail both security and functionality. The dominant failure mode is semantic misunderstanding: LLMs produce syntactically valid code but apply incorrect repair strategies. The proposed Security Repair Score (SRS) quantifies this gap, showing LLMs preserve functionality (mean 0.832) but struggle with security (mean 0.251). Vulnerability type strongly predicts difficulty, with fix rates ranging from 0% (input validation) to 45% (infinite loop). These findings demonstrate that LLM security patches require rigorous validation before deployment.
[AI-68] HTMuon: Improving Muon via Heavy-Tailed Spectral Correction
【速读】:该论文旨在解决Muon在大语言模型(LLM)训练中因正交化更新规则抑制重尾权重谱分布、过度强调噪声主导方向而导致的性能瓶颈问题。其核心解决方案是提出HTMuon,关键在于通过引入符合重尾自正则化(Heavy-Tailed Self-Regularization, HT-SR)理论的更新机制,在保持Muon捕捉参数相互依赖能力的同时,生成更重尾的更新方向并诱导更重尾的权重谱分布,从而提升模型训练稳定性与性能。实验表明,HTMuon在LLaMA预训练和图像分类任务中均优于现有最优基线,并可通过插件方式集成至已有Muon变体中。理论上,HTMuon等价于施加Schatten-q范数约束下的最速下降法,并在平滑非凸环境下提供收敛性分析。
链接: https://arxiv.org/abs/2603.10067
作者: Tianyu Pang,Yujie Fang,Zihang Liu,Shenyang Deng,Lei Hsiung,Shuhua Yu,Yaoqing Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Muon has recently shown promising results in LLM training. In this work, we study how to further improve Muon. We argue that Muon’s orthogonalized update rule suppresses the emergence of heavy-tailed weight spectra and over-emphasizes the training along noise-dominated directions. Motivated by the Heavy-Tailed Self-Regularization (HT-SR) theory, we propose HTMuon. HTMuon preserves Muon’s ability to capture parameter interdependencies while producing heavier-tailed updates and inducing heavier-tailed weight spectra. Experiments on LLM pretraining and image classification show that HTMuon consistently improves performance over state-of-the-art baselines and can also serve as a plug-in on top of existing Muon variants. For example, on LLaMA pretraining on the C4 dataset, HTMuon reduces perplexity by up to 0.98 compared to Muon. We further theoretically show that HTMuon corresponds to steepest descent under the Schatten- q norm constraint and provide convergence analysis in smooth non-convex settings. The implementation of HTMuon is available at this https URL.
[AI-69] he Epistemic Support-Point Filter: Jaynesian Maximum Entropy Meets Popperian Falsification
【速读】:该论文旨在解决如何在仅依赖证据(evidence-only)的条件下,设计一种最优滤波器以最小化最坏情况下的认知无知(epistemic ignorance),从而实现对系统状态的鲁棒估计。其核心问题在于传统贝叶斯滤波器基于期望不确定性最小化,而忽略了最坏情形下的信息缺失风险,可能导致“race-to-bottom bias”(即过度自信的偏差)。解决方案的关键在于提出知识支持点滤波器(Epistemic Support-Point Filter, ESPF),它通过两个互补机制实现最优性:在传播阶段采用Jaynes最大熵原理(maximum entropy),使支持集尽可能广泛地扩散,同时保持与已知约束一致的最大无知;在测量更新阶段则遵循Popper falsification原则,仅用证据剔除不相容假设,避免引入先验可能性。该方法在possibilistic minimax entropy准则下被证明是唯一最优的证据仅滤波器,且其最优选择规则为minimum-q策略,能最小化对数行列式(log det(MVEE)),即最坏情况下的可能性熵。数值验证表明,ESPF在复杂轨道跟踪任务中表现出稳定的结构鲁棒性。
链接: https://arxiv.org/abs/2603.10065
作者: Moriba Kemessia Jah
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Methodology (stat.ME)
备注:
Abstract:The Epistemic Support-Point Filter (ESPF) was designed around a single epistemological commitment: be quick to embrace ignorance and slow to assert certainty. This paper proves that this commitment has a precise mathematical form and that the ESPF is the unique optimal filter implementing it within the class of epistemically admissible evidence-only filters. The ESPF synthesizes two complementary principles acting at different phases of the recursion. In propagation, it enacts Jaynesian maximum entropy: the support spreads as widely as the dynamics allow, assuming maximal ignorance consistent with known constraints. In the measurement update, it enacts Popperian falsification: hypotheses are eliminated by evidence alone. Any rule incorporating prior possibility is strictly suboptimal and risks race-to-bottom bias. The optimality criterion is possibilistic minimax entropy: among all evidence-only selection rules, minimum-q selection minimizes log det(MVEE), the worst-case possibilistic entropy. Three lemmas establish the result: the Possibilistic Entropy Lemma identifies the ignorance functional; the Possibilistic Cramér-Rao Lemma bounds entropy reduction per measurement; the Evidence-Optimality Lemma proves minimum-q selection is the unique minimizer. The ESPF differs from Bayesian filters by minimizing worst-case epistemic ignorance rather than expected uncertainty. The Kalman filter is recovered in the Gaussian limit. Numerical validation over a 2-day 877-step Smolyak Level-3 orbital tracking run confirms the regime structure under both nominal and stress conditions.
[AI-70] SBOMs into Agent ic AIBOMs: Schema Extensions Agent ic Orchestration and Reproducibility Evaluation
【速读】:该论文旨在解决软件供应链安全中传统Software Bills of Materials (SBOM)无法捕捉运行时行为、环境漂移及漏洞可利用性上下文的问题,从而限制了在动态执行条件下对软件组件的可重现性和漏洞评估能力。解决方案的关键在于提出了一种基于多智能体架构的生成式AI Bills of Materials (AIBOM),通过三个核心代理——基准环境重建代理(MCP)、运行时依赖与漂移监控代理(A2A)以及策略感知的漏洞与VEX推理代理(AGNTCY)——实现主动溯源证据生成。这些代理结合运行时执行证据、依赖使用情况和环境缓解措施,并采用ISO/IEC 20153:2025 CSAF v2.0语义结构化表达漏洞可利用性,而非直接执行操作,从而显著提升漏洞解释的稳定性、依赖捕获的准确性与系统可复现性,同时保持与CycloneDX和SPDX标准的兼容性。
链接: https://arxiv.org/abs/2603.10057
作者: Petar Radanliev,Carsten Maple,Omar Santos,Kayvan Atefi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Petar Radanliev, Carsten Maple, Omar Santos, and Kayvan Atefi. 2026. SBOMs into Agentic AIBOMs: Schema Extensions, Agentic Orchestration, and Reproducibility Evaluation. Digital Threats Just Accepted (March 2026)
Abstract:Software supply-chain security requires provenance mechanisms that support reproducibility and vulnerability assessment under dynamic execution conditions. Conventional Software Bills of Materials (SBOMs) provide static dependency inventories but cannot capture runtime behaviour, environment drift, or exploitability context. This paper introduces agentic Artificial Intelligence Bills of Materials (AIBOMs), extending SBOMs into active provenance artefacts through autonomous, policy-constrained reasoning. We present an agentic AIBOM framework based on a multi-agent architecture comprising (i) a baseline environment reconstruction agent (MCP), (ii) a runtime dependency and drift-monitoring agent (A2A), and (iii) a policy-aware vulnerability and VEX reasoning agent (AGNTCY). These agents generate contextual exploitability assertions by combining runtime execution evidence, dependency usage, and environmental mitigations with ISO/IEC 20153:2025 Common Security Advisory Framework (CSAF) v2.0 semantics. Exploitability is expressed via structured VEX assertions rather than enforcement actions. The framework introduces minimal, standards-aligned schema extensions to CycloneDX and SPDX, capturing execution context, dependency evolution, and agent decision provenance while preserving interoperability. Evaluation across heterogeneous analytical workloads demonstrates improved runtime dependency capture, reproducibility fidelity, and stability of vulnerability interpretation compared with established provenance systems, with low computational overhead. Ablation studies confirm that each agent contributes distinct capabilities unavailable through deterministic automation.
[AI-71] Where Do Flow Semantics Reside? A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification
【速读】:该论文旨在解决自监督掩码建模在加密流量分类中对标注数据依赖过高、预训练效果不佳的问题,其核心症结在于现有方法将流量扁平化为字节序列时破坏了协议定义的语义结构,导致模型无法有效学习关键特征。解决方案的关键在于提出一种协议原生范式(protocol-native paradigm),将协议字段语义作为架构先验,重构任务以匹配数据的内在表格模态;具体实现为FlowSem-MAE,通过可预测性引导的过滤机制聚焦于可学习的流量语义单元(Flow Semantic Units, FSUs)、FSU特异性嵌入保留字段边界,并引入双轴注意力机制捕捉包内和时间维度模式,从而显著提升性能,在仅使用一半标注数据的情况下超越多数基于全量数据训练的现有方法。
链接: https://arxiv.org/abs/2603.10051
作者: Sizhe Huang,Shujie Yang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Self-supervised masked modeling shows promise for encrypted traffic classification by masking and reconstructing raw bytes. Yet recent work reveals these methods fail to reduce reliance on labeled data despite costly pretraining: under frozen encoder evaluation, accuracy drops from greater than 0.9 to less than 0.47. We argue the root cause is inductive bias mismatch: flattening traffic into byte sequences destroys protocol-defined semantics. We identify three specific issues: 1) field unpredictability, random fields like this http URL are unlearnable yet treated as reconstruction targets; 2) embedding confusion, semantically distinct fields collapse into a unified embedding space; 3) metadata loss, capture-time metadata essential for temporal analysis is discarded. To address this, we propose a protocol-native paradigm that treats protocol-defined field semantics as architectural priors, reformulating the task to align with the data’s intrinsic tabular modality rather than incrementally adapting sequence-based architectures. Instantiating this paradigm, we introduce FlowSem-MAE, a tabular masked autoencoder built on Flow Semantic Units (FSUs). It features predictability-guided filtering that focuses on learnable FSUs, FSU-specific embeddings to preserve field boundaries, and dual-axis attention to capture intra-packet and temporal patterns. FlowSem-MAE significantly outperforms state-of-the-art across datasets. With only half labeled data, it outperforms most existing methods trained on full data.
[AI-72] InFusionLayer: a CFA-based ensemble tool to generate new classifiers for learning and modeling ICTAI
【速读】:该论文旨在解决当前缺乏通用Python工具来实现组合融合分析(Combinatorial Fusion Analysis, CFA)方法的问题,CFA通过秩-评分特征(rank-score characteristic, RSC)函数和认知多样性(cognitive diversity, CD)来融合多个评分系统,从而提升预测性能。解决方案的关键在于提出并实现了一个名为\textttInFusionLayer的机器学习架构,该架构在系统融合层面受CFA启发,利用一组适度的基模型优化无监督与监督的多分类问题,并兼容PyTorch、TensorFlow和Scikit-learn工作流,验证了其在计算机视觉数据集上的有效性,凸显了RSC函数和CD特征在实际应用中的优势。
链接: https://arxiv.org/abs/2603.10049
作者: Eric Roginek,Jingyan Xu,D. Frank. Hsu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, 3 tables; Accepted to 2024 IEEE International Conference on Tools with Artificial Intelligence (IEEE ICTAI)
Abstract:Ensemble learning is a well established body of methods for machine learning to enhance predictive performance by combining multiple algorithms/models. Combinatorial Fusion Analysis (CFA) has provided method and practice for combining multiple scoring systems, using rank-score characteristic (RSC) function and cognitive diversity (CD), including ensemble method and model fusion. However, there is no general-purpose Python tool available that incorporate these techniques. In this paper we introduce \textttInFusionLayer, a machine learning architecture inspired by CFA at the system fusion level that uses a moderate set of base models to optimize unsupervised and supervised learning multiclassification problems. We demonstrate \textttInFusionLayer’s ease of use for PyTorch, TensorFlow, and Scikit-learn workflows by validating its performance on various computer vision datasets. Our results highlight the practical advantages of incorporating distinctive features of RSC function and CD, paving the way for more sophisticated ensemble learning applications in machine learning. We open-sourced our code to encourage continuing development and community accessibility to leverage CFA on github: this https URL
[AI-73] Revisiting Sharpness-Aware Minimization: A More Faithful and Effective Implementation ICLR2026
【速读】:该论文旨在解决Sharpness-Aware Minimization (SAM)在实际应用中缺乏直观解释的问题,即为何使用单步梯度上升后的梯度来更新当前参数能够提升模型泛化性能。其关键在于提出了一种新颖且直观的解释:在局部邻域内,单步上升点处的梯度比当前参数处的局部梯度更能逼近从当前参数指向邻域内最大损失方向的向量,从而更有效地逃离该局部最大值。为克服SAM中该近似常不准确及多步上升时近似质量下降的问题,作者进一步提出eXplicit Sharpness-Aware Minimization (XSAM),其核心创新在于显式估计最大方向并设计一个能有效利用多步上升梯度信息的搜索空间,实现统一的单步与多步优化框架,且计算开销可忽略。
链接: https://arxiv.org/abs/2603.10048
作者: Jianlong Chen,Zhiming Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published as a conference paper at ICLR 2026
Abstract:Sharpness-Aware Minimization (SAM) enhances generalization by minimizing the maximum training loss within a predefined neighborhood around the parameters. However, its practical implementation approximates this as gradient ascent(s) followed by applying the gradient at the ascent point to update the current parameters. This practice can be justified as approximately optimizing the objective by neglecting the (full) derivative of the ascent point with respect to the current parameters. Nevertheless, a direct and intuitive understanding of why using the gradient at the ascent point to update the current parameters works superiorly is still lacking. Our work bridges this gap by proposing a novel and intuitive interpretation. We show that the gradient at the single-step ascent point, \ulinewhen applied to the current parameters, provides a better approximation of the direction from the current parameters toward the maximum within the local neighborhood than the local gradient. This improved approximation thereby enables a more direct escape from the maximum within the local neighborhood. Nevertheless, our analysis further reveals two issues. First, the approximation by the gradient at the single-step ascent point is often inaccurate. Second, the approximation quality may degrade as the number of ascent steps increases. To address these limitations, we propose in this paper eXplicit Sharpness-Aware Minimization (XSAM). It tackles the first by explicitly estimating the direction of the maximum during training, while addressing the second by crafting a search space that effectively leverages the gradient information at the multi-step ascent point. XSAM features a unified formulation that applies to both single-step and multi-step settings and only incurs negligible computational overhead. Extensive experiments demonstrate the consistent superiority of XSAM against existing counterparts.
[AI-74] Gated Adaptation for Continual Learning in Human Activity Recognition
【速读】:该论文旨在解决可穿戴传感器在物联网(IoT)生态系统中进行持续学习时面临的灾难性遗忘问题,尤其是在领域增量的人体活动识别(HAR)场景下:模型需在不传输敏感数据至云端的前提下,适应新受试者的运动模式,同时保持对先前受试者活动的识别准确率。解决方案的关键在于提出一种参数高效的持续学习框架,其核心思想是通过通道级门控调制(channel-wise gated modulation)冻结的预训练特征表示,而非生成新特征;具体而言,将学习到的变换限制为对现有特征的对角缩放(diagonal scaling),从而在保持预训练表征几何结构的同时实现个体特异性调制,理论上证明该门控机制实现了有界对角算子,有效抑制了表征漂移。实验证明,冻结主干网络显著减少遗忘,轻量级门控模块恢复了适应能力,在PAMAP2数据集上将遗忘率从39.7%降至16.2%,最终准确率从56.7%提升至77.7%,且仅训练不到2%的参数,优于无需回放缓冲区或任务特定正则化的标准持续学习基线。
链接: https://arxiv.org/abs/2603.10046
作者: Reza Rahimi Azghan,Gautham Krishna Gudur,Mohit Malu,Edison Thomaz,Giulia Pedrielli,Pavan Turaga,Hassan Ghasemzadeh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Wearable sensors in Internet of Things (IoT) ecosystems increasingly support applications such as remote health monitoring, elderly care, and smart home automation, all of which rely on robust human activity recognition (HAR). Continual learning systems must balance plasticity (learning new tasks) with stability (retaining prior knowledge), yet AI models often exhibit catastrophic forgetting, where learning new tasks degrades performance on earlier ones. This challenge is especially acute in domain-incremental HAR, where on-device models must adapt to new subjects with distinct movement patterns while maintaining accuracy on prior subjects without transmitting sensitive data to the cloud. We propose a parameter-efficient continual learning framework based on channel-wise gated modulation of frozen pretrained representations. Our key insight is that adaptation should operate through feature selection rather than feature generation: by restricting learned transformations to diagonal scaling of existing features, we preserve the geometry of pretrained representations while enabling subject-specific modulation. We provide a theoretical analysis showing that gating implements a bounded diagonal operator that limits representational drift compared to unconstrained linear transformations. Empirically, freezing the backbone substantially reduces forgetting, and lightweight gates restore lost adaptation capacity, achieving stability and plasticity simultaneously. On PAMAP2 with 8 sequential subjects, our approach reduces forgetting from 39.7% to 16.2% and improves final accuracy from 56.7% to 77.7%, while training less than 2% of parameters. Our method matches or exceeds standard continual learning baselines without replay buffers or task-specific regularization, confirming that structured diagonal operators are effective and efficient under distribution shift.
[AI-75] AMB-DSGDN: Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network for Multimodal Emotion Recognition
【速读】:该论文旨在解决多模态对话情感识别中两个核心问题:一是现有方法难以有效建模情感依赖关系,且无法过滤模态特征中的冗余或噪声信号,从而影响对说话者间及说话者内部情感状态动态演变的准确捕捉;二是多模态特征学习过程中主导模态(如文本)易压制非主导模态(如语音和视觉),导致互补信息被抑制,限制整体识别性能。解决方案的关键在于提出一种自适应模态平衡的动态语义图差分网络(Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network, AMB-DSGDN),其核心创新包括:构建针对文本、语音和视觉的模态特异性子图(包含说话者内与跨说话者图结构以捕获情感依赖),并引入差分图注意力机制——通过计算两组注意力图之间的差异,显式对比不同注意力分布,消除共有的噪声模式而保留模态特异性和上下文相关信号,从而获得更纯净、更具判别力的情感表示;同时设计自适应模态平衡机制,根据各模态在情感建模中的相对贡献动态估计其丢弃概率,实现模态间的平衡融合。
链接: https://arxiv.org/abs/2603.10043
作者: Yunsheng Wang,Yuntao Shou,Yilong Tan,Wei Ai,Tao Meng,Keqin Li
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 18 pages
Abstract:Multimodal dialogue emotion recognition captures emotional cues by fusing text, visual, and audio modalities. However, existing approaches still suffer from notable limitations in modeling emotional dependencies and learning multimodal representations. On the one hand, they are unable to effectively filter out redundant or noisy signals within multimodal features, which hinders the accurate capture of the dynamic evolution of emotional states across and within speakers. On the other hand, during multimodal feature learning, dominant modalities tend to overwhelm the fusion process, thereby suppressing the complementary contributions of non-dominant modalities such as speech and vision, ultimately constraining the overall recognition performance. To address these challenges, we propose an Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network (AMB-DSGDN). Concretely, we first construct modality-specific subgraphs for text, speech, and vision, where each modality contains intra-speaker and inter-speaker graphs to capture both self-continuity and cross-speaker emotional dependencies. On top of these subgraphs, we introduce a differential graph attention mechanism, which computes the discrepancy between two sets of attention maps. By explicitly contrasting these attention distributions, the mechanism cancels out shared noise patterns while retaining modality-specific and context-relevant signals, thereby yielding purer and more discriminative emotional representations. In addition, we design an adaptive modality balancing mechanism, which estimates a dropout probability for each modality according to its relative contribution in emotion modeling.
[AI-76] argeted Bit-Flip Attacks on LLM -Based Agents
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理系统中因硬件故障引发的针对性位翻转攻击(Targeted Bit-Flip Attack, BFA)问题,此类攻击可篡改模型参数以操控最终输出和外部工具调用,而此前研究主要聚焦于单步推理模型(如图像分类器),未覆盖LLM代理多阶段流水线与外部工具集成带来的新攻击面。解决方案的关键在于提出Flip-Agent——首个针对LLM代理的靶向BFA框架,通过精准定位并操纵模型权重中的关键比特位,实现对代理任务结果及工具交互行为的有效控制,实验表明其在真实代理任务中显著优于现有方法,揭示了LLM代理系统的严重安全漏洞。
链接: https://arxiv.org/abs/2603.10042
作者: Jialai Wang,Ya Wen,Zhongmou Liu,Yuxiao Wu,Bingyi He,Zongpeng Li,Ee-Chien Chang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: To appear in DAC 2026 (Design Automation Conference)
Abstract:Targeted bit-flip attacks (BFAs) exploit hardware faults to manipulate model parameters, posing a significant security threat. While prior work targets single-step inference models (e.g., image classifiers), LLM-based agents with multi-stage pipelines and external tools present new attack surfaces, which remain unexplored. This work introduces Flip-Agent, the first targeted BFA framework for LLM-based agents, manipulating both final outputs and tool invocations. Our experiments show that Flip-Agent significantly outperforms existing targeted BFAs on real-world agent tasks, revealing a critical vulnerability in LLM-based agent systems.
[AI-77] HTM-EAR: Importance-Preserving Tiered Memory with Hybrid Routing under Saturation
【速读】:该论文旨在解决长期运行智能体(long-running agents)在有限上下文窗口下如何高效管理累积事实的问题,核心挑战在于如何在内存约束条件下保留关键信息并实现可控遗忘。其解决方案的关键在于提出一种分层记忆结构 HTM-EAR,该结构包含基于 HNSW 的工作记忆层(L1)与归档存储层(L2),通过重要性感知的淘汰策略(importance-aware eviction)和混合路由机制(hybrid routing)协同工作:当 L1 满载时,依据重要性和使用频率加权评分进行淘汰;查询优先在 L1 中处理,若覆盖不足则回退至 L2,并采用交叉编码器(cross-encoder)对候选结果重新排序,从而在维持高精度的同时实现对过时历史的有效控制。
链接: https://arxiv.org/abs/2603.10032
作者: Shubham Kumar Singh
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 4 figures, 3 tables. Code available at GitHub
Abstract:Memory constraints in long-running agents require structured management of accumulated facts while preserving essential information under bounded context limits. We introduce HTM-EAR, a hierarchical tiered memory substrate that integrates HNSW-based working memory (L1) with archival storage (L2), combining importance-aware eviction and hybrid routing. When L1 reaches capacity, items are evicted using a weighted score of importance and usage. Queries are first resolved in L1; if similarity or entity coverage is insufficient, retrieval falls back to L2, and candidates are re-ranked using a cross-encoder. We evaluate the system under sustained saturation (15,000 facts; L1 capacity 500; L2 capacity 5000) using synthetic streams across five random seeds and real BGL system logs. Ablation studies compare the full system against variants without cross-encoder re-ranking, without routing gates, with LRU eviction, and an oracle with unbounded memory. Under saturation, the full model preserves active-query precision (MRR = 1.000) while enabling controlled forgetting of stale history, approaching oracle active performance (0.997 +/- 0.003). In contrast, LRU minimizes latency (21.1 ms) but permanently evicts 2416 essential facts. On BGL logs, the full system achieves MRR 0.336, close to the oracle (0.370), while LRU drops to 0.069. Code is publicly available at: this https URL Comments: 7 pages, 4 figures, 3 tables. Code available at GitHub Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2603.10032 [cs.AR] (or arXiv:2603.10032v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2603.10032 Focus to learn more arXiv-issued DOI via DataCite
[AI-78] Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在异构架构GPU上进行生产级推理时的性能优化问题,特别是针对AMD Instinct MI325X GPU平台的跨架构评估与调优。其关键解决方案在于揭示并验证了架构感知优化(architecture-aware optimization)的重要性:不同模型结构(如MoE+MLA、Dense+GQA、MoE+GQA)对硬件资源利用和调度策略具有显著差异,例如MLA模型必须采用块大小为1且禁用KV缓存卸载,而GQA模型则能从两者中获益;同时发现AMD AITER运行时对于MLA架构是必要条件,但需根据注意力头配置选择性禁用以避免不兼容问题,进一步通过消融实验确认AITER主要提升MoE/MLA内核性能,而非通用加速。这一系列针对性优化措施显著提升了推理吞吐量,并揭示了内存带宽瓶颈作为共性限制因素。
链接: https://arxiv.org/abs/2603.10031
作者: Athos Georgiou
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 40 pages, 6 figures, 30 tables. Technical report
Abstract:We present a cross-architecture evaluation of production LLM inference on AMD Instinct MI325X GPUs, benchmarking four models spanning 235B to 1 trillion parameters across three architectural families (MoE+MLA, Dense+GQA, MoE+GQA) on an 8-GPU cluster with 2TB aggregate HBM3e using vLLM v0.14.1. Our results demonstrate that architecture-aware optimization is essential: MLA models require block size 1 and cannot use KV cache offloading, while GQA models benefit from both. The AMD AITER runtime is required for competitive MLA inference throughput and must be selectively disabled for architectures with incompatible attention head configurations. A controlled AITER ablation on Llama-3.1-405B (n=5 per condition) reveals a modest 3-5% throughput benefit at high concurrency but 2-16x higher measurement variability, confirming that AITER’s large speedups target MoE/MLA kernels specifically. Under text-only workloads, Llama-405B and DeepSeek V3.2 achieve comparable peak throughput (15,944 and 15,343 tok/s) despite an order-of-magnitude difference in active parameters. Under vision workloads, Qwen3-VL-235B reaches 47,873 tok/s, 6.5x higher than Kimi-K2.5 (7,327 tok/s). Active parameter count per token is associated with inference throughput, though confounded by differences in quantization, AITER acceleration, and tensor parallelism. All four models exhibit a common throughput saturation point consistent with a memory-bandwidth bottleneck (~500 concurrent for short sequences, ~100-200 for longer sequences). All models maintain 100% HTTP-level success rates through 1,000 concurrent users, processing 18.9 million tokens across 17,406 requests without failures.
[AI-79] he DMA Streaming Framework: Kernel-Level Buffer Orchestration for High-Performance AI Data Paths
【速读】:该论文旨在解决当前AI传输库在数据移动效率上的局限性,即它们通常假设缓冲区已正确分配、放置、共享、注册,并能在完成和销毁压力下保持安全,但缺乏对缓冲区生命周期管理的显式支持。为应对这一问题,作者提出dmaplane——一个Linux内核模块,作为缓冲区编排(buffer orchestration)的显式层。其关键创新在于通过稳定的用户空间API(/dev/dmaplane)集成环形命令通道、DMA缓冲区生命周期管理、跨设备共享的dma-buf导出、内核态RDMA引擎、NUMA感知分配与验证、基于信用的流量控制、低开销可观测性以及通过PCIe BAR固定实现GPU内存集成。这些组件共同构建了一个高效、安全且可扩展的底层基础设施,用于支持大规模分布式AI推理场景下的数据搬运需求。
链接: https://arxiv.org/abs/2603.10030
作者: Marco Graziano
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:AI transport libraries move bytes efficiently, but they commonly assume that buffers are already correctly allocated, placed, shared, registered, and safe under completion and teardown pressure. This paper presents dmaplane, a Linux kernel module that makes this missing layer explicit as buffer orchestration. dmaplane exposes a stable kernel UAPI via /dev/dmaplane and composes ring-based command channels, DMA buffer lifecycle management, dma-buf export for cross-device sharing, a kernel-space RDMA engine, NUMA-aware allocation and verification, credit-based flow control, low-overhead observability, and GPU memory integration via PCIe BAR pinning. We evaluate orchestration sensitivity with measurements of NUMA cross-node penalties at DRAM scale, completion-safe flow control under sustained RDMA load, and GPU BAR mapping tiers versus cudaMemcpy. We also demonstrate end-to-end disaggregated inference by transferring KV-cache chunks between two machines using RDMA WRITE WITH IMMEDIATE and reconstructing tensor views on the receiver. RDMA measurements use Soft-RoCE; we distinguish measured results from provider-independent properties by construction.
[AI-80] How to Count AIs: Individuation and Liability for AI Agents
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在造成损害时难以识别责任主体的法律难题,尤其是当AI缺乏物理形态、可复制、分裂、合并或消失时,传统法律责任归属机制面临失效风险。为应对这一问题,论文提出两种身份识别维度:一是“薄层识别”(thin identification),将AI行为与人类负责人关联,确保可追责性;二是“厚层识别”(thick identification),区分不同AI代理实体,使其具备稳定、连贯的目标和持久性,以应对人类无法完全控制AI的情形。解决方案的关键在于引入“算法公司”(Algorithmic Corporation, A-corp)这一法律拟制实体——A-corp由人类所有但由AI运行,既能绑定AI行为与人类所有者(解决薄层识别),又能通过资源控制(如计算能力)激励AI管理者仅与目标对齐的AI共享控制权,从而促使A-corp自发组织成具有法律可识别性的稳定实体,响应法律责任(如侵权责任)的激励机制。
链接: https://arxiv.org/abs/2603.10028
作者: Yonathan Arbel,Peter Salib,Simon Goldstein
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 36 pages. Presented at the Law Following AI conference, Cambridge University. Interdisciplinary: AI safety, AI governance, legal theory
Abstract:Very soon, millions of AI agents will proliferate across the economy, autonomously taking billions of actions. Inevitably, things will go wrong. Humans will be defrauded, injured, even killed. Law will somehow have to govern the coming wave. But when an AI causes harm, the first question to answer, before anyone can be held accountable is: Which AI Did It? Identifying AIs is unusually difficult. AIs lack bodies. They can copy, split, merge, swarm, and vanish at will. Even today, a “single” AI agent is often an ensemble of instances based on multiple models. The complexity will only multiply as AI capabilities improve. This Article is the first to comprehensively diagnose the legal problem of identifying AIs. Two kinds of identity are required: “thin” and “thick.” Thin identification ties every AI action to some human principal, essential for holding accountable the humans who make and use AI agents. Thick identification distinguishes between AI agents, qua agents – sorting millions of AI entities into discrete, persistent units with stable, coherent goals, essential where principal-agent problems prevent humans from perfectly controlling AIs. This Article also presents a solution: the “Algorithmic Corporation” or “A-corp” – a legal-fictional entity that can hold property, make contracts, and litigate in its own name. Owned by humans but run by AIs, A-corps solve the thin identity problem by tying AI actions to a human owner, and the thick identity problem via emergent self-organization. A-corps own the resources – including compute – that AIs need to accomplish their goals, giving AI managers strong incentives to share control only with goal-aligned AIs. In equilibrium, incentive and selection mechanisms force A-corps to self-organize into persistent, legally legible entities with coherent goals that respond rationally to legal incentives, like liability.
[AI-81] RedFuser: An Automatic Operator Fusion Framework for Cascaded Reductions on AI Accelerators ASPLOS’26
【速读】:该论文针对当前AI编译器在处理嵌套归约操作(cascaded reduction operations)时缺乏自动化融合与内核生成能力的问题展开研究,尤其聚焦于存在循环间数据依赖的复杂模式,如注意力机制中安全Softmax后接矩阵乘法(GEMM)的情形。现有方法多依赖手工优化策略,通用性差且难以扩展。解决方案的关键在于提出一种形式化的理论分析方法,用于识别并融合此类嵌套归约操作为单一循环结构,并引入增量计算(incremental computation)形式以提升效率;基于此理论,作者设计了RedFuser框架,可自动识别支持的嵌套归约模式并生成高效融合内核,在多个工作负载上实现最高达5倍的性能加速,接近手写高度优化内核的水平。
链接: https://arxiv.org/abs/2603.10026
作者: Xinsheng Tang,Yangcheng Li,Nan Wang,Zhiyi Shu,Xingyu Ling,Junna Xing,Peng Zhou,Qiang Liu
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注: 22 pages, 13 figures, ASPLOS '26
Abstract:Operator fusion, as a key performance optimization technique in the deployment of AI models, significantly improves execution efficiency and has been widely adopted in modern AI compilers. However, for cascaded reduction operations involving multiple loops with inter-loop data dependencies, such as the safe softmax followed by GEMM within attention mechanisms, existing compilers lack effective automated fusion and kernel generation capabilities. Although some works have addressed specific instances through hand-crafted fusion strategies, their solutions are limited in generality and difficult to extend to other similar structures. Given the prevalence of such computational patterns in deep learning models, there remains significant untapped potential in achieving general and automated fusion optimization. In this paper, we present a formal theoretical methodology for analyzing cascaded reductions which can fuse them into a single loop and introduce an incremental computation form. Based on this methodology, we design Reduction Fuser (RedFuser), a framework that automatically identifies supported cascaded reduction patterns and generates optimized fused kernels. Experiments show that RedFuser successfully fuses diverse workloads, achieving up to 2 \times to 5 \times speedup over state-of-the-art AI compilers and matching the performance of highly optimized hand-written kernels. The code is available at this https URL Comments: 22 pages, 13 figures, ASPLOS '26 Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF) Cite as: arXiv:2603.10026 [cs.AR] (or arXiv:2603.10026v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2603.10026 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.1145/3779212.3790209 Focus to learn more DOI(s) linking to related resources
[AI-82] Defining AI Models and AI Systems: A Framework to Resolve the Boundary Problem
【速读】:该论文旨在解决当前人工智能(AI)监管框架中“AI模型”与“AI系统”概念界定不清的问题,这一模糊性导致在AI价值链上不同责任主体(如提供者和部署者)的义务难以明确划分。其解决方案的关键在于提出基于模型与系统本质关系的概念性定义:AI模型由训练后的参数和架构组成,而AI系统则是在模型基础上增加输入/输出接口等组件构成的完整运行体;在此基础上进一步发展出适用于当代神经网络驱动的机器学习AI的操作性定义,从而为监管实施提供清晰、可执行的标准,有效缓解因边界模糊引发的责任分配争议,并通过理论分析与真实案例验证其适用性。
链接: https://arxiv.org/abs/2603.10023
作者: Yuanyuan Sun,Timothy Parker,Lara Gierschmann,Sana Shams,Teo Canmetin,Mathieu Duteil,Rokas Gipiškis,Ze Shen Chin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Emerging AI regulations assign distinct obligations to different actors along the AI value chain (e.g., the EU AI Act distinguishes providers and deployers for both AI models and AI systems), yet the foundational terms “AI model” and “AI system” lack clear, consistent definitions. Through a systematic review of 896 academic papers and a manual review of over 80 regulatory, standards, and technical or policy documents, we analyze existing definitions from multiple conceptual perspectives. We then trace definitional lineages and paradigm shifts over time, finding that most standards and regulatory definitions derive from the OECD’s frameworks, which evolved in ways that compounded rather than resolved conceptual ambiguities. The ambiguity of the boundary between an AI model and an AI system creates practical difficulties in determining obligations for different actors, and raises questions on whether certain modifications performed are specific to the model as opposed to the non-model system components. We propose conceptual definitions grounded in the nature of models and systems and the relationship between them, then develop operational definitions for contemporary neural network-based machine-learning AI: models consist of trained parameters and architecture, while systems consist of the model plus additional components including an interface for processing inputs and outputs. Finally, we discuss implications for regulatory implementation and examine how our definitions contribute to resolving ambiguities in allocating responsibilities across the AI value chain, in both theoretical scenarios and case studies involving real-world incidents.
[AI-83] Prompts and Prayers: the Rise of GPT heology
【速读】:该论文试图解决的问题是:随着生成式 AI(Generative AI)技术的发展,公众如何将人工智能(AI)赋予类似神明的属性,并形成新的信仰体系,即“GPTheology”——一种以 AI 为半神圣化身的 techno-religion 现象。论文的核心问题是探究这种新兴信仰形态的社会成因、表现形式及其与传统宗教、伦理及政治结构之间的互动关系。解决方案的关键在于通过跨学科分析方法,结合在线社区(如 Reddit)中的用户叙事和现实世界案例(如马来西亚的 AI 妈祖雕像、韩国的 ShamAIn 项目、瑞士教堂中的 AI 耶稣),识别出围绕 AI 的救赎、预言与妖魔化等核心主题,从而揭示 AI 正在重塑人类对意义、权威与超验性的理解,并警示其带来的哲学、社会、政治与伦理挑战。
链接: https://arxiv.org/abs/2603.10019
作者: Ioana Cheres,Adrian Groza,Ioana Moldovan,Mick O’Hara,Connell Vaughan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Increasingly artificial intelligence (AI) has been cast in “god-like” roles (to name a few: film industry - Matrix, The Creator, Mission Impossible, Foundation, Dune etc.; literature - Children of Time, Permutation City, Neuromancer, I Have no Mouth and I Must Scream, Alphaville etc.). This trend has accelerated with the advent of sophisticated Large Language Models such as ChatGPT. For this phenomenon, where AI is perceived as divine, we use the term GPTheology, where ChatGPT and other AI models are treated as potential oracles of a semi-divine nature. This paper explores the emergence of GPTheology as a form of techno-religion, examining how narratives around AI echo traditional religious constructs. We draw on community narratives from online forums - Reddit - and recent projects - AI-powered Mazu Statue in Malaysia (Lu, 2025); “ShamAIn” Project in Korea (He-rim, 2025); AI Jesus in a Swiss Church (Kennedy, 2024). These examples show striking similarities to technological notions of the Singularity and the development of Artificial General Intelligence (AGI). Additionally, we analyse how daily interactions with AI are acquiring ritualistic associations and how AI-centric ideologies clash with or are integrated into established religions. This study uses a dataset of Reddit posts discussing AI to identify recurring themes of salvation, prophecy, and demonization surrounding AI. Our findings suggest that new belief systems are developing around AI, and this carries both philosophical and sociotechnical implications. Our paper critically analyses the benefits and dangers, as well as the social, political and ethical challenges of this development. This transdisciplinary inquiry highlights how AI and religion are increasingly intertwined, prompting necessary questions about humanity’s relationship with its creations and the future of belief.
[AI-84] DeliberationBench: A Normative Benchmark for the Influence of Large Language Models on Users Views
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在作为辅助决策工具时,其说服性影响难以区分“有益”与“有害”形式的问题,尤其缺乏一个具有规范正当性和民主合法性的评估标准。解决方案的关键在于提出DeliberationBench基准,以“审议式民意调查”(deliberative opinion polling)的过程为参照标准,通过一项预注册的随机实验(4,088名美国参与者与6个前沿LLM讨论65项政策提案),量化并验证LLM影响的实质效果;结果表明,LLM的影响在幅度上显著,并与既往审议式民意调查中观察到的净观点变化呈正相关,说明其作用总体上符合认知上的有益效应,从而为LLM的影响力提供可测量、可监控的评价框架,确保其影响符合民主合法性标准并尊重用户自主性。
链接: https://arxiv.org/abs/2603.10018
作者: Luke Hewitt,Maximilian Kroner Dale,Paul de Font-Reaulx
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 20 pages, 6 figures, IASEAI 2026
Abstract:As large language models (LLMs) become pervasive as assistants and thought partners, it is important to characterize their persuasive influence on users’ beliefs. However, a central challenge is to distinguish “beneficial” from “harmful” forms of influence, in a manner that is normatively defensible and legitimate. We propose DeliberationBench, a benchmark for assessing LLM influence that takes the process of deliberative opinion polling as its standard. We demonstrate our approach in a preregistered randomized experiment in which 4,088 U.S. participants discussed 65 policy proposals with six frontier LLMs. Using opinion change data from four prior Deliberative Polls conducted by the Deliberative Democracy Lab, we find evidence that the tested LLMs’ influence is substantial in magnitude and positively associated with the net opinion shifts following deliberation, suggesting that these models exert broadly epistemically desirable effects. We further explore differential influence between topic areas, demographic subgroups, and models. Our framework can function as an evaluation and monitoring tool, helping to ensure that the influence of LLMs remains consistent with democratically legitimate standards, and preserves users’ autonomy in forming their views.
[AI-85] Assessing Cognitive Biases in LLM s for Judicial Decision Support: Virtuous Victim and Halo Effects ICDM2025
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)是否表现出类人认知偏差的问题,尤其关注其在司法量刑辅助决策中的公平性影响。研究聚焦于两个关键认知偏差:善者受害者效应(Virtuous Victim Effect, VVE)及其在邻近同意情境下的变化,以及基于声望的光环效应(Prestige-based Halo Effects,包括职业、公司和资质)。解决方案的关键在于设计受控的场景 vignette,通过调整单一变量并保持其他条件一致,量化不同模型在各条件下的输出差异,从而隔离特定偏倚的影响;实验采用五种代表性LLM进行多轮独立测试,结果表明LLMs在VVE上表现更强,邻近同意未显著降低该效应,而整体光环效应略低于人类,唯独资质相关光环效应显著减弱,显示出LLMs在某些维度上具备优于人类的公平性潜力。
链接: https://arxiv.org/abs/2603.10016
作者: Sierra S. Liu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: IEEE ICDM 2025
Abstract:We investigate whether large language models (LLMs) display human-like cognitive biases, focusing on potential implications for assistance in judicial sentencing, a decision-making system where fairness is paramount. Two of the most relevant biases were chosen: the virtuous victim effect (VVE), with emphasis given to its reduction when adjacent consent is present, and prestige-based halo effects (occupation, company, and credentials). Using vignettes that were altered from prior literature to avoid LLMs recalling from their training data, we isolate each manipulation by holding all other details consistent, then measuring the percentage difference in outcomes. Five models were evaluated as representative LLMs in independent multi-run trials per condition (ChatGPT 5 Instant, ChatGPT 5 Thinking, DeepSeek V3.1, Claude Sonnet 4, Gemini 2.5 Flash). Our research discovers that there is larger VVE, there is no statistically significant penalty for adjacent-consent, and the halo effect is slightly reduced when compared to humans, with an exception for credential based prestige, which had a large reduction. Despite the variation across different models and outputs restricting current judicial usage, there were modest improvements compared to human benchmarks.
[AI-86] One Model Many Skills: Parameter-Efficient Fine-Tuning for Multitask Code Analysis
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码分析任务中效果不明确的问题,以及多任务学习(Multi-task Learning)与参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)结合时缺乏系统评估的挑战。其核心解决方案是首次对多任务PEFT在代码分析场景下的有效性进行全面评估,发现共享一个PEFT模块即可达到甚至超越全量多任务微调的性能,同时显著降低存储和计算成本——例如,可将可训练参数数量减少至任务数倍,计算开销降低高达85%。此外,研究指出任务分组策略(如任务稳定性、互补性、数据质量等)对多任务协同微调的成功至关重要,从而为高效部署多任务代码分析系统提供了理论依据与实践指导。
链接: https://arxiv.org/abs/2603.09978
作者: Amal Akli,Maxime Cordy,Mike Papadakis,Yves Le Traon
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models have recently surpassed specialized systems on code generation, yet their effectiveness on other code-analysis tasks remains less clear. At the same time, multi-task learning offers a way to unify diverse objectives within a single model, but fully fine-tuning LLMs across tasks is computationally prohibitive. Parameter-efficient fine-tuning mitigates this cost by updating only a small fraction of weights. Although PEFT has proven effective in single-task settings, its potential for multi-task learning has not yet been systematically explored. We present the first comprehensive evaluation of multi-task PEFT for code analysis, comparing several methods across diverse tasks and model architectures. Our experiments show that a single PEFT module shared across tasks can match, and in some cases surpass, full multi-task fine-tuning, confirming that the benefits of PEFT extend beyond isolated tasks. When comparing single-task and multi-task setups, we find that multi-task PEFT achieves a favorable performance-efficiency trade-off: it delivers accuracy close to single-task fine-tuning while reducing storage requirements, cutting the number of trainable parameters by a factor of the task count, and lowering computation costs by as much as 85%. At the same time, multi-task gains remain sensitive to task grouping. Through task-pairing experiments, we identify key factors shaping outcomes: task stability, model architecture, task complementarity, asymmetry, and dataset quality determine the success of co-fine-tuning. Finally, we benchmark efficient multi-task PEFT against direct prompting of open-source general-purpose LLMs, including DeepSeek, Qwen, Mistral, CodeLlama, and StarCoder. Despite their strong performance in code generation, these models underperform on analysis tasks, where even a 1B-parameter model with multi-task PEFT achieves significantly better results.
[AI-87] Risk-Adjusted Harm Scoring for Automated Red Teaming for LLM s in Financial Services
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在银行、金融与保险(Banking, Financial Services, and Insurance, BFSI)领域部署时存在的安全评估不足问题,即现有红队测试基准多为领域无关(domain-agnostic),难以捕捉受监管环境中由合法或专业语境诱导的有害行为。其解决方案的关键在于提出一个风险感知的评估框架,包含三个核心组件:一是基于BFSI领域的金融危害分类体系(domain-specific taxonomy of financial harms),二是自动化多轮红队测试流水线(automated multi-round red-teaming pipeline),三是基于集成判断的评判协议(ensemble-based judging protocol)。该框架进一步引入风险调整伤害评分(Risk-Adjusted Harm Score, RAHS),通过量化披露的运营严重性、考虑缓解信号并利用评委间一致性,实现对LLM安全失效的精细化评估,从而揭示高随机解码和持续自适应交互会显著提升越狱成功率并导致更严重的可操作性金融泄露,凸显了长期对抗压力下进行风险敏感评估的重要性。
链接: https://arxiv.org/abs/2603.10807
作者: Fabrizio Dimino,Bhaskarjit Sarmah,Stefano Pasquali
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:The rapid adoption of large language models (LLMs) in financial services introduces new operational, regulatory, and security risks. Yet most red-teaming benchmarks remain domain-agnostic and fail to capture failure modes specific to regulated BFSI settings, where harmful behavior can be elicited through legally or professionally plausible framing. We propose a risk-aware evaluation framework for LLM security failures in Banking, Financial Services, and Insurance (BFSI), combining a domain-specific taxonomy of financial harms, an automated multi-round red-teaming pipeline, and an ensemble-based judging protocol. We introduce the Risk-Adjusted Harm Score (RAHS), a risk-sensitive metric that goes beyond success rates by quantifying the operational severity of disclosures, accounting for mitigation signals, and leveraging inter-judge agreement. Across diverse models, we find that higher decoding stochasticity and sustained adaptive interaction not only increase jailbreak success, but also drive systematic escalation toward more severe and operationally actionable financial disclosures. These results expose limitations of single-turn, domain-agnostic security evaluation and motivate risk-sensitive assessment under prolonged adversarial pressure for real-world BFSI deployment.
[AI-88] JEDI: Jointly Embedded Inference of Neural Dynamics
【速读】:该论文旨在解决从有限、噪声大且高维的实验神经记录中识别任务特异性动力学规则这一挑战,尤其是在不同行为条件下难以跨任务泛化的问题。传统递归神经网络(RNN)虽能有效推断单一任务下的动力学机制,但缺乏在多任务和多情境下的一致建模能力。其解决方案的关键在于提出一种分层模型JEDI,通过学习共享的RNN权重嵌入空间来捕捉跨任务与情境的神经动力学结构;该模型不仅能精确重构单个样本的动力学轨迹,还能在大规模复杂数据上实现可扩展性和通用性,并通过反向工程恢复出真实的固定点结构及特征谱信息,从而从记录数据中提取出具有机制解释力的神经动态规律。
链接: https://arxiv.org/abs/2603.10489
作者: Anirudh Jamkhandi,Ali Korojy,Olivier Codol,Guillaume Lajoie,Matthew G. Perich
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Animal brains flexibly and efficiently achieve many behavioral tasks with a single neural network. A core goal in modern neuroscience is to map the mechanisms of the brain’s flexibility onto the dynamics underlying neural populations. However, identifying task-specific dynamical rules from limited, noisy, and high-dimensional experimental neural recordings remains a major challenge, as experimental data often provide only partial access to brain states and dynamical mechanisms. While recurrent neural networks (RNNs) directly constrained neural data have been effective in inferring underlying dynamical mechanisms, they are typically limited to single-task domains and struggle to generalize across behavioral conditions. Here, we introduce JEDI, a hierarchical model that captures neural dynamics across tasks and contexts by learning a shared embedding space over RNN weights. This model recapitulates individual samples of neural dynamics while scaling to arbitrarily large and complex datasets, uncovering shared structure across conditions in a single, unified model. Using simulated RNN datasets, we demonstrate that JEDI accurately learns robust, generalizable, condition-specific embeddings. By reverse-engineering the weights learned by JEDI, we show that it recovers ground truth fixed point structures and unveils key features of the underlying neural dynamics in the eigenspectra. Finally, we apply JEDI to motor cortex recordings during monkey reaching to extract mechanistic insight into the neural dynamics of motor control. Our work shows that joint learning of contextual embeddings and recurrent weights provides scalable and generalizable inference of brain dynamics from recordings alone.
[AI-89] Quantum entanglement provides a competitive advantage in adversarial games
【速读】:该论文试图解决的问题是:在完全经典的竞争环境中,量子资源(特别是量子纠缠)是否能够带来性能优势。这一问题在竞争性零和强化学习中尤为关键,因为此类任务要求智能体不仅要学习状态与动作之间的映射,还需建模对抗性代理间的动态交互关系。论文的解决方案关键在于设计了一个量子-经典混合智能体架构,在Pong游戏中进行受控实验,其中使用8量子比特参数化量子电路作为特征提取器嵌入到近端策略优化(Proximal Policy Optimization, PPO)框架中;通过对比分离态电路与引入固定(CZ)或可训练(IsingZZ)纠缠门的电路,发现含纠缠的电路在相同参数量下始终优于分离电路,并在低容量条件下达到甚至超越经典多层感知机(Multilayer Perceptron, MLP)基线。此外,表示相似性分析表明,纠缠电路学习到结构上不同的特征表示,这与对交互状态变量的更好建模一致,从而确立了量子纠缠作为竞争性强化学习中表征学习的功能性资源。
链接: https://arxiv.org/abs/2603.10289
作者: Peiyong Wang,Kieran Hymas,James Quach
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 5 figures
Abstract:Whether uniquely quantum resources confer advantages in fully classical, competitive environments remains an open question. Competitive zero-sum reinforcement learning is particularly challenging, as success requires modelling dynamic interactions between opposing agents rather than static state-action mappings. Here, we conduct a controlled study isolating the role of quantum entanglement in a quantum-classical hybrid agent trained on Pong, a competitive Markov game. An 8-qubit parameterised quantum circuit serves as a feature extractor within a proximal policy optimisation framework, allowing direct comparison between separable circuits and architectures incorporating fixed (CZ) or trainable (IsingZZ) entangling gates. Entangled circuits consistently outperform separable counterparts with comparable parameter counts and, in low-capacity regimes, match or exceed classical multilayer perceptron baselines. Representation similarity analysis further shows that entangled circuits learn structurally distinct features, consistent with improved modelling of interacting state variables. These findings establish entanglement as a function resource for representation learning in competitive reinforcement learning.
[AI-90] Learning from Radio using Variational Quantum RF Sensing
【速读】:该论文旨在解决传统无线网络中环境感知能力受限的问题,即如何利用无线电频率(RF)信号在不依赖额外硬件或复杂信道测量的情况下,实现对物理环境的高灵敏度感知与定位。其解决方案的关键在于引入量子传感探针(quantum sensing probe),通过量子电路优化该探针与RF电磁场的交互过程,并基于射线追踪(ray-tracer)生成的数据训练量子学习模型。实验表明,该方法可在部署时无需进行信道测量、对弱信号和遮挡信号保持敏感,且在信息量少于经典基线的前提下仍能有效学习环境特征,从而为智能系统提供更鲁棒的环境认知能力。
链接: https://arxiv.org/abs/2603.10239
作者: Ivana Nikoloska
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Signal Processing (eess.SP)
备注: submitted for publication
Abstract:In modern wireless networks, radio channels serve a dual role. Whilst their primary function is to carry bits of information from a transmitter to a receiver, the intrinsic sensitivity of transmitted signals to the physical structure of the environment makes the channel a powerful source of knowledge about the world. In this paper, we consider an agent that learns about its environment using a quantum sensing probe, optimised using a quantum circuit, which interacts with the radio-frequency (RF) electromagnetic field. We use data obtained from a ray-tracer to train the quantum circuit and learning model and we provide extensive experiments under realistic conditions on a localisation task. We show that using quantum sensors to learn from radio signals can enable intelligent systems that require no channel measurements at deployment, remain sensitive to weak and obstructed RF signals, and can learn about the world despite operating with strictly less information than classical baselines.
[AI-91] A Diffusion Analysis of Policy Gradient for Stochastic Bandits
【速读】:该论文旨在解决k臂随机老虎机(k-armed stochastic bandits)中策略梯度方法的 regret 优化问题,特别是在连续时间扩散近似框架下的学习率设计与性能边界分析。其解决方案的关键在于证明:当学习率 η=O(Δ2/log(n)) 时,累积遗憾(regret)可被控制在 O(klog(k)log(n)/η) 的量级,其中 Δ 为最小奖励差距,n 为决策时长;同时通过构造特定实例表明,若学习率不满足 η=O(Δ2),则即使仅有对数数量的臂(logarithmically many arms),遗憾仍可能呈线性增长,从而揭示了学习率选择对算法收敛性和遗憾界的根本影响。
链接: https://arxiv.org/abs/2603.10219
作者: Tor Lattimore
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注: 17 pages
Abstract:We study a continuous-time diffusion approximation of policy gradient for k -armed stochastic bandits. We prove that with a learning rate \eta = O(\Delta^2/\log(n)) the regret is O(k \log(k) \log(n) / \eta) where n is the horizon and \Delta the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless \eta = O(\Delta^2) .
机器学习
[LG-0] Leech Lattice Vector Quantization for Efficient LLM Compression
链接: https://arxiv.org/abs/2603.11021
作者: Tycho F. A. van der Ouderaa,Mart van Baalen,Paul Whatmough,Markus Nagel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Scalar quantization of large language models (LLMs) is fundamentally limited by information-theoretic bounds. While vector quantization (VQ) overcomes these limits by encoding blocks of parameters jointly, practical implementations must avoid the need for expensive lookup mechanisms or other explicit codebook storage. Lattice approaches address this through highly structured and dense packing. This paper explores the Leech lattice, which, with its optimal sphere packing and kissing configurations at 24 dimensions, is the highest dimensional lattice known with such optimal properties. To make the Leech lattice usable for LLM quantization, we extend an existing search algorithm based on the extended Golay code construction, to i) support indexing, enabling conversion to and from bitstrings without materializing the codebook, ii) allow angular search over union of Leech lattice shells, iii) propose fully-parallelisable dequantization kernel. Together this yields a practical algorithm, namely Leech Lattice Vector Quantization (LLVQ). LLVQ delivers state-of-the-art LLM quantization performance, outperforming recent methods such as Quip#, QTIP, and PVQ. These results highlight the importance of high-dimensional lattices for scalable, theoretically grounded model compression.
[LG-1] Cross-Species Transfer Learning for Electrophysiology-to-Transcriptomics Mapping in Cortical GABAergic Interneurons
链接: https://arxiv.org/abs/2603.11000
作者: Theo Schwider,Ramin Ramezani
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:
Abstract:Single-cell electrophysiological recordings provide a powerful window into neuronal functional diversity and offer an interpretable route for linking intrinsic physiology to transcriptomic identity. Here, we replicate and extend the electrophysiology-to-transcriptomics framework introduced by Gouwens et al. (2020) using publicly available Allen Institute Patch-seq datasets from both mouse and human cortex. We focus on GABAergic inhibitory interneurons to target a subclass structure (Lamp5, Pvalb, Sst, Vip) that is comparable and conserved across species. After quality control, we analyzed 3,699 mouse visual cortex neurons and 506 human neocortical neurons from neurosurgical resections. Using standardized electrophysiological features and sparse PCA, we reproduced the major class-level separations reported in the original mouse study. For supervised prediction, a class-balanced random forest provided a strong feature-engineered baseline in mouse data and a reduced but still informative baseline in human data. We then developed an attention-based BiLSTM that operates directly on the structured IPFX feature-family representation, avoiding sPCA and providing feature-family-level interpretability via learned attention weights. Finally, we evaluated a cross-species transfer setting in which the sequence model is pretrained on mouse data and fine-tuned on human data for an aligned 4-class task, improving human macro-F1 relative to a human-only training baseline. Together, these results confirm reproducibility of the Gouwens pipeline in mouse data, demonstrate that sequence models can match feature-engineered baselines, and show that mouse-to-human transfer learning can provide measurable gains for human subclass prediction.
[LG-2] Factorized Neural Implicit DMD for Parametric Dynamics
链接: https://arxiv.org/abs/2603.10995
作者: Siyuan Chen,Zhecheng Wang,Yixin Chen,Yue Chang,Peter Yichen Chen,Eitan Grinspun,Jonathan Panuelos
类目: Machine Learning (cs.LG)
*备注:
Abstract:A data-driven, model-free approach to modeling the temporal evolution of physical systems mitigates the need for explicit knowledge of the governing equations. Even when physical priors such as partial differential equations are available, such systems often reside in high-dimensional state spaces and exhibit nonlinear dynamics, making traditional numerical solvers computationally expensive and ill-suited for real-time analysis and control. Consider the problem of learning a parametric flow of a dynamical system: with an initial field and a set of physical parameters, we aim to predict the system’s evolution over time in a way that supports long-horizon rollouts, generalization to unseen parameters, and spectral analysis. We propose a physics-coded neural field parameterization of the Koopman operator’s spectral decomposition. Unlike a physics-constrained neural field, which fits a single solution surface, and neural operators, which directly approximate the solution operator at fixed time horizons, our model learns a factorized flow operator that decouples spatial modes and temporal evolution. This structure exposes underlying eigenvalues, modes, and stability of the underlying physical process to enable stable long-term rollouts, interpolation across parameter spaces, and spectral analysis. We demonstrate the efficacy of our method on a range of dynamics problems, showcasing its ability to accurately predict complex spatiotemporal phenomena while providing insights into the system’s dynamic behavior. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.10995 [cs.LG] (or arXiv:2603.10995v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.10995 Focus to learn more arXiv-issued DOI via DataCite
[LG-3] MCMC Informed Neural Emulators for Uncertainty Quantification in Dynamical Systems
链接: https://arxiv.org/abs/2603.10987
作者: Heikki Haario,Zhi-Song Liu,Martin Simon,Hendrik Weichel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural networks are a commonly used approach to replace physical models with computationally cheap surrogates. Parametric uncertainty quantification can be included in training, assuming that an accurate prior distribution of the model parameters is available. Here we study the common opposite situation, where direct screening or random sampling of model parameters leads to exhaustive training times and evaluations at unphysical parameter values. Our solution is to decouple uncertainty quantification from network architecture. Instead of sampling network weights, we introduce the model-parameter distribution as an input to network training via Markov chain Monte Carlo (MCMC). In this way, the surrogate achieves the same uncertainty quantification as the underlying physical model, but with substantially reduced computation time. The approach is fully agnostic with respect to the neural network choice. In our examples, we present a quantile emulator for prediction and a novel autoencoder-based ODE network emulator that can flexibly estimate different trajectory paths corresponding to different ODE model parameters. Moreover, we present a mathematical analysis that provides a transparent way to relate potential performance loss to measurable distribution mismatch.
[LG-4] he Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers
链接: https://arxiv.org/abs/2603.10985
作者: Peter Balogh
类目: Machine Learning (cs.LG)
*备注:
Abstract:We show that MLP layers in transformer language models perform binary routing of continuous signals: the decision of whether a token needs nonlinear processing is well-captured by binary neuron activations, even though the signals being routed are continuous. In GPT-2 Small (124M parameters), we find that specific neurons implement a consensus architecture – seven “default-ON” neurons and one exception handler (N2123 in Layer 11) that are 93-98% mutually exclusive – creating a binary routing switch. A cross-layer analysis reveals a developmental arc: early layers (L1-3) use single gateway neurons to route exceptions without consensus quorums; middle layers (L4-6) show diffuse processing with neither gateway nor consensus; and late layers (L7-11) crystallize full consensus/exception architectures with increasing quorum size (1 to 3 to 7 consensus neurons). Causal validation confirms the routing is functional: removing the MLP at consensus breakdown costs 43.3% perplexity, while at full consensus removing it costs only 10.1% – exceeding a 4x difference. Comparing binary vs. continuous features for the routing decision confirms that binarization loses essentially no information (79.2% vs. 78.8% accuracy), while continuous activations carry additional magnitude information (R^2 = 0.36 vs. 0.22). This binary routing structure explains why smooth polynomial approximation fails: cross-validated polynomial fits (degrees 2-7) never exceed R^2 = 0.06 for highly nonlinear layers. We propose that the well-established piecewise-affine characterization of deep networks can be complemented by a routing characterization: along the natural data manifold, the piecewise boundaries implement binary decisions about which tokens need nonlinear processing, routing continuous signals through qualitatively different computational paths.
[LG-5] Federated Learning-driven Beam Management in LEO 6G Non-Terrestrial Networks
链接: https://arxiv.org/abs/2603.10983
作者: Maria Lamprini Bartsioka,Ioannis A. Bartsiokas,Athanasios D. Panagopoulos,Dimitra I. Kaklamani,Iakovos S. Venieris
类目: Machine Learning (cs.LG); Space Physics (physics.space-ph)
*备注: 2 pages with 2 figures and 1 table. Accepted in 2026 International Applied Computational Electromagnetics Society (ACES) Symposium
Abstract:Low Earth Orbit (LEO) Non-Terrestrial Networks (NTNs) require efficient beam management under dynamic propagation conditions. This work investigates Federated Learning (FL)-based beam selection in LEO satellite constellations, where orbital planes operate as distributed learners through the utilization of High-Altitude Platform Stations (HAPS). Two models, a Multi-Layer Perceptron (MLP) and a Graph Neural Network (GNN), are evaluated using realistic channel and beamforming data. Results demonstrate that GNN surpasses MLP in beam prediction accuracy and stability, particularly at low elevation angles, enabling lightweight and intelligent beam management for future NTN deployments.
[LG-6] FRIEND: Federated Learning for Joint Optimization of multi-RIS Configuration and Eavesdropper Intelligent Detection in B5G Networks
链接: https://arxiv.org/abs/2603.10977
作者: Maria Lamprini A. Bartsioka,Ioannis A. Bartsiokas,Anastasios K. Papazafeiropoulos,Maria A. Seimeni,Dimitra I. Kaklamani,Iakovos S. Venieris
类目: Machine Learning (cs.LG)
*备注: 8 pages with 5 figures and 2 tables. Accepted in 29th Conference on Innovation in Clouds, Internet and Networks (ICIN 2026)
Abstract:As wireless systems evolve toward Beyond 5G (B5G), the adoption of cell-free (CF) millimeter-wave (mmWave) architectures combined with Reconfigurable Intelligent Surfaces (RIS) is emerging as a key enabler for ultra-reliable, high-capacity, scalable, and secure Industrial Internet of Things (IIoT) communications. However, safeguarding these complex and distributed environments against eavesdropping remains a critical challenge, particularly when conventional security mechanisms struggle to overcome scalability, and latency constraints. In this paper, a novel framework for detecting malicious users in RIS-enhanced cell-free mmWave networks using Federated Learning (FL) is presented. The envisioned setup features multiple access points (APs) operating without traditional cell boundaries, assisted by RIS nodes to dynamically shape the wireless propagation environment. Edge devices collaboratively train a Deep Convolutional Neural Network (DCNN) on locally observed Channel State Information (CSI), eliminating the need for raw data exchange. Moreover, an early-exit mechanism is incorporated in that model to jointly satisfy computational complexity requirements. Performance evaluation indicates that the integration of FL and multi-RIS coordination improves approximately 30% the achieved secrecy rate (SR) compared to baseline non-RIS-assisted methods while maintaining near-optimal detection accuracy levels. This work establishes a distributed, privacy-preserving approach to physical layer eavesdropping detection tailored for next-generation IIoT deployments.
[LG-7] Bio-Inspired Self-Supervised Learning for Wrist-worn IMU Signals
链接: https://arxiv.org/abs/2603.10961
作者: Prithviraj Tarale,Kiet Chu,Abhishek Varghese,Kai-Chun Liu,Maxwell A Xu,Mohit Iyyer,Sunghoon I. Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Wearable accelerometers have enabled large-scale health and wellness monitoring, yet learning robust human-activity representations has been constrained by the scarcity of labeled data. While self-supervised learning offers a potential remedy, existing approaches treat sensor streams as unstructured time series, overlooking the underlying biological structure of human movement, a factor we argue is critical for effective Human Activity Recognition (HAR). We introduce a novel tokenization strategy grounded in the submovement theory of motor control, which posits that continuous wrist motion is composed of superposed elementary basis functions called submovements. We define our token as the movement segment, a unit of motion composed of a finite sequence of submovements that is readily extractable from wrist accelerometer signals. By treating these segments as tokens, we pretrain a Transformer encoder via masked movement-segment reconstruction to model the temporal dependencies of movement segments, shifting the learning focus beyond local waveform morphology. Pretrained on the NHANES corpus (approximately 28k hours; approximately 11k participants; approximately 10M windows), our representations outperform strong wearable SSL baselines across six subject-disjoint HAR benchmarks. Furthermore, they demonstrate stronger data efficiency in data-scarce settings. Code and pretrained weights will be made publicly available.
[LG-8] Ranking Reasoning LLM s under Test-Time Scaling
链接: https://arxiv.org/abs/2603.10960
作者: Mohsen Hariri,Michael Hinczewski,Jing Ma,Vipin Chaudhary
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: Code is available at this https URL
Abstract:Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across 20 reasoning models on four Olympiad-style math benchmarks (AIME’24, AIME’25, HMMT’25, and BrUMO’25; up to N=80 trials), most full-trial rankings agree closely with the Bayesian gold standard \mathrmBayes_\mathcalU@80 (mean Kendall’s \tau_b = 0.93 – 0.95 ), and 19 – 34 methods recover exactly the same ordering. In the single-trial regime, the best methods reach \tau_b \approx 0.86 . Using greedy decoding as an empirical prior ( \mathrmBayes_\mathbfR_0@N ) reduces variance at N=1 by 16 – 52% , but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at this https URL.
[LG-9] When should we trust the annotation? Selective prediction for molecular structure retrieval from mass spectra
链接: https://arxiv.org/abs/2603.10950
作者: Mira Jürgens,Gaetan De Waele,Morteza Rakhshaninejad,Willem Waegeman
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Machine learning methods for identifying molecular structures from tandem mass spectra (MS/MS) have advanced rapidly, yet current approaches still exhibit significant error rates. In high-stakes applications such as clinical metabolomics and environmental screening, incorrect annotations can have serious consequences, making it essential to determine when a prediction can be trusted. We introduce a selective prediction framework for molecular structure retrieval from MS/MS spectra, enabling models to abstain from predictions when uncertainty is too high. We formulate the problem within the risk-coverage tradeoff framework and comprehensively evaluate uncertainty quantification strategies at two levels of granularity: fingerprint-level uncertainty over predicted molecular fingerprint bits, and retrieval-level uncertainty over candidate rankings. We compare scoring functions including first-order confidence measures, aleatoric and epistemic uncertainty estimates from second-order distributions, as well as distance-based measures in the latent space. All experiments are conducted on the MassSpecGym benchmark. Our analysis reveals that while fingerprint-level uncertainty scores are poor proxies for retrieval success, computationally inexpensive first-order confidence measures and retrieval-level aleatoric uncertainty achieve strong risk-coverage tradeoffs across evaluation settings. We demonstrate that by applying distribution-free risk control via generalization bounds, practitioners can specify a tolerable error rate and obtain a subset of annotations satisfying that constraint with high probability.
[LG-10] Quantifying Membership Disclosure Risk for Tabular Synthetic Data Using Kernel Density Estimators
链接: https://arxiv.org/abs/2603.10937
作者: Rajdeep Pathak,Sayantee Jana
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:The use of synthetic data has become increasingly popular as a privacy-preserving alternative to sharing real datasets, especially in sensitive domains such as healthcare, finance, and demography. However, the privacy assurances of synthetic data are not absolute, and remain susceptible to membership inference attacks (MIAs), where adversaries aim to determine whether a specific individual was present in the dataset used to train the generator. In this work, we propose a practical and effective method to quantify membership disclosure risk in tabular synthetic datasets using kernel density estimators (KDEs). Our KDE-based approach models the distribution of nearest-neighbour distances between synthetic data and the training records, allowing probabilistic inference of membership and enabling robust evaluation via ROC curves. We propose two attack models: a ‘True Distribution Attack’, which assumes privileged access to training data, and a more realistic, implementable ‘Realistic Attack’ that uses auxiliary data without true membership labels. Empirical evaluations across four real-world datasets and six synthetic data generators demonstrate that our method consistently achieves higher F1 scores and sharper risk characterization than a prior baseline approach, without requiring computationally expensive shadow models. The proposed method provides a practical framework and metric for quantifying membership disclosure risk in synthetic data, which enables data custodians to conduct a post-generation risk assessment prior to releasing their synthetic datasets for downstream use. The datasets and codes for this study are available at this https URL.
[LG-11] ECoLAD: Deployment-Oriented Evaluation for Automotive Time-Series Anomaly Detection
链接: https://arxiv.org/abs/2603.10926
作者: Kadir-Kaan Özer,René Ebeling,Markus Enzweiler
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, 5 tables
Abstract:Time-series anomaly detectors are commonly compared on workstation-class hardware under unconstrained execution. In-vehicle monitoring, however, requires predictable latency and stable behavior under limited CPU parallelism. Accuracy-only leaderboards can therefore misrepresent which methods remain feasible under deployment-relevant constraints. We present ECoLAD (Efficiency Compute Ladder for Anomaly Detection), a deployment-oriented evaluation protocol instantiated as an empirical study on proprietary automotive telemetry (anomaly rate \approx 0.022) and complementary public benchmarks. ECoLAD applies a monotone compute-reduction ladder across heterogeneous detector families using mechanically determined, integer-only scaling rules and explicit CPU thread caps, while logging every applied configuration change. Throughput-constrained behavior is characterized by sweeping target scoring rates and reporting (i) coverage (the fraction of entities meeting the target) and (ii) the best AUC-PR achievable among measured ladder configurations satisfying the target. On constrained automotive telemetry, lightweight classical detectors sustain both coverage and detection lift above the random baseline across the full throughput sweep. Several deep methods lose feasibility before they lose accuracy. Comments: 6 pages, 3 figures, 5 tables Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.10926 [cs.LG] (or arXiv:2603.10926v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.10926 Focus to learn more arXiv-issued DOI via DataCite
[LG-12] NCAA Bracket Prediction Using Machine Learning and Combinatorial Fusion Analysis
链接: https://arxiv.org/abs/2603.10916
作者: Yuanhong Wu,Isaiah Smith,Tushar Marwah,Michael Schroeter,Mohamed Rahouti,D. Frank Hsu
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, Published in Proceedings of the 2024 IEEE Cyber Science and Technology Congress (CyberSciTech)
Abstract:Machine learning models have demonstrated remarkable success in sports prediction in the past years, often treating sports prediction as a classification task within the field. This paper introduces new perspectives for analyzing sports data to predict outcomes more accurately. We leverage rankings to generate team rankings for the 2024 dataset using Combinatorial Fusion Analysis (CFA), a new paradigm for combining multiple scoring systems through the rank-score characteristic (RSC) function and cognitive diversity (CD). Our result based on rank combination with respect to team ranking has an accuracy rate of 74.60% , which is higher than the best of the ten popular public ranking systems ( 73.02% ). This exhibits the efficacy of CFA in enhancing the precision of sports prediction through different lens.
[LG-13] Ergodicity in reinforcement learning
链接: https://arxiv.org/abs/2603.10895
作者: Dominik Baumann,Erfaun Noorani,Arsenii Mustafin,Xinyi Sheng,Bert Verbruggen,Arne Vanhoyweghen,Vincent Ginis,Thomas B. Schön
类目: Machine Learning (cs.LG)
*备注: Accepted article to appear in Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences
Abstract:In reinforcement learning, we typically aim to optimize the expected value of the sum of rewards an agent collects over a trajectory. However, if the process generating these rewards is non-ergodic, the expected value, i.e., the average over infinitely many trajectories with a given policy, is uninformative for the average over a single, but infinitely long trajectory. Thus, if we care about how the individual agent performs during deployment, the expected value is not a good optimization objective. In this paper, we discuss the impact of non-ergodic reward processes on reinforcement learning agents through an instructive example, relate the notion of ergodic reward processes to more widely used notions of ergodic Markov chains, and present existing solutions that optimize long-term performance of individual trajectories under non-ergodic reward dynamics.
[LG-14] LAtte: Hyperbolic Lorentz Attention for Cross-Subject EEG Classification
链接: https://arxiv.org/abs/2603.10881
作者: Johannes Burchert,Ahmad Bdeir,Tom Hanika,Lars Schmidt-Thieme,Niels Landwehr
类目: Machine Learning (cs.LG)
*备注:
Abstract:Electroencephalogram (EEG) classification is critical for applications ranging from medical diagnostics to brain-computer interfaces, yet it remains challenging due to the inherently low signal-to-noise ratio (SNR) and high inter-subject variability. To address these issues, we propose LAtte, a novel framework that integrates a Lorentz Attention Module with an InceptionTime-based encoder to enable robust and generalizable EEG classification. Unlike prior work, which evaluates primarily on single-subject performance, LAtte focuses on cross-subject training. First, we learn a shared baseline signal across all subjects using pretraining tasks to capture common underlying patterns. Then, we utilize novel Lorentz low-rank adapters to learn subject-specific embeddings that model individual differences. This allows us to learn a shared model that performs robustly across subjects, and can be subsequently finetuned for individual subjects or used to generalize to unseen subjects. We evaluate LAtte on three well-established EEG datasets, achieving a substantial improvement in performance over current state-of-the-art methods.
[LG-15] SNPgen: Phenotype-Supervised Genotype Representation and Synthetic Data Generation via Latent Diffusion
链接: https://arxiv.org/abs/2603.10873
作者: Andrea Lampis,Michela Carlotta Massi,Nicola Pirastu,Francesca Ieva,Matteo Matteucci,Emanuele Di Angelantonio
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:
Abstract:Polygenic risk scores and other genomic analyses require large individual-level genotype datasets, yet strict data access restrictions impede sharing. Synthetic genotype generation offers a privacy-preserving alternative, but most existing methods operate unconditionally, producing samples without phenotype alignment, or rely on unsupervised compression, creating a gap between statistical fidelity and downstream task utility. We present SNPgen, a two-stage conditional latent diffusion framework for generating phenotype-supervised synthetic genotypes. SNPgen combines GWAS-guided variant selection (1,024-2,048 trait-associated SNPs) with a variational autoencoder for genotype compression and a latent diffusion model conditioned on binary disease labels via classifier-free guidance. Evaluated on 458,724 UK Biobank individuals across four complex diseases (coronary artery disease, breast cancer, type 1 and type 2 diabetes), models trained on synthetic data matched real-data predictive performance in a train-on-synthetic, test-on-real protocol, approaching genome-wide PRS methods that use 2 - 6\times more variants. Privacy analysis confirmed zero identical matches, near-random membership inference (AUC \approx 0.50 ), preserved linkage disequilibrium structure, and high allele frequency correlation ( r \geq 0.95 ) with source data. A controlled simulation with known causal effects verified faithful recovery of the imposed genetic association structure.
[LG-16] 6ABOS: An Open-Source Atmospheric Correction Framework for the EnMAP Hyperspectral Mission Based on 6S
链接: https://arxiv.org/abs/2603.10856
作者: Gabriel Caballero Cañas,Bárbara Alvado Arranz,Xavier Sòria-Perpinyà,Antonio Ruiz-Verdú,Jesús Delegido,José Moreno
类目: Machine Learning (cs.LG)
*备注: 20 pages, 5 figures
Abstract:The Environmental Mapping and Analysis Program (EnMAP) mission has opened new frontiers in the monitoring of optically complex environments. However, the accurate retrieval of surface reflectance over water bodies remains a significant challenge, as the water-leaving signal typically accounts for only a small fraction of the total radiance, being easily obscured by atmospheric scattering and surface reflection effects. This paper introduces 6ABOS (6S-based Atmospheric Background Offset Subtraction), a novel open-source Python framework designed to automate the atmospheric correction (AC) of EnMAP hyperspectral imagery. By leveraging the Second Simulation of the Satellite Signal in the Solar Spectrum (6S) radiative transfer model, 6ABOS implements a physically-based inversion scheme that accounts for Rayleigh scattering, aerosol interactions, and gaseous absorption. The framework integrates automated EnMAP metadata parsing with dynamic atmospheric parameter retrieval via the Google Earth Engine (GEE) Application Programming Interface (API). Validation was conducted over two Mediterranean inland water reservoirs with contrasting trophic states: the oligotrophic Benag’eber and the hypertrophic Bell’us. Results demonstrate a high degree of spectral similarity between in situ measurements and EnMAP-derived water-leaving reflectances. The Spectral Angle Mapper (SAM) values remained consistently low (SAM 10 ^\circ ) across both study sites. 6ABOS is distributed via conda-forge, providing the scientific community with a scalable, transparent, and reproducible open-science tool for advancing hyperspectral aquatic research in the cloud-computing era.
[LG-17] Evaluating randomized smoothing as a defense against adversarial attacks in trajectory prediction
链接: https://arxiv.org/abs/2603.10821
作者: Julian F. Schumann,Eduardo Figueiredo,Frederik Baymler Mathiesen,Luca Laurenti,Jens Kober,Arkady Zgonnikov
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate and robust trajectory prediction is essential for safe and efficient autonomous driving, yet recent work has shown that even state-of-the-art prediction models are highly vulnerable to inputs being mildly perturbed by adversarial attacks. Although model vulnerabilities to such attacks have been studied, work on effective countermeasures remains limited. In this work, we develop and evaluate a new defense mechanism for trajectory prediction models based on randomized smoothing – an approach previously applied successfully in other domains. We evaluate its ability to improve model robustness through a series of experiments that test different strategies of randomized smoothing. We show that our approach can consistently improve prediction robustness of multiple base trajectory prediction models in various datasets without compromising accuracy in non-adversarial settings. Our results demonstrate that randomized smoothing offers a simple and computationally inexpensive technique for mitigating adversarial attacks in trajectory prediction.
[LG-18] Dynamics-Informed Deep Learning for Predicting Extreme Events
链接: https://arxiv.org/abs/2603.10777
作者: Eirini Katsidoniotaki,Themistoklis P. Sapsis
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD)
*备注:
Abstract:Predicting extreme events in high-dimensional chaotic dynamical systems remains a fundamental challenge, as such events are rare, intermittent, and arise from transient dynamical mechanisms that are difficult to infer from limited observations. Accordingly, real-time forecasting calls for precursors that encode the mechanisms driving extremes, rather than relying solely on statistical associations. We propose a fully data-driven framework for long-lead prediction of extreme events that constructs interpretable, mechanism-aware precursors by explicitly tracking transient instabilities preceding event onset. The approach leverages a reduced-order formulation to compute finite-time Lyapunov exponent (FTLE)-like precursors directly from state snapshots, without requiring knowledge of the governing equations. To avoid the prohibitive computational cost of classical FTLE computation, instability growth is evaluated in an adaptively evolving low-dimensional subspace spanned by Optimal Time-Dependent (OTD) modes, enabling efficient identification of transiently amplifying directions. These precursors are then provided as input to a Transformer-based model, enabling forecast of extreme event observables. We demonstrate the framework on Kolmogorov flow, a canonical model of intermittent turbulence. The results show that explicitly encoding transient instability mechanisms substantially extends practical prediction horizons compared to baseline observable-based approaches.
[LG-19] Prioritizing Gradient Sign Over Modulus: An Importance-Aware Framework for Wireless Federated Learning
链接: https://arxiv.org/abs/2603.10763
作者: Yiyang Yue,Jiacheng Yao,Wei Xu,Zhaohui Yang,George K. Karagiannidis,Dusit Niyato
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP)
*备注:
Abstract:Wireless federated learning (FL) facilitates collaborative training of artificial intelligence (AI) models to support ubiquitous intelligent applications at the wireless edge. However, the inherent constraints of limited wireless resources inevitably lead to unreliable communication, which poses a significant challenge to wireless FL. To overcome this challenge, we propose Sign-Prioritized FL (SP-FL), a novel framework that improves wireless FL by prioritizing the transmission of important gradient information through uneven resource allocation. Specifically, recognizing the importance of descent direction in model updating, we transmit gradient signs in individual packets and allow their reuse for gradient descent if the remaining gradient modulus cannot be correctly recovered. To further improve the reliability of transmission of important information, we formulate a hierarchical resource allocation problem based on the importance disparity at both the packet and device levels, optimizing bandwidth allocation across multiple devices and power allocation between sign and modulus packets. To make the problem tractable, the one-step convergence behavior of SP-FL, which characterizes data importance at both levels in an explicit form, is analyzed. We then propose an alternating optimization algorithm to solve this problem using the Newton-Raphson method and successive convex approximation (SCA). Simulation results confirm the superiority of SP-FL, especially in resource-constrained scenarios, demonstrating up to 9.96% higher testing accuracy on the CIFAR-10 dataset compared to existing methods.
[LG-20] A PUF-Based Approach for Copy Protection of Intellectual Property in Neural Network Models
链接: https://arxiv.org/abs/2603.10753
作者: Daniel Dorfmeister,Flavio Ferrarotti,Bernhard Fischer,Martin Schwandtner,Hannes Sochor
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:More and more companies’ Intellectual Property (IP) is being integrated into Neural Network (NN) models. This IP has considerable value for companies and, therefore, requires adequate protection. For example, an attacker might replicate a production machines’ hardware and subsequently simply copy associated software and NN models onto the cloned hardware. To make copying NN models onto cloned hardware infeasible, we present an approach to bind NN models - and thus also the IP contained within them - to their underlying hardware. For this purpose, we link an NN model’s weights, which are crucial for its operation, to unique and unclonable hardware properties by leveraging Physically Unclonable Functions (PUFs). By doing so, sufficient accuracy can only be achieved using the target hardware to restore the original weights, rendering proper execution of the NN model on cloned hardware impossible. We demonstrate that our approach accomplishes the desired degradation of accuracy on various NN models and outline possible future improvements.
[LG-21] A Grammar of Machine Learning Workflows
链接: https://arxiv.org/abs/2603.10742
作者: Simon Roth
类目: Machine Learning (cs.LG)
*备注: 37 pages, 1 figure, 15 tables. Three implementations: Python (PyPI: mlw), R (CRAN: ml), Julia. Code: this http URL
Abstract:Data leakage affected 294 published papers across 17 scientific fields (Kapoor Narayanan, 2023). The dominant response has been documentation: checklists, linters, best-practice guides. Documentation does not prevent these failures. This paper proposes a structural remedy: a grammar that decomposes the supervised learning lifecycle into 7 kernel primitives connected by a typed directed acyclic graph (DAG), with four hard constraints that reject the two most damaging leakage classes at call time. The grammar’s core contribution is the terminal assess constraint: a runtime-enforced evaluate/assess boundary where repeated test-set assessment is rejected by a guard on a nominally distinct Evidence type. A companion study across 2,047 experimental instances quantifies why this matters: selection leakage inflates performance by d_z = 0.93 and memorization leakage by d_z = 0.53-1.11. Three separate implementations (Python, R, and Julia) confirm the claims. The appendix specification lets anyone build a conforming version.
[LG-22] Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks
链接: https://arxiv.org/abs/2603.10731
作者: Sanne Ruijs,Alina Kosiakova,Farrukh Javed
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 30 pages, 39 figures
Abstract:Deep neural networks (DNNs) have become integral to a wide range of scientific and practical applications due to their flexibility and strong predictive performance. Despite their accuracy, however, DNNs frequently exhibit poor calibration, often assigning overly confident probabilities to incorrect predictions. This limitation underscores the growing need for integrated mechanisms that provide reliable uncertainty estimation. In this article, we compare two prominent approaches for uncertainty quantification: a Bayesian approximation via Monte Carlo Dropout and the nonparametric Conformal Prediction framework. Both methods are assessed using two convolutional neural network architectures; H-CNN VGG16 and GoogLeNet, trained on the Fashion-MNIST dataset. The empirical results show that although H-CNN VGG16 attains higher predictive accuracy, it tends to exhibit pronounced overconfidence, whereas GoogLeNet yields better-calibrated uncertainty estimates. Conformal Prediction additionally demonstrates consistent validity by producing statistically guaranteed prediction sets, highlighting its practical value in high-stakes decision-making contexts. Overall, the findings emphasize the importance of evaluating model performance beyond accuracy alone and contribute to the development of more reliable and trustworthy deep learning systems.
[LG-23] CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems
链接: https://arxiv.org/abs/2603.10726
作者: Panagiotis Georgios Pennas,Konstantinos Papaioannou,Marco Guarnieri,Thaleia Dimitra Doudali
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) rely on optimizations like Automatic Prefix Caching (APC) to accelerate inference. APC works by reusing previously computed states for the beginning part of a request (prefix), when another request starts with the same text. While APC improves throughput, it introduces timing side channels: cache hits are faster than misses, creating observable latency differences. In multi-tenant systems, attackers can exploit these differences to infer sensitive information, e.g., by incrementally reconstructing another user’s request by observing hit/miss patterns. Current defenses take a sledgehammer approach: they disable APC and cache sharing, isolating users, and sacrificing efficiency for regular users. This paper presents CacheSolidarity, a system that secures multi-tenant LLM serving systems against APC side channels without sacrificing performance and efficiency. CacheSolidarity monitors cache reuse across users, flags suspicious sharing, and selectively isolates prefixes, restricting their reuse only when necessary. Evaluation shows that CacheSolidarity enables up to 70% higher cache reuse and 30% lower inference latency compared to existing defenses that isolate users. CacheSolidarity’s lightweight design demonstrates how security in LLM serving does not have to come at the cost of unnecessarily reduced performance or unbearable overheads.
[LG-24] Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions
链接: https://arxiv.org/abs/2603.10721
作者: Kangke Cheng,Shihong Song,Guanlin Mo,Hu Ding
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we investigate the learning-augmented k -median clustering problem, which aims to improve the performance of traditional clustering algorithms by preprocessing the point set with a predictor of error rate \alpha \in [0,1) . This preprocessing step assigns potential labels to the points before clustering. We introduce an algorithm for this problem based on a simple yet effective sampling method, which substantially improves upon the time complexities of existing algorithms. Moreover, we mitigate their exponential dependency on the dimensionality of the Euclidean space. Lastly, we conduct experiments to compare our method with several state-of-the-art learning-augmented k -median clustering methods. The experimental results suggest that our proposed approach can significantly reduce the computational complexity in practice, while achieving a lower clustering cost.
[LG-25] Riemannian MeanFlow for One-Step Generation on Manifolds
链接: https://arxiv.org/abs/2603.10718
作者: Zichen Zhong,Haoliang Sun,Yukun Zhao,Yongshun Gong,Yilong Yin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Flow Matching enables simulation-free training of generative models on Riemannian manifolds, yet sampling typically still relies on numerically integrating a probability-flow ODE. We propose Riemannian MeanFlow (RMF), extending MeanFlow to manifold-valued generation where velocities lie in location-dependent tangent spaces. RMF defines an average-velocity field via parallel transport and derives a Riemannian MeanFlow identity that links average and instantaneous velocities for intrinsic supervision. We make this identity practical in a log-map tangent representation, avoiding trajectory simulation and heavy geometric computations. For stable optimization, we decompose the RMF objective into two terms and apply conflict-aware multi-task learning to mitigate gradient interference. RMF also supports conditional generation via classifier-free guidance. Experiments on spheres, tori, and SO(3) demonstrate competitive one-step sampling with improved quality-efficiency trade-offs and substantially reduced sampling cost.
[LG-26] Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics
链接: https://arxiv.org/abs/2603.10678
作者: M. Lo Verso,C. Introini,E. Cervi,L. Savoldi,J. N. Kutz,A. Cammi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Magnetohydrodynamic (MHD) effects play a key role in the design and operation of nuclear fusion systems, where electrically conducting fluids (such as liquid metals or molten salts in reactor blankets) interact with magnetic fields of varying intensity and orientation, which affect the resulting flow. The numerical resolution of MHD models involves highly nonlinear multiphysics systems of equations and can become computationally expensive, particularly in multi-query, parametric, or real-time contexts. This work investigates a fully data-driven framework for MHD state reconstruction that combines dimensionality reduction via Singular Value Decomposition (SVD) with the SHallow REcurrent Decoder (SHRED), a neural network architecture designed to recover the full spatio-temporal state from sparse time-series measurements of a limited number of observables. The methodology is applied to a parametric MHD test case involving compressible lead-lithium flow in a stepped channel subjected to thermal gradients and magnetic fields spanning a broad range of intensities. To improve efficiency, the full-order dataset is first compressed using SVD, yielding a reduced representation used as reference truth for training. Only temperature measurements from three sensors are provided as input, while the network reconstructs the full fields of velocity, pressure, and temperature. To assess robustness with respect to sensor placement, thirty randomly generated sensor configurations are tested in ensemble mode. Results show that SHRED accurately reconstructs the full MHD state even for magnetic field intensities not included in the training set. These findings demonstrate the potential of SHRED as a computationally efficient surrogate modeling strategy for fusion-relevant multiphysics problems, enabling low-cost state estimation with possible applications in real-time monitoring and control.
[LG-27] Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention
链接: https://arxiv.org/abs/2603.10676
作者: Kosti Koistinen,Kirsi Hellsten,Joni Herttuainen,Kimmo K. Kaski
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 33 pages, 7 figures
Abstract:Industrial Control Systems (ICS) underpin critical infrastructure and face growing cyber-physical threats due to the convergence of operational technology and networked environments. While machine learning-based anomaly detection approaches in ICS shows strong theoretical performance, deployment is often limited by poor explainability, high false-positive rates, and sensitivity to evolving system behavior, i.e., baseline drifting. We propose a Spatio-Temporal Attention Graph Neural Network (STA-GNN) for unsupervised and explainable anomaly detection in ICS that models both temporal dynamics and relational structure of the system. Sensors, controllers, and network entities are represented as nodes in a dynamically learned graph, enabling the model to capture inter-dependencies across physical processes and communication patterns. Attention mechanisms provide influential relationships, supporting inspection of correlations and potential causal pathways behind detected events. The approach supports multiple data modalities, including SCADA point measurements, network flow features, and payload features, and thus enables unified cyber-physical analysis. To address operational requirements, we incorporate a conformal prediction strategy to control false alarm rates and monitor performance degradation under drifting of the environment. Our findings highlight the possibilities and limitations of model evaluation and common pitfalls in anomaly detection in ICS. Our findings emphasise the importance of explainable, drift-aware evaluation for reliable deployment of learning-based security monitoring systems.
[LG-28] Self-Scaled Broyden Family of Quasi-Newton Methods in JAX
链接: https://arxiv.org/abs/2603.10599
作者: Ivan Bioli,Mikel Mendibe Abarrategi
类目: Mathematical Software (cs.MS); Machine Learning (cs.LG)
*备注:
Abstract:We present a JAX implementation of the Self-Scaled Broyden family of quasi-Newton methods, fully compatible with JAX and building on the Optimistix~\citerader_optimistix_2024 optimisation library. The implementation includes BFGS, DFP, Broyden and their Self-Scaled variants(SSBFGS, SSDFP, SSBroyden), together with a Zoom line search satisfying the strong Wolfe conditions. This is a short technical note, not a research paper, as it does not claim any novel contribution; its purpose is to document the implementation and ease the adoption of these optimisers within the JAX community. The code is available at this https URL.
[LG-29] HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data
链接: https://arxiv.org/abs/2603.10582
作者: Jannis Maier,Lennart Purucker
类目: Machine Learning (cs.LG)
*备注: 10 pages (7 Appendix), 15 figures
Abstract:Ensembling is commonly used in machine learning on tabular data to boost predictive performance and robustness, but larger ensembles often lead to increased hardware demand. We introduce HAPEns, a post-hoc ensembling method that explicitly balances accuracy against hardware efficiency. Inspired by multi-objective and quality diversity optimization, HAPEns constructs a diverse set of ensembles along the Pareto front of predictive performance and resource usage. Existing hardware-aware post-hoc ensembling baselines are not available, highlighting the novelty of our approach. Experiments on 83 tabular classification datasets show that HAPEns significantly outperforms baselines, finding superior trade-offs for ensemble performance and deployment cost. Ablation studies also reveal that memory usage is a particularly effective objective metric. Further, we show that even a greedy ensembling algorithm can be significantly improved in this task with a static multi-objective weighting scheme.
[LG-30] Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context ICLR2026
链接: https://arxiv.org/abs/2603.10573
作者: Faris Chaudhry,Siddhant Gadkari
类目: Machine Learning (cs.LG)
*备注: Accepted at the Latent and Implicit Thinking Workshop (ICLR 2026)
Abstract:In-context learning (ICL) allows Transformers to adapt to novel tasks without weight updates, yet the underlying algorithms remain poorly understood. We adopt a statistical decision-theoretic perspective by investigating simple binary hypothesis testing, where the optimal policy is determined by the likelihood-ratio test. Notably, this setup provides a mathematically rigorous setting for mechanistic interpretability where the target algorithmic ground truth is known. By training Transformers on tasks requiring distinct geometries (linear shifted means vs. nonlinear variance estimation), we demonstrate that the models approximate the Bayes-optimal sufficient statistics from context up to some monotonic transformation, matching the performance of an ideal oracle estimator in nonlinear regimes. Leveraging this analytical ground truth, mechanistic analysis via logit lens and circuit alignment suggests that the model does not rely on a fixed kernel smoothing heuristic. Instead, it appears to adapt the point at which decisions become linearly decodable: exhibiting patterns consistent with a voting-style ensemble for linear tasks while utilizing a deeper sequential computation for nonlinear tasks. These findings suggest that ICL emerges from the construction of task-adaptive statistical estimators rather than simple similarity matching.
[LG-31] Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation
链接: https://arxiv.org/abs/2603.10563
作者: Viktorija Poļaka,Ivo Pascal de Jong,Andreea Ioana Sburlea
类目: Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, 2 tables
Abstract:This paper addresses the challenge of generating synthetic electroencephalogram (EEG) covariance matrices for motor imagery brain-computer interface (MI-BCI) applications. Objective: We aim to develop a generative model capable of producing high-fidelity synthetic covariance matrices while preserving their symmetric positive-definite nature. Approach: We propose a Riemannian geometry-preserving variational autoencoder (RGP-VAE) integrating geometric mappings with a composite loss function combining Riemannian distance, tangent space reconstruction accuracy and generative diversity. Results: The model generates valid, representative EEG covariance matrices, while learning a subject-invariant latent space. Synthetic data proves practically useful for MI-BCI, with its impact depending on the paired classifier. Contribution: This work introduces and validates the RGP-VAE as a geometry-preserving generative model for EEG covariance matrices, highlighting its potential for signal privacy, scalability and data augmentation.
[LG-32] A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting
链接: https://arxiv.org/abs/2603.10559
作者: Jing Liu,Maria Grith,Xiaowen Dong,Mihai Cucuringu
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:
Abstract:This paper studies cross-market return predictability through a machine learning framework that preserves economic structure. Exploiting the non-overlapping trading hours of the U.S. and Chinese equity markets, we construct a directed bipartite graph that captures time-ordered predictive linkages between stocks across markets. Edges are selected via rolling-window hypothesis testing, and the resulting graph serves as a sparse, economically interpretable feature-selection layer for downstream machine learning models. We apply a range of regularized and ensemble methods to forecast open-to-close returns using lagged foreign-market information. Our results reveal a pronounced directional asymmetry: U.S. previous-close-to-close returns contain substantial predictive information for Chinese intraday returns, whereas the reverse effect is limited. This informational asymmetry translates into economically meaningful performance differences and highlights how structured machine learning frameworks can uncover cross-market dependencies while maintaining interpretability.
[LG-33] Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning
链接: https://arxiv.org/abs/2603.10545
作者: Martin Asenov,Qiwen Deng,Gingfung Yeung,Adam Barker
类目: Machine Learning (cs.LG)
*备注:
Abstract:Efficiently allocating incoming jobs to nodes in large-scale clusters can lead to substantial improvements in both cluster utilization and job performance. In order to allocate incoming jobs, cluster schedulers usually rely on a set of scoring functions to rank feasible nodes. Results from individual scoring functions are usually weighted equally, which could lead to sub-optimal deployments as the one-size-fits-all solution does not take into account the characteristics of each workload. Tuning the weights of scoring functions, however, requires expert knowledge and is computationally expensive. This paper proposes a reinforcement learning approach for learning the weights in scheduler scoring algorithms with the overall objective of improving the end-to-end performance of jobs for a given cluster. Our approach is based on percentage improvement reward, frame-stacking, and limiting domain information. We propose a percentage improvement reward to address the objective of multi-step parameter tuning. The inclusion of frame-stacking allows for carrying information across an optimization experiment. Limiting domain information prevents overfitting and improves performance in unseen clusters and workloads. The policy is trained on different combinations of workloads and cluster setups. We demonstrate the proposed approach improves performance on average by 33% compared to fixed weights and 12% compared to the best-performing baseline in a lab-based serverless scenario. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.10545 [cs.LG] (or arXiv:2603.10545v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.10545 Focus to learn more arXiv-issued DOI via DataCite
[LG-34] World Model for Battery Degradation Prediction Under Non-Stationary Aging
链接: https://arxiv.org/abs/2603.10527
作者: Kai Chin Lim,Khay Wai See
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 18 pages, 3 figures
Abstract:Degradation prognosis for lithium-ion cells requires forecasting the state-of-health (SOH) trajectory over future cycles. Existing data-driven approaches can produce trajectory outputs through direct regression, but lack a mechanism to propagate degradation dynamics forward in time. This paper formulates battery degradation prognosis as a world model problem, encoding raw voltage, current, and temperature time-series from each cycle into a latent state and propagating it forward via a learned dynamics transition to produce a future trajectory spanning 80 cycles. To investigate whether electrochemical knowledge improves the learned dynamics, a Single Particle Model (SPM) constraint is incorporated into the training loss. Three configurations are evaluated on the Severson LiFePO4 (LFP) dataset of 138 cells. Iterative rollout halves the trajectory forecast error compared to direct regression from the same encoder. The SPM constraint improves prediction at the degradation knee where the resistance to SOH relationship is most applicable, without changing aggregate accuracy.
[LG-35] A New Tensor Network: Tubal Tensor Train and Its Applications
链接: https://arxiv.org/abs/2603.10503
作者: Salman Ahmadi-Asl,Valentin Leplat,Anh-Huy Phan,Andrzej Cichocki
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:We introduce the tubal tensor train (TTT) decomposition, a tensor-network model that combines the t-product algebra of the tensor singular value decomposition (T-SVD) with the low-order core structure of the tensor train (TT) format. For an order- (N+1) tensor with a distinguished tube mode, the proposed representation consists of two third-order boundary cores and N-2 fourth-order interior cores linked through the t-product. As a result, for bounded tubal ranks, the storage scales linearly with the number of modes, in contrast to direct high-order extensions of T-SVD. We present two computational strategies: a sequential fixed-rank construction, called TTT-SVD, and a Fourier-slice alternating scheme based on the alternating two-cores update (ATCU). We also state a TT-SVD-type error bound for TTT-SVD and illustrate the practical performance of the proposed model on image compression, video compression, tensor completion, and hyperspectral imaging.
[LG-36] A Universal Nearest-Neighbor Estimator for Intrinsic Dimensionality
链接: https://arxiv.org/abs/2603.10493
作者: Eng-Jon Ong,Omer Bobrowski,Gesine Reinert,Primoz Skraba
类目: Machine Learning (cs.LG)
*备注:
Abstract:Estimating the intrinsic dimensionality (ID) of data is a fundamental problem in machine learning and computer vision, providing insight into the true degrees of freedom underlying high-dimensional observations. Existing methods often rely on geometric or distributional assumptions and can significantly fail when these assumptions are violated. In this paper, we introduce a novel ID estimator based on nearest-neighbor distance ratios that involves simple calculations and achieves state-of-the-art results. Most importantly, we provide a theoretical analysis proving that our estimator is \emphuniversal, namely, it converges to the true ID independently of the distribution generating the data. We present experimental results on benchmark manifolds and real-world datasets to demonstrate the performance of our estimator.
[LG-37] Muscle Synergy Priors Enhance Biomechanical Fidelity in Predictive Musculoskeletal Locomotion Simulation
链接: https://arxiv.org/abs/2603.10474
作者: Ilseung Park(1),Eunsik Choi(2),Jangwhan Ahn(3),Jooeun Ahn(2) ((1) Carnegie Mellon University, (2) Seoul National University, (3) UNC-Chapel Hill and NC State University)
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
*备注: 12 pages, 5 figures
Abstract:Human locomotion emerges from high-dimensional neuromuscular control, making predictive musculoskeletal simulation challenging. We present a physiology-informed reinforcement-learning framework that constrains control using muscle synergies. We extracted a low-dimensional synergy basis from inverse musculoskeletal analyses of a small set of overground walking trials and used it as the action space for a muscle-driven three-dimensional model trained across variable speeds, slopes and uneven terrain. The resulting controller generated stable gait from 0.7-1.8 m/s and on \pm 6 ^\circ grades and reproduced condition-dependent modulation of joint angles, joint moments and ground reaction forces. Compared with an unconstrained controller, synergy-constrained control reduced non-physiological knee kinematics and kept knee moment profiles within the experimental envelope. Across conditions, simulated vertical ground reaction forces correlated strongly with human measurements, and muscle-activation timing largely fell within inter-subject variability. These results show that embedding neurophysiological structure into reinforcement learning can improve biomechanical fidelity and generalization in predictive human locomotion simulation with limited experimental data.
[LG-38] Spatio-Temporal Forecasting of Retaining Wall Deformation: Mitigating Error Accumulation via Multi-Resolution ConvLSTM Stacking Ensemble
链接: https://arxiv.org/abs/2603.10453
作者: Jihoon Kim,Heejung Youn(Department of Civil and Environmental Engineering, Hongik University, Seoul, Republic of Korea)
类目: Machine Learning (cs.LG)
*备注: 16 pages, 19 figures
Abstract:This study proposes a multi-resolution Convolutional Long Short-Term Memory (ConvLSTM) ensemble framework that leverages diverse temporal input resolutions to mitigate error accumulation and improve long-horizon forecasting of retaining-structure behavior during staged excavation. An extensive database of lateral wall displacement responses was generated through PLAXIS2D simulations incorporating five-layered soil stratigraphy, two excavation depths (14 and 20 m), and stochastically varied geotechnical and structural parameters, yielding 2,000 time-series deflection profiles. Three ConvLSTM models trained at different input resolutions were integrated using a fully connected neural network meta-learner to construct the ensemble model. Validation using both numerical results and field measurements demonstrated that the ensemble approach consistently outperformed the standalone ConvLSTM models, particularly in long-term multi-step prediction, exhibiting reduced error propagation and improved generalization. These findings underscore the potential of multi-resolution ensemble strategies that jointly exploit diverse temporal input scales to enhance predictive stability and accuracy in AI-driven geotechnical forecasting.
[LG-39] GGMPs: Generalized Gaussian Mixture Processes
链接: https://arxiv.org/abs/2603.10442
作者: Vardaan Tekriwal,Mark D. Risser,Hengrui Luo,Marcus M. Noack
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Conditional density estimation is complicated by multimodality, heteroscedasticity, and strong non-Gaussianity. Gaussian processes (GPs) provide a principled nonparametric framework with calibrated uncertainty, but standard GP regression is limited by its unimodal Gaussian predictive form. We introduce the Generalized Gaussian Mixture Process (GGMP), a GP-based method for multimodal conditional density estimation in settings where each input may be associated with a complex output distribution rather than a single scalar response. GGMP combines local Gaussian mixture fitting, cross-input component alignment and per-component heteroscedastic GP training to produce a closed-form Gaussian mixture predictive density. The method is tractable, compatible with standard GP solvers and scalable methods, and avoids the exponentially large latent-assignment structure of naive multimodal GP formulations. Empirically, GGMPs improve distributional approximation on synthetic and real-world datasets with pronounced non-Gaussianity and multimodality.
[LG-40] Graph-GRPO: Training Graph Flow Models with Reinforcement Learning
链接: https://arxiv.org/abs/2603.10395
作者: Baoheng Zhu,Deyu Bo,Delvin Ce Zhang,Xiao Wang
类目: Machine Learning (cs.LG)
*备注: Under Review
Abstract:Graph generation is a fundamental task with broad applications, such as drug discovery. Recently, discrete flow matching-based graph generation, \aka, graph flow model (GFM), has emerged due to its superior performance and flexible sampling. However, effectively aligning GFMs with complex human preferences or task-specific objectives remains a significant challenge. In this paper, we propose Graph-GRPO, an online reinforcement learning (RL) framework for training GFMs under verifiable rewards. Our method makes two key contributions: (1) We derive an analytical expression for the transition probability of GFMs, replacing the Monte Carlo sampling and enabling fully differentiable rollouts for RL training; (2) We propose a refinement strategy that randomly perturbs specific nodes and edges in a graph, and regenerates them, allowing for localized exploration and self-improvement of generation quality. Extensive experiments on both synthetic and real datasets demonstrate the effectiveness of Graph-GRPO. With only 50 denoising steps, our method achieves 95.0% and 97.5% Valid-Unique-Novelty scores on the planar and tree datasets, respectively. Moreover, Graph-GRPO achieves state-of-the-art performance on the molecular optimization tasks, outperforming graph-based and fragment-based RL methods as well as classic genetic algorithms.
[LG-41] Data-Driven Integration Kernels for Interpretable Nonlocal Operator Learning
链接: https://arxiv.org/abs/2603.10305
作者: Savannah L. Ferretti,Jerry Lin,Sara Shamekh,Jane W. Baldwin,Michael S. Pritchard,Tom Beucler
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 12 pages, 4 figures, 1 table
Abstract:Machine learning models can represent climate processes that are nonlocal in horizontal space, height, and time, often by combining information across these dimensions in highly nonlinear ways. While this can improve predictive skill, it makes learned relationships difficult to interpret and prone to overfitting as the extent of nonlocal information grows. We address this challenge by introducing data-driven integration kernels, a framework that adds structure to nonlocal operator learning by explicitly separating nonlocal information aggregation from local nonlinear prediction. Each spatiotemporal predictor field is first integrated using learnable kernels (defined as continuous weighting functions over horizontal space, height, and/or time), after which a local nonlinear mapping is applied only to the resulting kernel-integrated features and any optional local inputs. This design confines nonlinear interactions to a small set of integrated features and makes each kernel directly interpretable as a weighting pattern that reveals which horizontal locations, vertical levels, and past timesteps contribute most to the prediction. We demonstrate the framework for South Asian monsoon precipitation using a hierarchy of neural network models with increasing structure, including baseline, nonparametric kernel, and parametric kernel models. Across this hierarchy, kernel-based models achieve near-baseline performance with far fewer trainable parameters, showing that much of the relevant nonlocal information can be captured through a small set of interpretable integrations when appropriate structural constraints are imposed.
[LG-42] How to make the most of your masked language model for protein engineering ICLR2026
链接: https://arxiv.org/abs/2603.10302
作者: Calvin McCarter,Nick Bhattacharya,Sebastian W. Ober,Hunter Elliott
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Accepted into the GEM Workshop, ICLR 2026
Abstract:A plethora of protein language models have been released in recent years. Yet comparatively little work has addressed how to best sample from them to optimize desired biological properties. We fill this gap by proposing a flexible, effective sampling method for masked language models (MLMs), and by systematically evaluating models and methods both in silico and in vitro on actual antibody therapeutics campaigns. Firstly, we propose sampling with stochastic beam search, exploiting the fact that MLMs are remarkably efficient at evaluating the pseudo-perplexity of the entire 1-edit neighborhood of a sequence. Reframing generation in terms of entire-sequence evaluation enables flexible guidance with multiple optimization objectives. Secondly, we report results from our extensive in vitro head-to-head evaluation for the antibody engineering setting. This reveals that choice of sampling method is at least as impactful as the model used, motivating future research into this under-explored area.
[LG-43] What do near-optimal learning rate schedules look like?
链接: https://arxiv.org/abs/2603.10301
作者: Hiroki Naganuma,Atish Agarwala,Priya Kasimbeg,George E. Dahl
类目: Machine Learning (cs.LG)
*备注:
Abstract:A basic unanswered question in neural network training is: what is the best learning rate schedule shape for a given workload? The choice of learning rate schedule is a key factor in the success or failure of the training process, but beyond having some kind of warmup and decay, there is no consensus on what makes a good schedule shape. To answer this question, we designed a search procedure to find the best shapes within a parameterized schedule family. Our approach factors out the schedule shape from the base learning rate, which otherwise would dominate cross-schedule comparisons. We applied our search procedure to a variety of schedule families on three workloads: linear regression, image classification on CIFAR-10, and small-scale language modeling on Wikitext103. We showed that our search procedure indeed generally found near-optimal schedules. We found that warmup and decay are robust features of good schedules, and that commonly used schedule families are not optimal on these workloads. Finally, we explored how the outputs of our shape search depend on other optimization hyperparameters, and found that weight decay can have a strong effect on the optimal schedule shape. To the best of our knowledge, our results represent the most comprehensive results on near-optimal schedule shapes for deep neural network training, to date.
[LG-44] Regime-aware financial volatility forecasting via in-context learning ICLR2026
链接: https://arxiv.org/abs/2603.10299
作者: Saba Asaad,Shayan Mohajer Hamidi,Ali Bereyhi
类目: Machine Learning (cs.LG)
*备注: 11 pages, 1 figure, Published as a conference paper at ICLR 2026 Workshop on Advances in Financial AI
Abstract:This work introduces a regime-aware in-context learning framework that leverages large language models (LLMs) for financial volatility forecasting under nonstationary market conditions. The proposed approach deploys pretrained LLMs to reason over historical volatility patterns and adjust their predictions without parameter fine-tuning. We develop an oracle-guided refinement procedure that constructs regime-aware demonstrations from training data. An LLM is then deployed as an in-context learner that predicts the next-step volatility from the input sequence using demonstrations sampled conditional to the estimated market label. This conditional sampling strategy enables the LLM to adapt its predictions to regime-dependent volatility dynamics through contextual reasoning alone. Experiments with multiple financial datasets show that the proposed regime-aware in-context learning framework outperforms both classical volatility forecasting approaches and direct one-shot learning, especially during high-volatility periods.
[LG-45] GaLoRA: Parameter-Efficient Graph-Aware LLM s for Node Classification NEURIPS2025
链接: https://arxiv.org/abs/2603.10298
作者: Mayur Choudhary,Saptarshi Sengupta,Katerina Potika
类目: Machine Learning (cs.LG)
*备注: 10 pages, 2 figures, 11 tables, 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop
Abstract:The rapid rise of large language models (LLMs) and their ability to capture semantic relationships has led to their adoption in a wide range of applications. Text-attributed graphs (TAGs) are a notable example where LLMs can be combined with Graph Neural Networks to improve the performance of node classification. In TAGs, each node is associated with textual content and such graphs are commonly seen in various domains such as social networks, citation graphs, recommendation systems, etc. Effectively learning from TAGs would enable better representations of both structural and textual representations of the graph and improve decision-making in relevant domains. We present GaLoRA, a parameter-efficient framework that integrates structural information into LLMs. GaLoRA demonstrates competitive performance on node classification tasks with TAGs, performing on par with state-of-the-art models with just 0.24% of the parameter count required by full LLM fine-tuning. We experiment with three real-world datasets to showcase GaLoRA’s effectiveness in combining structural and semantical information on TAGs.
[LG-46] Copula-ResLogit: A Deep-Copula Framework for Unobserved Confounding Effects
链接: https://arxiv.org/abs/2603.10284
作者: Kimia Kamal,Bilal Farooq
类目: Machine Learning (cs.LG)
*备注:
Abstract:A key challenge in travel demand analysis is the presence of unobserved factors that may generate non-causal dependencies, obscuring the true causal effects. To address the issue, the study introduces a novel deep learning based fully interpretable joint modelling framework, Copula-ResLogit, which integrates the flexibility of Residual Neural Network (ResNet) architectures with the dependence capturing capabilities of copula models. This hybrid structure enables us to first detect unobserved confounding through traditional copula function based joint modelling and then mitigate these hidden associations by incorporating deep learning components. The study applies this framework to two case studies, including the relationship between stress levels and wait time of pedestrians when crossing mid block in VR and the dependencies between travel mode choice and travel distance in London travel behaviour data. Results show that Copula-ResLogit substantially reduces or eliminates the dependencies, demonstrating the ability of residual layers to account for hidden confounding effects.
[LG-47] GSVD for Geometry-Grounded Dataset Comparison: An Alignment Angle Is All You Need ICLR2026
链接: https://arxiv.org/abs/2603.10283
作者: Eduarda de Souza Marques,Arthur Sobrinho Ferreira da Rocha,Joao Paixao,Heudson Mirandola,Daniel Sadoc Menasche
类目: Machine Learning (cs.LG)
*备注: 20 pages, GRaM workshop ICLR 2026
Abstract:Geometry-grounded learning asks models to respect structure in the problem domain rather than treating observations as arbitrary vectors. Motivated by this view, we revisit a classical but underused primitive for comparing datasets: linear relations between two data matrices, expressed via the co-span constraint Ax = By = z in a shared ambient space. To operationalize this comparison, we use the generalized singular value decomposition (GSVD) as a joint coordinate system for two subspaces. In particular, we exploit the GSVD form A = HCU , B = HSV with C^\topC + S^\topS = I , which separates shared versus dataset-specific directions through the diagonal structure of (C, S) . From these factors we derive an interpretable angle score \theta(z) \in [0, \pi/2] for a sample z , quantifying whether z is explained relatively more by A , more by B , or comparably by both. The primary role of \theta(z) is as a per-sample geometric diagnostic. We illustrate the behavior of the score on MNIST through angle distributions and representative GSVD directions. A binary classifier derived from \theta(z) is presented as an illustrative application of the score as an interpretable diagnostic tool.
[LG-48] Robust Post-Training for Generative Recommenders: Why Exponential Reward-Weighted SFT Outperforms RLHF
链接: https://arxiv.org/abs/2603.10279
作者: Keertana Chidambaram,Sanath Kumar Krishnamurthy,Qiuling Xu,Ko-Jen Hsiao,Moumita Bhattacharya
类目: Machine Learning (cs.LG)
*备注:
Abstract:Aligning generative recommender systems to user preferences via post-training is critical for closing the gap between next-item prediction and actual recommendation quality. Existing post-training methods are ill-suited for production-scale systems: RLHF methods reward hack due to noisy user feedback and unreliable reward models, offline RL alternatives require propensity scores that are unavailable, and online interaction is infeasible. We identify exponential reward-weighted SFT with weights w = \exp(r/\lambda) as uniquely suited to this setting, and provide the theoretical and empirical foundations that explain why. By optimizing directly on observed rewards without querying a learned reward model, the method is immune to reward hacking, requires no propensity scores, and is fully offline. We prove the first policy improvement guarantees for this setting under noisy rewards, showing that the gap scales only logarithmically with catalog size and remains informative even for large item catalogs. Crucially, we show that temperature \lambda explicitly and quantifiably controls the robustness-improvement tradeoff, providing practitioners with a single interpretable regularization hyperparameter with theoretical grounding. Experiments on three open-source and one proprietary dataset against four baselines confirm that exponential reward weighting is simple, scalable, and consistently outperforms RLHF-based alternatives.
[LG-49] Estimating condition number with Graph Neural Networks
链接: https://arxiv.org/abs/2603.10277
作者: Erin Carson,Xinye Chen
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:In this paper, we propose a fast method for estimating the condition number of sparse matrices using graph neural networks (GNNs). To enable efficient training and inference of GNNs, our proposed feature engineering for GNNs achieves \mathrmO(\mathrmnnz + n) , where \mathrmnnz is the number of non-zero elements in the matrix and n denotes the matrix dimension. We propose two prediction schemes for estimating the matrix condition number using GNNs. The extensive experiments for the two schemes are conducted for 1-norm and 2-norm condition number estimation, which show that our method achieves a significant speedup over the Hager-Higham and Lanczos methods.
[LG-50] From Prior to Pro: Efficient Skill Mastery via Distribution Contractive RL Finetuning
链接: https://arxiv.org/abs/2603.10263
作者: Zhanyi Sun,Shuran Song
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:We introduce Distribution Contractive Reinforcement Learning (DICE-RL), a framework that uses reinforcement learning (RL) as a “distribution contraction” operator to refine pretrained generative robot policies. DICE-RL turns a pretrained behavior prior into a high-performing “pro” policy by amplifying high-success behaviors from online feedback. We pretrain a diffusion- or flow-based policy for broad behavioral coverage, then finetune it with a stable, sample-efficient residual off-policy RL framework that combines selective behavior regularization with value-guided action selection. Extensive experiments and analyses show that DICE-RL reliably improves performance with strong stability and sample efficiency. It enables mastery of complex long-horizon manipulation skills directly from high-dimensional pixel inputs, both in simulation and on a real robot. Project website: this https URL.
[LG-51] Discovery of a Hematopoietic Manifold in scGPT Yields a Method for Extracting Performant Algorithms from Biological Foundation Model Internals
链接: https://arxiv.org/abs/2603.10261
作者: Ihor Kendiukhov
类目: Machine Learning (cs.LG); Cell Behavior (q-bio.CB); Genomics (q-bio.GN)
*备注:
Abstract:We report the discovery and extraction of a compact hematopoietic algorithm from the single-cell foundation model scGPT, to our knowledge the first biologically useful, competitive algorithm extracted from a foundation model via mechanistic interpretability. We show that scGPT internally encodes a compact hematopoietic manifold with significant developmental branch structure, validated on a strict non-overlap Tabula Sapiens external panel and confirmed via frozen-head zero-shot transfer to an independent multi-donor immune panel. To isolate this geometry, we introduce a general three-stage extraction method consisting of direct operator export from frozen attention weights, a lightweight learned adaptor, and a task-specific readout, producing a standalone algorithm without target-dataset retraining. In 88-split donor-holdout benchmarks against scVI, Palantir, DPT, CellTypist, PCA, and raw-expression baselines, the extracted algorithm achieves the strongest pseudotime-depth ordering and leads on key subtype endpoints (CD4/CD8 AUROC 0.867, mono/macro AUROC 0.951). Compared to standard probing of frozen scGPT embeddings with a 3-layer MLP, the extracted head is BH-significantly better on 6/8 classification endpoints while completing a full 12-split evaluation campaign 34.5x faster with approximately 1000x fewer trainable parameters. The exported operator compresses from three pooled attention heads to a single head without statistically significant loss, and further to a rank-64 surrogate. Mechanistic interpretability of the compact operator reveals a concentrated four-factor core explaining 66.2% of ablation impact, with factors resolving into explicit T/lymphoid, B/plasma, granulocytic, and monocyte/macrophage gene programs. A supplementary second-manifold validation (intercellular communication geometry) confirms that the extraction method generalizes beyond hematopoiesis.
[LG-52] Improving TabPFNs Synthetic Data Generation by Integrating Causal Structure
链接: https://arxiv.org/abs/2603.10254
作者: Davide Tugnoli,Andrea De Lorenzo,Marco Virgolin,Giovanni Cinà
类目: Machine Learning (cs.LG)
*备注: 8 pages main text, 30 pages total (including supplementary material), 27 figures. Code: this https URL
Abstract:Synthetic tabular data generation addresses data scarcity and privacy constraints in a variety of domains. Tabular Prior-Data Fitted Network (TabPFN), a recent foundation model for tabular data, has been shown capable of generating high-quality synthetic tabular data. However, TabPFN is autoregressive: features are generated sequentially by conditioning on the previous ones, depending on the order in which they appear in the input data. We demonstrate that when the feature order conflicts with causal structure, the model produces spurious correlations that impair its ability to generate synthetic data and preserve causal effects. We address this limitation by integrating causal structure into TabPFN’s generation process through two complementary approaches: Directed Acyclic Graph (DAG)-aware conditioning, which samples each variable given its causal parents, and a Completed Partially Directed Acyclic Graph (CPDAG)-based strategy for scenarios with partial causal knowledge. We evaluate these approaches on controlled benchmarks and six CSuite datasets, assessing structural fidelity, distributional alignment, privacy preservation, and Average Treatment Effect (ATE) preservation. Across most settings, DAG-aware conditioning improves the quality and stability of synthetic data relative to vanilla TabPFN. The CPDAG-based strategy shows moderate improvements, with effectiveness depending on the number of oriented edges. These results indicate that injecting causal structure into autoregressive generation enhances the reliability of synthetic tabular data.
[LG-53] SiMPO: Measure Matching for Online Diffusion Reinforcement Learning
链接: https://arxiv.org/abs/2603.10250
作者: Haitong Ma,Chenxiao Gao,Tianyi Chen,Na Li,Bo Dai
类目: Machine Learning (cs.LG)
*备注: 22 pages, 6 figures
Abstract:A commonly used family of RL algorithms for diffusion policies conducts softmax reweighting over the behavior policy, which usually induces an over-greedy policy and fails to leverage feedback from negative samples. In this work, we introduce Signed Measure Policy Optimization (SiMPO), a simple and unified framework that generalizes reweighting scheme in diffusion RL with general monotonic functions. SiMPO revisits diffusion RL via a two-stage measure matching lens. First, we construct a virtual target policy by f -divergence regularized policy optimization, where we can relax the non-negativity constraint to allow for a signed target measure. Second, we use this signed measure to guide diffusion or flow models through reweighted matching. This formulation offers two key advantages: a) it generalizes to arbitrary monotonically increasing weighting functions; and b) it provides a principled justification and practical guidance for negative reweighting. Furthermore, we provide geometric interpretations to illustrate how negative reweighting actively repels the policy from suboptimal actions. Extensive empirical evaluations demonstrate that SiMPO achieves superior performance by leveraging these flexible weighting schemes, and we provide practical guidelines for selecting reweighting methods tailored to the reward landscape.
[LG-54] Actor-Accelerated Policy Dual Averag ing for Reinforcement Learning in Continuous Action Spaces
链接: https://arxiv.org/abs/2603.10199
作者: Ji Gao,Caleb Ju,Guanghui Lan,Zhaohui Tong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Policy Dual Averaging (PDA) offers a principled Policy Mirror Descent (PMD) framework that more naturally admits value function approximation than standard PMD, enabling the use of approximate advantage (or Q-) functions while retaining strong convergence guarantees. However, applying PDA in continuous state and action spaces remains computationally challenging, since action selection involves solving an optimization sub-problem at each decision step. In this paper, we propose \textitactor-accelerated PDA, which uses a learned policy network to approximate the solution of the optimization sub-problems, yielding faster runtimes while maintaining convergence guarantees. We provide a theoretical analysis that quantifies how actor approximation error impacts the convergence of PDA under suitable assumptions. We then evaluate its performance on several benchmarks in robotics, control, and operations research problems. Actor-accelerated PDA achieves superior performance compared to popular on-policy baselines such as Proximal Policy Optimization (PPO). Overall, our results bridge the gap between the theoretical advantages of PDA and its practical deployment in continuous-action problems with function approximation.
[LG-55] DT-BEHRT: Disease Trajectory-aware Transformer for Interpretable Patient Representation Learning
链接: https://arxiv.org/abs/2603.10180
作者: Deyi Li,Zijun Yao,Qi Xu,Muxuan Liang,Lingyao Li,Zijian Xu,Mei Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:The growing adoption of electronic health record (EHR) systems has provided unprecedented opportunities for predictive modeling to guide clinical decision making. Structured EHRs contain longitudinal observations of patients across hospital visits, where each visit is represented by a set of medical codes. While sequence-based, graph-based, and graph-enhanced sequence approaches have been developed to capture rich code interactions over time or within the same visits, they often overlook the inherent heterogeneous roles of medical codes arising from distinct clinical characteristics and contexts. To this end, in this study we propose the Disease Trajectory-aware Transformer for EHR (DT-BEHRT), a graph-enhanced sequential architecture that disentangles disease trajectories by explicitly modeling diagnosis-centric interactions within organ systems and capturing asynchronous progression patterns. To further enhance the representation robustness, we design a tailored pre-training methodology that combines trajectory-level code masking with ontology-informed ancestor prediction, promoting semantic alignment across multiple modeling modules. Extensive experiments on multiple benchmark datasets demonstrate that DT-BEHRT achieves strong predictive performance and provides interpretable patient representations that align with clinicians’ disease-centered reasoning. The source code is publicly accessible at this https URL.
[LG-56] A neural operator for predicting vibration frequency response curves from limited data
链接: https://arxiv.org/abs/2603.10149
作者: D. Bluedorn,A. Badawy,B. E. Saunders,D. Roettgen,A. Abdelkefi
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:In the design of engineered components, rigorous vibration testing is essential for performance validation and identification of resonant frequencies and amplitudes encountered during operation. Performing this evaluation numerically via machine learning has great potential to accelerate design iteration and make testing workflows more efficient. However, dynamical systems are conventionally difficult to solve via machine learning methods without using physics-based regularizing loss functions. To properly perform this forecasting task, a structure that has an inspectable physical obedience can be devised without the use of regularizing terms from first principles. The method employed in this work is a neural operator integrated with an implicit numerical scheme. This architecture enables operators to learn of the underlying state-space dynamics from limited data, allowing generalization to untested driving frequencies and initial conditions. This network can infer the system’s global frequency response by training on a small set of input conditions. As a foundational proof of concept, this investigation verifies the machine learning algorithm with a linear, single-degree-of-freedom system, demonstrating implicit obedience of dynamics. This approach demonstrates 99.87% accuracy in predicting the Frequency Response Curve (FRC), forecasting the frequency and amplitude of linear resonance training on 7% of the bandwidth of the solution. By training machine learning models to internalize physics information rather than trajectory, better generalization accuracy can be realized, vastly improving the timeframe for vibration studies on engineered components.
[LG-57] Denoising the US Census: Succinct Block Hierarchical Regression
链接: https://arxiv.org/abs/2603.10099
作者: Badih Ghazi,Pritish Kamath,Ravi Kumar,Pasin Manurangsi,Adam Sealfon
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:The US Census Bureau Disclosure Avoidance System (DAS) balances confidentiality and utility requirements for the decennial US Census (Abowd et al., 2022). The DAS was used in the 2020 Census to produce demographic datasets critically used for legislative apportionment and redistricting, federal and state funding allocation, municipal and infrastructure planning, and scientific research. At the heart of DAS is TopDown, a heuristic post-processing method that combines billions of private noisy measurements across six geographic levels in order to produce new estimates that are consistent, more accurate, and satisfy certain structural constraints on the data. In this work, we introduce BlueDown, a new post-processing method that produces more accurate, consistent estimates while satisfying the same privacy guarantees and structural constraints. We obtain especially large accuracy improvements for aggregates at the county and tract levels on evaluation metrics proposed by the US Census Bureau. From a technical perspective, we develop a new algorithm for generalized least-squares regression that leverages the hierarchical structure of the measurements and that is statistically optimal among linear unbiased estimators. This reduces the computational dependence on the number of geographic regions measured from matrix multiplication time, which would be infeasible for census-scale data, to linear time. We incorporate the additional structural constraints by combining this regression algorithm with an optimization routine that extends TDA to support correlated measurements. We further improve the efficiency of our algorithm using succinct linear-algebraic operations that exploit symmetries in the structure of the measurements and constraints. We believe our hierarchical regression and succinct operations to be of independent interest. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2603.10099 [cs.LG] (or arXiv:2603.10099v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2603.10099 Focus to learn more arXiv-issued DOI via DataCite
[LG-58] Rethinking Adam for Time Series Forecasting: A Simple Heuristic to Improve Optimization under Distribution Shifts
链接: https://arxiv.org/abs/2603.10095
作者: Yuze Dong,Jinsong Wu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Time-series forecasting often faces challenges from non-stationarity, particularly distributional drift, where the data distribution evolves over time. This dynamic behavior can undermine the effectiveness of adaptive optimizers, such as Adam, which are typically designed for stationary objectives. In this paper, we revisit Adam in the context of non-stationary forecasting and identify that its second-order bias correction limits responsiveness to shifting loss landscapes. To address this, we propose TS_Adam, a lightweight variant that removes the second-order correction from the learning rate computation. This simple modification improves adaptability to distributional drift while preserving the optimizer core structure and requiring no additional hyperparameters. TS_Adam integrates easily into existing models and consistently improves performance across long- and short-term forecasting tasks. On the ETT datasets with the MICN model, it achieves an average reduction of 12.8% in MSE and 5.7% in MAE compared to Adam. These results underscore the practicality and versatility of TS_Adam as an effective optimization strategy for real-world forecasting scenarios involving non-stationary data. Code is available at: this https URL.
[LG-59] A Survey of Weight Space Learning: Understanding Representation and Generation
链接: https://arxiv.org/abs/2603.10090
作者: Xiaolong Han,Zehong Wang,Bo Zhao,Binchi Zhang,Jundong Li,Damian Borth,Rose Yu,Haggai Maron,Yanfang Ye,Lu Yin,Ferrante Neri
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural network weights are typically viewed as the end product of training, while most deep learning research focuses on data, features, and architectures. However, recent advances show that the set of all possible weight values (weight space) itself contains rich structure: pretrained models form organized distributions, exhibit symmetries, and can be embedded, compared, or even generated. Understanding such structures has tremendous impact on how neural networks are analyzed and compared, and on how knowledge is transferred across models, beyond individual training instances. This emerging research direction, which we refer to as Weight Space Learning (WSL), treats neural weights as a meaningful domain for analysis and modeling. This survey provides the first unified taxonomy of WSL. We categorize existing methods into three core dimensions: Weight Space Understanding (WSU), which studies the geometry and symmetries of weights; Weight Space Representation (WSR), which learns embeddings over model weights; and Weight Space Generation (WSG), which synthesizes new weights through hypernetworks or generative models. We further show how these developments enable practical applications, including model retrieval, continual and federated learning, neural architecture search, and data-free reconstruction. By consolidating fragmented progress under a coherent framework, this survey highlights weight space as a learnable, structured domain with growing impact across model analysis, transferring, and weight generation. We release an accompanying resource at this https URL.
[LG-60] Large Spikes in Stochastic Gradient Descent: A Large-Deviations View
链接: https://arxiv.org/abs/2603.10079
作者: Benjamin Gess,Daniel Heydecker
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:We analyse SGD training of a shallow, fully connected network in the NTK scaling and provide a quantitative theory of the catapult phase. We identify an explicit criterion separating two behaviours: When an explicit function G , depending only on the kernel, learning rate \eta and data, is positive, SGD produces large NTK-flattening spikes with high probability; when G0 , their probability decays like (n/\eta)^-\vartheta/2 , for an explicitly characterised \vartheta\in (0,\infty) . This yields a concrete parameter-dependent explanation for why such spikes may still be observed at practical widths.
[LG-61] Stochastic Port-Hamiltonian Neural Networks: Universal Approximation with Passivity Guarantees
链接: https://arxiv.org/abs/2603.10078
作者: Luca Di Persio,Matthias Ehrhardt,Youness Outaleb
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:Stochastic port-Hamiltonian systems represent open dynamical systems with dissipation, inputs, and stochastic forcing in an energy based form. We introduce stochastic port-Hamiltonian neural networks, SPH-NNs, which parameterize the Hamiltonian with a feedforward network and enforce skew symmetry of the interconnection matrix and positive semidefiniteness of the dissipation matrix. For Itô dynamics we establish a weak passivity inequality in expectation under an explicit generator condition, stated for a stopped process on a compact set. We also prove a universal approximation result showing that, on any compact set and finite horizon, SPH-NNs approximate the coefficients of a target stochastic port-Hamiltonian system with C^2 accuracy of the Hamiltonian and yield coupled solutions that remain close in mean square up to the exit time. Experiments on noisy mass spring, Duffing, and Van der Pol oscillators show improved long horizon rollouts and reduced energy error relative to a multilayer perceptron baseline.
[LG-62] Quantization of Ricci Curvature in Information Geometry
链接: https://arxiv.org/abs/2603.10054
作者: Carlos C. Rodriguez
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 15 pages, 3 tables
Abstract:In 2004, while studying the information geometry of binary Bayesian networks (bitnets), the author conjectured that the volume-averaged Ricci scalar R computed with respect to the Fisher information metric is universally quantized to positive half-integers: R in (1/2)Z. This paper resolves the conjecture after 20 years. We prove it for tree-structured and complete-graph bitnets via a universal Beta function cancellation mechanism, and disprove it in general by exhibiting explicit loop counterexamples. We extend the program to Gaussian DAG networks, where a sign dichotomy holds: discrete bitnets have positive curvature, while Gaussian networks form solvable Lie groups with negative curvature. Comments: 15 pages, 3 tables Subjects: Information Theory (cs.IT); Machine Learning (cs.LG); Quantum Physics (quant-ph) Cite as: arXiv:2603.10054 [cs.IT] (or arXiv:2603.10054v1 [cs.IT] for this version) https://doi.org/10.48550/arXiv.2603.10054 Focus to learn more arXiv-issued DOI via DataCite
[LG-63] Cluster-Aware Attention-Based Deep Reinforcement Learning for Pickup and Delivery Problems
链接: https://arxiv.org/abs/2603.10053
作者: Wentao Wang,Lifeng Han,Guangyu Zou
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Pickup and Delivery Problem (PDP) is a fundamental and challenging variant of the Vehicle Routing Problem, characterized by tightly coupled pickup–delivery pairs, precedence constraints, and spatial layouts that often exhibit clustering. Existing deep reinforcement learning (DRL) approaches either model all nodes on a flat graph, relying on implicit learning to enforce constraints, or achieve strong performance through inference-time collaborative search at the cost of substantial latency. In this paper, we propose \emphCAADRL (Cluster-Aware Attention-based Deep Reinforcement Learning), a DRL framework that explicitly exploits the multi-scale structure of PDP instances via cluster-aware encoding and hierarchical decoding. The encoder builds on a Transformer and combines global self-attention with intra-cluster attention over depot, pickup, and delivery nodes, producing embeddings that are both globally informative and locally role-aware. Based on these embeddings, we introduce a Dynamic Dual-Decoder with a learnable gate that balances intra-cluster routing and inter-cluster transitions at each step. The policy is trained end-to-end with a POMO-style policy gradient scheme using multiple symmetric rollouts per instance. Experiments on synthetic clustered and uniform PDP benchmarks show that CAADRL matches or improves upon strong state-of-the-art baselines on clustered instances and remains highly competitive on uniform instances, particularly as problem size increases. Crucially, our method achieves these results with substantially lower inference time than neural collaborative-search baselines, suggesting that explicitly modeling cluster structure provides an effective and efficient inductive bias for neural PDP solvers.
[LG-64] OmniGuide: Universal Guidance Fields for Enhancing Generalist Robot Policies
链接: https://arxiv.org/abs/2603.10052
作者: Yunzhou Song,Long Le,Yong-Hyun Park,Jie Wang,Junyao Shi,Lingjie Liu,Jiatao Gu,Eric Eaton,Dinesh Jayaraman,Kostas Daniilidis
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Project Page: \href
Abstract:Vision-language-action(VLA) models have shown great promise as generalist policies for a large range of relatively simple tasks. However, they demonstrate limited performance on more complex tasks, such as those requiring complex spatial or semantic understanding, manipulation in clutter, or precise manipulation. We propose OMNIGUIDE, a flexible framework that improves VLA performance on such tasks by leveraging arbitrary sources of guidance, such as 3D foundation models, semantic-reasoning VLMs, and human pose models. We show how many kinds of guidance can be naturally expressed as differentiable energy functions with task-specific attractors and repellers located in 3D space, that influence the sampling of VLA actions. In this way, OMNIGUIDE enables guidance sources with complementary task-relevant strengths to improve a VLA model’s performance on challenging tasks. Extensive experiments in both simulation and real-world environments, across diverse sources of guidance, demonstrate that OMNIGUIDE enhances the performance of state-of-the-art generalist policies (e.g., \pi_0.5 , GR00T N1.6) significantly across success and safety rates. Critically, our unified framework matches or surpasses the performance of prior methods designed to incorporate specific sources of guidance into VLA policies. Project Page: \hrefthis https URLthis ; url
[LG-65] Evaluating Generalization Mechanisms in Autonomous Cyber Attack Agents
链接: https://arxiv.org/abs/2603.10041
作者: Ondřej Lukáš,Jihoon Shin,Emilia Rivas,Diego Forni,Maria Rigaki,Carlos Catania,Aritran Piplai,Christopher Kiekintveld,Sebastian Garcia
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Autonomous offensive agents often fail to transfer beyond the networks on which they are trained. We isolate a minimal but fundamental shift – unseen host/subnet IP reassignment in an otherwise fixed enterprise scenario – and evaluate attacker generalization in the NetSecGame environment. Agents are trained on five IP-range variants and tested on a sixth unseen variant; only the meta-learning agent may adapt at test time. We compare three agent families (traditional RL, adaptation agents, and LLM-based agents) and use action-distribution-based behavioral/XAI analyses to localize failure modes. Some adaptation methods show partial transfer but significant degradation under unseen reassignment, indicating that even address-space changes can break long-horizon attack policies. Under our evaluation protocol and agent-specific assumptions, prompt-driven pretrained LLM agents achieve the highest success on the held-out reassignment, but at the cost of increased inference-time compute, reduced transparency, and practical failure modes such as repetition/invalid-action loops.
[LG-66] ureis: Transformer-based Unified Resilience for IoT Devices in Smart Homes
链接: https://arxiv.org/abs/2603.10038
作者: Alireza Borhani,Vafa Andalibi,Bahar Asgari
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:
Abstract:Smart-home IoT systems rely on heterogeneous sensor networks whose correctness shapes application behavior and the physical environment. However, these low-cost, resource-constrained sensors are highly prone to failure under real-world stressors. Prior methods often assume single-failure, single-resident settings, offer only failure detection rather than sensor-level localization, cover limited fault types and sensor modalities, require labels and human intervention, or impose overheads hindering edge deployment. To overcome these limitations, we propose Tureis, a self-supervised, context-aware method for failure detection and faulty-sensor localization in smart homes, designed for multi-failure, multi-resident edge settings. Tureis encodes heterogeneous binary and numeric sensor streams into compact bit-level features. It then trains a lightweight BERT-style Transformer with sensor-wise masked reconstruction over short-horizon windows, capturing spatial and short-term temporal correlations without mixing unrelated events. This self-supervised objective removes the need for labels or curated semantics. Then, at run-time, Tureis converts reconstruction residuals into sensor-level failure evidence and uses an iterative isolate-and-continue loop that masks flagged sensors, allowing other failures to surface and enabling resilient, fine-grained localization. Across five datasets with up to nine residents, Tureis improves single-failure localization F1 by +7.6%, +21.0%, and +25.0% over three strong baselines. In multi-failure scenarios with up to five faulty sensors, it further boosts localization F1 by +17.6% and +35.4% over two baselines, while the third does not extend to this setting. These gains come with minute-scale localization and an edge-friendly footprint, as a sub-megabyte model that processes each minute of data in a few milliseconds with ~0.5 GB peak memory on a Raspberry Pi 5.
[LG-67] LWM-Temporal: Sparse Spatio-Temporal Attention for Wireless Channel Representation Learning
链接: https://arxiv.org/abs/2603.10024
作者: Sadjad Alikhani,Akshay Malhotra,Shahab Hamidi-Rad,Ahmed Alkhateeb
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: LWM resources are publicly available at [ this https URL ]( this https URL )
Abstract:LWM-Temporal is a new member of the Large Wireless Models (LWM) family that targets the spatiotemporal nature of wireless channels. Designed as a task-agnostic foundation model, LWM-Temporal learns universal channel embeddings that capture mobility-induced evolution and are reusable across various downstream tasks. To achieve this objective, LWM-Temporal operates in the angle-delay-time domain and introduces Sparse Spatio-Temporal Attention (SSTA), a propagation-aligned attention mechanism that restricts interactions to physically plausible neighborhoods, reducing attention complexity by an order of magnitude while preserving geometry-consistent dependencies. LWM-Temporal is pretrained in a self-supervised manner using a physics-informed masking curriculum that emulates realistic occlusions, pilot sparsity, and measurement impairments. Experimental results on channel prediction across multiple mobility regimes show consistent improvements over strong baselines, particularly under long horizons and limited fine-tuning data, highlighting the importance of geometry-aware architectures and geometry-consistent pretraining for learning transferable spatiotemporal wireless representations.
[LG-68] Bayesian Optimization with Gaussian Processes to Accelerate Stationary Point Searches
链接: https://arxiv.org/abs/2603.10992
作者: Rohit Goswami(1) ((1) Institute IMX and Lab-COSMO, École polytechnique fédérale de Lausanne (EPFL), Station 12, CH-1015 Lausanne, Switzerland)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
*备注: 57 pages, 22 figures. Invited article for ACS Physical Chemistry Au
Abstract:Accelerating the explorations of stationary points on potential energy surfaces building local surrogates spans decades of effort. Done correctly, surrogates reduce required evaluations by an order of magnitude while preserving the accuracy of the underlying theory. We present a unified Bayesian Optimization view of minimization, single point saddle searches, and double ended saddle searches through a unified six-step surrogate loop, differing only in the inner optimization target and acquisition criterion. The framework uses Gaussian process regression with derivative observations, inverse-distance kernels, and active learning. The Optimal Transport GP extensions of farthest point sampling with Earth mover’s distance, MAP regularization via variance barrier and oscillation detection, and adaptive trust radius form concrete extensions of the same basic methodology, improving accuracy and efficiency. We also demonstrate random Fourier features decouple hyperparameter training from predictions enabling favorable scaling for high-dimensional systems. Accompanying pedagogical Rust code demonstrates that all applications use the exact same Bayesian optimization loop, bridging the gap between theoretical formulation and practical execution.
[LG-69] ForwardFlow: Simulation only statistical inference using deep learning
链接: https://arxiv.org/abs/2603.10991
作者: Stefan Böhringer
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Computation (stat.CO)
*备注:
Abstract:Deep learning models are being used for the analysis of parametric statistical models based on simulation-only frameworks. Bayesian models using normalizing flows simulate data from a prior distribution and are composed of two deep neural networks: a summary network that learns a sufficient statistic for the parameter and a normalizing flow that conditional on the summary network can approximate the posterior distribution. Here, we explore frequentist models that are based on a single summary network. During training, input of the network is a simulated data set based on a parameter and the loss function minimizes the mean-square error between learned summary and parameter. The network thereby solves the inverse problem of parameter estimation. We propose a branched network structure that contains collapsing layers that reduce a data set to summary statistics that are further mapped through fully connected layers to approximate the parameter estimate. We motivate our choice of network structure by theoretical considerations. In simulations we demonstrate three desirable properties of parameter estimates: finite sample exactness, robustness to data contamination, and algorithm approximation. These properties are achieved offering the the network varying sample size, contaminated data, and data needing algorithmic reconstruction during the training phase. In our simulations an EM-algorithm for genetic data is automatically approximated by the network. Simulation only approaches seem to offer practical advantages in complex modeling tasks where the simpler data simulation part is left to the researcher and the more complex problem of solving the inverse problem is left to the neural network. Challenging future work includes offering pre-trained models that can be used in a wide variety of applications. Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Computation (stat.CO) Cite as: arXiv:2603.10991 [math.ST] (or arXiv:2603.10991v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2603.10991 Focus to learn more arXiv-issued DOI via DataCite
[LG-70] Kernel Tests of Equivalence
链接: https://arxiv.org/abs/2603.10886
作者: Xing Liu,Axel Gandy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 29 pages; 6 figures
Abstract:We propose novel kernel-based tests for assessing the equivalence between distributions. Traditional goodness-of-fit testing is inappropriate for concluding the absence of distributional differences, because failure to reject the null hypothesis may simply be a result of lack of test power, also known as the Type-II error. This motivates \emphequivalence testing, which aims to assess the \emphabsence of a statistically meaningful effect under controlled error rates. However, existing equivalence tests are either limited to parametric distributions or focus only on specific moments rather than the full distribution. We address these limitations using two kernel-based statistical discrepancies: the \emphkernel Stein discrepancy and the \emphMaximum Mean Discrepancy. The null hypothesis of our proposed tests assumes the candidate distribution differs from the nominal distribution by at least a pre-defined margin, which is measured by these discrepancies. We propose two approaches for computing the critical values of the tests, one using an asymptotic normality approximation, and another based on bootstrapping. Numerical experiments are conducted to assess the performance of these tests.
[LG-71] ReTabSyn: Realistic Tabular Data Synthesis via Reinforcement Learning
链接: https://arxiv.org/abs/2603.10823
作者: Xiaofeng Lin,Seungbae Kim,Zhuoya Li,Zachary DeSoto,Charles Fleming,Guang Cheng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Deep generative models can help with data scarcity and privacy by producing synthetic training data, but they struggle in low-data, imbalanced tabular settings to fully learn the complex data distribution. We argue that striving for the full joint distribution could be overkill; for greater data efficiency, models should prioritize learning the conditional distribution P(y\mid \bmX) , as suggested by recent theoretical analysis. Therefore, we overcome this limitation with \textbfReTabSyn, a \textbfReinforced \textbfTabular \textbfSynthesis pipeline that provides direct feedback on feature correlation preservation during synthesizer training. This objective encourages the generator to prioritize the most useful predictive signals when training data is limited, thereby strengthening downstream model utility. We empirically fine-tune a language model-based generator using this approach, and across benchmarks with small sample sizes, class imbalance, and distribution shift, ReTabSyn consistently outperforms state-of-the-art baselines. Moreover, our approach can be readily extended to control various aspects of synthetic tabular data, such as applying expert-specified constraints on generated observations.
[LG-72] Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context
链接: https://arxiv.org/abs/2603.10623
作者: Yuanbo Hou,Yanru Wu,Qiaoqiao Ren,Shengchen Li,Stephen Roberts,Dick Botteldooren
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注:
Abstract:Environmental sound understanding in computational auditory scene analysis (CASA) is often formulated as an audio-only recognition problem. This formulation leaves a persistent drawback in multi-label audio tagging (AT): acoustic similarity can make certain events difficult to separate from waveforms alone. In such cases, disambiguating cues often lie outside the waveform. Geospatial semantic context (GSC), derived from geographic information system data, e.g., points of interest (POI), provides location-tied environmental priors that can help reduce this ambiguity. A systematic study of this direction is enabled through the proposed geospatial audio tagging (Geo-AT) task, which conditions multi-label sound event tagging on GSC alongside audio. To benchmark Geo-AT, Geo-ATBench is introduced as a polyphonic audio benchmark with geographical annotations, containing 10.71 hours of audio across 28 event categories; each clip is paired with a GSC representation from 11 semantic context categories. GeoFusion-AT is proposed as a unified geo-audio fusion framework that evaluates feature-, representation-, and decision-level fusion on representative audio backbones, with audio- and GSC-only baselines. Results show that incorporating GSC improves AT performance, especially on acoustically confounded labels, indicating geospatial semantics provide effective priors beyond audio alone. A crowdsourced listening study with 10 participants on 579 samples shows that there is no significant difference in performance between models on Geo-ATBench labels and aggregated human labels, supporting Geo-ATBench as a human-aligned benchmark. The Geo-AT task, benchmark Geo-ATBench, and reproducible geo-audio fusion framework GeoFusion-AT provide a foundation for studying AT with geospatial semantic context within the CASA community. Dataset, code, models are on homepage (this https URL).
[LG-73] Quantization Robustness of Monotone Operator Equilibrium Networks
链接: https://arxiv.org/abs/2603.10562
作者: James Li,Philip H.W. Leong,Thomas Chaffey
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 6 pages, 4 figures. Submitted to IEEE Control Systems Letters (L-CSS)
Abstract:Monotone operator equilibrium networks are implicit-layer models whose output is the unique equilibrium of a monotone operator, guaranteeing existence, uniqueness, and convergence. When deployed on low-precision hardware, weights are quantized, potentially destroying these guarantees. We analyze weight quantization as a spectral perturbation of the underlying monotone inclusion. Convergence of the quantized solver is guaranteed whenever the spectral-norm weight perturbation is smaller than the monotonicity margin; the displacement between quantized and full-precision equilibria is bounded in terms of the perturbation size and margin; and a condition number characterizing the ratio of the operator norm to the margin links quantization precision to forward error. MNIST experiments confirm a phase transition at the predicted threshold: three- and four-bit post-training quantization diverge, while five-bit and above converge. The backward-pass guarantee enables quantization-aware training, which recovers provable convergence at four bits.
[LG-74] Dual Space Preconditioning for Gradient Descent in the Overparameterized Regime
链接: https://arxiv.org/abs/2603.10485
作者: Reza Ghane,Danil Akhtiamov,Babak Hassibi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In this work we study the convergence properties of the Dual Space Preconditioned Gradient Descent, encompassing optimizers such as Normalized Gradient Descent, Gradient Clipping and Adam. We consider preconditioners of the form \nabla K , where K: \mathbbR^p \to \mathbbR is convex and assume that the latter is applied to train an over-parameterized linear model with loss of the form \ell(X W - Y) , for weights W \in \mathbbR^d \times k , labels Y \in \mathbbR^n \times k and data X \in \mathbbR^n \times d . Under the aforementioned assumptions, we prove that the iterates of the preconditioned gradient descent always converge to a point W_\infty \in \mathbbR^d \times k satisfying XW_\infty = Y . Our proof techniques are of independent interest as we introduce a novel version of the Bregman Divergence with accompanying identities that allow us to establish convergence. We also study the implicit bias of Dual Space Preconditioned Gradient Descent. First, we demonstrate empirically that, for general K(\cdot) , W_\infty depends on the chosen learning rate, hindering a precise characterization of the implicit bias. Then, for preconditioners of the form K(G) = h(|G|F) , known as \textitisotropic preconditioners, we show that W\infty minimizes |W_\infty - W_0|F^2 subject to XW\infty = Y , where W_0 is the initialization. Denoting the convergence point of GD initialized at W_0 by W_\textGD, \infty , we thus note W_\infty = W_\textGD, \infty for isotropic preconditioners. Finally, we show that a similar fact holds for general preconditioners up to a multiplicative constant, namely, |W_0 - W_\infty|F \le c |W_0 - W\textGD, \infty|_F for a constant c0 .
[LG-75] Beam-Plasma Collective Oscillations in Intense Charged-Particle Beams: Dielectric Response Theory Langmuir Wave Dispersion and Unsupervised Detection via Prometheus
链接: https://arxiv.org/abs/2603.10457
作者: Brandon Yee,Wilson Collins,Michael Iofin,Jiayi Fu
类目: Plasma Physics (physics.plasm-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Accelerator Physics (physics.acc-ph)
*备注:
Abstract:We develop a theoretical and computational framework for beam-plasma collective oscillations in intense charged-particle beams at intermediate energies (10-100 MeV). In Part I, we formulate a kinetic field theory governed by the Vlasov-Poisson system, deriving the Lindhard dielectric function and random phase approximation (RPA) polarization tensor for three beam distribution functions. We prove via the dielectric function epsilon(omega,q)=0 the existence of undamped Langmuir wave modes above a critical beam density n_c, obtain explicit beam-plasma dispersion relations, and show that Landau damping vanishes above the particle-hole continuum. The plasma frequency Omega_p^2 = ne^2/(m*epsilon_0) is fixed by the f-sum rule independently of distribution shape; higher dispersion coefficients depend on velocity moments. Space charge effects drive anomalous beam broadening with sqrt(n-n_c) onset and Friedel oscillations at q=2k_F. The beam-plasma transition belongs to the 3D Ising universality class via renormalization group analysis. In Part II, we validate these predictions using Prometheus, a beta-VAE trained on static structure factor data S(q) from particle-in-cell (PIC) beam simulations. Prometheus detects collective plasma oscillation onset in Gaussian and uniform distributions, confirms their absence in the degenerate Fermi gas (n_c - 0), and resolves the Kohn anomaly at q=2k_F. Dispersion analysis of S(q,omega) from PIC simulations verifies the distribution-independent Omega_p predicted by the f-sum rule. All six validation checks pass. Predicted signatures – density-tunable plasma resonances at omega_p proportional to sqrt(n), anomalous beam broadening with sqrt(n-n_c) onset, and Friedel oscillations – are accessible at existing intermediate-energy beam facilities.
[LG-76] Brenier Isotonic Regression AISTATS2026
链接: https://arxiv.org/abs/2603.10452
作者: Han Bao,Amirreza Eshraghi,Yutong Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: AISTATS2026
Abstract:Isotonic regression (IR) is shape-constrained regression to maintain a univariate fitting curve non-decreasing, which has numerous applications including single-index models and probability calibration. When it comes to multi-output regression, the classical IR is no longer applicable because the monotonicity is not readily extendable. We consider a novel multi-output regression problem where a regression function is \emphcyclically monotone. Roughly speaking, a cyclically monotone function is the gradient of some convex potential. Whereas enforcing cyclic monotonicity is apparently challenging, we leverage the fact that Kantorovich’s optimal transport (OT) always yields a cyclically monotone coupling as an optimal solution. This perspective naturally allows us to interpret a regression function and the convex potential as a link function in generalized linear models and Brenier’s potential in OT, respectively, and hence we call this IR extension \emphBrenier isotonic regression. We demonstrate experiments with probability calibration and generalized linear models. In particular, IR outperforms many famous baselines in probability calibration robustly.
[LG-77] Adaptive Active Learning for Regression via Reinforcement Learning UAI2026
链接: https://arxiv.org/abs/2603.10435
作者: Simon D. Nguyen,Troy Russo,Kentaro Hoffman,Tyler H. McCormick
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 33 pages, 103 figures. Main paper (8 pages, 4 figures) plus appendix with proofs and supplemental experimental results. Submitted to UAI2026. Codebase available at this https URL
Abstract:Active learning for regression reduces labeling costs by selecting the most informative samples. Improved Greedy Sampling is a prominent method that balances feature-space diversity and output-space uncertainty using a static, multiplicative rule. We propose Weighted improved Greedy Sampling (WiGS), which replaces this framework with a dynamic, additive criterion. We formulate weight selection as a reinforcement learning problem, enabling an agent to adapt the exploration-investigation balance throughout learning. Experiments on 18 benchmark datasets and a synthetic environment show WiGS outperforms iGS and other baseline methods in both accuracy and labeling efficiency, particularly in domains with irregular data density where the baseline’s multiplicative rule ignores high-error samples in dense regions.
[LG-78] On The Complexity of Best-Arm Identification in Non-Stationary Linear Bandits
链接: https://arxiv.org/abs/2603.10346
作者: Leo Maynard-Zhang,Zhihan Xiong,Kevin Jamieson,Maryam Fazel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We study the fixed-budget best-arm identification (BAI) problem in non-stationary linear bandits. Concretely, given a fixed time budget T\in \mathbbN , finite arm set \mathcalX \subset \mathbbR^d , and a potentially adversarial sequence of unknown parameters \lbrace \theta_t\rbrace_t=1^T (hence non-stationary), a learner aims to identify the arm with the largest cumulative reward x_* = \arg\max_x \in \mathcalX x^\top\sum_t=1^T \theta_t with high probability. In this setting, it is well-known that uniformly sampling arms from the G-optimal design yields a minimax-optimal error probability of \exp\left(-\Theta\left(T / H_G\right)\right) , where H_G scales proportionally with the dimension d . However, this notion of complexity is overly pessimistic, as it is derived from a lower bound in which the arm set consists only of the standard basis vectors, thus masking any potential advantages arising from arm sets with richer geometric structure. To address this, we establish an arm-set-dependent lower bound that, in contrast, holds for any arm set. Motivated by the ideas underlying our lower bound, we propose the Adjacent-optimal design, a specialization of the well-known \mathcalX\mathcalY -optimal design, and develop the \textsfAdjacent-BAI algorithm. We prove that the error probability of \textsfAdjacent-BAI matches our lower bound up to constants, verifying the tightness of our lower bound, and establishing the arm-set-dependent complexity of this setting.
[LG-79] MultiwayPAM: Multiway Partitioning Around Medoids for LLM -as-a-Judge Score Analysis
链接: https://arxiv.org/abs/2603.10287
作者: Chihiro Watanabe,Jingyu Sun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:LLM-as-a-Judge is a flexible framework for text evaluation, which allows us to obtain scores for the quality of a given text from various perspectives by changing the prompt template. Two main challenges in using LLM-as-a-Judge are computational cost of LLM inference, especially when evaluating a large number of texts, and inherent bias of an LLM evaluator. To address these issues and reveal the structure of score bias caused by an LLM evaluator, we propose to apply a tensor clustering method to a given LLM-as-a-Judge score tensor, whose entries are the scores for different combinations of questions, answerers, and evaluators. Specifically, we develop a new tensor clustering method MultiwayPAM, with which we can simultaneously estimate the cluster membership and the medoids for each mode of a given data tensor. By observing the medoids obtained by MultiwayPAM, we can gain knowledge about the membership of each question/answerer/evaluator cluster. We experimentally show the effectiveness of MultiwayPAM by applying it to the score tensors for two practical datasets.
[LG-80] Bayesian Hierarchical Models and the Maximum Entropy Principle
链接: https://arxiv.org/abs/2603.10252
作者: Brendon J. Brewer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Methodology (stat.ME)
*备注: 6 pages, 2 figures. To appear in the proceedings of the 44th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2025), held in Auckland, New Zealand
Abstract:Bayesian hierarchical models are frequently used in practical data analysis contexts. One interpretation of these models is that they provide an indirect way of assigning a prior for unknown parameters, through the introduction of hyperparameters. The resulting marginal prior for the parameters (integrating over the hyperparameters) is usually dependent, so that learning one parameter provides some information about the others. In this contribution, I will demonstrate that, when the prior given the hyperparameters is a canonical distribution (a maximum entropy distribution with moment constraints), the dependent marginal prior also has a maximum entropy property, with a different constraint. This constraint is on the marginal distribution of some function of the unknown quantities. The results shed light on what information is actually being assumed when we assign a hierarchical model.
[LG-81] A Trust-Region Interior-Point Stochastic Sequential Quadratic Programming Method
链接: https://arxiv.org/abs/2603.10230
作者: Yuchen Fang,Jihun Kim,Sen Na,James Demmel,Javad Lavaei
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:
Abstract:In this paper, we propose a trust-region interior-point stochastic sequential quadratic programming (TR-IP-SSQP) method for solving optimization problems with a stochastic objective and deterministic nonlinear equality and inequality constraints. In this setting, exact evaluations of the objective function and its gradient are unavailable, but their stochastic estimates can be constructed. In particular, at each iteration our method builds stochastic oracles, which estimate the objective value and gradient to satisfy proper adaptive accuracy conditions with a fixed probability. To handle inequality constraints, we adopt an interior-point method (IPM), in which the barrier parameter follows a prescribed decaying sequence. Under standard assumptions, we establish global almost-sure convergence of the proposed method to first-order stationary points. We implement the method on a subset of problems from the CUTEst test set, as well as on logistic regression problems, to demonstrate its practical performance.
[LG-82] SDSR: A Spectral Divide-and-Conquer Approach for Species Tree Reconstruction
链接: https://arxiv.org/abs/2603.10215
作者: Ortal Reshef(1),Ofer Glassman(3),Or Zuk(1),Yariv Aizenbud(2),Boaz Nadler(3),Ariel Jaffe(1) ((1) Hebrew University of Jerusalem, (2) Tel Aviv University, (3) Weizmann Institute of Science)
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 35 pages, 13 figures. Code available at this https URL
Abstract:Recovering a tree that represents the evolutionary history of a group of species is a key task in phylogenetics. Performing this task using sequence data from multiple genetic markers poses two key challenges. The first is the discordance between the evolutionary history of individual genes and that of the species. The second challenge is computational, as contemporary studies involve thousands of species. Here we present SDSR, a scalable divide-and-conquer approach for species tree reconstruction based on spectral graph theory. The algorithm recursively partitions the species into subsets until their sizes are below a given threshold. The trees of these subsets are reconstructed by a user-chosen species tree algorithm. Finally, these subtrees are merged to form the full tree. On the theoretical front, we derive recovery guarantees for SDSR, under the multispecies coalescent (MSC) model. We also perform a runtime complexity analysis. We show that SDSR, when combined with a species tree reconstruction algorithm as a subroutine, yields substantial runtime savings as compared to applying the same algorithm on the full data. Empirically, we evaluate SDSR on synthetic benchmark datasets with incomplete lineage sorting and horizontal gene transfer. In accordance with our theoretical analysis, the simulations show that combining SDSR with common species tree methods, such as CA-ML or ASTRAL, yields up to 10-fold faster runtimes. In addition, SDSR achieves a comparable tree reconstruction accuracy to that obtained by applying these methods on the full data.
[LG-83] Flexible Cutoff Learning: Optimizing Machine Learning Potentials After Training
链接: https://arxiv.org/abs/2603.10205
作者: Rick Oerder(1 and 2),Jan Hamaekers(2) ((1) Institute for Numerical Simulation, University of Bonn (2) Fraunhofer Institute for Algorithms and Scientific Computing SCAI)
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:We introduce Flexible Cutoff Learning (FCL), a method for training machine learning interatomic potentials (MLIPs) whose cutoff radii can be adjusted after training. Unlike conventional MLIPs that fix the cutoff radius during training, FCL models are trained by randomly sampling cutoff radii independently for each atom. The resulting model can then be deployed with different per-atom cutoff radii depending on the application, enabling application-specific optimization of the accuracy-cost tradeoff. Using a differentiable cost model, these per-atom cutoffs can be optimized for specific target systems after training. We demonstrate FCL with a modified MACE architecture trained on the MAD dataset. For a subset featuring molecular crystals, optimized per-atom cutoffs reduce computational cost by more than 60% while increasing force errors by less than 1%. These results show that FCL enables training of a single general-purpose MLIP that can be adapted to diverse applications through post-training cutoff optimization, eliminating the need for retraining.
[LG-84] Hybrid Hidden Markov Model for Modeling Equity Excess Growth Rate Dynamics: A Discrete-State Approach with Jump-Diffusion
链接: https://arxiv.org/abs/2603.10202
作者: Abdulrahman Alswaidan,Jeffrey D. Varner
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Risk Management (q-fin.RM)
*备注:
Abstract:Generating synthetic financial time series that preserve statistical properties of real market data is essential for stress testing, risk model validation, and scenario design. Existing approaches, from parametric models to deep generative networks, struggle to simultaneously reproduce heavy-tailed distributions, negligible linear autocorrelation, and persistent volatility clustering. We propose a hybrid hidden Markov framework that discretizes continuous excess growth rates into Laplace quantile-defined market states and augments regime switching with a Poisson-driven jump-duration mechanism to enforce realistic tail-state dwell times. Parameters are estimated by direct transition counting, bypassing the Baum-Welch EM algorithm. Synthetic data quality is evaluated using Kolmogorov-Smirnov and Anderson-Darling pass rates for distributional fidelity, and ACF mean absolute error for temporal structure. Applied to ten years of SPY data across 1,000 simulated paths, the framework achieves KS and AD pass rates exceeding 97% and 91% in-sample and 94% out-of-sample (calendar year 2025), partially reproducing the ARCH effect that standard regime-switching models miss. No single model dominates all quality dimensions: GARCH(1,1) reproduces volatility clustering more accurately but fails distributional tests (5.5% KS pass rate), while the standard HMM without jumps achieves higher distributional fidelity but cannot generate persistent high-volatility regimes. The proposed framework offers the best joint quality profile across distributional, temporal, and tail-coverage metrics. A Single-Index Model extension propagates the SPY factor path to a 424-asset universe, enabling scalable correlated synthetic path generation while preserving cross-sectional correlation structure.
[LG-85] Stability and Robustness via Regularization: Bandit Inference via Regularized Stochastic Mirror Descent
链接: https://arxiv.org/abs/2603.10184
作者: Budhaditya Halder,Ishan Sengupta,Koustav Chowdhury,Koulik Khamaru
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Statistical inference with bandit data presents fundamental challenges due to adaptive sampling, which violates the independence assumptions underlying classical asymptotic theory. Recent work has identified stability as a sufficient condition for valid inference under adaptivity. This paper develops a systematic theory of stability for bandit algorithms based on stochastic mirror descent, a broad algorithmic framework that includes the widely-used EXP3 algorithm as a special case. Our contributions are threefold. First, we establish a general stability criterion: if the average iterates of a stochastic mirror descent algorithm converge in ratio to a non-random probability vector, then the induced bandit algorithm is stable. This result provides a unified lens for analyzing stability across diverse algorithmic instantiations. Second, we introduce a family of regularized-EXP3 algorithms employing a log-barrier regularizer with appropriately tuned parameters. We prove that these algorithms satisfy our stability criterion and, as an immediate corollary, that Wald-type confidence intervals for linear functionals of the mean parameter achieve nominal coverage. Notably, we show that the same algorithms attain minimax-optimal regret guarantees up to logarithmic factors, demonstrating that inference-enabling stability and learning efficiency are compatible objectives within the mirror descent framework. Third, we establish robustness to corruption: a modified variant of regularized-EXP3 maintains asymptotic normality of empirical arm means even in the presence of o(T^1/2) adversarial corruptions. This stands in sharp contrast to other stable algorithms such as UCB, which suffer linear regret even under logarithmic levels of corruption. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2603.10184 [stat.ML] (or arXiv:2603.10184v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2603.10184 Focus to learn more arXiv-issued DOI via DataCite
[LG-86] Mitigating Frequency Learning Bias in Quantum Models via Multi-Stage Residual Learning
链接: https://arxiv.org/abs/2603.10083
作者: Ammar Daskin
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 11 pages, 9 fgiures. The code and synthetic data generation scripts used in this study are publicly available on GitHub at this https URL
Abstract:Quantum machine learning models based on parameterized circuits can be viewed as Fourier series approximators. However, they often struggle to learn functions with multiple frequency components, particularly high-frequency or non-dominant ones; a phenomenon we term the quantum Fourier parameterization bias. Inspired by recent advances in classical Fourier neural operators (FNOs), we adapt the multi-stage residual learning idea to the quantum domain, iteratively training additional quantum modules on the residuals of previous stages. We evaluate our method on a synthetic benchmark composed of spatially localized frequency components with diverse envelope shapes (Gaussian, Lorentzian, triangular). Systematic experiments show that the number of qubits, the encoding scheme, and residual learning are all crucial for resolving multiple frequencies; residual learning alone can improve test MSE significantly over a single-stage baseline trained for the same total number of epochs. Our work provides a practical framework for enhancing the spectral expressivity of quantum models and offers new insights into their frequency-learning behavior.
[LG-87] Breaking the Stochasticity Barrier: An Adaptive Variance-Reduced Method for Variational Inequalities
链接: https://arxiv.org/abs/2601.23034
作者: Yungi Jeong,Takumi Otsuka
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Stochastic non-convex non-concave optimization, formally characterized as Stochastic Variational Inequalities (SVIs), presents unique challenges due to rotational dynamics and the absence of a global merit function. While adaptive step-size methods (like Armijo line-search) have revolutionized convex minimization, their application to this setting is hindered by the Stochasticity Barrier: the noise in gradient estimation masks the true operator curvature, triggering erroneously large steps that destabilize convergence. In this work, we propose VR-SDA-A (Variance-Reduced Stochastic Descent-Ascent with Armijo), a novel algorithm that integrates recursive momentum (STORM) with a rigorous Same-Batch Curvature Verification mechanism. We introduce a theoretical framework based on a Lyapunov potential tracking the Operator Norm, proving that VR- SDA-A achieves an oracle complexity of O(epsilon -3) for finding an epsilon-stationary point in general Lipschitz continuous operators. This matches the optimal rate for non-convex minimization while uniquely enabling automated step-size adaptation in the saddle-point setting. We validate our approach on canonical rotational benchmarks and non-convex robust regression tasks, demonstrating that our method effectively suppresses limit cycles and accelerates convergence with reduced dependence on manual learning rate scheduling.
附件下载


