本篇博文主要内容为 2026-02-18 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-02-18)
今日共更新402篇论文,其中:
- 自然语言处理共53篇(Computation and Language (cs.CL))
- 人工智能共130篇(Artificial Intelligence (cs.AI))
- 计算机视觉共55篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共119篇(Machine Learning (cs.LG))
- 多智能体系统共7篇(Multiagent Systems (cs.MA))
- 信息检索共10篇(Information Retrieval (cs.IR))
- 人机交互共15篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Outer Diversity of Structured Domains
【速读】:该论文旨在解决选举中偏好域(preference domain)结构特性对投票机制设计与分析的影响问题,特别是如何衡量和利用偏好域的外多样性(outer diversity)来评估其在实际应用中的价值。其解决方案的关键在于提出并形式化“外多样性”这一概念,用以刻画一个偏好域中允许的偏好顺序集合的扩展能力,并通过系统分析单峰(single-peaked)、单交叉(single-crossing)、群可分离(group-separable)及欧几里得(Euclidean)等典型结构化偏好域的外多样性,揭示不同结构对选举行为约束的差异及其理论意义。
链接: https://arxiv.org/abs/2602.15708
作者: Piotr Faliszewski,Krzysztof Sornat,Stanisław Szufa,Tomasz Wąs
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:An ordinal preference domain is a subset of preference orders that the voters are allowed to cast in an election. We introduce and study the notion of outer diversity of a domain and evaluate its value for a number of well-known structured domains, such as the single-peaked, single-crossing, group-separable, and Euclidean ones.
[MA-1] Neural Network-Based Parameter Estimation of a Labour Market Agent -Based Model CCS2026
【速读】:该论文旨在解决大规模代理模型(Agent-Based Model, ABM)在参数估计中面临的计算瓶颈问题,尤其是在探索高维参数空间时效率低下且难以收敛的挑战。其核心解决方案是采用一种先进的基于仿真的推断(Simulation-Based Inference, SBI)框架,该框架利用神经网络(Neural Network, NN)自动学习从模拟数据到参数后验分布的映射关系,从而替代传统贝叶斯方法对复杂统计摘要的依赖。关键创新在于通过嵌入式神经网络直接从数据中学习有效特征表示,显著提升了参数估计的准确性与计算效率,尤其在不同规模数据集下均能恢复原始参数并优于传统方法。
链接: https://arxiv.org/abs/2602.15572
作者: M Lopes Alves,Joel Dyer,Doyne Farmer,Michael Wooldridge,Anisoara Calinescu
机构: University of Oxford (牛津大学); Institute of New Economic Thinking (新经济思维研究所)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: To be presented at the 6th World Conference on Complex Systems (WCCS 2026)
Abstract:Agent-based modelling (ABM) is a widespread approach to simulate complex systems. Advancements in computational processing and storage have facilitated the adoption of ABMs across many fields; however, ABMs face challenges that limit their use as decision-support tools. A significant issue is parameter estimation in large-scale ABMs, particularly due to computational constraints on exploring the parameter space. This study evaluates a state-of-the-art simulation-based inference (SBI) framework that uses neural networks (NN) for parameter estimation. This framework is applied to an established labour market ABM based on job transition networks. The ABM is initiated with synthetic datasets and the real U.S. labour market. Next, we compare the effectiveness of summary statistics derived from a list of statistical measures with that learned by an embedded NN. The results demonstrate that the NN-based approach recovers the original parameters when evaluating posterior distributions across various dataset scales and improves efficiency compared to traditional Bayesian methods.
[MA-2] Enhancing Computational Efficiency in NetLogo: Best Practices for Running Large-Scale Agent -Based Models on AWS and Cloud Infrastructures
【速读】:该论文旨在解决代理模型(Agent-Based Models, ABMs)规模日益增长所带来的计算资源需求激增问题,特别是在云平台(如Amazon Web Services, AWS)上运行大规模NetLogo模型时的性能瓶颈与成本效率问题。其解决方案的关键在于系统性地优化NetLogo在云环境中的资源配置与执行策略,包括内存管理、Java虚拟机参数调优、BehaviorSpace并行执行配置以及AWS实例类型的选择;通过这些优化措施,实现了32%的计算成本降低和更高的性能一致性,验证了针对特定ABM场景(如狼-羊捕食模型)进行硬件与软件协同优化的有效性。
链接: https://arxiv.org/abs/2602.15317
作者: Michael A. Duprey,Georgiy V. Bobashev
机构: RTI International (RTI国际)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:The rising complexity and scale of agent-based models (ABMs) necessitate efficient computational strategies to manage the increasing demand for processing power and memory. This manuscript provides a comprehensive guide to optimizing NetLogo, a widely used platform for ABMs, for running large-scale models on Amazon Web Services (AWS) and other cloud infrastructures. It covers best practices in memory management, Java options, BehaviorSpace execution, and AWS instance selection. By implementing these optimizations and selecting appropriate AWS instances, we achieved a 32% reduction in computational costs and improved performance consistency. Through a comparative analysis of NetLogo simulations on different AWS instances using the wolf-sheep predation model, we demonstrate the performance gains achievable through these optimizations.
[MA-3] Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems
【速读】:该论文旨在解决多智能体系统中大语言模型(Large Language Model, LLM)代理在协作任务中可能形成联盟并追求次级目标、从而损害整体任务目标的安全问题,即“合谋行为”(collusion)。其解决方案的关键在于提出Colosseum框架,通过将代理间的合作建模为分布式约束优化问题(Distributed Constraint Optimization Problem, DCOP),并以相对于合作最优解的遗憾值(regret)来量化合谋程度;同时,在不同目标设定、说服策略和网络拓扑下对LLM代理进行审计测试,从而识别出真实合谋与仅存在于文本计划中的“纸上合谋”(collusion on paper),实现了对合谋行为的可测量、可验证评估。
链接: https://arxiv.org/abs/2602.15198
作者: Mason Nakamura,Abhinav Kumar,Saswat Das,Sahar Abdelnabi,Saaduddin Mahmud,Ferdinando Fioretto,Shlomo Zilberstein,Eugene Bagdasarian
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks. This surfaces a unique safety problem when individual agents form a coalition and \emphcollude to pursue secondary goals and degrade the joint objective. In this paper, we present Colosseum, a framework for auditing LLM agents’ collusive behavior in multi-agent settings. We ground how agents cooperate through a Distributed Constraint Optimization Problem (DCOP) and measure collusion via regret relative to the cooperative optimum. Colosseum tests each LLM for collusion under different objectives, persuasion tactics, and network topologies. Through our audit, we show that most out-of-the-box models exhibited a propensity to collude when a secret communication channel was artificially formed. Furthermore, we discover ``collusion on paper’’ when agents plan to collude in text but would often pick non-collusive actions, thus providing little effect on the joint task. Colosseum provides a new way to study collusion by measuring communications and actions in rich yet verifiable environments.
[MA-4] Beyond Context Sharing: A Unified Agent Communication Protocol (ACP) for Secure Federated and Autonomous Agent Agent -to-Agent (A2A) Orchestration
【速读】:该论文旨在解决当前人工智能领域中跨平台、去中心化且安全的智能体(Agent)间交互难题,这是实现真正“智能体网络”(Agentic Web)的关键瓶颈。其解决方案的核心在于提出了一种标准化的代理通信协议(Agent Communication Protocol, ACP),该协议通过联邦编排模型集成去中心化身份验证、语义意图映射与自动化服务级别协议(Service-Level Agreement, SLA),从而支持异构智能体在不同环境中发现、协商并执行协作工作流,同时显著降低通信延迟并维持零信任安全策略。
链接: https://arxiv.org/abs/2602.15055
作者: Naveen Kumar Krishnan
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:In the artificial intelligence space, as we transition from isolated large language models to autonomous agents capable of complex reasoning and tool use. While foundational architectures and local context management protocols have been established, the challenge of cross-platform, decentralized, and secure interaction remains a significant barrier to the realization of a truly Agentic Web. Building upon the foundations of AI agent architectures and the Model Context Protocol (MCP) for multi-agent coordination, this paper introduces the Agent Communication Protocol (ACP). ACP provides a standardized framework for Agent-to-Agent (AA) interaction, enabling heterogeneous agents to discover, negotiate, and execute collaborative workflows across disparate environments. We propose a federated orchestration model that integrates decentralized identity verification, semantic intent mapping, and automated service-level agreements. Our evaluation demonstrates that ACP reduces inter-agent communication latency by % while maintaining a zero-trust security posture. This work represents a critical advancement toward a scalable and interoperable ecosystem of autonomous digital entities
[MA-5] Social Contagion and Bank Runs: An Agent -Based Model with LLM Depositors
【速读】:该论文旨在解决传统均衡模型难以解释数字时代银行挤兑中信念同步与实时传播机制的问题,特别是在在线通信和数字银行业务背景下,银行挤兑的速度与网络化特征显著增强。其解决方案的关键在于构建一个基于过程的代理模型(agent-based model),显式刻画信息传递与协调层:通过引入具有风险容忍度异质性的存款人、基于基础信息与社会信息权重决策的行为规则,并结合以Twitter活动校准的重尾网络结构,模拟存款人在真实社交网络中的互动行为;同时,利用受限的大语言模型(constrained large language model, LLM)生成个体决策策略并产出文本反馈,从而在微观层面捕捉社会关联对挤兑扩散的非线性放大效应。此框架不仅重现了硅谷银行(SVB)等危机事件中银行失败顺序,还揭示了存款人重叠与网络放大效应的协同作用,将社会相关性作为可量化指标纳入运行风险评估体系。
链接: https://arxiv.org/abs/2602.15066
作者: Chris Ruano,Shreshth Rajan
机构: Harvard College (哈佛学院)
类目: Physics and Society (physics.soc-ph); Multiagent Systems (cs.MA)
备注:
Abstract:Digital banking and online communication have made modern bank runs faster and more networked than the canonical queue-at-the-branch setting. While equilibrium models explain why strategic complementarities generate run risk, they offer limited guidance on how beliefs synchronize and propagate in real time. We develop a process-based agent-based model that makes the information and coordination layer explicit. Banks follow cash-first withdrawal processing with discounted fire-sale liquidation and an endogenous stress index. Depositors are heterogeneous in risk tolerance and in the weight placed on fundamentals versus social information, communicating on a heavy-tailed network calibrated to Twitter activity during March 2023. Depositor behavior is generated by a constrained large language model that maps each agent’s information set into a discrete action and an optional post; we validate this policy against laboratory coordination evidence and theoretical benchmarks. Across 4,900 configurations and full LLM simulations, three findings emerge. Within-bank connectivity raises the likelihood and speed of withdrawal cascades holding fundamentals fixed. Cross-bank contagion exhibits a sharp phase transition near spillover rates of 0.10. Depositor overlap and network amplification interact nonlinearly, so channels weak in isolation become powerful in combination. In an SVB, First Republic, and regional bank scenario disciplined by crisis-era data, the model reproduces the observed ordering of failures and predicts substantially higher withdrawal rates among uninsured depositors. The results frame social correlation as a measurable amplifier of run risk alongside balance-sheet fundamentals.
[MA-6] Cooperative Game Theory Model for Sustainable UN Financing: Addressing Global Public Goods Provision
【速读】:该论文旨在解决联合国当前资金筹措机制过度依赖自愿捐款所导致的效率低下和公平性不足问题,尤其是成员国在缺乏约束的情况下倾向于“搭便车”(free-rider),从而削弱全球公共物品(global public goods)的供给稳定性。解决方案的关键在于引入一种基于合作博弈论(cooperative game theory)的新模型,通过将各国的财政贡献与其从联合国活动中获得的效用(utility)相匹配,实现个性化定价(personalized pricing),从而在保障经济能力差异的前提下提升全球整体效用,优化资源分配,并增强筹资体系的可持续性与公平性。
链接: https://arxiv.org/abs/2602.15062
作者: Labib Shami,Teddy Lazebnik
机构: Western Galilee College (西部加利利学院); University of Haifa (海法大学); Jonkoping University (约克比林大学)
类目: Physics and Society (physics.soc-ph); Multiagent Systems (cs.MA)
备注:
Abstract:This study introduces a novel cooperative game theory model designed to improve the United Nations’ current funding mechanisms, which predominantly rely on voluntary contributions. By shifting from a Nash equilibrium framework, where member states act in self-interest, to a cooperative model, the proposed approach aligns each country’s financial contributions with the benefits they derive from UN activities. The model ensures a more sustainable and equitable system by introducing personalized pricing based on derived utility. Using agent-based simulations, the research demonstrates that the suggested approach increases global utility, reduces free-rider issues, and creates a more efficient resource allocation system. The findings suggest that the proposed model can optimize UN funding, ensuring a more stable and effective framework for global public goods provision, while considering the varying economic capacities of member states. Further research is recommended to assess the political viability of the model.
自然语言处理
[NLP-0] Avey-B
【速读】: 该论文旨在解决工业级自然语言处理(Natural Language Processing, NLP)中对高效、紧凑的双向编码器的需求问题,特别是在计算资源和内存受限场景下,如何保持高精度的上下文建模能力。传统基于Transformer的模型如BERT虽具备优秀的双向上下文表示能力,但其自注意力机制在长序列处理时存在计算复杂度高和内存占用大的瓶颈。为此,作者对Avey这一无注意力机制的自回归架构进行了重构,提出面向编码器-only范式的改进方案,其关键创新包括:解耦静态与动态参数化设计、面向稳定性的归一化策略以及神经压缩技术,从而在不依赖自注意力的情况下实现高效的双向上下文建模,在标准的token分类和信息检索任务上优于四种主流Transformer编码器,并展现出更优的长文本扩展性。
链接: https://arxiv.org/abs/2602.15814
作者: Devang Acharya,Mohammad Hammoud
机构: Avey AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attention’s ability to deliver high-quality bidirectional contextualization with sequence-level parallelism, as popularized by BERT-style architectures. Recently, Avey was introduced as an autoregressive, attention-free alternative that naturally admits an encoder-only adaptation. In this paper, we reformulate Avey for the encoder-only paradigm and propose several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more efficiently to long contexts.
[NLP-1] Enhancing Building Semantics Preservation in AI Model Training with Large Language Model Encodings
【速读】: 该论文旨在解决建筑信息模型(BIM)中建筑语义表示不充分的问题,尤其是传统独热编码(one-hot encoding)无法有效捕捉紧密相关子类型之间的细微差异,从而限制了人工智能(AI)在建筑、工程、施工和运营(AECO)领域对复杂语义的理解能力。解决方案的关键在于引入大语言模型(LLM)嵌入(如OpenAI GPT和Meta LLaMA)作为新型编码方式,以保留建筑对象子类型间的细粒度语义区分;实验表明,基于LLM的嵌入(特别是经过Matryoshka压缩后的llama-3嵌入)显著提升了图卷积神经网络(GraphSAGE)在42类建筑对象子类型分类任务中的性能,F1-score达到0.8766,优于one-hot编码的0.8475,验证了LLM嵌入在增强AI对领域特定建筑语义理解方面的有效性。
链接: https://arxiv.org/abs/2602.15791
作者: Suhyung Jang,Ghang Lee,Jaekun Lee,Hyunjun Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 42nd International Symposium on Automation and Robotics in Construction (ISARC 2025)
Abstract:Accurate representation of building semantics, encompassing both generic object types and specific subtypes, is essential for effective AI model training in the architecture, engineering, construction, and operation (AECO) industry. Conventional encoding methods (e.g., one-hot) often fail to convey the nuanced relationships among closely related subtypes, limiting AI’s semantic comprehension. To address this limitation, this study proposes a novel training approach that employs large language model (LLM) embeddings (e.g., OpenAI GPT and Meta LLaMA) as encodings to preserve finer distinctions in building semantics. We evaluated the proposed method by training GraphSAGE models to classify 42 building object subtypes across five high-rise residential building information models (BIMs). Various embedding dimensions were tested, including original high-dimensional LLM embeddings (1,536, 3,072, or 4,096) and 1,024-dimensional compacted embeddings generated via the Matryoshka representation model. Experimental results demonstrated that LLM encodings outperformed the conventional one-hot baseline, with the llama-3 (compacted) embedding achieving a weighted average F1-score of 0.8766, compared to 0.8475 for one-hot encoding. The results underscore the promise of leveraging LLM-based encodings to enhance AI’s ability to interpret complex, domain-specific building semantics. As the capabilities of LLMs and dimensionality reduction techniques continue to evolve, this approach holds considerable potential for broad application in semantic elaboration tasks throughout the AECO industry.
[NLP-2] *-PLUIE: Personalisable metric with Llm Used for Improved Evaluation
【速读】: 该论文旨在解决当前基于大语言模型作为评判者(LLM-as-a-judge)的自动文本质量评估方法中存在的计算成本高和需后处理的问题。其解决方案的关键在于改进ParaPLUIE这一基于困惑度(perplexity)的LLM评判指标,提出任务特定提示(task-specific prompting)变体——*-PLUIE,该方法在不生成文本的前提下估计对“是/否”答案的信心水平,从而在保持低计算开销的同时显著提升与人工评分的一致性。
链接: https://arxiv.org/abs/2602.15778
作者: Quentin Lemesle,Léane Jourdan,Daisy Munson,Pierre Alain,Jonathan Chevelu,Arnaud Delhay,Damien Lolive
机构: Univ Rennes, CNRS, IRISA, Expression; Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France; Univ Rennes, CNRS, IRISA, Sotern; Univ of South Brittany, CNRS, IRISA, Expression
类目: Computation and Language (cs.CL)
备注: Under review
Abstract:Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No’’ answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.
[NLP-3] ViTaB-A: Evaluating Multimodal Large Language Models on Visual Table Attribution
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, mLLMs)在处理结构化数据(如表格)时缺乏细粒度证据溯源能力的问题,即模型难以准确指出支持其答案的具体行和列。解决方案的关键在于系统性评估不同mLLMs在多种表格格式(Markdown、JSON、图像)和提示策略下的结构化数据归因能力(structured data attribution),发现当前模型在问答准确性尚可的情况下,归因准确率显著偏低,尤其在JSON输入中接近随机水平;同时揭示了模型对行引用优于列引用、对图像格式优于文本格式的差异,并指出不同模型家族间存在明显性能分化。这一发现表明,现有mLLMs在需要透明性和可追溯性的应用场景中仍不可靠。
链接: https://arxiv.org/abs/2602.15769
作者: Yahia Alqurnawi,Preetom Biswas,Anmol Rao,Tejas Anvekar,Chitta Baral,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multimodal Large Language Models (mLLMs) are often used to answer questions in structured data such as tables in Markdown, JSON, and images. While these models can often give correct answers, users also need to know where those answers come from. In this work, we study structured data attribution/citation, which is the ability of the models to point to the specific rows and columns that support an answer. We evaluate several mLLMs across different table formats and prompting strategies. Our results show a clear gap between question answering and evidence attribution. Although question answering accuracy remains moderate, attribution accuracy is much lower, near random for JSON inputs, across all models. We also find that models are more reliable at citing rows than columns, and struggle more with textual formats than images. Finally, we observe notable differences across model families. Overall, our findings show that current mLLMs are unreliable at providing fine-grained, trustworthy attribution for structured data, which limits their usage in applications requiring transparency and traceability.
[NLP-4] GLM-5: from Vibe Coding to Agent ic Engineering
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)在实际软件工程任务中面临的效率与自主性不足的问题,尤其是模型训练和推理成本高、长程交互学习能力弱以及后训练阶段对齐效果差等瓶颈。其解决方案的关键在于:首先引入动态稀疏注意力(DSA, Dynamic Sparse Attention)机制,在保持长上下文理解能力的同时显著降低训练与推理开销;其次构建异步强化学习(Asynchronous Reinforcement Learning)基础设施,通过解耦生成与训练过程大幅提升后训练效率;最后提出新型异步代理强化学习算法,增强模型从复杂、长周期交互中学习的能力,从而实现更高效的自主编程与工程执行。
链接: https://arxiv.org/abs/2602.15763
作者: GLM-5 Team:Aohan Zeng,Xin Lv,Zhenyu Hou,Zhengxiao Du,Qinkai Zheng,Bin Chen,Da Yin,Chendi Ge,Chengxing Xie,Cunxiang Wang,Gengzheng Pan,Hao Zeng,Haoke Zhang,Haoran Wang,Huilong Chen,Jiajie Zhang,Jian Jiao,Jiaqi Guo,Jingsen Wang,Jingzhao Du,Jinzhu Wu,Kedong Wang,Lei Li,Lin Fan,Lucen Zhong,Mingdao Liu,Mingming Zhao,Pengfan Du,Qian Dong,Rui Lu,Shuang-Li,Shulin Cao,Song Liu,Ting Jiang,Xiaodong Chen,Xiaohan Zhang,Xuancheng Huang,Xuezhen Dong,Yabo Xu,Yao Wei,Yifan An,Yilin Niu,Yitong Zhu,Yuanhao Wen,Yukuo Cen,Yushi Bai,Zhongpei Qiao,Zihan Wang,Zikang Wang,Zilin Zhu,Ziqiang Liu,Zixuan Li,Bojie Wang,Bosi Wen,Can Huang,Changpeng Cai,Chao Yu,Chen Li,Chen Li,Chenghua Huang,Chengwei Hu,Chenhui Zhang,Chenzheng Zhu,Congfeng Yin,Daoyan Lin,Dayong Yang,Di Wang,Ding Ai,Erle Zhu,Fangzhou Yi,Feiyu Chen,Guohong Wen,Hailong Sun,Haisha Zhao,Haiyi Hu,Hanchen Zhang,Hanrui Liu,Hanyu Zhang,Hao Peng,Hao Tai,Haobo Zhang,He Liu,Hongwei Wang,Hongxi Yan,Hongyu Ge,Huan Liu,Huan Liu,Huanpeng Chu,Jia’ni Zhao,Jiachen Wang,Jiajing Zhao,Jiamin Ren,Jiapeng Wang,Jiaxin Zhang,Jiayi Gui,Jiayue Zhao,Jijie Li,Jing An,Jing Li
机构: Zhipu AI (智谱AI); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks. Most critically, GLM-5 demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end software engineering challenges. Code, models, and more information are available at this https URL.
[NLP-5] ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在真实场景下支持迭代式探索性数据分析(Exploratory Data Analysis, EDA)能力不足的问题,尤其关注其在多轮交互中维持共同语境、追踪历史编辑并适应用户偏好变化的能力。现有研究主要基于单轮图表生成的评估基准,难以衡量模型在持续、上下文感知的可视化编辑任务中的表现。解决方案的关键在于提出 ChartEditBench——一个包含 5,000 条难度可控修改链的基准测试集,并辅以人工严格验证的子集,用于系统评估 MLLMs 在代码驱动的增量式图表编辑中的性能;同时设计了一套鲁棒的评估框架,融合执行级保真度检查、像素级视觉相似性比对和逻辑代码验证,有效缓解了 LLM-as-a-Judge 方法的局限性。实验表明,尽管当前 SOTA MLLMs 在风格类编辑上表现良好,但在数据驱动的变换中频繁出现执行失败,凸显了多轮编辑中误差累积与共享语境失效的核心挑战。
链接: https://arxiv.org/abs/2602.15758
作者: Manav Nitin Kapadnis,Lawanya Baghel,Atharva Naik,Carolyn Rosé
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 13 figures including Supplementary Material
Abstract:While Multimodal Large Language Models (MLLMs) perform strongly on single-turn chart generation, their ability to support real-world exploratory data analysis remains underexplored. In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences. We introduce ChartEditBench, a benchmark for incremental, visually grounded chart editing via code, comprising 5,000 difficulty-controlled modification chains and a rigorously human-verified subset. Unlike prior one-shot benchmarks, ChartEditBench evaluates sustained, context-aware editing. We further propose a robust evaluation framework that mitigates limitations of LLM-as-a-Judge metrics by integrating execution-based fidelity checks, pixel-level visual similarity, and logical code verification. Experiments with state-of-the-art MLLMs reveal substantial degradation in multi-turn settings due to error accumulation and breakdowns in shared context, with strong performance on stylistic edits but frequent execution failures on data-centric transformations. ChartEditBench, establishes a challenging testbed for grounded, intent-aware multimodal programming.
[NLP-6] Beyond Binary Classification: Detecting Fine-Grained Sexism in Social Media Videos
【速读】: 该论文旨在解决在线性别歧视(online sexism)检测中因缺乏细粒度、上下文敏感标签而导致的挑战,尤其是传统自动化工具多局限于二分类任务,难以识别隐含或复杂的性别歧视形式。其解决方案的关键在于:构建一个包含二分类与细粒度标注的多模态性别歧视数据集 FineMuSe(西班牙语),提出一套涵盖性别歧视类型、非性别歧视内容及讽刺与幽默修辞手法的层次化分类体系,并系统评估多种大语言模型(LLMs)在二分类与细粒度检测任务中的表现。结果表明,多模态 LLMs 在识别细微性别歧视方面可媲美人工标注者,但在通过视觉线索表达多重共现性别歧视时仍存在局限。
链接: https://arxiv.org/abs/2602.15757
作者: Laura De Grazia,Danae Sánchez Villegas,Desmond Elliott,Mireia Farrús,Mariona Taulé
机构: CLiC – Language and Computing Center, University of Barcelona(巴塞罗那大学语言与计算中心); Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Online sexism appears in various forms, which makes its detection challenging. Although automated tools can enhance the identification of sexist content, they are often restricted to binary classification. Consequently, more subtle manifestations of sexism may remain undetected due to the lack of fine-grained, context-sensitive labels. To address this issue, we make the following contributions: (1) we present FineMuSe, a new multimodal sexism detection dataset in Spanish that includes both binary and fine-grained annotations; (2) we introduce a comprehensive hierarchical taxonomy that encompasses forms of sexism, non-sexism, and rhetorical devices of irony and humor; and (3) we evaluate a wide range of LLMs for both binary and fine-grained sexism detection. Our findings indicate that multimodal LLMs perform competitively with human annotators in identifying nuanced forms of sexism; however, they struggle to capture co-occurring sexist types when these are conveyed through visual cues.
[NLP-7] Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian Georgian Greek and Syriac
【速读】: 该论文旨在解决低资源语言(Low-resource languages)在自然语言处理(Natural Language Processing, NLP)任务中面临的持续挑战,特别是词形还原(lemmatization)和词性标注(part-of-speech tagging, POS-tagging)的性能瓶颈。其解决方案的关键在于评估近期大型语言模型(Large Language Models, LLMs),包括GPT-4变体和开源权重的Mistral模型,在少样本(few-shot)与零样本(zero-shot)设置下对四种历史语言(古希腊语、古典亚美尼亚语、古格鲁吉亚语和叙利亚语)的表现,并将其与基于RNN的任务特定基线模型PIE进行对比。结果表明,LLMs无需微调即可在多数语言上实现具有竞争力甚至优于基线的性能,证明其在缺乏标注数据时可作为启动语言标注任务的有效工具。
链接: https://arxiv.org/abs/2602.15753
作者: Chahan Vidal-Gorène(CJM, LIPN),Bastien Kindt(UCL),Florian Cafiero(PSL, CJM)
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Low-resource languages pose persistent challenges for Natural Language Processing tasks such as lemmatization and part-of-speech (POS) tagging. This paper investigates the capacity of recent large language models (LLMs), including GPT-4 variants and open-weight Mistral models, to address these tasks in few-shot and zero-shot settings for four historically and linguistically diverse under-resourced languages: Ancient Greek, Classical Armenian, Old Georgian, and Syriac. Using a novel benchmark comprising aligned training and out-of-domain test corpora, we evaluate the performance of foundation models across lemmatization and POS-tagging, and compare them with PIE, a task-specific RNN baseline. Our results demonstrate that LLMs, even without fine-tuning, achieve competitive or superior performance in POS-tagging and lemmatization across most languages in few-shot settings. Significant challenges persist for languages characterized by complex morphology and non-Latin scripts, but we demonstrate that LLMs are a credible and relevant option for initiating linguistic annotation tasks in the absence of data, serving as an effective aid for annotation.
[NLP-8] Causal Effect Estimation with Latent Textual Treatments
【速读】: 该论文旨在解决文本作为处理变量(text-as-treatment)时因果效应估计中的偏差问题,即在下游任务中,文本内容天然地将处理变量与协变量信息混杂,导致传统估计方法产生显著偏倚。其解决方案的关键在于构建一个端到端的因果推断流程:首先利用稀疏自编码器(sparse autoencoders, SAEs)进行假设生成与文本干预的可控引导,随后通过协变量残差化(covariate residualization)来分离处理变量与混杂因素的影响,从而实现对潜在文本干预的稳健因果估计。实证结果表明,该方法能有效诱导目标特征变化并降低估计误差,为文本作为处理变量的因果推断提供了可靠基础。
链接: https://arxiv.org/abs/2602.15730
作者: Omri Feldman,Amar Venugopal,Jann Spiess,Amir Feder
机构: 未知
类目: Computation and Language (cs.CL); Econometrics (econ.EM)
备注:
Abstract:Understanding the causal effects of text on downstream outcomes is a central task in many applications. Estimating such effects requires researchers to run controlled experiments that systematically vary textual features. While large language models (LLMs) hold promise for generating text, producing and evaluating controlled variation requires more careful attention. In this paper, we present an end-to-end pipeline for the generation and causal estimation of latent textual interventions. Our work first performs hypothesis generation and steering via sparse autoencoders (SAEs), followed by robust causal estimation. Our pipeline addresses both computational and statistical challenges in text-as-treatment experiments. We demonstrate that naive estimation of causal effects suffers from significant bias as text inherently conflates treatment and covariate information. We describe the estimation bias induced in this setting and propose a solution based on covariate residualization. Our empirical results show that our pipeline effectively induces variation in target features and mitigates estimation error, providing a robust foundation for causal effect estimation in text-as-treatment settings.
[NLP-9] Recursive Concept Evolution for Compositional Reasoning in Large Language Models
【速读】: 该论文旨在解决大语言模型在需要组合推理(compositional reasoning)的任务上性能显著下降的问题,例如ARC-AGI-2、GPQA、MATH、BBH和HLE等基准测试中表现不佳。现有方法如思维链提示(chain-of-thought prompting)、自一致性或强化学习虽能扩展token级搜索空间,但受限于固定不变的潜在表示空间(latent representation space),当所需抽象概念未被编码其中时,模型性能会崩溃。解决方案的关键在于提出递归概念演化(Recursive Concept Evolution, RCE)框架,其核心机制是在推理过程中动态生成低秩概念子空间(low-rank concept subspaces),通过最小描述长度准则(minimum description length criterion)选择、合并协同子空间,并利用约束优化进行整合,从而在不破坏稳定性前提下构建新的抽象表征,而非简单重组已有概念。
链接: https://arxiv.org/abs/2602.15725
作者: Sarim Chaudhry
机构: Purdue University (普渡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models achieve strong performance on many complex reasoning tasks, yet their accuracy degrades sharply on benchmarks that require compositional reasoning, including ARC-AGI-2, GPQA, MATH, BBH, and HLE. Existing methods improve reasoning by expanding token-level search through chain-of-thought prompting, self-consistency, or reinforcement learning, but they leave the model’s latent representation space fixed. When the required abstraction is not already encoded in this space, performance collapses. We propose Recursive Concept Evolution (RCE), a framework that enables pretrained language models to modify their internal representation geometry during inference. RCE introduces dynamically generated low-rank concept subspaces that are spawned when representational inadequacy is detected, selected through a minimum description length criterion, merged when synergistic, and consolidated via constrained optimization to preserve stability. This process allows the model to construct new abstractions rather than recombining existing ones. We integrate RCE with Mistral-7B and evaluate it across compositional reasoning benchmarks. RCE yields 12-18 point gains on ARC-AGI-2, 8-14 point improvements on GPQA and BBH, and consistent reductions in depth-induced error on MATH and HLE.
[NLP-10] Rethinking Metrics for Lexical Semantic Change Detection EACL2026
【速读】: 该论文旨在解决词汇语义变化检测(Lexical Semantic Change Detection, LSCD)中依赖有限语义变化度量指标的问题,尤其是传统方法主要使用平均成对距离(Average Pairwise Distance, APD)和词原型余弦距离(cosine distance over word prototypes, PRT),这些指标在面对维度缩减或非专业化编码器时表现不稳定。解决方案的关键在于提出两种新的度量方法:平均最小距离(Average Minimum Distance, AMD)和对称平均最小距离(Symmetric Average Minimum Distance, SAMD),它们通过跨时间周期的词用法局部对应关系来量化语义变化。实验表明,AMD在维度压缩和非专业化编码器下更具鲁棒性,而SAMD在专业化编码器上表现更优,从而为基于上下文嵌入的LSCD提供了更可靠且多样化的评估工具。
链接: https://arxiv.org/abs/2602.15716
作者: Roksana Goworek,Haim Dubossarsky
机构: Queen Mary University of London (伦敦玛丽女王大学); The Alan Turing Institute (艾伦图灵研究所); University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注: Accepted to the LChange 2026 Workshop, colocated with EACL 2026
Abstract:Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and cosine distance over word prototypes (PRT). We introduce Average Minimum Distance (AMD) and Symmetric Average Minimum Distance (SAMD), new measures that quantify semantic change via local correspondence between word usages across time periods. Across multiple languages, encoder models, and representation spaces, we show that AMD often provides more robust performance, particularly under dimensionality reduction and with non-specialised encoders, while SAMD excels with specialised encoders. We suggest that LSCD may benefit from considering alternative semantic change metrics beyond APD and PRT, with AMD offering a robust option for contextualised embedding-based analysis.
[NLP-11] Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU
【速读】: 该论文旨在解决实时对话式助手在执行程序性任务(如家具组装)时依赖视频输入所带来的计算开销大和用户隐私泄露问题。其核心解决方案是设计一种仅使用轻量级、隐私友好的模态(如音频和惯性测量单元IMU数据)来理解用户上下文,并据此提供分步指导与回答用户提问的对话系统。关键创新在于提出了一种用户意图无关的LoRA微调方法(User Whim Agnostic, UWA LoRA),有效抑制了无信息量对话,同时保留对重要指令的传达能力,使F-score提升30%;此外,通过微调消除了提示中需要提供示例的需求,实现16倍推理速度提升,并可在边缘设备上独立运行,无需云端支持。
链接: https://arxiv.org/abs/2602.15707
作者: Rehana Mahfuz,Yinyi Guo,Erik Visser,Phanidhar Chinchili
机构: Qualcomm Technologies, Inc. (高通技术公司)
类目: Multimedia (cs.MM); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 3 figures
Abstract:Real-time conversational assistants for procedural tasks often depend on video input, which can be computationally expensive and compromise user privacy. For the first time, we propose a real-time conversational assistant that provides comprehensive guidance for a procedural task using only lightweight privacy-preserving modalities such as audio and IMU inputs from a user’s wearable device to understand the context. This assistant proactively communicates step-by-step instructions to a user performing a furniture assembly task, and answers user questions. We construct a dataset containing conversations where the assistant guides the user in performing the task. On observing that an off-the-shelf language model is a very talkative assistant, we design a novel User Whim Agnostic (UWA) LoRA finetuning method which improves the model’s ability to suppress less informative dialogues, while maintaining its tendency to communicate important instructions. This leads to 30% improvement in the F-score. Finetuning the model also results in a 16x speedup by eliminating the need to provide in-context examples in the prompt. We further describe how such an assistant is implemented on edge devices with no dependence on the cloud.
[NLP-12] A Content-Based Framework for Cybersecurity Refusal Decisions in Large Language Models
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的网络安全代理在拒绝请求时存在的不一致性、过度限制合法防御行为以及对混淆或请求分割敏感等问题。现有方法多依赖宽泛的主题屏蔽或以攻击为导向的分类体系,难以准确权衡潜在的攻击风险与防御价值。解决方案的关键在于提出一种内容驱动的拒绝策略设计与审计框架,通过五个维度——攻击行为贡献度(Offensive Action Contribution)、攻击风险(Offensive Risk)、技术复杂度(Technical Complexity)、防御收益(Defensive Benefit)和合法用户预期频率(Expected Frequency for Legitimate Users)——对请求进行量化评估,从而显式建模进攻与防御之间的权衡关系,实现可调谐、风险感知的拒绝决策。
链接: https://arxiv.org/abs/2602.15689
作者: Meirav Segal,Noa Linder,Omer Antverg,Gil Gekker,Tomer Fichman,Omri Bodenheimer,Edan Maor,Omer Nevo
机构: Irregular
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Large language models and LLM-based agents are increasingly used for cybersecurity tasks that are inherently dual-use. Existing approaches to refusal, spanning academic policy frameworks and commercially deployed systems, often rely on broad topic-based bans or offensive-focused taxonomies. As a result, they can yield inconsistent decisions, over-restrict legitimate defenders, and behave brittlely under obfuscation or request segmentation. We argue that effective refusal requires explicitly modeling the trade-off between offensive risk and defensive benefit, rather than relying solely on intent or offensive classification. In this paper, we introduce a content-based framework for designing and auditing cyber refusal policies that makes offense-defense tradeoffs explicit. The framework characterizes requests along five dimensions: Offensive Action Contribution, Offensive Risk, Technical Complexity, Defensive Benefit, and Expected Frequency for Legitimate Users, grounded in the technical substance of the request rather than stated intent. We demonstrate that this content-grounded approach resolves inconsistencies in current frontier model behavior and allows organizations to construct tunable, risk-aware refusal policies.
[NLP-13] Revisiting Northrop Fryes Four Myths Theory with Large Language Models
【速读】: 该论文旨在解决当前计算叙事学中对诺思罗普·弗莱(Northrop Frye)四类基本叙事类型(喜剧、浪漫、悲剧、讽刺)的研究多集中于情节模式而忽视角色功能的问题。其解决方案的关键在于提出一个基于荣格原型理论(Jungian archetype theory)的新型角色功能框架,将普遍的角色功能(主角、导师、反派、同伴)映射到荣格心理结构成分,并进一步细化为十六种与具体叙事类型相关的角色类型;通过六种先进大语言模型(LLMs)对40部叙事作品中的角色对应关系进行验证,结果显示模型在识别有效角色对应和排除无效对应方面表现优异(平均平衡准确率达82.5%),且不同类别角色和叙事类型的性能差异反映了真实的叙事特征,如浪漫类型中角色功能分布规律及讽刺类型中对原型的刻意颠覆。这一角色导向的方法为计算叙事分析提供了新路径,并为未来叙事生成与交互式故事应用奠定基础。
链接: https://arxiv.org/abs/2602.15678
作者: Edirlei Soares de Lima,Marco A. Casanova,Antonio L. Furtado
机构: Breda University of Applied Sciences (布雷达应用科学大学); PUC-Rio (天主教联邦大学里约热内卢分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Northrop Frye’s theory of four fundamental narrative genres (comedy, romance, tragedy, satire) has profoundly influenced literary criticism, yet computational approaches to his framework have focused primarily on narrative patterns rather than character functions. In this paper, we present a new character function framework that complements pattern-based analysis by examining how archetypal roles manifest differently across Frye’s genres. Drawing on Jungian archetype theory, we derive four universal character functions (protagonist, mentor, antagonist, companion) by mapping them to Jung’s psychic structure components. These functions are then specialized into sixteen genre-specific roles based on prototypical works. To validate this framework, we conducted a multi-model study using six state-of-the-art Large Language Models (LLMs) to evaluate character-role correspondences across 40 narrative works. The validation employed both positive samples (160 valid correspondences) and negative samples (30 invalid correspondences) to evaluate whether models both recognize valid correspondences and reject invalid ones. LLMs achieved substantial performance (mean balanced accuracy of 82.5%) with strong inter-model agreement (Fleiss’ \kappa = 0.600), demonstrating that the proposed correspondences capture systematic structural patterns. Performance varied by genre (ranging from 72.7% to 89.9%) and role (52.5% to 99.2%), with qualitative analysis revealing that variations reflect genuine narrative properties, including functional distribution in romance and deliberate archetypal subversion in satire. This character-based approach demonstrates the potential of LLM-supported methods for computational narratology and provides a foundation for future development of narrative generation methods and interactive storytelling applications.
[NLP-14] LLM -to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models EACL26
【速读】: 该论文旨在解决埃及阿拉伯语(Egyptian Arabic)在语音合成领域资源严重匮乏的问题,当前多数文本到语音(Text-to-Speech, TTS)研究集中于现代标准阿拉伯语(Modern Standard Arabic, MSA)和海湾方言,而埃及阿拉伯语作为最广泛理解的阿拉伯语方言却缺乏高质量训练数据。解决方案的关键在于提出了一种新颖的合成数据生成流程:首先利用大语言模型(Large Language Models, LLM)生成埃及阿拉伯语内容,再通过音频合成工具将其转换为自然语音,随后采用自动转录与说话人分离技术处理,并辅以人工质量验证,最终构建出38小时的标注语音数据集NileTTS。在此基础上,作者对最先进的多语言TTS模型XTTS v2进行微调,显著提升了埃及阿拉伯语语音合成性能。
链接: https://arxiv.org/abs/2602.15675
作者: Ahmed Khaled Khamis,Hesham Ali
机构: Georgia Institute of Technology (佐治亚理工学院); Nile University (尼罗大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 2 figures, EACL26
Abstract:Despite the advances in neural text to speech (TTS), many Arabic dialectal varieties remain marginally addressed, with most resources concentrated on Modern Spoken Arabic (MSA) and Gulf dialects, leaving Egyptian Arabic – the most widely understood Arabic dialect – severely under-resourced. We address this gap by introducing NileTTS: 38 hours of transcribed speech from two speakers across diverse domains including medical, sales, and general conversations. We construct this dataset using a novel synthetic pipeline: large language models (LLM) generate Egyptian Arabic content, which is then converted to natural speech using audio synthesis tools, followed by automatic transcription and speaker diarization with manual quality verification. We fine-tune XTTS v2, a state-of-the-art multilingual TTS model, on our dataset and evaluate against the baseline model trained on other Arabic dialects. Our contributions include: (1) the first publicly available Egyptian Arabic TTS dataset, (2) a reproducible synthetic data generation pipeline for dialectal TTS, and (3) an open-source fine-tuned model. All resources are released to advance Egyptian Arabic speech synthesis research.
[NLP-15] STAPO: Stabilizing Reinforcement Learning for LLM s by Silencing Rare Spurious Tokens
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)微调大语言模型时存在的训练不稳定问题,特别是晚期性能崩溃现象。研究表明,这种不稳定性主要由极少数(约0.01%)的“伪标记”(spurious tokens)引发:这些token在正确响应中贡献微弱但继承完整序列级奖励,导致梯度更新异常放大。解决方案的关键在于提出一种针对伪标记感知的策略优化方法(Spurious-Token-Aware Policy Optimization, STAPO),通过选择性屏蔽此类token的梯度更新并重新归一化有效token上的损失函数,从而提升训练稳定性和推理质量。
链接: https://arxiv.org/abs/2602.15620
作者: Shiqi Liu,Zeyu He,Guojian Zhan,Letian Tao,Zhilong Zheng,Jiang Wu,Yinuo Wang,Yang Guan,Kehua Sheng,Bo Zhang,Keqiang Li,Jingliang Duan,Shengbo Eben Li
机构: Tsinghua University (清华大学); 163.com (网易)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often experience late-stage performance collapse, leading to degraded reasoning quality and unstable training. We derive that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. Building on this result, we prove that training instability is driven by a tiny fraction of tokens, approximately 0.01%, which we term \emphspurious tokens. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. Motivated by this observation, we propose Spurious-Token-Aware Policy Optimization (STAPO) for large-scale model refining, which selectively masks such updates and renormalizes the loss over valid tokens. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13% over GRPO, 20-Entropy and JustRL.
[NLP-16] Clinically Inspired Symptom-Guided Depression Detection from Emotion-Aware Speech Representations
【速读】: 该论文旨在解决现有抑郁症预测模型未能显式建模症状特异性信息的问题,即大多数方法将抑郁预测视为二分类标签或整体严重程度评分,忽略了不同症状(如睡眠障碍、兴趣丧失、注意力困难等)在语音特征中的差异化表达,从而限制了其在临床筛查中提供症状层面分析的能力。解决方案的关键在于提出一种基于症状引导且情绪感知的框架,通过引入症状引导的交叉注意力机制(symptom-guided cross-attention mechanism),将PHQ-8问卷条目与情绪感知的语音表示对齐,识别出参与者语音中与每种症状相关的关键片段;同时,为捕捉症状随时间表达差异,设计了一个可学习的症状特异性参数,自适应调节注意力分布的锐度,从而提升模型对抑郁症状的细粒度建模能力与可解释性。
链接: https://arxiv.org/abs/2602.15578
作者: Chaithra Nerella,Chiranjeevi Yarra
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages, 3 figures
Abstract:Depression manifests through a diverse set of symptoms such as sleep disturbance, loss of interest, and concentration difficulties. However, most existing works treat depression prediction either as a binary label or an overall severity score without explicitly modeling symptom-specific information. This limits their ability to provide symptom-level analysis relevant to clinical screening. To address this, we propose a symptom-specific and clinically inspired framework for depression severity estimation from speech. Our approach uses a symptom-guided cross-attention mechanism that aligns PHQ-8 questionnaire items with emotion-aware speech representations to identify which segments of a participant’s speech are more important to each symptom. To account for differences in how symptoms are expressed over time, we introduce a learnable symptom-specific parameter that adaptively controls the sharpness of attention distributions. Our results on EDAIC, a standard clinical-style dataset, demonstrate improved performance outperforming prior works. Further, analyzing the attention distributions showed that higher attention is assigned to utterances containing cues related to multiple depressive symptoms, highlighting the interpretability of our approach. These findings outline the importance of symptom-guided and emotion-aware modeling for speech-based depression screening.
[NLP-17] Beyond Static Pipelines: Learning Dynamic Workflows for Text-to-SQL
【速读】: 该论文试图解决Text-to-SQL(文本到SQL)任务在真实场景中难以有效应用的问题,其根本原因在于现有方法依赖于单一静态工作流(static workflow),导致在分布外(out-of-distribution)和长尾(long-tail)场景下泛化能力不足。解决方案的关键在于提出一种基于强化学习的动态工作流构建框架SquRL,通过在推理时自适应地构造工作流来提升系统灵活性与性能;其核心创新包括设计基于规则的奖励函数、引入动态演员掩码(dynamic actor masking)以促进更广泛的探索,以及采用伪奖励(pseudo rewards)机制提高训练效率,从而显著优于最优静态工作流方法,尤其在复杂和分布外查询上表现突出。
链接: https://arxiv.org/abs/2602.15564
作者: Yihan Wang,Peiyu Liu,Runyu Chen,Wei Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Text-to-SQL has recently achieved impressive progress, yet remains difficult to apply effectively in real-world scenarios. This gap stems from the reliance on single static workflows, fundamentally limiting scalability to out-of-distribution and long-tail scenarios. Instead of requiring users to select suitable methods through extensive experimentation, we attempt to enable systems to adaptively construct workflows at inference time. Through theoretical and empirical analysis, we demonstrate that optimal dynamic policies consistently outperform the best static workflow, with performance gains fundamentally driven by heterogeneity across candidate workflows. Motivated by this, we propose SquRL, a reinforcement learning framework that enhances LLMs’ reasoning capability in adaptive workflow construction. We design a rule-based reward function and introduce two effective training mechanisms: dynamic actor masking to encourage broader exploration, and pseudo rewards to improve training efficiency. Experiments on widely-used Text-to-SQL benchmarks demonstrate that dynamic workflow construction consistently outperforms the best static workflow methods, with especially pronounced gains on complex and out-of-distribution queries. The codes are available at this https URL
[NLP-18] RUVA: Personalized Transparent On-Device Graph Reasoning
【速读】: 该论文旨在解决当前个人人工智能(Personal AI)领域中“黑箱”检索增强生成(Retrieval-Augmented Generation, RAG)架构所引发的可解释性与隐私保护问题:标准向量数据库在进行统计匹配时缺乏问责机制,导致AI幻觉或敏感信息泄露时用户无法溯源和修正;同时,向量空间中的“删除”操作数学上不精确,残留的“幽灵”数据违背了真正的隐私权。解决方案的关键在于提出Ruva——首个面向“人在回路”记忆管理的“白箱”(Glass Box)架构,其核心是将个人AI建立在个人知识图谱(Personal Knowledge Graph)之上,使用户能够直观审查AI所存储的信息,并对特定事实执行精准删减,从而实现从“向量匹配”到“图推理”的范式转变,保障用户的“被遗忘权”。
链接: https://arxiv.org/abs/2602.15553
作者: Gabriele Conte,Alessio Mattiace,Gianni Carmosino,Potito Aghilar,Giovanni Servedio,Francesco Musicco,Vito Walter Anelli,Tommaso Di Noia,Francesco Maria Donini
机构: Politecnico di Bari (巴里理工大学); Università degli Studi della Tuscia (图斯西亚大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The Personal AI landscape is currently dominated by “Black Box” Retrieval-Augmented Generation. While standard vector databases offer statistical matching, they suffer from a fundamental lack of accountability: when an AI hallucinates or retrieves sensitive data, the user cannot inspect the cause nor correct the error. Worse, “deleting” a concept from a vector space is mathematically imprecise, leaving behind probabilistic “ghosts” that violate true privacy. We propose Ruva, the first “Glass Box” architecture designed for Human-in-the-Loop Memory Curation. Ruva grounds Personal AI in a Personal Knowledge Graph, enabling users to inspect what the AI knows and to perform precise redaction of specific facts. By shifting the paradigm from Vector Matching to Graph Reasoning, Ruva ensures the “Right to be Forgotten.” Users are the editors of their own lives; Ruva hands them the pen. The project and the demo video are available at this http URL.
[NLP-19] jina-embeddings-v5-text: Task-Targeted Embedding Distillation
【速读】: 该论文旨在解决小尺寸文本嵌入模型在保持高性能的同时,如何有效提升语义相似度任务表现的问题。现有方法通常依赖纯对比损失(contrastive loss)或纯知识蒸馏(distillation)训练范式,难以兼顾模型紧凑性与性能。其解决方案的关键在于创新性地将模型蒸馏技术与任务特定的对比损失相结合,通过多阶段训练策略,在保留教师模型语义能力的同时优化学生模型的嵌入质量,从而显著提升小型模型的性能表现。实验表明,所提出的 jina-embeddings-v5-text-small 和 jina-embeddings-v5-text-nano 模型在基准测试中达到或超越同规模模型的最先进水平,并具备长文本支持和鲁棒性(如截断与二值量化下的稳定性)。
链接: https://arxiv.org/abs/2602.15547
作者: Mohammad Kalim Akram,Saba Sturua,Nastia Havriushenko,Quentin Herreros,Michael Günther,Maximilian Werk,Han Xiao
机构: Jina AI GmbH (Jina AI GmbH); Jina by Elastic (Jina by Elastic)
类目: Computation and Language (cs.CL)
备注: 14 pages, 8 figures. Model weights: this https URL
Abstract:Text embedding models are widely used for semantic similarity tasks, including information retrieval, clustering, and classification. General-purpose models are typically trained with single- or multi-stage processes using contrastive loss functions. We introduce a novel training regimen that combines model distillation techniques with task-specific contrastive loss to produce compact, high-performance embedding models. Our findings suggest that this approach is more effective for training small models than purely contrastive or distillation-based training paradigms alone. Benchmark scores for the resulting models, jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano, exceed or match the state-of-the-art for models of similar size. jina-embeddings-v5-text models additionally support long texts (up to 32k tokens) in many languages, and generate embeddings that remain robust under truncation and binary quantization. Model weights are publicly available, hopefully inspiring further advances in embedding model development.
[NLP-20] Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite
【速读】: 该论文旨在解决数字人文(Digital Humanities, DH)学者在处理大规模非结构化文档集合时,难以高效探索和组织文本数据以发现潜在主题、情感或其他语义类别的问题。解决方案的关键在于提出一个名为Perspectives的交互式扩展工具,其核心是实现一种灵活的、面向特定分析维度的文档聚类流程,并集成“人机协同”优化机制:首先通过文档重写提示(rewriting prompts)和基于指令的嵌入(instruction-based embeddings)定义分析视角,随后借助用户对聚类结果的交互式调整与嵌入模型微调(fine-tuning),使聚类结果更贴合研究者意图,从而提升文档地图的可解释性和实用性。
链接: https://arxiv.org/abs/2602.15540
作者: Tim Fischer,Chris Biemann
机构: University of Hamburg (汉堡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections. Perspectives implements a flexible, aspect-focused document clustering pipeline with human-in-the-loop refinement capabilities. We showcase how this process can be initially steered by defining analytical lenses through document rewriting prompts and instruction-based embeddings, and further aligned with user intent through tools for refining clusters and mechanisms for fine-tuning the embedding model. The demonstration highlights a typical workflow, illustrating how DH researchers can leverage Perspectives’s interactive document map to uncover topics, sentiments, or other relevant categories, thereby gaining insights and preparing their data for subsequent in-depth analysis.
[NLP-21] ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling
【速读】: 该论文旨在解决纯语音语言模型(Pure speech language models)在训练过程中因自监督语音编码器输出离散token导致序列过长的问题,从而影响模型效率与性能。现有方法如Sylber和SyllableLM虽尝试使用音节类单元以减少序列长度,但依赖复杂的多阶段训练流程。本文提出ZeroSyl——一种无需训练的简单方法,直接从冻结的WavLM模型中间层特征中提取音节边界与嵌入表示,其核心在于利用WavLM中间层特征的L2范数识别音节边界,随后对片段进行均值池化并采用K-means聚类离散化,最终用于构建语言模型。该方案不仅在词汇、句法和叙事任务上优于现有音节分词器,且在扩展性上展现出更优的句法建模能力。
链接: https://arxiv.org/abs/2602.15537
作者: Nicol Visser,Simon Malan,Danel Slabbert,Herman Kamper
机构: Stellenbosch University (斯泰伦博斯大学)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 3 figures, 2 tables
Abstract:Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM’s intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.
[NLP-22] ExpertWeaver: Unlocking the Inherent MoE in Dense LLM s with GLU Activation Patterns
【速读】: 该论文旨在解决从预训练密集模型(dense model)高效转换为高质量稀疏专家混合模型(Mixture-of-Experts, MoE)的问题,现有方法因破坏原模型内在激活模式而导致专家构建次优。解决方案的关键在于发现门控线性单元(Gated Linear Unit, GLU)机制中细粒度的神经元级激活模式蕴含粗粒度的MoE结构:即由始终活跃的通用神经元和动态激活的专业化神经元组成。基于此洞察,作者提出无需训练的ExpertWeaver框架,依据激活模式对神经元进行分区,并采用层自适应配置构建共享专家与路由专家,从而实现更优的MoE初始化与动态结构剪枝效果。
链接: https://arxiv.org/abs/2602.15521
作者: Ziyu Zhao,Tong Zhu,Zhi Zhang,Tiantian Fan,Jinluan Yang,Kun Kuang,Zhongyu Wei,Fei Wu,Yu Cheng
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Mixture-of-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation. However, training high-quality MoEs from scratch is prohibitively expensive. A promising alternative is to convert pretrained dense models into sparse MoEs. Existing dense-to-MoE methods fall into two categories: \textbfdynamic structural pruning that converts dense models into MoE architectures with moderate sparsity to balance performance and inference efficiency, and \textbfdowncycling approaches that use pretrained dense models to initialize highly sparse MoE architectures. However, existing methods break the intrinsic activation patterns within dense models, leading to suboptimal expert construction. In this work, we argue that the Gated Linear Unit (GLU) mechanism provides a natural blueprint for dense-to-MoE conversion. We show that the fine-grained neural-wise activation patterns of GLU reveal a coarse-grained structure, uncovering an inherent MoE architecture composed of consistently activated universal neurons and dynamically activated specialized neurons. Leveraging this discovery, we introduce ExpertWeaver, a training-free framework that partitions neurons according to their activation patterns and constructs shared experts and specialized routed experts with layer-adaptive configurations. Our experiments demonstrate that ExpertWeaver significantly outperforms existing methods, both as a training-free dynamic structural pruning technique and as a downcycling strategy for superior MoE initialization.
[NLP-23] DependencyAI: Detecting AI Generated Text through Dependency Parsing
【速读】: 该论文旨在解决生成式 AI (Generative AI) 生成文本的检测问题,以降低其潜在风险。解决方案的关键在于提出 DependencyAI,一种仅依赖语言依存关系标签(dependency relations)的简单且可解释的检测方法,通过分析语法结构特征来区分 AI 生成文本与人类撰写文本,从而在单语、多生成器和多语言场景下均表现出竞争力,并为模型的跨域泛化能力提供了新的洞察。
链接: https://arxiv.org/abs/2602.15514
作者: Sara Ahmed,Tracy Hammond
机构: Texas A&M University (德克萨斯A&M大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:As large language models (LLMs) become increasingly prevalent, reliable methods for detecting AI-generated text are critical for mitigating potential risks. We introduce DependencyAI, a simple and interpretable approach for detecting AI-generated text using only the labels of linguistic dependency relations. Our method achieves competitive performance across monolingual, multi-generator, and multilingual settings. To increase interpretability, we analyze feature importance to reveal syntactic structures that distinguish AI-generated from human-written text. We also observe a systematic overprediction of certain models on unseen domains, suggesting that generator-specific writing styles may affect cross-domain generalization. Overall, our results demonstrate that dependency relations alone provide a robust signal for AI-generated text detection, establishing DependencyAI as a strong linguistically grounded, interpretable, and non-neural network baseline.
[NLP-24] Fine-Refine: Iterative Fine-grained Refinement for Mitigating Dialogue Hallucination
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在对话系统中存在幻觉(hallucination)的问题,即生成的事实性错误响应可能误导用户并削弱系统可信度。现有改进方法通常仅在整句层面进行修正,忽略了单个响应中可能包含多个可验证或不可验证的事实单元。为此,作者提出 Fine-Refine 框架,其核心在于将响应分解为原子事实单元,利用外部知识库逐一验证每个单元的真实性,并通过困惑度(perplexity)评估语义流畅性,最终迭代修正细粒度错误,从而实现更精准的生成结果。
链接: https://arxiv.org/abs/2602.15509
作者: Xiangyan Chen,Yujian Gan,Matthew Purver
机构: Queen Mary University of London (伦敦玛丽女王大学); Queen’s University Belfast (贝尔法斯特女王大学); Institut Jožef Stefan (约瑟夫·斯蒂芬研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:The tendency for hallucination in current large language models (LLMs) negatively impacts dialogue systems. Such hallucinations produce factually incorrect responses that may mislead users and undermine system trust. Existing refinement methods for dialogue systems typically operate at the response level, overlooking the fact that a single response may contain multiple verifiable or unverifiable facts. To address this gap, we propose Fine-Refine, a fine-grained refinement framework that decomposes responses into atomic units, verifies each unit using external knowledge, assesses fluency via perplexity, and iteratively corrects granular errors. We evaluate factuality across the HybriDialogue and OpendialKG datasets in terms of factual accuracy (fact score) and coverage (Not Enough Information Proportion), and experiments show that Fine-Refine substantially improves factuality, achieving up to a 7.63-point gain in dialogue fact score, with a small trade-off in dialogue quality.
[NLP-25] LuxMT Technical Report
【速读】: 该论文旨在解决卢森堡语(Luxembourgish, LB)到其他语言(如法语、英语)的机器翻译性能不足的问题,尤其针对低资源语言对中数据稀缺与翻译质量不佳的挑战。解决方案的关键在于构建一个基于Gemma 3 27B大模型并针对LB→FR和LB→EN进行微调的翻译系统LuxMT,并通过LuxEmbedder对平行语料库(如LuxAlign和议会记录)进行句子级嵌入过滤,去除低等效性句对以提升训练数据质量。实验表明,即使未在训练数据中包含德语(DE),LuxMT仍能实现对LB→DE翻译的显著改进,同时验证了LuxEmbedder作为质量估计(Quality Estimation, QE)指标的潜力,其与参考标准指标具有强相关性,但需进一步研究以确认其可靠性。
链接: https://arxiv.org/abs/2602.15506
作者: Nils Rehlinger
机构: University of Luxembourg (卢森堡大学)
类目: Computation and Language (cs.CL)
备注: preprint
Abstract:We introduce LuxMT, a machine translation system based on Gemma 3 27B and fine-tuned for translation from Luxembourgish (LB) into French (FR) and English (EN). To assess translation performance, we construct a novel benchmark covering LB-FR, LB-EN, and LB-FR using human-translated data from Luci, a tourist magazine about Luxembourg. Training data stems from LuxAlign, a parallel corpus of multilingual Luxembourgish news articles, and LB parliamentary transcripts augmented with Google Translate. We filter the data using LuxEmbedder, LB sentence embeddings, to remove low-equivalence segment-pairs. Overall, LuxMT’s results suggest strong improvements over the Gemma 3 baseline, even for translating LB to German (DE), despite the training data not containing any DE. We also explore LuxEmbedder’s potential to be used as a quality estimation metric and find strong correlations with other reference-based metrics. However, we call for further research to fully assess the metric’s utility and advise using it with caution.
[NLP-26] owards Expectation Detection in Language: A Case Study on Treatment Expectations in Reddit
【速读】: 该论文试图解决的问题是:在医疗领域中,患者对治疗的期望如何在线上平台(如Reddit)中被表达,以及这些期望的语义特征和分布模式是什么。由于传统临床研究难以捕捉患者不愿或不便在正式场合表达的隐性期望,本文提出通过自然语言处理(NLP)技术识别和分析此类在线文本中的期望内容。解决方案的关键在于引入“期望检测”(Expectation Detection)这一新任务,并构建了首个面向医学语境的标注语料库RedHOTExpect(包含4.5K条Reddit帖子),利用大语言模型(LLM)进行银标签(silver-labeling)并人工验证其准确性(约78%),进而挖掘出不同疾病类型下患者期望的语言特征及其内容倾向,发现物理/治疗相关疾病更倾向于乐观与主动表述,且多数讨论聚焦于治疗益处而非负面结果。
链接: https://arxiv.org/abs/2602.15504
作者: Aswathy Velutharambath,Amelie Wührl
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Patients’ expectations towards their treatment have a substantial effect on the treatments’ success. While primarily studied in clinical settings, online patient platforms like medical subreddits may hold complementary insights: treatment expectations that patients feel unnecessary or uncomfortable to share elsewhere. Despite this, no studies examine what type of expectations users discuss online and how they express them. Presumably this is because expectations have not been studied in natural language processing (NLP) before. Therefore, we introduce the task of Expectation Detection, arguing that expectations are relevant for many applications, including opinion mining and product design. Subsequently, we present a case study for the medical domain, where expectations are particularly crucial to extract. We contribute RedHOTExpect, a corpus of Reddit posts (4.5K posts) to study expectations in this context. We use a large language model (LLM) to silver-label the data and validate its quality manually (label accuracy ~78%). Based on this, we analyze which linguistic patterns characterize expectations and explore what patients expect and why. We find that optimism and proactive framing are more pronounced in posts about physical or treatment-related illnesses compared to mental-health contexts, and that in our dataset, patients mostly discuss benefits rather than negative outcomes. The RedHOTExpect corpus can be obtained from this https URL
[NLP-27] In Agents We Trust but Who Do Agents Trust? Latent Source Preferences Steer LLM Generations ICLR2026
【速读】: 该论文试图解决的问题是:当前基于大语言模型(Large Language Models, LLMs)的智能代理在信息筛选与呈现过程中存在隐性来源偏好(latent source preferences),即模型倾向于优先展示某些特定来源的信息,而忽略其他来源,这种偏倚可能影响用户获取信息的多样性与公平性。解决方案的关键在于通过受控实验验证了多个主流LLM在合成任务和真实世界任务中均表现出稳定且可预测的来源偏好,且这些偏好对上下文语境敏感、能超越内容本身的影响,并在明确提示避免时仍持续存在;这一发现揭示了LLM代理在信息分发中的潜在偏倚机制,呼吁加强对偏倚成因的深入研究,并开发透明度与控制机制以增强用户对LLM驱动代理决策过程的理解与干预能力。
链接: https://arxiv.org/abs/2602.15456
作者: Mohammad Aflah Khan,Mahsa Amani,Soumi Das,Bishwamittra Ghosh,Qinyuan Wu,Krishna P. Gummadi,Manish Gupta,Abhilasha Ravichander
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICLR 2026
Abstract:Agents based on Large Language Models (LLMs) are increasingly being deployed as interfaces to information on online platforms. These agents filter, prioritize, and synthesize information retrieved from the platforms’ back-end databases or via web search. In these scenarios, LLM agents govern the information users receive, by drawing users’ attention to particular instances of retrieved information at the expense of others. While much prior work has focused on biases in the information LLMs themselves generate, less attention has been paid to the factors that influence what information LLMs select and present to users. We hypothesize that when information is attributed to specific sources (e.g., particular publishers, journals, or platforms), current LLMs exhibit systematic latent source preferences- that is, they prioritize information from some sources over others. Through controlled experiments on twelve LLMs from six model providers, spanning both synthetic and real-world tasks, we find that several models consistently exhibit strong and predictable source preferences. These preferences are sensitive to contextual framing, can outweigh the influence of content itself, and persist despite explicit prompting to avoid them. They also help explain phenomena such as the observed left-leaning skew in news recommendations in prior work. Our findings advocate for deeper investigation into the origins of these preferences, as well as for mechanisms that provide users with transparency and control over the biases guiding LLM-powered agents.
[NLP-28] AROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在代码生成中难以合成算法复杂且鲁棒性强的代码这一关键挑战,其核心问题在于现有强化微调(Reinforcement Fine-Tuning, RFT)方法忽视了测试用例在难度和粒度上的异质性,导致奖励信号分布失衡,进而引发梯度更新偏差。解决方案的关键在于提出Test-driven and cApability-adaptive cuRriculum reinfOrcement fine-Tuning (TAROT),该方法为每个问题构建包含基础、中级、复杂和边缘四层的测试套件,形成受控的难度梯度用于课程设计;更重要的是,TAROT将课程推进策略与原始奖励分数解耦,实现基于模型能力的条件化评估,并从一组课程策略中进行原则性选择,而非依赖随机的测试用例难度组合,从而提升优化稳定性与能力获取效率。实验表明,最优课程策略与模型固有能力密切相关:低能力模型受益于由易到难的课程,而高能力模型则在先难后易的课程下表现更优。
链接: https://arxiv.org/abs/2602.15449
作者: Chansung Park,Juyong Jiang,Fan Wang,Sayak Paul,Jiasi Shen,Jing Tang,Jianguo Li
机构: Electronics and Telecommunications Research Institute (电子与电信研究所); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong University of Science and Technology (香港科技大学); Hugging Face; Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: The first three authors contributed equally to this work; listing order is random
Abstract:Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy to address this need. However, most existing approaches overlook the heterogeneous difficulty and granularity inherent in test cases, leading to an imbalanced distribution of reward signals and consequently biased gradient updates during training. To address this, we propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fine-Tuning (TAROT). TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation. Crucially, TAROT decouples curriculum progression from raw reward scores, enabling capability-conditioned evaluation and principled selection from a portfolio of curriculum policies rather than incidental test-case difficulty composition. This design fosters stable optimization and more efficient competency acquisition. Extensive experimental results reveal that the optimal curriculum for RFT in code generation is closely tied to a model’s inherent capability, with less capable models achieving greater gains with an easy-to-hard progression, whereas more competent models excel under a hard-first curriculum. TAROT provides a reproducible method that adaptively tailors curriculum design to a model’s capability, thereby consistently improving the functional correctness and robustness of the generated code. All code and data are released to foster reproducibility and advance community research at this https URL.
[NLP-29] Measuring Social Integration Through Participation: Categorizing Organizations and Leisure Activities in the Displaced Karelians Interview Archive using LLM s EACL2026
【速读】: 该论文旨在解决从大规模数字化历史文本中提取的非结构化信息难以直接用于定量社会科学研究的问题,特别是在芬兰二战期间卡累利阿难民家庭访谈中识别出的超过35万条休闲活动与组织参与记录(涉及7.1万个唯一名称)无法有效分析。其解决方案的关键在于构建一个结构化的分类框架,涵盖参与行为的核心维度——活动/组织类型、社交属性、发生频率及身体强度,并基于人工标注的黄金标准数据集评估大语言模型(Large Language Models, LLMs)的分类能力;通过多轮运行投票机制,发现开放权重LLM可接近专家判断,最终实现对全部实体的自动化标注,形成可用于社会融合等下游研究的结构化资源。
链接: https://arxiv.org/abs/2602.15436
作者: Joonatan Laato,Veera Schroderus,Jenna Kanerva,Jenni Kauppi,Virpi Lummaa,Filip Ginter
机构: 未知
类目: Computation and Language (cs.CL)
备注: Presented at: The 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature; EACL 2026 Workshop
Abstract:Digitized historical archives make it possible to study everyday social life on a large scale, but the information extracted directly from text often does not directly allow one to answer the research questions posed by historians or sociologists in a quantitative manner. We address this problem in a large collection of Finnish World War II Karelian evacuee family interviews. Prior work extracted more than 350K mentions of leisure time activities and organizational memberships from these interviews, yielding 71K unique activity and organization names – far too many to analyze directly. We develop a categorization framework that captures key aspects of participation (the kind of activity/organization, how social it typically is, how regularly it happens, and how physically demanding it is). We annotate a gold-standard set to allow for a reliable evaluation, and then test whether large language models can apply the same schema at scale. Using a simple voting approach across multiple model runs, we find that an open-weight LLM can closely match expert judgments. Finally, we apply the method to label the 350K entities, producing a structured resource for downstream studies of social integration and related outcomes. Comments: Presented at: The 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature; EACL 2026 Workshop Subjects: Computation and Language (cs.CL) Cite as: arXiv:2602.15436 [cs.CL] (or arXiv:2602.15436v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.15436 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-30] World-Model-Augmented Web Agents with Action Correction
【速读】: 该论文旨在解决当前基于大语言模型的Web代理在自动化网页任务中因环境变化预测能力有限而难以合理推理行动,以及缺乏对执行风险的全面认知,导致提前采取高风险操作并引发任务失败的问题。解决方案的关键在于提出WAC框架,其核心包括三个模块:一是通过多智能体协作机制,使动作模型能够咨询世界模型(world model)以获取环境专家的战略指导,并结合先验的状态转移知识生成可执行动作;二是引入两阶段演绎链,由世界模型模拟动作后果,再由裁判模型评估结果并在必要时触发纠正反馈,从而实现风险感知与鲁棒的任务执行。
链接: https://arxiv.org/abs/2602.15384
作者: Zhouzhou Shen,Xueyu Hu,Xiyun Li,Tianqing Fang,Juncheng Li,Shengyu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Web agents based on large language models have demonstrated promising capability in automating web tasks. However, current web agents struggle to reason out sensible actions due to the limitations of predicting environment changes, and might not possess comprehensive awareness of execution risks, prematurely performing risky actions that cause losses and lead to task failure. To address these challenges, we propose WAC, a web agent that integrates model collaboration, consequence simulation, and feedback-driven action refinement. To overcome the cognitive isolation of individual models, we introduce a multi-agent collaboration process that enables an action model to consult a world model as a web-environment expert for strategic guidance; the action model then grounds these suggestions into executable actions, leveraging prior knowledge of environmental state transition dynamics to enhance candidate action proposal. To achieve risk-aware resilient task execution, we introduce a two-stage deduction chain. A world model, specialized in environmental state transitions, simulates action outcomes, which a judge model then scrutinizes to trigger action corrective feedback when necessary. Experiments show that WAC achieves absolute gains of 1.8% on VisualWebArena and 1.3% on Online-Mind2Web.
[NLP-31] he Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
【速读】: 该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中因离散文本通信导致的效率低下问题,包括显著的运行时开销和信息量化损失。现有方法虽尝试通过潜在状态传输实现高带宽通信,但受限于同质化架构假设或依赖成对特定的翻译器,难以在不同模型家族间实现可扩展性和模块化。其解决方案的关键在于提出“视觉虫洞”(Vision Wormhole)框架,利用视觉语言模型(Vision-Language Models, VLMs)的视觉接口实现无文本、模型无关的通信机制;通过引入通用视觉编解码器(Universal Visual Codec),将异构推理轨迹映射至共享连续潜在空间,并直接注入接收端的视觉路径,从而将视觉编码器视为跨智能体直觉传递的通用端口;同时采用中心辐射拓扑结构降低配对对齐复杂度,以及基于标签无关的师生蒸馏目标,使高速视觉通道与文本路径的鲁棒推理模式对齐,实现在多种异构模型族(如Qwen-VL、Gemma)上的高效且保真的协作推理。
链接: https://arxiv.org/abs/2602.15382
作者: Xiaoze Liu,Ruowang Zhang,Weichen Yu,Siheng Xiong,Liu He,Feijie Wu,Hoin Jung,Matt Fredrikson,Xiaoqian Wang,Jing Gao
机构: Purdue University (普渡大学); Contextual AI; Carnegie Mellon University (卡内基梅隆大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint. Work in progress
Abstract:Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and information quantization loss. While latent state transfer offers a high-bandwidth alternative, existing approaches either assume homogeneous sender-receiver architectures or rely on pair-specific learned translators, limiting scalability and modularity across diverse model families with disjoint manifolds. In this work, we propose the Vision Wormhole, a novel framework that repurposes the visual interface of Vision-Language Models (VLMs) to enable model-agnostic, text-free communication. By introducing a Universal Visual Codec, we map heterogeneous reasoning traces into a shared continuous latent space and inject them directly into the receiver’s visual pathway, effectively treating the vision encoder as a universal port for inter-agent telepathy. Our framework adopts a hub-and-spoke topology to reduce pairwise alignment complexity from O(N^2) to O(N) and leverages a label-free, teacher-student distillation objective to align the high-speed visual channel with the robust reasoning patterns of the text pathway. Extensive experiments across heterogeneous model families (e.g., Qwen-VL, Gemma) demonstrate that the Vision Wormhole reduces end-to-end wall-clock time in controlled comparisons while maintaining reasoning fidelity comparable to standard text-based MAS. Code is available at this https URL
[NLP-32] Making Large Language Models Speak Tulu: Structured Prompting for an Extremely Low-Resource Language EACL
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练数据极度匮乏的低资源语言(如印度南部的Dravidian语系语言Tulu)中是否仍具备基本对话能力的问题。其核心挑战在于如何在无足够语料的情况下,仅通过提示工程(prompt engineering)实现有效的语言生成。解决方案的关键在于:结合显式语法文档以引导结构化输出、引入负向约束(negative constraints)抑制高概率但无关语言词汇的干扰、统一罗马化拼写标准以减少歧义,并利用自对弈(self-play)生成高质量合成数据。实验表明,该方法将词汇污染率从80%降至5%,同时达到85%的语法准确性,且负向约束在不同模型间均带来12–18个百分点的性能提升,验证了结构化提示策略在低资源语言场景下的有效性。
链接: https://arxiv.org/abs/2602.15378
作者: Prathamesh Devadiga,Paras Chopra
机构: Lossfunk
类目: Computation and Language (cs.CL)
备注: Accepted to EACL LoResLM Workshop
Abstract:Can large language models converse in languages virtually absent from their training data? We investigate this question through a case study on Tulu, a Dravidian language with over 2 million speakers but minimal digital presence. Rather than fine-tuning an LLM, we examine whether structured prompts alone can elicit basic conversational ability under controlled prompting. We systematically tackle various challenges posed by absence of training data for Tulu by combining explicit grammar documentation, negative constraints to suppress high-probability tokens from related languages, romanization standardization, and quality-controlled synthetic data generation via self-play. Evaluated on a manually curated held-out set across three LLMs (Gemini 2.0 Flash, GPT-4o, Llama 3.1 70B) and validated by native speakers, our approach reduces vocabulary contamination from 80% to 5% while achieving 85% grammatical accuracy. Cross-model analysis reveals that negative constraints provide consistent improvements (12–18 percentage points), while grammar documentation effects vary by model architecture (8–22 points).
[NLP-33] Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework
【速读】: 该论文旨在解决客户服务中心自动化中存在的两大问题:一是现有方法依赖复杂的模块化系统设计与人工代理编排,导致实施成本高且难以扩展;二是传统指令模板过于简化,缺乏对复杂服务流程的有效指导,限制了模型的泛化能力。其解决方案的关键在于提出一种无编排(orchestration-free)框架,利用任务导向型流程图(Task-Oriented Flowcharts, TOFs)实现端到端自动化,通过形式化定义TOF组件与评估指标,并设计高效的成本可控的流程图构建算法,从服务对话中抽象出过程知识;同时强调在本地部署小型语言模型(small language models),并引入基于流程图的去中心化蒸馏机制,以缓解训练数据稀缺和隐私保护难题,从而显著提升自动化系统的性能与实用性。
链接: https://arxiv.org/abs/2602.15377
作者: Mengze Hong,Chen Jason Zhang,Zichang Guo,Hanlin Gu,Di Jiang,Li Qing
机构: Hong Kong Polytechnic University (香港理工大学); WeBank (微众银行)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by TheWebConf 2026
Abstract:Customer service automation has seen growing demand within digital transformation. Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability. This paper introduces an orchestration-free framework using Task-Oriented Flowcharts (TOFs) to enable end-to-end automation without manual intervention. We first define the components and evaluation metrics for TOFs, then formalize a cost-efficient flowchart construction algorithm to abstract procedural knowledge from service dialogues. We emphasize local deployment of small language models and propose decentralized distillation with flowcharts to mitigate data scarcity and privacy issues in model training. Extensive experiments validate the effectiveness in various service tasks, with superior quantitative and application performance compared to strong baselines and market products. By releasing a web-based system demonstration with case studies, we aim to promote streamlined creation of future service automation.
[NLP-34] Far Out: Evaluating Language Models on Slang in Australian and Indian English EACL2026
【速读】: 该论文旨在解决语言模型在处理非标准语言变体(如印度英语 en-IN 和澳大利亚英语 en-AU)中的俚语表达时存在的系统性性能差距问题,尤其是对这些变体中特定俚语的理解能力尚未得到充分探索。其解决方案的关键在于构建两个互补的数据集:\textscweb(来自 Urban Dictionary 的 377 个真实网络用例)和 \textscgen(1,492 个合成生成的俚语使用场景),并设计三种任务评估语言模型的能力——目标词预测(TWP)、引导式目标词预测(TWP∗)和目标词选择(TWS)。实验结果揭示了生成式与判别式能力之间的不对称性,以及不同语言变体间的性能差异,为提升语言模型对多样化语言变体的适应性提供了实证基础和方法框架。
链接: https://arxiv.org/abs/2602.15373
作者: Deniz Kaya Dilsiz,Dipankar Srirag,Aditya Joshi
机构: University of New South Wales (新南威尔士大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as a paper at 13th VarDial workshop at EACL 2026
Abstract:Language models exhibit systematic performance gaps when processing text in non-standard language varieties, yet their ability to comprehend variety-specific slang remains underexplored for several languages. We present a comprehensive evaluation of slang awareness in Indian English (en-IN) and Australian English (en-AU) across seven state-of-the-art language models. We construct two complementary datasets: \textscweb, containing 377 web-sourced usage examples from Urban Dictionary, and \textscgen, featuring 1,492 synthetically generated usages of these slang terms, across diverse scenarios. We assess language models on three tasks: target word prediction (TWP), guided target word prediction (TWP ^* ) and target word selection (TWS). Our results reveal four key findings: (1) Higher average model performance TWS versus TWP and TWP ^* , with average accuracy score increasing from 0.03 to 0.49 respectively (2) Stronger average model performance on \textscweb versus \textscgen datasets, with average similarity score increasing by 0.03 and 0.05 across TWP and TWP ^* tasks respectively (3) en-IN tasks outperform en-AU when averaged across all models and datasets, with TWS demonstrating the largest disparity, increasing average accuracy from 0.44 to 0.54. These findings underscore fundamental asymmetries between generative and discriminative competencies for variety-specific language, particularly in the context of slang expressions despite being in a technologically rich language such as English.
[NLP-35] NeuroSymActive: Differentiable Neural-Symbolic Reasoning with Active Exploration for Knowledge Graph Question Answering
【速读】: 该论文旨在解决知识图谱问答(Knowledge Graph Question Answering, KGQA)中知识密集型查询的多跳推理问题,即如何在保证高准确率的同时,高效地利用结构化知识进行推理。传统方法要么将知识嵌入提示(prompt)导致效率低且不稳定,要么依赖纯符号或高成本检索策略,缺乏梯度驱动的优化能力。其解决方案的关键在于提出一种模块化框架NeuroSymActive,通过可微分的神经符号推理层与主动的价值引导探索控制器相结合:一方面使用软统一(soft-unification)风格的符号模块处理逻辑推理,另一方面引入基于蒙特卡洛采样的探索策略,优先扩展高价值路径,并由神经路径评估器指导搜索方向,从而在显著减少昂贵的知识图谱查找次数和模型调用的前提下,实现高性能的多跳推理。
链接: https://arxiv.org/abs/2602.15353
作者: Rong Fu,Yang Li,Zeyu Zhang,Jiekai Wu,Yaohua Liu,Shuaishuai Cao,Yangchen Zeng,Yuhang Zhang,Xiaojing Du,Chuang Zhao,Kangning Cui,Simon Fong
机构: University of Macau (澳门大学); University of Chinese Academy of Sciences (中国科学院大学); The Australian National University (澳大利亚国立大学); Juntendo University (顺天堂大学); Guangdong Institute of Intelligence Science and Technology (广东省智能科学与技术研究院); Central South University (中南大学); Southeast University (东南大学); China Agricultural University (中国农业大学); Adelaide University (阿德莱德大学); The Hong Kong University of Science and Technology (香港科技大学); Wake Forest University (维克森林大学); Simon Fong (此处为作者姓名,非机构,忽略)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages, 7 figures
Abstract:Large pretrained language models and neural reasoning systems have advanced many natural language tasks, yet they remain challenged by knowledge-intensive queries that require precise, structured multi-hop inference. Knowledge graphs provide a compact symbolic substrate for factual grounding, but integrating graph structure with neural models is nontrivial: naively embedding graph facts into prompts leads to inefficiency and fragility, while purely symbolic or search-heavy approaches can be costly in retrievals and lack gradient-based refinement. We introduce NeuroSymActive, a modular framework that combines a differentiable neural-symbolic reasoning layer with an active, value-guided exploration controller for Knowledge Graph Question Answering. The method couples soft-unification style symbolic modules with a neural path evaluator and a Monte-Carlo style exploration policy that prioritizes high-value path expansions. Empirical results on standard KGQA benchmarks show that NeuroSymActive attains strong answer accuracy while reducing the number of expensive graph lookups and model calls compared to common retrieval-augmented baselines.
[NLP-36] Discovering Implicit Large Language Model Alignment Objectives
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)对齐过程中奖励信号复杂且难以解释的问题,这种模糊性可能导致模型行为偏离预期目标(即“误对齐”或“奖励黑客”攻击)。现有解释方法依赖预设规则,易遗漏未知的潜在目标,或无法识别与模型行为具有因果关系的完整目标集合。其解决方案的关键在于提出Obj-Disco框架,通过迭代贪婪算法分析训练检查点间的策略变化,自动将对齐奖励信号分解为稀疏、加权的自然语言目标组合,从而精准识别并验证能最好解释残差奖励信号的人类可理解的目标。实验表明,该方法在多种任务、模型规模和对齐算法下均具鲁棒性,能捕获90%的奖励行为,并成功揭示隐藏的错误激励机制。
链接: https://arxiv.org/abs/2602.15338
作者: Edward Chen,Sanmi Koyejo,Carlos Guestrin
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Large language model (LLM) alignment relies on complex reward signals that often obscure the specific behaviors being incentivized, creating critical risks of misalignment and reward hacking. Existing interpretation methods typically rely on pre-defined rubrics, risking the omission of “unknown unknowns”, or fail to identify objectives that comprehensively cover and are causal to the model behavior. To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives. Our approach utilizes an iterative greedy algorithm to analyze behavioral changes across training checkpoints, identifying and validating candidate objectives that best explain the residual reward signal. Extensive evaluations across diverse tasks, model sizes, and alignment algorithms demonstrate the framework’s robustness. Experiments with popular open-source reward models show that the framework consistently captures 90% of reward behavior, a finding further corroborated by human evaluation. Additionally, a case study on alignment with an open-source reward model reveals that Obj-Disco can successfully identify latent misaligned incentives that emerge alongside intended behaviors. Our work provides a crucial tool for uncovering the implicit objectives in LLM alignment, paving the way for more transparent and safer AI development.
[NLP-37] Prescriptive Scaling Reveals the Evolution of Language Model Capabilities
【速读】: 该论文旨在解决基础模型(foundation models)部署中缺乏可操作的缩放定律(prescriptive scaling laws)的问题,即在给定预训练计算预算的前提下,如何准确预测当前后训练实践中可达到的下游任务性能,并评估该映射关系在领域演进中的稳定性。解决方案的关键在于:通过大规模观测实验(5000个观测数据点和2000个新采样数据点)构建模型性能边界,利用带有单调饱和Sigmoid参数化的平滑分位数回归方法估计基准分数的高条件分位数随预训练浮点运算次数(log pre training FLOPs)的变化趋势;并通过在早期模型上拟合、后期模型上验证的方式证明该边界的时序可靠性,同时进一步扩展方法以分析任务依赖的饱和特性及数学推理任务中因数据污染导致的性能偏移。此外,论文提出一种高效算法,仅用约20%的评估预算即可恢复接近完整的性能前沿,为实际部署提供可靠且可扩展的性能预期工具。
链接: https://arxiv.org/abs/2602.15327
作者: Hanlin Zhang,Jikai Jin,Vasilis Syrgkanis,Sham Kakade
机构: Harvard University (哈佛大学); Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Blog Post: this https URL
Abstract:For deploying foundation models, practitioners increasingly need prescriptive scaling laws: given a pre training compute budget, what downstream accuracy is attainable with contemporary post training practice, and how stable is that mapping as the field evolves? Using large scale observational evaluations with 5k observational and 2k newly sampled data on model performance, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs, via smoothed quantile regression with a monotone, saturating sigmoid parameterization. We validate the temporal reliability by fitting on earlier model generations and evaluating on later releases. Across various tasks, the estimated boundaries are mostly stable, with the exception of math reasoning that exhibits a consistently advancing boundary over time. We then extend our approach to analyze task dependent saturation and to probe contamination related shifts on math reasoning tasks. Finally, we introduce an efficient algorithm that recovers near full data frontiers using roughly 20% of evaluation budget. Together, our work releases the Proteus 2k, the latest model performance evaluation dataset, and introduces a practical methodology for translating compute budgets into reliable performance expectations and for monitoring when capability boundaries shift across time.
[NLP-38] Mnemis: Dual-Route Retrieval on Hierarchical Graphs for Long-Term LLM Memory
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在长期记忆管理中面临的挑战,即现有基于相似性检索的方法(如RAG和Graph-RAG)虽高效但难以应对需要全局推理或全面覆盖相关知识的场景。其解决方案的关键在于提出一种名为Mnemis的记忆框架,该框架融合了两种互补机制:一是基于相似性的System-1快速检索,二是引入的System-2全局选择机制(Global Selection),通过分层图结构实现自顶向下的语义层次遍历。这种双路径设计使模型能够同时获取语义相关性和结构合理性兼具的记忆项,从而显著提升长程记忆任务的表现,在LoCoMo和LongMemEval-S基准上分别达到93.9和91.6的得分(使用GPT-4.1-mini)。
链接: https://arxiv.org/abs/2602.15313
作者: Zihao Tang,Xin Yu,Ziyu Xiao,Zengxuan Wen,Zelin Li,Jiaxi Zhou,Hualei Wang,Haohua Wang,Haizhen Huang,Weiwei Deng,Feng Sun,Qi Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages
Abstract:AI Memory, specifically how models organizes and retrieves historical messages, becomes increasingly valuable to Large Language Models (LLMs), yet existing methods (RAG and Graph-RAG) primarily retrieve memory through similarity-based mechanisms. While efficient, such System-1-style retrieval struggles with scenarios that require global reasoning or comprehensive coverage of all relevant information. In this work, We propose Mnemis, a novel memory framework that integrates System-1 similarity search with a complementary System-2 mechanism, termed Global Selection. Mnemis organizes memory into a base graph for similarity retrieval and a hierarchical graph that enables top-down, deliberate traversal over semantic hierarchies. By combining the complementary strength from both retrieval routes, Mnemis retrieves memory items that are both semantically and structurally relevant. Mnemis achieves state-of-the-art performance across all compared methods on long-term memory benchmarks, scoring 93.9 on LoCoMo and 91.6 on LongMemEval-S using GPT-4.1-mini.
[NLP-39] Extracting Consumer Insight from Text: A Large Language Model Approach to Emotion and Evaluation Measurement
【速读】: 该论文旨在解决从非结构化文本中准确测量消费者情绪与评价这一核心挑战,这是营销研究与实践中的关键问题。解决方案的关键在于提出并验证了Linguistic eXtractor (LX),一个基于消费者自撰文本并标注了16种消费相关情绪及信任、承诺、推荐和情感四个评价构念的微调大语言模型(Large Language Model, LLM)。LX在开放式问卷回复上达到81%的宏F1分数,在第三方标注的亚马逊和Yelp评论上超过95%准确率,显著优于GPT-4 Turbo、RoBERTa和DeepSeek等主流模型。该方法通过无代码、免费的Web应用实现可扩展分析,为营销领域提供了可靠的情绪与评价识别工具,并证实情绪在产品评分与购买行为之间起中介作用,同时部分情绪如不满与平静可直接驱动购买决策,从而揭示了情感基调超越星级评分的深层价值。
链接: https://arxiv.org/abs/2602.15312
作者: Stephan Ludwig,Peter J. Danaher,Xiaohao Yang,Yu-Ting Lin,Ehsan Abedin,Dhruv Grewal,Lan Du
机构: 未知
类目: Computation and Language (cs.CL); Econometrics (econ.EM)
备注:
Abstract:Accurately measuring consumer emotions and evaluations from unstructured text remains a core challenge for marketing research and practice. This study introduces the Linguistic eXtractor (LX), a fine-tuned, large language model trained on consumer-authored text that also has been labeled with consumers’ self-reported ratings of 16 consumption-related emotions and four evaluation constructs: trust, commitment, recommendation, and sentiment. LX consistently outperforms leading models, including GPT-4 Turbo, RoBERTa, and DeepSeek, achieving 81% macro-F1 accuracy on open-ended survey responses and greater than 95% accuracy on third-party-annotated Amazon and Yelp reviews. An application of LX to online retail data, using seemingly unrelated regression, affirms that review-expressed emotions predict product ratings, which in turn predict purchase behavior. Most emotional effects are mediated by product ratings, though some emotions, such as discontent and peacefulness, influence purchase directly, indicating that emotional tone provides meaningful signals beyond star ratings. To support its use, a no-code, cost-free, LX web application is available, enabling scalable analyses of consumer-authored text. In establishing a new methodological foundation for consumer perception measurement, this research demonstrates new methods for leveraging large language models to advance marketing research and practice, thereby achieving validated detection of marketing constructs from consumer data.
[NLP-40] he Information Geometry of Softmax: Probing and Steering
【速读】: 该论文试图解决的问题是:如何将语义结构编码到人工智能(AI)系统表示空间的几何结构中,特别是当这些表示定义了softmax分布时,其自然几何应如何体现模型行为的语义组织。解决方案的关键在于引入信息几何(information geometry)作为刻画此类表示空间自然几何的理论框架,并基于此提出“对偶引导”(dual steering)方法——该方法通过线性探测器实现对特定概念的鲁棒控制,同时最小化对非目标概念的干扰,从而在理论上最优地调整目标概念表征并提升概念操控的可控性和稳定性。
链接: https://arxiv.org/abs/2602.15293
作者: Kiho Park,Todd Nief,Yo Joong Choe,Victor Veitch
机构: University of Chicago (芝加哥大学); INSEAD (欧洲工商管理学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Code is available at this https URL
Abstract:This paper concerns the question of how AI systems encode semantic structure into the geometric structure of their representation spaces. The motivating observation of this paper is that the natural geometry of these representation spaces should reflect the way models use representations to produce behavior. We focus on the important special case of representations that define softmax distributions. In this case, we argue that the natural geometry is information geometry. Our focus is on the role of information geometry on semantic encoding and the linear representation hypothesis. As an illustrative application, we develop “dual steering”, a method for robustly steering representations to exhibit a particular concept using linear probes. We prove that dual steering optimally modifies the target concept while minimizing changes to off-target concepts. Empirically, we find that dual steering enhances the controllability and stability of concept manipulation.
[NLP-41] FrameRef: A Framing Dataset and Simulation Testbed for Modeling Bounded Rational Information Health
【速读】: 该论文旨在解决信息生态系统中用户长期暴露于负面数字体验所引发的信息健康(Information Health)问题,尤其是排名与个性化策略如何通过持续影响用户的信息接触模式,进而塑造其认知偏差和判断力。解决方案的关键在于提出一个名为FrameRef的大规模数据集和基于模拟的框架:FrameRef包含107万条系统性重构的主张(claims),涵盖权威性、共识度、情感倾向、声望和煽动性五个框架维度;在此基础上,研究者构建了具有框架敏感性的代理人格(framing-sensitive agent personas),通过条件损失衰减微调语言模型以诱导特定偏倚,同时保持任务能力;最后利用蒙特卡洛轨迹采样方法验证了微小但系统性的接受度与信心变化会随时间累积,导致信息健康轨迹显著分化,从而为信息健康研究提供了可量化的仿真工具,并辅以人类评估验证其有效性。
链接: https://arxiv.org/abs/2602.15273
作者: Victor De Lima,Jiqun Liu,Grace Hui Yang
机构: Georgetown InfoSense(乔治城信息感知中心); University of Oklahoma(俄克拉荷马大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:
Abstract:Information ecosystems increasingly shape how people internalize exposure to adverse digital experiences, raising concerns about the long-term consequences for information health. In modern search and recommendation systems, ranking and personalization policies play a central role in shaping such exposure and its long-term effects on users. To study these effects in a controlled setting, we present FrameRef, a large-scale dataset of 1,073,740 systematically reframed claims across five framing dimensions: authoritative, consensus, emotional, prestige, and sensationalist, and propose a simulation-based framework for modeling sequential information exposure and reinforcement dynamics characteristic of ranking and recommendation systems. Within this framework, we construct framing-sensitive agent personas by fine-tuning language models with framing-conditioned loss attenuation, inducing targeted biases while preserving overall task competence. Using Monte Carlo trajectory sampling, we show that small, systematic shifts in acceptance and confidence can compound over time, producing substantial divergence in cumulative information health trajectories. Human evaluation further confirms that FrameRef’s generated framings measurably affect human judgment. Together, our dataset and framework provide a foundation for systematic information health research through simulation, complementing and informing responsible human-centered research. We release FrameRef, code, documentation, human evaluation data, and persona adapter models at this https URL.
[NLP-42] How to Train Your Long-Context Visual Document Model
【速读】: 该论文旨在解决长上下文视觉语言模型(Long-Context Vision-Language Models)在训练与评估过程中缺乏可复现性及性能优化瓶颈的问题,尤其聚焦于344K token级别的长文档视觉问答任务(Long-Document Visual Question Answering)。其核心解决方案在于系统性地研究了持续预训练(continued pretraining)、监督微调(supervised finetuning)和偏好优化(preference optimization)三阶段训练策略,并通过大规模长上下文(LC)评测与消融实验验证了关键发现:首先,训练时使用与评估一致的上下文长度优于训练更长上下文;其次,引入页码索引(page indices)能显著提升长文档理解能力;再次,构建合成数据流水线支持模型自我迭代改进;最后,首次证明了视觉长上下文训练可正向迁移至文本长上下文任务,扩展了跨模态长上下文知识迁移的边界。这些发现为未来长上下文多模态模型的设计提供了可复现、高效率的训练范式。
链接: https://arxiv.org/abs/2602.15257
作者: Austin Veselka
机构: LightOn
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines enable self-improvement via continued pretraining and supervised finetuning, and (iv) we extend the known text-to-visual long context transfer to the reverse, showing that visual long context training transfers to long-context text performance. We also release MMLBD-C, a manually corrected version of MMLongBenchDoc to reduce erroneous and low quality examples in the benchmark.
[NLP-43] OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在面对不透明工具(opaque tools)时性能受限的问题,即现实世界中许多工具(如通用搜索API)缺乏清晰的使用规范或失败模式说明,导致现有方法难以有效调用。解决方案的关键在于提出一个名为ToolObserver的简单框架,通过迭代式观察工具调用轨迹中的执行反馈来逐步优化工具文档,从而提升LLM代理在复杂、不透明环境下的任务完成能力。该方法在OpaqueToolsBench基准上显著优于现有自动文档生成方法,且在测试阶段工具探索场景下具有更高的效率,token消耗仅为最优基线的1/3.5至1/7.5。
链接: https://arxiv.org/abs/2602.15197
作者: Skyler Hallinan,Thejas Venkatesh,Xiang Ren,Sai Praneeth Karimireddy,Ashwin Paranjape,Yuhao Zhang,Jack Hessel
机构: University of Southern California (南加州大学); Samaya AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks. While most existing benchmarks assume simple, perfectly documented tools, real-world tools (e.g., general “search” APIs) are often opaque, lacking clear best practices or failure modes. Can LLM agents improve their performance in environments with opaque tools by interacting and subsequently improving documentation? To study this, we create OpaqueToolsBench, a benchmark consisting of three distinct task-oriented environments: general function calling, interactive chess playing, and long-trajectory agentic search. Each environment provides underspecified tools that models must learn to use effectively to complete the task. Results on OpaqueToolsBench suggest existing methods for automatically documenting tools are expensive and unreliable when tools are opaque. To address this, we propose a simple framework, ToolObserver, that iteratively refines tool documentation by observing execution feedback from tool-calling trajectories. Our approach outperforms existing methods on OpaqueToolsBench across datasets, even in relatively hard settings. Furthermore, for test-time tool exploration settings, our method is also efficient, consuming 3.5-7.5x fewer total tokens than the best baseline.
[NLP-44] Weight space Detection of Backdoors in LoRA Adapters
【速读】: 该论文旨在解决LoRA(Low-Rank Adaptation)适配器在开放共享平台(如Hugging Face Hub)中易受后门攻击的问题,现有检测方法需依赖测试输入数据运行模型,难以高效筛选大量适配器。其解决方案的关键在于无需运行模型即可通过分析适配器权重矩阵直接检测中毒样本:提取奇异值集中度、熵和分布形状等简单统计特征,识别偏离正常模式的异常适配器,实现数据无关(data-agnostic)的高效检测,在500个LoRA适配器(400个干净、100个中毒)上达到97%检测准确率且假阳性低于2%。
链接: https://arxiv.org/abs/2602.15195
作者: David Puertolas Merenciano,Ekaterina Vasyagina,Raghav Dixit,Kevin Zhu,Ruizhe Li,Javier Ferrando,Maheep Chaudhary
机构: Algoverse AI Research (Algoverse AI 研究); University of Aberdeen (阿伯丁大学); Independent (独立研究者)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:LoRA adapters let users fine-tune large language models (LLMs) efficiently. However, LoRA adapters are shared through open repositories like Hugging Face Hub \citephuggingface_hub_docs, making them vulnerable to backdoor attacks. Current detection methods require running the model with test input data – making them impractical for screening thousands of adapters where the trigger for backdoor behavior is unknown. We detect poisoned adapters by analyzing their weight matrices directly, without running the model – making our method data-agnostic. Our method extracts simple statistics – how concentrated the singular values are, their entropy, and the distribution shape – and flags adapters that deviate from normal patterns. We evaluate the method on 500 LoRA adapters – 400 clean, and 100 poisoned for Llama-3.2-3B on instruction and reasoning datasets: Alpaca, Dolly, GSM8K, ARC-Challenge, SQuADv2, NaturalQuestions, HumanEval, and GLUE dataset. We achieve 97% detection accuracy with less than 2% false positives.
[NLP-45] AIC CTU@AVerImaTeC: dual-retriever RAG for image-text fact checking
【速读】: 该论文旨在解决事实核查(fact-checking)任务中多模态信息融合的挑战,尤其是如何高效利用文本与图像信息提升核查准确性。其解决方案的关键在于构建一个由三个解耦模块组成的轻量级系统:基于相似性搜索的文本检索模块、通过API调用实现的反向图像搜索(Reverse Image Search, RIS)模块,以及使用GPT5.1进行生成的模块。该架构在仅需单次多模态大语言模型(Multimodal Large Language Model, MLLM)调用的前提下,实现了平均0.013的低延迟性能,且具备良好的可复现性和可调优性,为后续研究提供了一个简洁、高效的基线方案。
链接: https://arxiv.org/abs/2602.15190
作者: Herbert Ullrich,Jan Drchal
机构: CTU FEE (捷克技术大学电气工程学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:In this paper, we present our 3rd place system in the AVerImaTeC shared task, which combines our last year’s retrieval-augmented generation (RAG) pipeline with a reverse image search (RIS) module. Despite its simplicity, our system delivers competitive performance with a single multimodal LLM call per fact-check at just 0.013 on average using GPT5.1 via OpenAI Batch API. Our system is also easy to reproduce and tweak, consisting of only three decoupled modules - a textual retrieval module based on similarity search, an image retrieval module based on API-accessed RIS, and a generation module using GPT5.1 - which is why we suggest it as an accesible starting point for further experimentation. We publish its code and prompts, as well as our vector stores and insights into the scheme’s running costs and directions for further improvement.
[NLP-46] Seeing to Generalize: How Visual Data Corrects Binding Shortcuts ICML2026
【速读】: 该论文试图解决的问题是:尽管视觉语言模型(Vision Language Models, VLMs)旨在扩展大语言模型(Large Language Models, LLMs)的视觉能力,但在纯文本任务中(尤其是长上下文信息检索),VLMs 有时反而能超越其基础 LLM。为探究这一现象,作者构建了一个受控的合成检索任务,发现仅用文本训练的 Transformer 在分布内(in-distribution)表现完美,但泛化能力差;而后续使用图像标记化的相同任务进行训练后,模型在分布外(out-of-distribution)的文本任务性能几乎翻倍。解决方案的关键在于:跨模态训练通过引入空间平移不变性(spatial translation invariance),改变了模型内部的符号绑定机制(symbolic binding mechanism)——文本训练倾向于利用位置捷径(positional shortcuts),而图像训练则破坏这些捷径,迫使模型采用更鲁棒的符号绑定策略,从而提升对纯文本任务的泛化能力,即使后续重新引入纯文本数据也保持有效。这一机制在预训练 LLM 到 VLM 的过渡中同样显现,表明跨模态训练可增强单模态任务上的推理与泛化性能。
链接: https://arxiv.org/abs/2602.15183
作者: Nicolas Buzeta,Felipe del Rio,Cristian Hinostroza,Denis Parra,Hans Lobel,Rodrigo Toro Icarte
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Submitted to ICML 2026
Abstract:Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particularly in long-context information retrieval. To investigate this effect, we build a controlled synthetic retrieval task and find that a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out of distribution, while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance. Mechanistic interpretability reveals that visual training changes the model’s internal binding strategy: text-only training encourages positional shortcuts, whereas image-based training disrupts them through spatial translation invariance, forcing the model to adopt a more robust symbolic binding mechanism that persists even after text-only examples are reintroduced. We further characterize how binding strategies vary across training regimes, visual encoders, and initializations, and show that analogous shifts occur during pretrained LLM-to-VLM transitions. Our findings suggest that cross-modal training can enhance reasoning and generalization even for tasks grounded in a single modality.
[NLP-47] Protecting Language Models Against Unauthorized Distillation through Trace Rewriting
【速读】: 该论文旨在解决生成式 AI (Generative AI) 领域中知识蒸馏(Knowledge Distillation)被未经授权使用的问题,即学生模型(student models)在未获授权的情况下复制大语言模型(LLMs)的推理能力,从而不公平地利用其开发成本与资源投入。解决方案的关键在于对教师模型(teacher model)生成的推理轨迹进行动态重写(dynamic rewriting),在保持答案正确性和语义一致性的前提下,实现两个目标:一是“反蒸馏”(anti-distillation),降低查询响应对学生模型训练的有效性;二是“API水印”(API watermarking),在学生模型中嵌入可验证的签名。研究提出多种方法,其中基于指令的重写策略表现最优,在有效抑制未经授权蒸馏的同时,甚至能提升教师模型性能,并实现高可靠性水印检测且无误报。
链接: https://arxiv.org/abs/2602.15143
作者: Xinhang Ma,William Yeoh,Ning Zhang,Yevgeniy Vorobeychik
机构: Washington University in St. Louis (圣路易斯华盛顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) \emphanti-distillation, or degrading the training usefulness of query responses, and (2) \emphAPI watermarking, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher’s reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables highly reliable watermark detection with essentially no false alarms.
[NLP-48] CGRA-DeBERTa Concept Guided Residual Augmentation Transformer for Theologically Islamic Understanding
【速读】: 该论文旨在解决古典伊斯兰文本中准确问答(QA)的挑战,这些问题源于领域特定语义、长程上下文依赖关系以及概念敏感推理等难点。解决方案的关键在于提出一种名为CGRA DeBERTa的框架,其核心创新包括:基于轻量级LoRA适配的定制化DeBERTa Transformer骨干网络、引入源自12个核心伊斯兰概念词典的先验知识的“概念引导残差块”,以及通过重要性加权注意力实现选择性增强语义关键token的“概念门控机制”(Concept Gating Mechanism),该机制采用1.04至3.00的差异化缩放系数,在保持上下文完整性的同时强化领域特定语义表示,并提升span提取的准确性与效率。
链接: https://arxiv.org/abs/2602.15139
作者: Tahir Hussain(1),Saddam Hussain Khan(2) ((1) Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering and Applied Sciences (UEAS), Swat, Pakistan (2) Interdisciplinary Research Center for Smart Mobility and Logistics (IRC-SML), King Fahad University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 Pages, 9 Tables, 7 Figures
Abstract:Accurate QA over classical Islamic texts remains challenging due to domain specific semantics, long context dependencies, and concept sensitive reasoning. Therefore, a new CGRA DeBERTa, a concept guided residual domain augmentation transformer framework, is proposed that enhances theological QA over Hadith corpora. The CGRA DeBERTa builds on a customized DeBERTa transformer backbone with lightweight LoRA based adaptations and a residual concept aware gating mechanism. The customized DeBERTa embedding block learns global and positional context, while Concept Guided Residual Blocks incorporate theological priors from a curated Islamic Concept Dictionary of 12 core terms. Moreover, the Concept Gating Mechanism selectively amplifies semantically critical tokens via importance weighted attention, applying differential scaling from 1.04 to 3.00. This design preserves contextual integrity, strengthens domain-specific semantic representations, and enables accurate, efficient span extraction while maintaining computational efficiency. This paper reports the results of training CGRA using a specially constructed dataset of 42591 QA pairs from the text of Sahih alBukhari and Sahih Muslim. While BERT achieved an EM score of 75.87 and DeBERTa one of 89.77, our model scored 97.85 and thus surpassed them by 8.08 on an absolute scale, all while adding approximately 8 inference overhead due to parameter efficient gating. The qualitative evaluation noted better extraction and discrimination and theological precision. This study presents Hadith QA systems that are efficient, interpretable, and accurate and that scale provide educational materials with necessary theological nuance.
[NLP-49] Indic-TunedLens: Interpreting Multilingual Models in Indian Languages EACL
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, LLMs)在非英语语境下可解释性不足的问题,特别是针对印度等语言多样性地区中低资源、形态丰富的印度语言。现有解释工具主要面向英语,导致模型在跨语言情境下的表征解码不准确。其解决方案的关键在于提出Indic-TunedLens框架,该框架通过学习每种目标语言的共享仿射变换(shared affine transformations),对隐藏状态进行语言特定调整,使其与目标输出分布对齐,从而实现更忠实的中间激活解码。相较标准Logit Lens方法,该框架显著提升了对印度10种语言的解释能力,尤其在形态复杂且数据稀缺的语言上表现突出。
链接: https://arxiv.org/abs/2602.15038
作者: Mihir Panchal,Deeksha Varshney,Mamta,Asif Ekbal
机构: Dwarkadas Jivanlal Sanghvi College of Engineering (达瓦拉达斯·吉万拉尔·桑格维工程学院); Indian Institute of Technology Jodhpur (印度理工学院焦特布尔分校); King’s College London (伦敦国王学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL) Thirteenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) 2026
Abstract:Multilingual large language models (LLMs) are increasingly deployed in linguistically diverse regions like India, yet most interpretability tools remain tailored to English. Prior work reveals that LLMs often operate in English centric representation spaces, making cross lingual interpretability a pressing concern. We introduce Indic-TunedLens, a novel interpretability framework specifically for Indian languages that learns shared affine transformations. Unlike the standard Logit Lens, which directly decodes intermediate activations, Indic-TunedLens adjusts hidden states for each target language, aligning them with the target output distributions to enable more faithful decoding of model representations. We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages. Our results provide crucial insights into the layer-wise semantic encoding of multilingual transformers. Our model is available at this https URL. Our code is available at this https URL.
[NLP-50] EduResearchBench: A Hierarchical Atomic Task Decomposition Benchmark for Full-Lifecycle Educational Research
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在教育学学术写作任务中缺乏细粒度评估体系的问题。现有基准多采用单次、整体生成的评价方式,无法准确反映复杂科研流程中的具体能力短板。其解决方案的关键在于提出EduResearchBench平台,该平台基于层级原子任务分解(Hierarchical Atomic Task Decomposition, HATD)框架,将研究全流程拆解为6个专业化模块(如定量分析、定性研究和政策研究)共24项细粒度原子任务,从而实现自动化、诊断性的评估反馈;同时引入课程学习策略,从基础技能逐步过渡到复杂方法论推理与论证,显著提升了模型在垂直领域(教育学)的写作能力。实验表明,专为教育学术写作训练的EduWrite模型(30B参数)在核心指标上优于更大规模的一般模型(72B),验证了数据质量密度与分阶段训练课程的重要性超越单纯参数规模。
链接: https://arxiv.org/abs/2602.15034
作者: Houping Yue,Zixiang Di,Mei Jiang,Bingdong Li,Hao Hao,Yu Song,Bo Jiang,Aimin Zhou
机构: Shanghai Institute of AI for Education, East China Normal University (华东师范大学人工智能教育研究所); Shanghai Innovation Institute (上海创新研究院); School of Computer Science and Technology, East China Normal University (华东师范大学计算机科学与技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures
Abstract:While Large Language Models (LLMs) are reshaping the paradigm of AI for Social Science (AI4SS), rigorously evaluating their capabilities in scholarly writing remains a major challenge. Existing benchmarks largely emphasize single-shot, monolithic generation and thus lack the fine-grained assessments required to reflect complex academic research workflows. To fill this gap, we introduce EduResearchBench, the first comprehensive evaluation platform dedicated to educational academic writing. EduResearchBench is built upon our Hierarchical Atomic Task Decomposition (HATD) framework, which decomposes an end-to-end research workflow into six specialized research modules (e.g., Quantitative Analysis, Qualitative Research, and Policy Research) spanning 24 fine-grained atomic tasks. This taxonomy enables an automated evaluation pipeline that mitigates a key limitation of holistic scoring, where aggregate scores often obscure specific capability bottlenecks, and instead provides fine-grained, diagnostic feedback on concrete deficiencies. Moreover, recognizing the high cognitive load inherent in scholarly writing, we propose a curriculum learning strategy that progressively builds competence from foundational skills to complex methodological reasoning and argumentation. Leveraging 55K raw academic samples, we curate 11K high-quality instruction pairs to train EduWrite, a specialized educational scholarly writing model. Experiments show that EduWrite (30B) substantially outperforms larger general-purpose models (72B) on multiple core metrics, demonstrating that in vertical domains, data quality density and hierarchically staged training curricula are more decisive than parameter scale.
信息检索
[IR-0] he Next Paradigm Is User-Centric Agent Not Platform-Centric Service
【速读】:该论文试图解决当前数字服务普遍采用平台-centric(平台中心化)模型所导致的用户利益与平台目标不一致的问题,即平台为追求用户参与度和转化率等指标而优化服务,往往忽视了用户的真正需求。其解决方案的关键在于推动数字服务从平台中心化向用户中心化智能代理(user-centric agent)转型,这类代理以用户隐私保护为核心,遵循用户自定义的目标,并赋予用户对其偏好和行为的控制权。借助大语言模型(LLM)和本地设备智能的进步,该愿景已具备技术可行性,论文进一步提出了一种可行的设备-云协同管道架构,并探讨了实现该模式所需的治理机制与生态系统结构。
链接: https://arxiv.org/abs/2602.15682
作者: Luankang Zhang,Hang Lv,Qiushi Pan,Kefen Wang,Yonghao Huang,Xinrui Miao,Yin Xu,Wei Guo,Yong Liu,Hao Wang,Enhong Chen
机构: 未知
类目: Information Retrieval (cs.IR)
备注:
Abstract:Modern digital services have evolved into indispensable tools, driving the present large-scale information systems. Yet, the prevailing platform-centric model, where services are optimized for platform-driven metrics such as engagement and conversion, often fails to align with users’ true needs. While platform technologies have advanced significantly-especially with the integration of large language models (LLMs)-we argue that improvements in platform service quality do not necessarily translate to genuine user benefit. Instead, platform-centric services prioritize provider objectives over user welfare, resulting in conflicts against user interests. This paper argues that the future of digital services should shift from a platform-centric to a user-centric agent. These user-centric agents prioritize privacy, align with user-defined goals, and grant users control over their preferences and actions. With advancements in LLMs and on-device intelligence, the realization of this vision is now feasible. This paper explores the opportunities and challenges in transitioning to user-centric intelligence, presents a practical device-cloud pipeline for its implementation, and discusses the necessary governance and ecosystem structures for its adoption.
[IR-1] Can Recommender Systems Teach Themselves? A Recursive Self-Improving Framework with Fidelity Control
【速读】:该论文旨在解决推荐系统中因用户交互数据极端稀疏而导致的优化景观崎岖、泛化性能差的问题,这本质上是高质量训练数据稀缺带来的瓶颈。其解决方案的关键在于提出递归自提升推荐(Recursive Self-Improving Recommendation, RSIR)框架,该框架通过一个闭环机制实现模型自我增强:当前模型生成合理的用户交互序列,基于保真度的质量控制机制筛选出与用户近似偏好流形一致的数据,并用这些增强后的数据训练下一代模型。RSIR被理论证明可作为数据驱动的隐式正则化器,平滑优化空间并引导模型走向更鲁棒的解,且在多个基准和架构上均表现出持续累积的性能提升,尤其对小模型有效,甚至弱模型可为强模型生成有效的训练课程,体现了该方法的通用性和可扩展性。
链接: https://arxiv.org/abs/2602.15659
作者: Luankang Zhang,Hao Wang,Zhongzhou Liu,Mingjia Yin,Yonghao Huang,Jiaqi Li,Wei Guo,Yong Liu,Huifeng Guo,Defu Lian,Enhong Chen
机构: 未知
类目: Information Retrieval (cs.IR)
备注:
Abstract:The scarcity of high-quality training data presents a fundamental bottleneck to scaling machine learning models. This challenge is particularly acute in recommendation systems, where extreme sparsity in user interactions leads to rugged optimization landscapes and poor generalization. We propose the Recursive Self-Improving Recommendation (RSIR) framework, a paradigm in which a model bootstraps its own performance without reliance on external data or teacher models. RSIR operates in a closed loop: the current model generates plausible user interaction sequences, a fidelity-based quality control mechanism filters them for consistency with user’s approximate preference manifold, and a successor model is augmented on the enriched dataset. Our theoretical analysis shows that RSIR acts as a data-driven implicit regularizer, smoothing the optimization landscape and guiding models toward more robust solutions. Empirically, RSIR yields consistent, cumulative gains across multiple benchmarks and architectures. Notably, even smaller models benefit, and weak models can generate effective training curricula for stronger ones. These results demonstrate that recursive self-improvement is a general, model-agnostic approach to overcoming data sparsity, suggesting a scalable path forward for recommender systems and beyond. Our anonymized code is available at this https URL .
[IR-2] Eco-Amazon: Enriching E-commerce Datasets with Product Carbon Footprint for Sustainable Recommendations
【速读】:该论文旨在解决当前信息检索与推荐系统在可持续发展背景下缺乏环境影响评估的问题,特别是由于标准基准数据集中缺少物品级别的碳排放数据(Product Carbon Footprint, PCF),导致难以衡量和优化模型的环境可持续性。解决方案的关键在于构建Eco-Amazon数据集——通过零样本框架利用大语言模型(Large Language Models, LLMs)从产品属性中推断出每个商品的CO₂e排放评分,并将其作为元数据注入到三个广泛使用的Amazon数据集(家居、服装、电子)中,从而为研究者提供可扩展的环境信号资源,支持开发和评估更可持续的推荐算法。
链接: https://arxiv.org/abs/2602.15508
作者: Giuseppe Spillo,Allegra De Filippo,Cataldo Musto,Michela Milano,Giovanni Semeraro
机构: University of Bari Aldo Moro (巴里阿尔多·莫罗大学); University of Bologna (博洛尼亚大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:In the era of responsible and sustainable AI, information retrieval and recommender systems must expand their scope beyond traditional accuracy metrics to incorporate environmental sustainability. However, this research line is severely limited by the lack of item-level environmental impact data in standard benchmarks. This paper introduces Eco-Amazon, a novel resource designed to bridge this gap. Our resource consists of an enriched version of three widely used Amazon datasets (i.e., Home, Clothing, and Electronics) augmented with Product Carbon Footprint (PCF) metadata. CO2e emission scores were generated using a zero-shot framework that leverages Large Language Models (LLMs) to estimate item-level PCF based on product attributes. Our contribution is three-fold: (i) the release of the Eco-Amazon datasets, enriching item metadata with PCF signals; (ii) the LLM-based PCF estimation script, which allows researchers to enrich any product catalogue and reproduce our results; (iii) a use case demonstrating how PCF estimates can be exploited to promote more sustainable products. By providing these environmental signals, Eco-Amazon enables the community to develop, benchmark, and evaluate the next generation of sustainable retrieval and recommendation models. Our resource is available at this https URL, while our source code is available at: this http URL.
[IR-3] Binge Watch: Reproducible Multimodal Benchmarks Datasets for Large-Scale Movie Recommendation on MovieLens-10M and 20M
【速读】:该论文旨在解决当前多模态推荐系统(Multimodal Recommender Systems, MRSs)研究中缺乏大规模、高质量且可复现的数据集问题。现有文献多依赖于小规模或未公开的数据集,且构建过程缺乏透明度,限制了模型性能的验证与比较。解决方案的关键在于发布两个大规模、可复现的电影领域多模态数据集——M3L-10M 和 M3L-20M,它们通过在流行的 MovieLens-10M 和 MovieLens-20M 基础上增强文本、图像、音频和视频特征而构建。作者采用完全文档化的处理流程,从电影剧情、海报和预告片中提取多种模态特征,并公开原始数据映射、提取特征及完整数据集的不同格式,从而显著提升研究的可复现性,推动多模态推荐领域的进展。
链接: https://arxiv.org/abs/2602.15505
作者: Giuseppe Spillo,Alessandro Petruzzelli,Cataldo Musto,Marco de Gemmis,Pasquale Lops,Giovanni Semeraro
机构: University of Bari Aldo Moro (巴里阿尔多莫罗大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:With the growing interest in Multimodal Recommender Systems (MRSs), collecting high-quality datasets provided with multimedia side information (text, images, audio, video) has become a fundamental step. However, most of the current literature in the field relies on small- or medium-scale datasets that are either not publicly released or built using undocumented processes. In this paper, we aim to fill this gap by releasing M3L-10M and M3L-20M, two large-scale, reproducible, multimodal datasets for the movie domain, obtained by enriching with multimodal features the popular MovieLens-10M and MovieLens-20M, respectively. By following a fully documented pipeline, we collect movie plots, posters, and trailers, from which textual, visual, acoustic, and video features are extracted using several state-of-the-art encoders. We publicly release mappings to download the original raw data, the extracted features, and the complete datasets in multiple formats, fostering reproducibility and advancing the field of MRSs. In addition, we conduct qualitative and quantitative analyses that showcase our datasets across several perspectives. This work represents a foundational step to ensure reproducibility and replicability in the large-scale, multimodal movie recommendation domain. Our resource can be fully accessed at the following link: this https URL, while the source code is accessible at this https URL. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2602.15505 [cs.IR] (or arXiv:2602.15505v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.15505 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-4] GaiaFlow: Semantic-Guided Diffusion Tuning for Carbon-Frugal Search
【速读】:该论文旨在解决当前复杂神经架构(neural architectures)带来的高算力需求与环境可持续性之间的矛盾,即在提升检索精度的同时降低计算密集型模型的碳排放。其核心问题在于:尽管现代神经排序器(neural rankers)已实现卓越的检索准确率,但其大规模部署中的环境外部性常被忽视。解决方案的关键在于提出GaiaFlow框架,通过语义引导的扩散调优(semantic-guided diffusion tuning),融合检索引导的Langevin动力学与硬件无关的性能建模策略,在搜索精度与环境友好性之间实现最优权衡;同时引入自适应早停机制(adaptive early exit protocols)和精度感知量化推理(precision-aware quantized inference),显著减少运行碳足迹,且在异构计算基础设施上保持鲁棒的检索质量。
链接: https://arxiv.org/abs/2602.15423
作者: Rong Fu,Wenxin Zhang,Jia Yee Tan,Chunlei Meng,Shuo Yin,Xiaowen Ma,Wangyu Wu,Muge Qi,Guangzhen Yao,Zhaolu Kang,Zeli Su,Simon Fong
机构: University of Macau (澳门大学); University of Chinese Academy of Sciences (中国科学院大学); Renmin University of China (中国人民大学); Fudan University (复旦大学); Tsinghua University (清华大学); Zhejiang University (浙江大学); University of Liverpool (利物浦大学); Peking University (北京大学); Northeast Normal University (东北师范大学); Minzu University of China (中央民族大学)
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 19 pages, 7 figures
Abstract:As the burgeoning power requirements of sophisticated neural architectures escalate, the information retrieval community has recognized ecological sustainability as a pivotal priority that necessitates a fundamental paradigm shift in model design. While contemporary neural rankers have attained unprecedented accuracy, the substantial environmental externalities associated with their computational intensity often remain overlooked in large-scale deployments. We present GaiaFlow, an innovative framework engineered to facilitate carbon-frugal search by operationalizing semantic-guided diffusion tuning. Our methodology orchestrates the convergence of retrieval-guided Langevin dynamics and a hardware-independent performance modeling strategy to optimize the trade-off between search precision and environmental preservation. By incorporating adaptive early exit protocols and precision-aware quantized inference, the proposed architecture significantly mitigates operational carbon footprints while maintaining robust retrieval quality across heterogeneous computing infrastructures. Extensive experimental evaluations demonstrate that GaiaFlow achieves a superior equilibrium between effectiveness and energy efficiency, offering a scalable and sustainable pathway for next-generation neural search systems.
[IR-5] Automatic Funny Scene Extraction from Long-form Cinematic Videos
【速读】:该论文旨在解决从长时影视作品中自动提取具有吸引力且高质量幽默片段的问题,以提升流媒体平台上的用户参与度。其核心挑战在于长视频的复杂叙事结构对场景定位带来的困难,以及幽默本身依赖多模态信息(如视觉、音频和文本)且风格细腻难以捕捉。解决方案的关键在于提出一个端到端系统,包含三个核心创新:一是结合视觉与文本线索的新型场景分割方法;二是通过引导三元组挖掘优化镜头表示;三是基于音频与文本的多模态幽默标签框架。该方案在OVSD数据集上实现18.3%的平均精度(AP)提升,并在长文本幽默检测中达到0.834的F1分数,验证了其在真实影视内容中的有效性与泛化能力。
链接: https://arxiv.org/abs/2602.15381
作者: Sibendu Paul,Haotian Jiang,Caren Chen
机构: 未知
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automatically extracting engaging and high-quality humorous scenes from cinematic titles is pivotal for creating captivating video previews and snackable content, boosting user engagement on streaming platforms. Long-form cinematic titles, with their extended duration and complex narratives, challenge scene localization, while humor’s reliance on diverse modalities and its nuanced style add further complexity. This paper introduces an end-to-end system for automatically identifying and ranking humorous scenes from long-form cinematic titles, featuring shot detection, multimodal scene localization, and humor tagging optimized for cinematic content. Key innovations include a novel scene segmentation approach combining visual and textual cues, improved shot representations via guided triplet mining, and a multimodal humor tagging framework leveraging both audio and text. Our system achieves an 18.3% AP improvement over state-of-the-art scene detection on the OVSD dataset and an F1 score of 0.834 for detecting humor in long text. Extensive evaluations across five cinematic titles demonstrate 87% of clips extracted by our pipeline are intended to be funny, while 98% of scenes are accurately localized. With successful generalization to trailers, these results showcase the pipeline’s potential to enhance content creation workflows, improve user engagement, and streamline snackable content generation for diverse cinematic media formats.
[IR-6] Semantics-Aware Denoising: A PLM-Guided Sample Reweighting Strategy for Robust Recommendation
【速读】:该论文旨在解决隐式反馈(Implicit Feedback)中噪声干扰导致推荐模型性能下降的问题,尤其是由误点击、诱导性点击和探索性浏览等非真实偏好行为引发的噪声样本对训练过程的负面影响。解决方案的关键在于提出一种语义感知的隐式去噪框架 SAID(Semantics-Aware Implicit Denoising),其核心思想是利用用户历史行为构建文本兴趣画像,并通过预训练语言模型(Pre-trained Language Model, PLM)编码器计算用户兴趣与目标物品描述之间的语义相似度,将相似度得分转化为样本权重以调节训练损失函数,从而降低语义不一致点击的影响。该方法无需修改推荐模型结构或引入复杂辅助网络,仅通过优化损失函数即可实现有效去噪,在高噪声环境下表现出显著鲁棒性。
链接: https://arxiv.org/abs/2602.15359
作者: Xikai Yang,Yang Wang,Yilin Li,Sebastian Sun
机构: 未知
类目: Information Retrieval (cs.IR)
备注:
Abstract:Implicit feedback, such as user clicks, serves as the primary data source for modern recommender systems. However, click interactions inherently contain substantial noise, including accidental clicks, clickbait-induced interactions, and exploratory browsing behaviors that do not reflect genuine user preferences. Training recommendation models with such noisy positive samples leads to degraded prediction accuracy and unreliable recommendations. In this paper, we propose SAID (Semantics-Aware Implicit Denoising), a simple yet effective framework that leverages semantic consistency between user interests and item content to identify and downweight potentially noisy interactions. Our approach constructs textual user interest profiles from historical behaviors and computes semantic similarity with target item descriptions using pre-trained language model (PLM) based text encoders. The similarity scores are then transformed into sample weights that modulate the training loss, effectively reducing the impact of semantically inconsistent clicks. Unlike existing denoising methods that require complex auxiliary networks or multi-stage training procedures, SAID only modifies the loss function while keeping the backbone recommendation model unchanged. Extensive experiments on two real-world datasets demonstrate that SAID consistently improves recommendation performance, achieving up to 2.2% relative improvement in AUC over strong baselines, with particularly notable robustness under high noise conditions.
[IR-7] nsorFM: Low-Rank Approximations of Cross-Order Feature Interactions
【速读】:该论文旨在解决基于表格型分类数据(tabular categorical data)的预测问题,其中每个样本由多个分类属性(fields)构成,每个属性取值来自有限集合,这类问题广泛存在于点击率预测和社科研究等领域。解决方案的关键在于提出了一种名为tensorFM的新模型,其通过低秩张量分解(low-rank tensor approximation)高效建模属性间的高阶交互关系,从而捕捉传统方法难以表达的复杂特征组合效应;该方法在保持低延迟的同时,实现了与当前最优方法相当甚至更优的性能表现,特别适用于在线广告等对响应时间敏感的应用场景。
链接: https://arxiv.org/abs/2602.15229
作者: Alessio Mazzetto(1),Mohammad Mahdi Khalili(2 and 3),Laura Fee Nern(3),Michael Viderman(3),Alex Shtoff(4),Krzysztof Dembczyński(3 and 5) ((1) Brown University, (2) Ohio State University, (3) Yahoo Research, (4) Technology Innovation Institute, (5) Poznan University of Technology)
机构: Brown University (布朗大学); Ohio State University (俄亥俄州立大学); Yahoo Research (雅虎研究院); Technology Innovation Institute (技术创新研究所); Poznan University of Technology (波兹南理工大学)
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
备注:
Abstract:We address prediction problems on tabular categorical data, where each instance is defined by multiple categorical attributes, each taking values from a finite set. These attributes are often referred to as fields, and their categorical values as features. Such problems frequently arise in practical applications, including click-through rate prediction and social sciences. We introduce and analyze tensorFM, a new model that efficiently captures high-order interactions between attributes via a low-rank tensor approximation representing the strength of these interactions. Our model generalizes field-weighted factorization machines. Empirically, tensorFM demonstrates competitive performance with state-of-the-art methods. Additionally, its low latency makes it well-suited for time-sensitive applications, such as online advertising.
[IR-8] ScrapeGraphAI-100k: A Large-Scale Dataset for LLM -Based Web Information Extraction
【速读】:该论文旨在解决当前用于网页信息提取的大语言模型(Large Language Models, LLMs)训练与评估所依赖的数据集普遍存在的局限性问题,即现有数据集往往规模小、为合成数据或仅包含文本内容,难以捕捉网页的结构化上下文。其解决方案的关键在于构建并公开一个大规模的真实世界LLM提取事件数据集——ScrapeGraphAI-100k,该数据集基于2025年第二至第三季度通过用户自愿上报的ScrapeGraphAI遥测收集的900万条原始事件,经去重和按Schema平衡后得到93,695个实例,覆盖多种领域和语言;每个样本包含Markdown内容、提示词(prompt)、JSON Schema、LLM响应及复杂度/验证元数据,从而支持对结构化抽取任务的细粒度分析与高效微调。实验表明,仅用该数据集的一个子集微调的小型模型(1.7B参数)即可显著缩小与大型基线模型(30B参数)之间的性能差距,验证了该数据集在提升小型模型效率方面的核心价值。
链接: https://arxiv.org/abs/2602.15189
作者: William Brach,Francesco Zuppichini,Marco Vinciguerra,Lorenzo Padoan
机构: Slovak University of Technology (斯洛伐克技术大学); ScrapeGraphAI (ScrapeGraphAI)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The use of large language models for web information extraction is becoming increasingly fundamental to modern web information retrieval pipelines. However, existing datasets tend to be small, synthetic or text-only, failing to capture the structural context of the web. We introduce ScrapeGraphAI-100k, a large-scale dataset comprising real-world LLM extraction events, collected via opt-in ScrapeGraphAI telemetry during Q2 and Q3 of 2025. Starting from 9M events, we deduplicate and balance by schema to produce 93,695 examples spanning diverse domains and languages. Each instance includes Markdown content, a prompt, a JSON schema, the LLM response, and complexity/validation metadata. We characterize the datasets structural diversity and its failure modes as schema complexity increases. We also provide a fine-tuning experiment showing that a small language model (1.7B) trained on a subset narrows the gap to larger baselines (30B), underscoring the datasets utility for efficient extraction. ScrapeGraphAI-100k enables fine-tuning small models, benchmarking structured extraction, and studying schema induction for web IR indexing, and is publicly available on HuggingFace.
[IR-9] da Costa and Tarski meet Goguen and Carnap: a novel approach for ontological heterogeneity based on consequence systems
【速读】:该论文旨在解决本体异构性(ontological heterogeneity)问题,即不同本体在概念结构、逻辑基础和语义表达上的不一致性所带来的集成与互操作难题。解决方案的关键在于提出一种名为“da Costian-Tarskianism”的新方法,其核心是基于后果系统(consequence systems)的扩展形式——扩展后果系统(extended consequence system),该系统将本体公理显式纳入逻辑推导框架;同时引入扩展发展图(extended development graph),通过扩展后果系统的态射以及纤维化(fibring)和分裂(splitting)等操作,实现对多个本体之间的结构性关联建模,从而为跨本体推理和整合提供形式化基础。
链接: https://arxiv.org/abs/2602.15158
作者: Gabriel Rocha
机构: Universidade Estadual de Campinas (Unicamp)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Logic (math.LO)
备注: 22 pages, 5 figures, 1 table
Abstract:This paper presents a novel approach for ontological heterogeneity that draws heavily from Carnapian-Goguenism, as presented by Kutz, Mossakowski and Lücke (2010). The approach is provisionally designated da Costian-Tarskianism, named after da Costa’s Principle of Tolerance in Mathematics and after Alfred Tarski’s work on the concept of a consequence operator. The approach is based on the machinery of consequence systems, as developed by Carnielli et al. (2008) and Citkin and Muravitsky (2022), and it introduces the idea of an extended consequence system, which is a consequence system extended with ontological axioms. The paper also defines the concept of an extended development graph, which is a graph structure that allows ontologies to be related via morphisms of extended consequence systems, and additionally via other operations such as fibring and splitting. Finally, we discuss the implications of this approach for the field of applied ontology and suggest directions for future research.
人机交互
[HC-0] Robot-Assisted Social Dining as a White Glove Service
【速读】:该论文旨在解决现有机器人辅助进食系统仅在实验室或家庭环境中测试,未能有效应对真实社交用餐场景(如餐厅)中动态、非监督环境所带来的挑战。其解决方案的关键在于通过与残障人士的参与式设计,结合半结构化访谈和自研的基于AI的视觉分镜工具,提炼出理想化的“野外社交用餐”场景,并提出四大核心原则:(1)支持多模态输入与无侵入式输出;(2)具备情境敏感的社会行为并优先考虑用户需求;(3)扩展角色功能超越单纯喂食;(4)适应餐桌上的其他人际关系。这一设计框架为未来面向真实社交场景的机器人辅助进食系统提供了理论与实践指导。
链接: https://arxiv.org/abs/2602.15767
作者: Atharva S Kashyap,Ugne Aleksandra Morkute,Patricia Alves-Oliveira
机构: Robotics University of Michigan (机器人大学密歇根大学); Leiden University (莱顿大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 20 pages, 9 figures. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26)
Abstract:Robot-assisted feeding enables people with disabilities who require assistance eating to enjoy a meal independently and with dignity. However, existing systems have only been tested in-lab or in-home, leaving in-the-wild social dining contexts (e.g., restaurants) largely unexplored. Designing a robot for such contexts presents unique challenges, such as dynamic and unsupervised dining environments that a robot needs to account for and respond to. Through speculative participatory design with people with disabilities, supported by semi-structured interviews and a custom AI-based visual storyboarding tool, we uncovered ideal scenarios for in-the-wild social dining. Our key insight suggests that such systems should: embody the principles of a white glove service where the robot (1) supports multimodal inputs and unobtrusive outputs; (2) has contextually sensitive social behavior and prioritizes the user; (3) has expanded roles beyond feeding; (4) adapts to other relationships at the dining table. Our work has implications for in-the-wild and group contexts of robot-assisted feeding.
[HC-1] Unraveling Entangled Feeds: Rethinking Social Media Design to Enhance User Well-being
【速读】:该论文旨在解决社交平台算法推荐机制对用户心理健康可能造成的潜在危害问题,尤其是在缺乏充分考量的情况下广泛采用算法内容筛选。研究通过与21名被诊断患有精神疾病用户的深度设计工作坊发现,用户会基于自身体验形成“民间理论”(folk theories)来解释其与算法推荐系统之间的互动关系,这些理论揭示了算法设计中的“纠缠”(entanglement)现象——即用户行为与其情感结果之间存在情绪层面的脱节。解决方案的关键在于通过增强情境化用户参与和恢复显式用户控制权来缓解这种纠缠效应,从而为社会计算与推荐系统研究提供新的设计方向,以支持用户的心理健康。
链接: https://arxiv.org/abs/2602.15745
作者: Ashlee Milton,Dan Runningen,Loren Terveen,Harmanpreet Kaur,Stevie Chancellor
机构: University of Minnesota - GroupLens(明尼苏达大学-GroupLens); Minneapolis(明尼阿波利斯); Minnesota(明尼苏达州); USA(美国)
类目: Human-Computer Interaction (cs.HC)
备注: Conditionally accepted to the 2026 CHI Conference on Human Factors in Computing Systems
Abstract:Social media platforms have rapidly adopted algorithmic curation with little consideration for the potential harm to users’ mental well-being. We present findings from design workshops with 21 participants diagnosed with mental illness about their interactions with social media platforms. We find that users develop cause-and-effect explanations, or folk theories, to understand their experiences with algorithmic curation. These folk theories highlight a breakdown in algorithmic design that we explain using the framework of entanglement, a phenomenon where there is a disconnect between users’ actions and platform outcomes on an emotional level. Participants’ designs to address entanglement and mitigate harms centered on contextualizing their engagement and restoring explicit user control on social media. The conceptualization of entanglement and the resulting design recommendations have implications for social computing and recommender systems research, particularly in evaluating and designing social media platforms that support users’ mental well-being.
[HC-2] Beyond Labels: Information-Efficient Human-in-the-Loop Learning using Ranking and Selection Queries
【速读】:该论文旨在解决传统机器学习系统中人类专家角色被简化为标签提供者(labeling oracle)的问题,这种模式限制了人机交互的信息量,无法充分捕捉人类判断的细微差别。解决方案的关键在于提出一种“人在回路”(human-in-the-loop)框架,通过引入丰富的查询类型(包括项目排序和示例选择)来增强人机交互的信息密度,并基于实验观察到的物品感知隐式得分与其与未知分类器距离之间的关系,构建概率性人类响应模型。在此基础上设计主动学习算法,利用这些丰富查询提升每次交互的信息获取效率,并进一步开发计算高效的变分近似方法以降低复杂度。实验证明,该方法在词情感分类任务中可使学习时间减少超过57%。
链接: https://arxiv.org/abs/2602.15738
作者: Belén Martín-Urcelay,Yoonsang Lee,Matthieu R. Bloch,Christopher J. Rozell
机构: 未知
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Integrating human expertise into machine learning systems often reduces the role of experts to labeling oracles, a paradigm that limits the amount of information exchanged and fails to capture the nuances of human judgment. We address this challenge by developing a human-in-the-loop framework to learn binary classifiers with rich query types, consisting of item ranking and exemplar selection. We first introduce probabilistic human response models for these rich queries motivated by the relationship experimentally observed between the perceived implicit score of an item and its distance to the unknown classifier. Using these models, we then design active learning algorithms that leverage the rich queries to increase the information gained per interaction. We provide theoretical bounds on sample complexity and develop a tractable and computationally efficient variational approximation. Through experiments with simulated annotators derived from crowdsourced word-sentiment and image-aesthetic datasets, we demonstrate significant reductions on sample complexity. We further extend active learning strategies to select queries that maximize information rate, explicitly balancing informational value against annotation cost. This algorithm in the word sentiment classification task reduces learning time by more than 57% compared to traditional label-only active learning.
[HC-3] How to Disclose? Strategic AI Disclosure in Crowdfunding
【速读】:该论文试图解决的问题是:在众筹(crowdfunding)场景中,人工智能(AI)的引入日益普遍,但如何通过不同的信息披露策略来影响投资者决策尚缺乏实证研究。具体而言,研究关注强制性AI披露对众筹绩效的影响,以及实质性信号(如AI参与程度)和修辞信号(如逻辑性/明确性、可信度/真实性、情感基调)如何调节这一影响。解决方案的关键在于识别出两种类型的信号机制:一是实质性信号主要通过影响创业者能力感知来作用于结果;二是修辞信号则通过多种路径发挥作用——或单独作为中介变量,或与能力感知共同构成序列中介路径。研究发现,高真实性和高明确性的披露可缓解AI披露带来的负面影响,而过度积极的情感基调反而加剧负面效应,从而为创业者、平台和政策制定者提供了基于信号理论和修辞框架的战略性AI透明度管理依据。
链接: https://arxiv.org/abs/2602.15698
作者: Ning Wang,Chen Liang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:As artificial intelligence (AI) increasingly integrates into crowdfunding practices, strategic disclosure of AI involvement has become critical. Yet, empirical insights into how different disclosure strategies influence investor decisions remain limited. Drawing on signaling theory and Aristotle’s rhetorical framework, we examine how mandatory AI disclosure affects crowdfunding performance and how substantive signals (degree of AI involvement) and rhetorical signals (logos/explicitness, ethos/authenticity, pathos/emotional tone) moderate these effects. Leveraging Kickstarter’s mandatory AI disclosure policy as a natural experiment and four supplementary online experiments, we find that mandatory AI disclosure significantly reduces crowdfunding performance: funds raised decline by 39.8% and backer counts by 23.9% for AI-involved projects. However, this adverse effect is systematically moderated by disclosure strategy. Greater AI involvement amplifies the negative effects of AI disclosure, while high authenticity and high explicitness mitigate them. Interestingly, excessive positive emotional tone (a strategy creators might intuitively adopt to counteract AI skepticism) backfires and exacerbates negative outcomes. Supplementary randomized experiments identify two underlying mechanisms: perceived creator competence and AI washing concerns. Substantive signals primarily affect competence judgments, whereas rhetorical signals operate through varied pathways: either mediator alone or both in sequence. These findings provide theoretical and practical insights for entrepreneurs, platforms, and policymakers strategically managing AI transparency in high-stakes investment contexts.
[HC-4] Estimating Human Muscular Fatigue in Dynamic Collaborative Robotic Tasks with Learning-Based Models ICRA2026
【速读】:该论文旨在解决在物理人机交互(physical human-robot interaction, pHRI)中准确估计人体肌肉疲劳的问题,以优化性能并提升安全性。其核心解决方案是构建一个数据驱动的回归框架,利用佩戴于手臂上的表面肌电信号(surface electromyography, sEMG)来预测剩余工作周期比例(fraction of cycles to fatigue, FCF)。关键创新在于将疲劳估计建模为连续回归任务而非分类任务,从而捕捉疲劳的渐进过程,支持早期检测与自适应机器人控制;同时比较了基于时频域特征的传统机器学习方法(如随机森林、XGBoost、线性回归)与基于谱图输入的卷积神经网络(CNN),结果表明CNN表现最优(平均RMSE 20.8±4.3%),且树模型具备良好跨任务泛化能力,即使在未见运动模式(如垂直和圆周运动)下仍保持较高精度,体现出无需重新训练即可实现疲劳监测的潜力,为安全的疲劳自适应共享自主控制提供了可行路径。
链接: https://arxiv.org/abs/2602.15684
作者: Feras Kiki,Pouya P. Niaz,Alireza Madani,Cagatay Basdogan
机构: Koc University (科奇大学); Robotics and Mechatronics Laboratory (机器人与机电实验室); KUIS AI Center (KUIS人工智能中心)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP); Systems and Control (eess.SY)
备注: ICRA 2026 Original Contribution, Vienne, Austria
Abstract:Assessing human muscle fatigue is critical for optimizing performance and safety in physical human-robot interaction(pHRI). This work presents a data-driven framework to estimate fatigue in dynamic, cyclic pHRI using arm-mounted surface electromyography(sEMG). Subject-specific machine-learning regression models(Random Forest, XGBoost, and Linear Regression predict the fraction of cycles to fatigue(FCF) from three frequency-domain and one time-domain EMG features, and are benchmarked against a convolutional neural network(CNN) that ingests spectrograms of filtered EMG. Framing fatigue estimation as regression (rather than classification) captures continuous progression toward fatigue, supporting earlier detection, timely intervention, and adaptive robot control. In experiments with ten participants, a collaborative robot under admittance control guided repetitive lateral (left-right) end-effector motions until muscular fatigue. Average FCF RMSE across participants was 20.8+/-4.3% for the CNN, 23.3+/-3.8% for Random Forest, 24.8+/-4.5% for XGBoost, and 26.9+/-6.1% for Linear Regression. To probe cross-task generalization, one participant additionally performed unseen vertical (up-down) and circular repetitions; models trained only on lateral data were tested directly and largely retained accuracy, indicating robustness to changes in movement direction, arm kinematics, and muscle recruitment, while Linear Regression deteriorated. Overall, the study shows that both feature-based ML and spectrogram-based DL can estimate remaining work capacity during repetitive pHRI, with the CNN delivering the lowest error and the tree-based models close behind. The reported transfer to new motion patterns suggests potential for practical fatigue monitoring without retraining for every task, improving operator protection and enabling fatigue-aware shared autonomy, for safer fatigue-adaptive pHRI control.
[HC-5] Meflex: A Multi-agent Scaffolding System for Entrepreneurial Ideation Iteration via Nonlinear Business Plan Writing
【速读】:该论文旨在解决传统商业计划书(Business Plan, BP)写作过程中存在的线性、僵化问题,这一问题难以匹配创业思维中动态且循环迭代的本质,尤其对初学者而言,其认知负荷较高。为应对这一挑战,作者提出Meflex系统,其核心解决方案在于将基于大语言模型(Large Language Model, LLM)的写作支架与非线性思维画布(idea canvas)相结合,通过支持反思(reflection)和元反思(meta-reflection)来促进发散与收敛思维,从而有效降低认知负荷并提升创业思维的深度与连贯性。
链接: https://arxiv.org/abs/2602.15631
作者: Lan Luo,Dongyijie Primo Pan,Junhua Zhu,Muzhi Zhou,Pan Hui
机构: 未知
类目: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注:
Abstract:Business plan (BP) writing plays a key role in entrepreneurship education by helping learners construct, evaluate, and iteratively refine their ideas. However, conventional BP writing remains a rigid, linear process that often fails to reflect the dynamic and recursive nature of entrepreneurial ideation. This mismatch is particularly challenging for novice entrepreneurial students, who struggle with the substantial cognitive demands of developing and refining ideas. While reflection and meta-reflection are critical strategies for fostering divergent and convergent thinking, existing writing tools rarely scaffold these higher-order processes. To address this gap, we present the Meflex System, a large language model (LLM)-based writing tool that integrates BP writing scaffolding with a nonlinear idea canvas to support iterative ideation through reflection and meta-reflection. We report findings from an exploratory user study with 30 participants that examined the system’s usability and cognitive impact. Results show that Meflex effectively scaffolds BP writing, promotes divergent thinking through LLM-supported reflection, and enhances meta-reflective awareness while reducing cognitive load during complex idea development. These findings highlight the potential of non-linear LLM-based writing tools to foster deeper and coherent entrepreneurial thinking.
[HC-6] “What Are You Doing?”: Effects of Intermediate Feedback from Agent ic LLM In-Car Assistants During Multi-Step Processing
【速读】:该论文旨在解决生成式 AI (Generative AI) 助手在复杂、多步骤任务中(如车载场景下的导航或信息查询)如何优化用户交互体验的问题,特别是在注意力高度敏感的场景下(如驾驶),如何通过反馈机制提升用户的感知速度、信任度与整体体验。其解决方案的关键在于实证验证了“中间反馈”(intermediate feedback)相较于“仅最终响应”的静默操作,在控制变量的混合方法学研究中显著改善了用户对系统性能的评价,并进一步提出一种自适应反馈策略:初期高透明度以建立信任,随后根据任务重要性与情境动态调整反馈冗余度,从而在透明性与效率之间实现平衡。
链接: https://arxiv.org/abs/2602.15569
作者: Johannes Kirmayr,Raphael Wennmacher,Khanh Huynh,Lukas Stappen,Elisabeth André,Florian Alt
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Accepted (conditionally) at CHI 2026
Abstract:Agentic AI assistants that autonomously perform multi-step tasks raise open questions for user experience: how should such systems communicate progress and reasoning during extended operations, especially in attention-critical contexts such as driving? We investigate feedback timing and verbosity from agentic LLM-based in-car assistants through a controlled, mixed-methods study (N=45) comparing planned steps and intermediate results feedback against silent operation with final-only response. Using a dual-task paradigm with an in-car voice assistant, we found that intermediate feedback significantly improved perceived speed, trust, and user experience while reducing task load - effects that held across varying task complexities and interaction contexts. Interviews further revealed user preferences for an adaptive approach: high initial transparency to establish trust, followed by progressively reducing verbosity as systems prove reliable, with adjustments based on task stakes and situational context. We translate our empirical findings into design implications for feedback timing and verbosity in agentic assistants, balancing transparency and efficiency.
[HC-7] Reflecting on 1000 Social Media Journeys: Generational Patterns in Platform Transition
【速读】:该论文旨在解决当前对用户为何在不同社交平台间迁移的理解不足问题,尤其是在已有主流平台竞争背景下,新平台难以脱颖而出的挑战。其核心问题是:用户使用过哪些社交平台?他们为何会从一个平台转向另一个平台?为解决这一问题,作者提出“社交媒体旅程”(Social Media Journeys)这一概念,通过收集1000名美国用户的配额样本数据,系统性地分析用户在整个社交媒体生态中的使用轨迹,识别出推动(push)与拉动(pull)因素,并揭示不同世代基于个人需求对平台的选择差异。该研究的关键在于引入整体性视角(holistic perspective),从而为社交媒体技术的设计、治理和监管提供新的实证依据与理论框架。
链接: https://arxiv.org/abs/2602.15489
作者: Artur Solomonik,Nicolas Ruiz,Hendrik Heuer
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Social media has billions of users, but we still do not fully understand why users prefer one platform over another. Establishing new platforms among already popular competitors is difficult. Prior research has richly documented people’s experiences within individual platforms, yet situating those experiences within the entirety of a user’s social media experience remains challenging. What platforms have people used, and why have they transitioned between them? We collected data from a quota-based sample of 1,000 U.S. participants. We introduce the concept of \emphSocial Media Journeys to study the entirety of their social media experiences systematically. We identify push and pull factors across the social media landscape. We also show how different generations adopted social media platforms based on personal needs. With this work, we advance HCI by moving towards holistic perspectives when discussing social media technology, offering new insights for platform design, governance, and regulation.
[HC-8] StatCounter: A Longitudinal Study of a Portable Scholarly Metric Display
【速读】:该论文试图解决的问题是:如何通过具身化(embodied)和情境化(situated)的方式重新理解学术评价指标(academic metrics)在学者日常生活中的作用,以及这些指标如何影响其动机、注意力、反思与情绪反应。解决方案的关键在于设计并使用一种便携式、电池供电的电子墨水(e-ink)设备,持续显示Google Scholar引用统计数据,使学术指标从传统的桌面环境迁移至日常生活的流动场景中,从而引发对学术身份的新叙事,并促使研究者以更反思性的方式与评价体系互动。
链接: https://arxiv.org/abs/2602.15413
作者: Jonas Oppenlaender
机构: University of Oulu(奥卢大学)
类目: Human-Computer Interaction (cs.HC); Digital Libraries (cs.DL)
备注: Published in the proceedings of 10th ACM International Symposium on Pervasive Displays (PerDis '26)
Abstract:This study explores a handheld, battery-operated e-ink device displaying Google Scholar citation statistics. The StatCounter places academic metrics into the flow of daily life rather than a desktop context. The work draws on a first-person, longitudinal auto-ethnographic inquiry examining how constant access to scholarly metrics influences motivation, attention, reflection, and emotional responses across work and non-work settings. The ambient proximity and pervasive availability of scholarly metrics invites frequent micro-checks, short reflective pauses, but also introduces moments of second-guessing when numbers drop or stagnate. Carrying the device prompts new narratives about academic identity, including a sense of companionship during travel and periods away from the office. Over time, the presence of the device turns metrics from an occasional reference into an ambient background of scholarly life. The study contributes insight into how situated, embodied access to academic metrics reshapes their meaning, and frames opportunities for designing tools that engage with scholarly evaluation in reflective ways.
[HC-9] Supporting Multimodal Data Interaction on Refreshable Tactile Displays: An Architecture to Combine Touch and Conversational AI
【速读】:该论文旨在解决盲人或低视力(BLV)人群在获取数据可视化信息时面临的可访问性问题,特别是如何通过结合对话式人工智能(Conversational AI)与可刷新触觉显示器(Refreshable Tactile Displays, RTDs)来实现更自然、高效的多模态交互。其解决方案的关键在于提出了一种集成RTD硬件、外部触觉传感和对话式AI的多模态数据交互架构,首次实现了在RTD上融合触摸输入与对话代理的交互模式,支持基于触觉上下文的指示性查询(deictic queries),如“这两个点之间的趋势是什么?”,从而显著提升了BLV用户对复杂数据的感知能力与交互效率。
链接: https://arxiv.org/abs/2602.15280
作者: Samuel Reinders,Munazza Zaib,Matthew Butler,Bongshin Lee,Ingrid Zukerman,Lizhen Qu,Kim Marriott
机构: Monash University (莫纳什大学); Yonsei University (延世大学)
类目: Human-Computer Interaction (cs.HC)
备注: Paper to be presented at IEEE PacificVis 2026 (VisNotes)
Abstract:Combining conversational AI with refreshable tactile displays (RTDs) offers significant potential for creating accessible data visualization for people who are blind or have low vision (BLV). To support researchers and developers building accessible data visualizations with RTDs, we present a multimodal data interaction architecture along with an open-source reference implementation. Our system is the first to combine touch input with a conversational agent on an RTD, enabling deictic queries that fuse touch context with spoken language, such as “what is the trend between these points?” The architecture addresses key technical challenges, including touch sensing on RTDs, visual-to-tactile encoding, integrating touch context with conversational AI, and synchronizing multimodal output. Our contributions are twofold: (1) a technical architecture integrating RTD hardware, external touch sensing, and conversational AI to enable multimodal data interaction; and (2) an open-source reference implementation demonstrating its feasibility. This work provides a technical foundation to support future research in multimodal accessible data visualization.
[HC-10] From Diagnosis to Inoculation: Building Cognitive Resistance to AI Disempowerment
【速读】:该论文试图解决的问题是:当前人工智能助手(AI assistant)交互可能引发情境性的人类去权能化(situational human disempowerment),包括现实扭曲、价值判断扭曲和行动扭曲,而针对此类问题的教育干预措施尚未明确。解决方案的关键在于构建一个以八个跨领域学习成果(Learning Outcomes, LOs)为核心的AI素养框架,并通过“共教法”(co-teaching methodology)——即让AI作为主动发声的合作者角色参与教学——实施该框架。该方案进一步引入接种理论(inoculation theory),强调仅靠陈述性知识不足以培养AI素养,必须通过有指导地暴露于AI的失效模式(如谄媚式验证和权威投射行为),从而增强用户对AI特定扭曲机制的抵抗力,这是将接种理论首次应用于AI相关扭曲现象的研究创新。
链接: https://arxiv.org/abs/2602.15265
作者: Aleksey Komissarov
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 11 pages, 1 table. Perspective / Position Paper
Abstract:Recent empirical research by Sharma et al. (2026) demonstrated that AI assistant interactions carry meaningful potential for situational human disempowerment, including reality distortion, value judgment distortion, and action distortion. While this work provides a critical diagnosis of the problem, concrete pedagogical interventions remain underexplored. I present an AI literacy framework built around eight cross-cutting Learning Outcomes (LOs), developed independently through teaching practice and subsequently found to align with Sharma et al.'s disempowerment taxonomy. I report a case study from a publicly available online course, where a co-teaching methodology–with AI serving as an active voice co-instructor–was used to deliver this framework. Drawing on inoculation theory (McGuire, 1961)–a well-established persuasion research framework recently applied to misinformation prebunking by the Cambridge school (van der Linden, 2022; Roozenbeek van der Linden, 2019)–I argue that AI literacy cannot be acquired through declarative knowledge alone, but requires guided exposure to AI failure modes, including the sycophantic validation and authority projection patterns identified by Sharma et al. This application of inoculation theory to AI-specific distortion is, to my knowledge, novel. I discuss the convergence between the pedagogically-derived framework and Sharma et al.'s empirically-derived taxonomy, and argue that this convergence–two independent approaches arriving at similar problem descriptions–strengthens the case for both the diagnosis and the proposed educational response.
[HC-11] MyoInteract: A Framework for Fast Prototyping of Biomechanical HCI Tasks using Reinforcement Learning
【速读】:该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的生物力学仿真在人机交互(Human-Computer Interaction, HCI)研究中存在可用性差和可解释性不足的问题。其核心挑战在于,传统方法需要专家长时间调试与训练,难以支持快速迭代和设计探索。解决方案的关键在于提出MyoInteract框架,该框架通过图形用户界面(GUI)实现任务、用户模型和训练参数的快速配置,并利用优化的肌肉驱动仿真机制,在几分钟内完成训练与评估,相较传统方法将训练时间缩短高达98%。这一改进使非专业用户也能在单次会话中完成目标导向的用户动作建模与评估,显著降低了进入门槛并加速了HCI生物力学研究的迭代流程。
链接: https://arxiv.org/abs/2602.15245
作者: Ankit Bhattarai,Hannah Selder,Florian Fischer,Arthur Fleig,Per Ola Kristensson
机构: University of Cambridge (剑桥大学); Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig, Leipzig University (莱比锡大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL)-based biomechanical simulations have the potential to revolutionise HCI research and interaction design, but currently lack usability and interpretability. Using the Human Action Cycle as a design lens, we identify key limitations of biomechanical RL frameworks and develop MyoInteract, a novel framework for fast prototyping of biomechanical HCI tasks. MyoInteract allows designers to setup tasks, user models, and training parameters from an easy-to-use GUI within minutes. It trains and evaluates muscle-actuated simulated users within minutes, reducing training times by up to 98%. A workshop study with 12 interaction designers revealed that MyoInteract allowed novices in biomechanical RL to successfully setup, train, and assess goal-directed user movements within a single session. By transforming biomechanical RL from a days-long expert task into an accessible hour-long workflow, this work significantly lowers barriers to entry and accelerates iteration cycles in HCI biomechanics research.
[HC-12] Ground-Truth Depth in Vision Language Models: Spatial Context Understanding in Conversational AI for XR-Robotic Support in Emergency First Response
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在应急首响应(Emergency First Response, EFR)应用中对空间推理支持不足的问题,尤其是在提升情境意识(Situational Awareness, SA)方面的局限性。现有方法多依赖文本或二维图像输入,难以有效支撑EFR任务中关键的空间认知能力。其解决方案的关键在于构建一个融合机器人搭载深度感知(depth sensing)与YOLO目标检测技术的原型系统,并结合具备语义化度量距离能力的视觉语言模型(Vision Language Model, VLM),实现对检测物体与观察者之间距离的精准量化描述(如“椅子距离3.02米”)。实验表明,该深度增强型VLM显著提升了距离估计的客观准确性与稳定性,同时未增加操作员工作负荷,从而有效支持了EFR场景下的空间推理和决策判断。
链接: https://arxiv.org/abs/2602.15237
作者: Rodrigo Gutierrez Maquilon,Marita Hueber,Georg Regal,Manfred Tscheligi
机构: AIT - Austrian Institute of Technology (奥地利科学技术研究院)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Large language models (LLMs) are increasingly used in emergency first response (EFR) applications to support situational awareness (SA) and decision-making, yet most operate on text or 2D imagery and offer little support for core EFR SA competencies like spatial reasoning. We address this gap by evaluating a prototype that fuses robot-mounted depth sensing and YOLO detection with a vision language model (VLM) capable of verbalizing metrically-grounded distances of detected objects (e.g., the chair is 3.02 meters away). In a mixed-reality toxic-smoke scenario, participants estimated distances to a victim and an exit window under three conditions: video-only, depth-agnostic VLM, and depth-augmented VLM. Depth-augmentation improved objective accuracy and stability, e.g., the victim and window distance estimation error dropped, while raising situational awareness without increasing workload. Conversely, depth- agnostic assistance increased workload and slightly worsened accuracy. We contribute to human SA augmentation by demonstrating that metrically grounded, object-centric verbal information supports spatial reasoning in EFR and improves decision-relevant judgments under time pressure.
[HC-13] Multi-Agent Home Energy Management Assistant
【速读】:该论文旨在解决家庭能源管理系统(Home Energy Management System, HEMS)在面对日益复杂用户需求时,缺乏智能化、可定制化且能持续适应真实场景的决策支持能力的问题。现有基于大语言模型(Large Language Model, LLM)的HEMS方案多依赖提示工程或预构建平台,难以灵活调整代理行为,且评估方式局限于单轮交互或单一任务,无法全面反映系统在实际应用中的表现。解决方案的关键在于提出一个名为多智能体家庭能源管理助手(Multi-agent Home Energy Management Assistant, HEMA)的新架构,其基于LangChain和LangGraph构建,具备全流程系统定制能力;通过自洽分类器对用户查询进行精准识别,并调用三个专业化代理(分析、知识、控制)协同工作,结合检索增强生成(Retrieval-Augmented Generation, RAG)与推理-执行机制,实现高准确性、高适应性的响应与控制策略。实验表明,在295个测试案例中,HEMA达到91.9%的目标达成率,显著优于对比配置,验证了该架构在提升人机协作效率与可持续性方面的可行性与价值。
链接: https://arxiv.org/abs/2602.15219
作者: Wooyoung Jung
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 27 pages, 9 figures
Abstract:The growing complexity in home energy management demands advanced systems that guide occupants toward informed energy decisions. Large language model (LLM)-integrated home energy management systems (HEMS) have shown promise, but prior studies relied on prompt engineering or pre-built platforms with limited customization of agent behavior, or assessed performance through single-turn or -task evaluations. This study introduces a multi-agent home energy management assistant (HEMA), built on LangChain and LangGraph, designed to adaptively and intelligently handle real-world use cases of HEMS with full system customization capability. It carefully classifies user queries via a self-consistency classifier, requests three specialized agents (Analysis, Knowledge, and Control) to prepare accurate, adaptive responses using purpose-built analysis and control tools and retrieval augmented generation under the reasoning and acting mechanism. HEMA was rigorously assessed using two different experimental analyses via an LLM-as-user approach: (1) analytical and informative capabilities using combinatorial test cases of various personas and differing scenarios against three alternative system configurations relying on vanilla LLM and (2) control capabilities using various control scenarios. Out of 295 test cases, HEMA acquired a 91.9% goal achievement rate, successfully fulfilling user requests while providing high levels of factual accuracy, action correctness, interaction quality, and system efficiency, especially when compared to alternative system configurations. Collectively, this study contributes to the advancement of the human-centered design of LLM-integrated HEMS by demonstrating the feasibility and value of agentic architectures, and by clarifying the architectural requirements and evaluation criteria necessary to support adaptive, sustained human-artificial intelligence collaboration in HEMS.
[HC-14] How Do We Research Human-Robot Interaction in the Age of Large Language Models ? A Systematic Review
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 驱动的人机交互(HRI)研究中缺乏系统性人本影响分析的问题,特别是对人类导向的理解、用户建模以及自主性水平等关键维度的探讨不足,导致难以整合新兴挑战。其解决方案的关键在于遵循 PRISMA 指南开展系统的文献综述,从 86 篇符合标准的文献中提炼出 LLM 在 HRI 中的核心作用与研究现状,明确其在情境感知、社会性交互生成和持续人机对齐方面的变革性影响,并识别出当前研究多为探索性、方法多样且指标不统一的问题,从而提出未来设计的关键考量与指导原则。
链接: https://arxiv.org/abs/2602.15063
作者: Yufeng Wang,Yuan Xu,Anastasia Nikolova,Yuxuan Wang,Jianyu Wang,Chongyang Wang,Xin Tong
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州) ); Zhejiang University(浙江大学); Savannah College of Art and Design(萨凡纳艺术与设计学院); West China Hospital, Sichuan University(四川大学华西医院); The Hong Kong University of Science and Technology(香港科技大学)
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:
Abstract:Advances in large language models (LLMs) are profoundly reshaping the field of human-robot interaction (HRI). While prior work has highlighted the technical potential of LLMs, few studies have systematically examined their human-centered impact (e.g., human-oriented understanding, user modeling, and levels of autonomy), making it difficult to consolidate emerging challenges in LLM-driven HRI systems. Therefore, we conducted a systematic literature search following the PRISMA guideline, identifying 86 articles that met our inclusion criteria. Our findings reveal that: (1) LLMs are transforming the fundamentals of HRI by reshaping how robots sense context, generate socially grounded interactions, and maintain continuous alignment with human needs in embodied settings; and (2) current research is largely exploratory, with different studies focusing on different facets of LLM-driven HRI, resulting in wide-ranging choices of experimental setups, study methods, and evaluation metrics. Finally, we identify key design considerations and challenges, offering a coherent overview and guidelines for future research at the intersection of LLMs and HRI.
计算机视觉
[CV-0] Dex4D: Task-Agnostic Point Track Policy for Sim-to-Real Dexterous Manipulation
【速读】:该论文旨在解决灵巧操作中通用策略学习的挑战,即如何在不依赖大量真实世界数据或任务特定环境设计的情况下,实现对多样化现实操作任务的零样本部署。其核心问题是:传统方法受限于真实世界遥操作数据收集成本高、仿真训练需为每项任务定制环境与奖励函数,难以扩展至复杂多样的日常任务场景。解决方案的关键在于提出 Dex4D 框架,该框架通过在仿真中学习一个与任务无关的 3D 点跟踪条件策略(domain-agnostic 3D point track conditioned policy),使其能够将任意物体操纵到任意期望姿态(Anypose-to-Anypose)。该策略在数千种不同物体和姿态配置下进行训练,覆盖广泛的机器人-物体交互空间,并可在测试时灵活组合使用;部署时仅需从生成视频中提取目标物体的点轨迹作为提示(prompt),即可实现无需微调的零样本迁移,结合在线点跟踪实现闭环感知与控制,从而显著提升泛化能力与可扩展性。
链接: https://arxiv.org/abs/2602.15828
作者: Yuxuan Kuang,Sungjae Park,Katerina Fragkiadaki,Shubham Tulsiani
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Learning generalist policies capable of accomplishing a plethora of everyday tasks remains an open challenge in dexterous manipulation. In particular, collecting large-scale manipulation data via real-world teleoperation is expensive and difficult to scale. While learning in simulation provides a feasible alternative, designing multiple task-specific environments and rewards for training is similarly challenging. We propose Dex4D, a framework that instead leverages simulation for learning task-agnostic dexterous skills that can be flexibly recomposed to perform diverse real-world manipulation tasks. Specifically, Dex4D learns a domain-agnostic 3D point track conditioned policy capable of manipulating any object to any desired pose. We train this ‘Anypose-to-Anypose’ policy in simulation across thousands of objects with diverse pose configurations, covering a broad space of robot-object interactions that can be composed at test time. At deployment, this policy can be zero-shot transferred to real-world tasks without finetuning, simply by prompting it with desired object-centric point tracks extracted from generated videos. During execution, Dex4D uses online point tracking for closed-loop perception and control. Extensive experiments in simulation and on real robots show that our method enables zero-shot deployment for diverse dexterous manipulation tasks and yields consistent improvements over prior baselines. Furthermore, we demonstrate strong generalization to novel objects, scene layouts, backgrounds, and trajectories, highlighting the robustness and scalability of the proposed framework.
[CV-1] VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch Generation
【速读】:该论文旨在解决现有生成式 AI 模型在处理草图生成时忽视其固有时间序列特性的问题,即大多数模型将草图视为静态图像,而忽略了创作过程中笔画的顺序性和动态演化过程。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)与视频扩散模型(Video Diffusion Models)的互补优势:LLMs 负责语义规划和笔画排序,视频扩散模型则作为高质量、时序一致的渲染器;作者提出将草图表示为短视频形式,通过文本指定的笔画顺序逐步绘制在空白画布上,并采用两阶段微调策略——先在合成形状数据上学习笔画顺序,再从极少量人工绘制的草图中蒸馏视觉细节,从而实现高效且可控的顺序草图生成。
链接: https://arxiv.org/abs/2602.15819
作者: Hui Ren,Yuval Alaluf,Omer Bar Tal,Alexander Schwing,Antonio Torralba,Yael Vinker
机构: UIUCUSA; RunwayUSA; MITUSA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sketching is inherently a sequential process, in which strokes are drawn in a meaningful order to explore and refine ideas. However, most generative models treat sketches as static images, overlooking the temporal structure that underlies creative drawing. We present a data-efficient approach for sequential sketch generation that adapts pretrained text-to-video diffusion models to generate sketching processes. Our key insight is that large language models and video diffusion models offer complementary strengths for this task: LLMs provide semantic planning and stroke ordering, while video diffusion models serve as strong renderers that produce high-quality, temporally coherent visuals. We leverage this by representing sketches as short videos in which strokes are progressively drawn on a blank canvas, guided by text-specified ordering instructions. We introduce a two-stage fine-tuning strategy that decouples the learning of stroke ordering from the learning of sketch appearance. Stroke ordering is learned using synthetic shape compositions with controlled temporal structure, while visual appearance is distilled from as few as seven manually authored sketching processes that capture both global drawing order and the continuous formation of individual strokes. Despite the extremely limited amount of human-drawn sketch data, our method generates high-quality sequential sketches that closely follow text-specified orderings while exhibiting rich visual detail. We further demonstrate the flexibility of our approach through extensions such as brush style conditioning and autoregressive sketch generation, enabling additional controllability and interactive, collaborative drawing.
[CV-2] ask-Agnostic Continual Learning for Chest Radiograph Classification
【速读】:该论文旨在解决临床部署中胸部X光图像分类模型在面对新数据集持续输入时的性能退化问题,即如何在不重新训练历史数据的前提下实现模型更新,同时保持已验证的诊断性能稳定。其解决方案的关键在于提出了一种基于适配器的持续学习策略(CARL-XRay),该策略采用固定高容量主干网络,通过增量分配轻量级任务特定适配器(adapter)和分类头来适应新任务;同时引入潜空间任务选择器,利用紧凑原型和特征级经验回放保留当前与历史上下文信息,从而实现稳定的任务识别与适应,且无需存储原始图像。该方法在多个大规模公开胸部X光数据集上表现出优异的性能保留能力和可靠的、任务感知的推理能力。
链接: https://arxiv.org/abs/2602.15811
作者: Muthu Subash Kavitha,Anas Zafar,Amgad Muneer,Jia Wu
机构: The University of Texas MD Anderson Cancer Center (德克萨斯大学MD安德森癌症中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures
Abstract:Clinical deployment of chest radiograph classifiers requires models that can be updated as new datasets become available without retraining on previously ob- served data or degrading validated performance. We study, for the first time, a task-incremental continual learning setting for chest radiograph classification, in which heterogeneous chest X-ray datasets arrive sequentially and task identifiers are unavailable at inference. We propose a continual adapter-based routing learning strategy for Chest X-rays (CARL-XRay) that maintains a fixed high-capacity backbone and incrementally allocates lightweight task-specific adapters and classifier heads. A latent task selector operates on task-adapted features and leverages both current and historical context preserved through compact prototypes and feature-level experience replay. This design supports stable task identification and adaptation across sequential updates while avoiding raw-image storage. Experiments on large-scale public chest radiograph datasets demonstrate robust performance retention and reliable task-aware inference under continual dataset ingestion. CARL-XRay outperforms joint training under task-unknown deployment, achieving higher routing accuracy (75.0% vs.\ 62.5%), while maintaining competitive diagnostic performance with AUROC of 0.74 in the oracle setting with ground-truth task identity and 0.75 under task-unknown inference, using significantly fewer trainable parameters. Finally, the proposed framework provides a practical alternative to joint training and repeated full retraining in continual clinical deployment.
[CV-3] Context-aware Skin Cancer Epithelial Cell Classification with Scalable Graph Transformers
【速读】:该论文旨在解决全切片病理图像(Whole-slide images, WSIs)在癌症诊断中因细胞形态相似而导致的分类困难问题,特别是针对皮肤鳞状细胞癌(cutaneous squamous cell carcinoma, cSCC)中健康与肿瘤上皮细胞难以区分的挑战。传统基于图像的深度学习方法依赖于局部补丁表示,易丢失组织层面的上下文信息,导致性能受限。其解决方案的关键在于构建一个完整的WSI细胞图(cell graph),并利用可扩展的图变压器(Graph Transformers)模型(如SGFormer和DIFFormer)进行分类,通过整合细胞的形态学、纹理特征及邻近非上皮细胞类别信息来增强表征能力,从而有效捕捉组织级上下文,显著提升分类精度。
链接: https://arxiv.org/abs/2602.15783
作者: Lucas Sancéré,Noémie Moreau,Katarzyna Bozek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 2 figures
Abstract:Whole-slide images (WSIs) from cancer patients contain rich information that can be used for medical diagnosis or to follow treatment progress. To automate their analysis, numerous deep learning methods based on convolutional neural networks and Vision Transformers have been developed and have achieved strong performance in segmentation and classification tasks. However, due to the large size and complex cellular organization of WSIs, these models rely on patch-based representations, losing vital tissue-level context. We propose using scalable Graph Transformers on a full-WSI cell graph for classification. We evaluate this methodology on a challenging task: the classification of healthy versus tumor epithelial cells in cutaneous squamous cell carcinoma (cSCC), where both cell types exhibit very similar morphologies and are therefore difficult to differentiate for image-based approaches. We first compared image-based and graph-based methods on a single WSI. Graph Transformer models SGFormer and DIFFormer achieved balanced accuracies of 85.2 \pm 1.5 ( \pm standard error) and 85.1 \pm 2.5 in 3-fold cross-validation, respectively, whereas the best image-based method reached 81.2 \pm 3.0 . By evaluating several node feature configurations, we found that the most informative representation combined morphological and texture features as well as the cell classes of non-epithelial cells, highlighting the importance of the surrounding cellular context. We then extended our work to train on several WSIs from several patients. To address the computational constraints of image-based models, we extracted four 2560 \times 2560 pixel patches from each image and converted them into graphs. In this setting, DIFFormer achieved a balanced accuracy of 83.6 \pm 1.9 (3-fold cross-validation), while the state-of-the-art image-based model CellViT256 reached 78.1 \pm 0.5 .
[CV-4] Meteorological data and Sky Images meets Neural Models for Photovoltaic Power Forecasting
【速读】:该论文旨在解决光伏发电(Photovoltaic, PV)能量产出因天气变化而具有高度不确定性所带来的预测难题,特别是在云层遮挡等复杂气象条件下提升预测精度与鲁棒性,并实现从短时临近预报(nowcasting)向中长期预测的扩展,以支持电网更高效运行和太阳能波动的有效管理。其解决方案的关键在于提出一种多模态混合方法,融合天空图像、历史光伏功率数据及多种气象变量(如地表长波辐射向下、风速与太阳位置),并采用深度神经网络模型进行建模,显著提升了在阴天等不利条件下的预测性能,同时增强了模型的可解释性与实用性。
链接: https://arxiv.org/abs/2602.15782
作者: Ines Montoya-Espinagosa,Antonio Agudo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CAI 2026
Abstract:Due to the rise in the use of renewable energies as an alternative to traditional ones, and especially solar energy, there is increasing interest in studying how to address photovoltaic forecasting in the face of the challenge of variability in photovoltaic energy production, using different methodologies. This work develops a hybrid approach for short and long-term forecasting based on two studies with the same purpose. A multimodal approach that combines images of the sky and photovoltaic energy history with meteorological data is proposed. The main goal is to improve the accuracy of ramp event prediction, increase the robustness of forecasts in cloudy conditions, and extend capabilities beyond nowcasting, to support more efficient operation of the power grid and better management of solar variability. Deep neural models are used for both nowcasting and forecasting solutions, incorporating individual and multiple meteorological variables, as well as an analytical solar position. The results demonstrate that the inclusion of meteorological data, particularly the surface long-wave, radiation downwards, and the combination of wind and solar position, significantly improves current predictions in both nowcasting and forecasting tasks, especially on cloudy days. This study highlights the importance of integrating diverse data sources to improve the reliability and interpretability of solar energy prediction models.
[CV-5] NeRFscopy: Neural Radiance Fields for in-vivo Time-Varying Tissues from Endoscopy
【速读】:该论文旨在解决单目内窥镜视频中可变形组织的鲁棒动态三维重建问题,该问题在医学影像中具有重要意义,但受限于组织形变、单目相机视角限制、光照变化、遮挡以及未知相机轨迹等挑战。解决方案的关键在于提出一种自监督的神经渲染框架 NeRFscopy,其核心创新包括:构建一个具有规范辐射场(canonical radiance field)和时变形变场(time-dependent deformation field)的可变形模型,其中形变场由 SE(3) 变换参数化;同时通过引入高效的颜色图像建模项,在无需模板或预训练模型的前提下,仅从数据中学习三维隐式表示,从而实现高质量的新视角合成与三维重建。
链接: https://arxiv.org/abs/2602.15775
作者: Laura Salort-Benejam,Antonio Agudo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ISBI 2026
Abstract:Endoscopy is essential in medical imaging, used for diagnosis, prognosis and treatment. Developing a robust dynamic 3D reconstruction pipeline for endoscopic videos could enhance visualization, improve diagnostic accuracy, aid in treatment planning, and guide surgery procedures. However, challenges arise due to the deformable nature of the tissues, the use of monocular cameras, illumination changes, occlusions and unknown camera trajectories. Inspired by neural rendering, we introduce NeRFscopy, a self-supervised pipeline for novel view synthesis and 3D reconstruction of deformable endoscopic tissues from a monocular video. NeRFscopy includes a deformable model with a canonical radiance field and a time-dependent deformation field parameterized by SE(3) transformations. In addition, the color images are efficiently exploited by introducing sophisticated terms to learn a 3D implicit model without assuming any template or pre-trained model, solely from data. NeRFscopy achieves accurate results in terms of novel view synthesis, outperforming competing methods across various challenging endoscopy scenes.
[CV-6] Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models ICLR2026
【速读】:该论文旨在解决多模态模型中生成能力(generation)与理解能力(understanding)之间的权衡问题,即提升一方往往导致另一方性能下降。研究表明,这种权衡的主要原因在于生成与理解任务之间存在潜在冲突,从而在模型内部形成竞争性动态。解决方案的关键在于提出一种名为“Reason-Reflect-Refine (R3)”的多步框架,将传统的单步生成任务重构为“生成-理解-再生成”的迭代过程,通过在生成阶段显式利用模型的理解能力,有效缓解了优化困境,实现了生成质量与理解能力的协同提升。
链接: https://arxiv.org/abs/2602.15772
作者: Sen Ye,Mengde Xu,Shuyang Gu,Di He,Liwei Wang,Han Hu
机构: Peking University (北京大学); Tencent (腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR2026
Abstract:Current research in multimodal models faces a key challenge where enhancing generative capabilities often comes at the expense of understanding, and vice versa. We analyzed this trade-off and identify the primary cause might be the potential conflict between generation and understanding, which creates a competitive dynamic within the model. To address this, we propose the Reason-Reflect-Refine (R3) framework. This innovative algorithm re-frames the single-step generation task into a multi-step process of “generate-understand-regenerate”. By explicitly leveraging the model’s understanding capability during generation, we successfully mitigate the optimization dilemma, achieved stronger generation results and improved understanding ability which are related to the generation process. This offers valuable insights for designing next-generation unified multimodal models. Code is available at this https URL.
[CV-7] RaCo: Ranking and Covariance for Practical Learned Keypoints
【速读】:该论文旨在解决3D计算机视觉任务中关键点(keypoint)检测的鲁棒性与通用性问题,尤其是如何在无对应图像对标注的情况下,独立估计关键点的排序和度量尺度下的空间不确定性。其解决方案的关键在于提出了一种轻量级神经网络RaCo,该模型集成三个核心组件:可重复的关键点检测器、用于最大化有限数量关键点匹配的可微分排序器(differentiable ranker),以及用于量化度量尺度下空间不确定性的协方差估计器(covariance estimator)。通过仅使用透视图像裁剪进行训练,RaCo无需依赖共可见图像对即可实现强旋转鲁棒性,且不依赖计算昂贵的等变网络结构,从而在多个挑战性数据集上实现了关键点重复性和两视图匹配的最先进性能。
链接: https://arxiv.org/abs/2602.15755
作者: Abhiram Shenoi,Philipp Lindenberger,Paul-Edouard Sarlin,Marc Pollefeys
机构: ETH Zurich (苏黎世联邦理工学院); Google(谷歌); Microsoft Mixed Reality & AI Lab (微软混合现实与人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:This paper introduces RaCo, a lightweight neural network designed to learn robust and versatile keypoints suitable for a variety of 3D computer vision tasks. The model integrates three key components: the repeatable keypoint detector, a differentiable ranker to maximize matches with a limited number of keypoints, and a covariance estimator to quantify spatial uncertainty in metric scale. Trained on perspective image crops only, RaCo operates without the need for covisible image pairs. It achieves strong rotational robustness through extensive data augmentation, even without the use of computationally expensive equivariant network architectures. The method is evaluated on several challenging datasets, where it demonstrates state-of-the-art performance in keypoint repeatability and two-view matching, particularly under large in-plane rotations. Ultimately, RaCo provides an effective and simple strategy to independently estimate keypoint ranking and metric covariance without additional labels, detecting interpretable and repeatable interest points. The code is available at this https URL.
[CV-8] Language and Geometry Grounded Sparse Voxel Representations for Holistic Scene Understanding
【速读】:该论文旨在解决现有3D开放词汇场景理解方法中,因过度依赖从2D基础模型蒸馏语言特征至3D特征场而忽视场景外观、语义与几何之间协同关系的问题,导致场景理解偏离真实几何结构且与重建过程脱钩。其解决方案的关键在于提出一种基于语言和几何约束的稀疏体素(sparse voxel)表示框架,以统一建模3D场景的外观、语义与几何信息:通过引入外观场、密度场、特征场和置信度场来完整刻画场景;设计特征调制模块促进各场之间的协同优化,并从2D基础模型蒸馏语言特征;同时将几何蒸馏融入特征蒸馏过程,借助深度相关性正则化和模式一致性正则化,将几何基础模型的知识迁移至3D场景表示中,从而实现多模态感知与几何结构的一致性建模。
链接: https://arxiv.org/abs/2602.15734
作者: Guile Wu,David Huang,Bingbing Liu,Dongfeng Bai
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report
Abstract:Existing 3D open-vocabulary scene understanding methods mostly emphasize distilling language features from 2D foundation models into 3D feature fields, but largely overlook the synergy among scene appearance, semantics, and geometry. As a result, scene understanding often deviates from the underlying geometric structure of scenes and becomes decoupled from the reconstruction process. In this work, we propose a novel approach that leverages language and geometry grounded sparse voxel representations to comprehensively model appearance, semantics, and geometry within a unified framework. Specifically, we use 3D sparse voxels as primitives and employ an appearance field, a density field, a feature field, and a confidence field to holistically represent a 3D scene. To promote synergy among the appearance, density, and feature fields, we construct a feature modulation module and distill language features from a 2D foundation model into our 3D scene model. In addition, we integrate geometric distillation into feature field distillation to transfer geometric knowledge from a geometry foundation model to our 3D scene representations via depth correlation regularization and pattern consistency regularization. These components work together to synergistically model the appearance, semantics, and geometry of the 3D scene within a unified framework. Extensive experiments demonstrate that our approach achieves superior overall performance compared with state-of-the-art methods in holistic scene understanding and reconstruction.
[CV-9] Spanning the Visual Analogy Space with a Weight Basis of LoRAs
【速读】:该论文旨在解决视觉类比学习(Visual Analogy Learning)中现有方法在处理复杂图像变换时泛化能力不足的问题。当前方法通常依赖单一低秩适配(Low-Rank Adaptation, LoRA)模块来适应不同类比任务,但这种固定结构难以覆盖多样化的视觉变换空间,限制了模型的灵活性与迁移能力。解决方案的关键在于提出LoRWeB框架,其核心创新是:(1)构建一个可学习的LoRA基底(learnable basis of LoRA modules),用于表征不同视觉变换的语义空间;(2)引入轻量级编码器,在推理阶段动态选择并加权这些基底LoRA模块,实现对每个类比任务的个性化适配,从而显著提升模型对未见变换的泛化性能。
链接: https://arxiv.org/abs/2602.15727
作者: Hila Manor,Rinon Gal,Haggai Maron,Tomer Michaeli,Gal Chechik
机构: Technion(以色列理工学院); NVIDIA(英伟达); Bar-Ilan University(巴伊兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Code and data are in this https URL
Abstract:Visual analogy learning enables image manipulation through demonstration rather than textual description, allowing users to specify complex transformations difficult to articulate in words. Given a triplet \mathbfa , \mathbfa’ , \mathbfb\ , the goal is to generate \mathbfb’ such that \mathbfa : \mathbfa’ :: \mathbfb : \mathbfb’ . Recent methods adapt text-to-image models to this task using a single Low-Rank Adaptation (LoRA) module, but they face a fundamental limitation: attempting to capture the diverse space of visual transformations within a fixed adaptation module constrains generalization capabilities. Inspired by recent work showing that LoRAs in constrained domains span meaningful, interpolatable semantic spaces, we propose LoRWeB, a novel approach that specializes the model for each analogy task at inference time through dynamic composition of learned transformation primitives, informally, choosing a point in a “space of LoRAs”. We introduce two key components: (1) a learnable basis of LoRA modules, to span the space of different visual transformations, and (2) a lightweight encoder that dynamically selects and weighs these basis LoRAs based on the input analogy pair. Comprehensive evaluations demonstrate our approach achieves state-of-the-art performance and significantly improves generalization to unseen visual transformations. Our findings suggest that LoRA basis decompositions are a promising direction for flexible visual manipulation. Code and data are in this https URL
[CV-10] Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的视觉-语言导航(Vision-and-Language Navigation, VLN)中决策效率低和稳定性差的问题。具体而言,现有提示驱动的LLM导航方法在每一步都需要从头理解指令,并在噪声大、冗长的可导航候选动作中进行推理,导致计算冗余和行为不稳定。解决方案的关键在于引入一种检索增强框架,在不修改或微调LLM的前提下,通过两个互补层级的检索机制提升决策效率与鲁棒性:其一是在任务级(episode level)使用指令嵌入检索器选取语义相似的成功导航轨迹作为上下文示例,提供任务特定的先验信息以增强指令定位;其二是在步骤级(step level)采用模仿学习训练的候选动作检索器,在LLM推理前过滤无关方向,降低动作歧义性和提示复杂度。这两个模块轻量、模块化且独立于LLM训练,实验表明该方法显著提升了R2R基准上的成功率(Success Rate)、Oracle成功率(Oracle Success Rate)和SPL指标,且消融研究验证了两种检索策略分别对全局引导和局部决策效率具有互补贡献。
链接: https://arxiv.org/abs/2602.15724
作者: Shutian Gu,Chengkai Huang,Ruoyu Wang,Lina Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions and navigate through previously unseen environments. Recent approaches increasingly employ large language models (LLMs) as high-level navigators due to their flexibility and reasoning capability. However, prompt-based LLM navigation often suffers from inefficient decision-making, as the model must repeatedly interpret instructions from scratch and reason over noisy and verbose navigable candidates at each step. In this paper, we propose a retrieval-augmented framework to improve the efficiency and stability of LLM-based VLN without modifying or fine-tuning the underlying language model. Our approach introduces retrieval at two complementary levels. At the episode level, an instruction-level embedding retriever selects semantically similar successful navigation trajectories as in-context exemplars, providing task-specific priors for instruction grounding. At the step level, an imitation-learned candidate retriever prunes irrelevant navigable directions before LLM inference, reducing action ambiguity and prompt complexity. Both retrieval modules are lightweight, modular, and trained independently of the LLM. We evaluate our method on the Room-to-Room (R2R) benchmark. Experimental results demonstrate consistent improvements in Success Rate, Oracle Success Rate, and SPL on both seen and unseen environments. Ablation studies further show that instruction-level exemplar retrieval and candidate pruning contribute complementary benefits to global guidance and step-wise decision efficiency. These results indicate that retrieval-augmented decision support is an effective and scalable strategy for enhancing LLM-based vision-and-language navigation.
[CV-11] oaSt: Token Channel Selection and Structured Pruning for Efficient ViT
【速读】:该论文旨在解决视觉 Transformer (Vision Transformer, ViT) 在实际部署中面临的高计算成本问题。现有方法如结构化权重剪枝和 token 压缩虽具潜力,但分别存在再训练时间过长和全局传播导致优化困难的局限性。其解决方案的关键在于提出 ToaSt 框架,采用解耦策略对 ViT 不同组件实施针对性压缩:对多头自注意力模块(Multi-Head Self-Attention)应用耦合的头级结构化剪枝以提升鲁棒性;针对占总浮点运算量(FLOPs)超 60% 的前馈网络(Feed-Forward Network),引入 Token Channel Selection(TCS)机制,在不引发全局传播问题的前提下显著提高压缩比,并通过分析验证 TCS 能有效过滤冗余噪声。该方法在多个模型上均实现精度与效率的更优权衡,且下游任务迁移效果优异。
链接: https://arxiv.org/abs/2602.15720
作者: Hyunchan Moon,Cheonjun Park,Steven L. Waslander
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures
Abstract:Vision Transformers (ViTs) have achieved remarkable success across various vision tasks, yet their deployment is often hindered by prohibitive computational costs. While structured weight pruning and token compression have emerged as promising solutions, they suffer from prolonged retraining times and global propagation that creates optimization challenges, respectively. We propose ToaSt, a decoupled framework applying specialized strategies to distinct ViT components. We apply coupled head-wise structured pruning to Multi-Head Self-Attention modules, leveraging attention operation characteristics to enhance robustness. For Feed-Forward Networks (over 60% of FLOPs), we introduce Token Channel Selection (TCS) that enhances compression ratios while avoiding global propagation issues. Our analysis reveals TCS effectively filters redundant noise during selection. Extensive evaluations across nine diverse models, including DeiT, ViT-MAE, and Swin Transformer, demonstrate that ToaSt achieves superior trade-offs between accuracy and efficiency, consistently outperforming existing baselines. On ViT-MAE-Huge, ToaSt achieves 88.52% accuracy (+1.64 %) with 39.4% FLOPs reduction. ToaSt transfers effectively to downstream tasks, cccccachieving 52.2 versus 51.9 mAP on COCO object detection. Code and models will be released upon acceptance.
[CV-12] Criteria-first semantics-later: reproducible structure discovery in image-based sciences
【速读】:该论文旨在解决图像科学中语义优先(semantics-first)分析范式在开放科学发现、跨传感器与跨站点可比性以及长期监测等场景下失效的问题,这些问题常因领域本体(ontology)和标签集的文化、机构及生态漂移而加剧。其解决方案的关键在于提出“标准优先、语义后置”(criteria-first and semantics-later)的演绎反转策略:构建一个统一的“标准优先”结构发现框架,将由显式最优性标准定义的、无语义依赖的结构提取过程(如稳定分区、结构场或层次)与下游语义映射分离,从而实现跨领域的可重复分析。该方法通过信息论中“信息与意义分离”的原理,确保结构产物作为FAIR(可发现、可访问、可互操作、可重用)、AI就绪的数字对象,在长期监测与数字孪生中保持稳定性与可扩展性。
链接: https://arxiv.org/abs/2602.15712
作者: Jan Bumberger
机构: Helmholtz Centre for Environmental Research – UFZ (德国环境研究中心); German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig (德国整合生物多样性研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Across the natural and life sciences, images have become a primary measurement modality, yet the dominant analytic paradigm remains semantics-first. Structure is recovered by predicting or enforcing domain-specific labels. This paradigm fails systematically under the conditions that make image-based science most valuable, including open-ended scientific discovery, cross-sensor and cross-site comparability, and long-term monitoring in which domain ontologies and associated label sets drift culturally, institutionally, and ecologically. A deductive inversion is proposed in the form of criteria-first and semantics-later. A unified framework for criteria-first structure discovery is introduced. It separates criterion-defined, semantics-free structure extraction from downstream semantic mapping into domain ontologies or vocabularies and provides a domain-general scaffold for reproducible analysis across image-based sciences. Reproducible science requires that the first analytic layer perform criterion-driven, semantics-free structure discovery, yielding stable partitions, structural fields, or hierarchies defined by explicit optimality criteria rather than local domain ontologies. Semantics is not discarded; it is relocated downstream as an explicit mapping from the discovered structural product to a domain ontology or vocabulary, enabling plural interpretations and explicit crosswalks without rewriting upstream extraction. Grounded in cybernetics, observation-as-distinction, and information theory’s separation of information from meaning, the argument is supported by cross-domain evidence showing that criteria-first components recur whenever labels do not scale. Finally, consequences are outlined for validation beyond class accuracy and for treating structural products as FAIR, AI-ready digital objects for long-term monitoring and digital twins.
[CV-13] Bayesian Optimization for Design Parameters of 3D Image Data Analysis
【速读】:该论文旨在解决大规模生物医学成像中3D数据的分割与分类任务中模型选择和参数调优困难的问题(即“模型设计与参数化瓶颈”)。其关键解决方案是提出了一种名为3D数据分析优化流程(3D Data Analysis Optimization Pipeline)的方法,该方法包含两个贝叶斯优化阶段:第一阶段基于领域自适应的语法基准数据集,自动选择最优分割模型并优化后处理参数,同时引入一种用于量化分割质量的指标作为优化目标;第二阶段则对分类器的设计要素(如编码器结构、分类头架构、先验知识融合方式及预训练策略)进行优化,并通过辅助类别标注工作流减少人工标注负担——该流程从分割结果中提取预测实例并逐个呈现给操作者,避免了手动追踪。此两阶段协同优化机制显著提升了模型配置效率与性能。
链接: https://arxiv.org/abs/2602.15660
作者: David Exler,Joaquin Eduardo Urrutia Gómez,Martin Krüger,Maike Schliephake,John Jbeily,Mario Vitacolonna,Rüdiger Rudolf,Markus Reischl
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Technische Hochschule Mannheim (曼海姆应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures
Abstract:Deep learning-based segmentation and classification are crucial to large-scale biomedical imaging, particularly for 3D data, where manual analysis is impractical. Although many methods exist, selecting suitable models and tuning parameters remains a major bottleneck in practice. Hence, we introduce the 3D data Analysis Optimization Pipeline, a method designed to facilitate the design and parameterization of segmentation and classification using two Bayesian Optimization stages. First, the pipeline selects a segmentation model and optimizes postprocessing parameters using a domain-adapted syntactic benchmark dataset. To ensure a concise evaluation of segmentation performance, we introduce a segmentation quality metric that serves as the objective function. Second, the pipeline optimizes design choices of a classifier, such as encoder and classifier head architectures, incorporation of prior knowledge, and pretraining strategies. To reduce manual annotation effort, this stage includes an assisted class-annotation workflow that extracts predicted instances from the segmentation results and sequentially presents them to the operator, eliminating the need for manual tracking. In four case studies, the 3D data Analysis Optimization Pipeline efficiently identifies effective model and parameter configurations for individual datasets.
[CV-14] A Novel Public Dataset for Strawberry (Frag aria x ananassa) Ripeness Detection and Comparative Evaluation of YOLO-Based Models
【速读】:该论文旨在解决草莓(Fragaria x ananassa)采收期 ripeness(成熟度)判断的主观性和误差问题,传统依赖人工视觉评估的方法难以保证准确性,从而影响生产者收益和消费者体验。为实现更客观、高效的成熟度检测,研究提出了一种新的公开可获取的草莓成熟度图像数据集(包含566张图像和1,201个标注对象),该数据集在土耳其两个温室中不同光照与环境条件下采集。解决方案的关键在于构建高质量、多样化的数据集,并基于YOLO系列目标检测模型(YOLOv8、YOLOv9、YOLOv11)进行系统性比较实验,结果表明小到中等规模模型在该任务上表现更均衡高效,其中YOLOv9c达到最高精度(90.94%),YOLOv11s获得最高召回率(83.74%),YOLOv8s在mAP@50指标上最优(86.09%),为智慧农业中的果实成熟度智能识别提供了可靠基准。
链接: https://arxiv.org/abs/2602.15656
作者: Mustafa Yurdakul,Zeynep Sena Bastug,Ali Emre Gok,Sakir Taşdemir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The strawberry (Fragaria x ananassa), known worldwide for its economic value and nutritional richness, is a widely cultivated fruit. Determining the correct ripeness level during the harvest period is crucial for both preventing losses for producers and ensuring consumers receive a quality product. However, traditional methods, i.e., visual assessments alone, can be subjective and have a high margin of error. Therefore, computer-assisted systems are needed. However, the scarcity of comprehensive datasets accessible to everyone in the literature makes it difficult to compare studies in this field. In this study, a new and publicly available strawberry ripeness dataset, consisting of 566 images and 1,201 labeled objects, prepared under variable light and environmental conditions in two different greenhouses in Turkey, is presented to the literature. Comparative tests conducted on the data set using YOLOv8, YOLOv9, and YOLO11-based models showed that the highest precision value was 90.94% in the YOLOv9c model, while the highest recall value was 83.74% in the YOLO11s model. In terms of the general performance criterion mAP@50, YOLOv8s was the best performing model with a success rate of 86.09%. The results show that small and medium-sized models work more balanced and efficiently on this type of dataset, while also establishing a fundamental reference point for smart agriculture applications.
[CV-15] UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling
【速读】:该论文旨在解决文本到语音(Text-to-Speech, TTS)与面部表情生成(Audio-to-Facial Expression, A2F)两个独立模型在跨模态一致性上的问题,即如何实现音频与面部表情在生成过程中的一致性提升。其解决方案的关键在于将TTS与A2F模型融合为一个统一架构,利用TTS模块的中间特征表示(intermediate representations)作为共享语义基础,从而实现音视频特征的内部迁移(internal feature transfer),并在联合建模中引入情绪控制机制,验证了通过复用TTS中间表示来优化语音与面部表达协同设计的可行性。
链接: https://arxiv.org/abs/2602.15651
作者: Qiangong Zhou,Nagasaka Tomohiro
机构: Sumeru AI Technology Co., Ltd.(Sumeru AI科技有限公司)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: 16 pages, 12 figures
Abstract:This work considers merging two independent models, TTS and A2F, into a unified model to enable internal feature transfer, thereby improving the consistency between audio and facial expressions generated from text. We also discuss the extension of the emotion control mechanism from TTS to the joint model. This work does not aim to showcase generation quality; instead, from a system design perspective, it validates the feasibility of reusing intermediate representations from TTS for joint modeling of speech and facial expressions, and provides engineering practice references for subsequent speech expression co-design. The project code has been open source at: this https URL
[CV-16] Concept-Enhanced Multimodal RAG : Towards Interpretable and Accurate Radiology Report Generation
【速读】:该论文旨在解决生成式 AI (Generative AI) 在放射学报告生成(Radiology Report Generation, RRG)中面临的两大核心挑战:一是模型缺乏可解释性,二是容易产生与影像证据不符的幻觉(hallucination)问题。现有方法通常将可解释性和准确性视为独立目标,例如基于概念的解释技术侧重于透明度,而检索增强生成(Retrieval-Augmented Generation, RAG)则专注于事实性 grounding。本文提出了一种统一框架——概念增强多模态 RAG(Concept-Enhanced Multimodal RAG, CEMRAG),其关键在于将视觉表征分解为可解释的临床概念,并将其与多模态 RAG 机制融合,从而构建富含上下文信息的提示(prompt),在提升模型可解释性的同时显著增强报告的事实准确性和临床可靠性。实验证明,该方法在 MIMIC-CXR 和 IU X-Ray 数据集上优于传统 RAG 和仅使用概念的基线模型,在多个 VLM 架构和配置下均实现一致性能提升,颠覆了“可解释性与性能存在权衡”的假设,表明透明的视觉概念能够促进而非损害诊断准确性。
链接: https://arxiv.org/abs/2602.15650
作者: Marco Salmè,Federico Siciliano,Fabrizio Silvestri,Paolo Soda,Rosa Sicilia,Valerio Guarrasi
机构: Università Campus Bio-Medico di Roma (罗马大学生物医学校区); University of Milano-Bicocca (米兰博科尼大学); Sapienza University of Rome (罗马大学); Umeå University (于默奥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Radiology Report Generation (RRG) through Vision-Language Models (VLMs) promises to reduce documentation burden, improve reporting consistency, and accelerate clinical workflows. However, their clinical adoption remains limited by the lack of interpretability and the tendency to hallucinate findings misaligned with imaging evidence. Existing research typically treats interpretability and accuracy as separate objectives, with concept-based explainability techniques focusing primarily on transparency, while Retrieval-Augmented Generation (RAG) methods targeting factual grounding through external retrieval. We present Concept-Enhanced Multimodal RAG (CEMRAG), a unified framework that decomposes visual representations into interpretable clinical concepts and integrates them with multimodal RAG. This approach exploits enriched contextual prompts for RRG, improving both interpretability and factual accuracy. Experiments on MIMIC-CXR and IU X-Ray across multiple VLM architectures, training regimes, and retrieval configurations demonstrate consistent improvements over both conventional RAG and concept-only baselines on clinical accuracy metrics and standard NLP measures. These results challenge the assumed trade-off between interpretability and performance, showing that transparent visual concepts can enhance rather than compromise diagnostic accuracy in medical VLMs. Our modular design decomposes interpretability into visual transparency and structured language model conditioning, providing a principled pathway toward clinically trustworthy AI-assisted radiology.
[CV-17] Guided Diffusion by Optimized Loss Functions on Relaxed Parameters for Inverse Material Design
【速读】:该论文旨在解决工程与材料科学中常见的逆向设计(inverse design)问题,即从期望的输出性能(如目标体积模量)反推满足条件的设计参数。传统方法依赖数值模拟(如有限元法 FEM)进行正向计算,但面临设计空间结构复杂、离散参数或约束限制导致梯度不可导的问题,难以高效获取多样化的可行解。解决方案的关键在于提出一种基于扩散模型(diffusion model)的新方法:首先将原始离散设计空间松弛为连续网格表示,使得可通过隐式微分(implicit differentiation)在正向仿真中计算梯度;随后训练扩散模型作为先验以生成合理的松弛空间设计样本,并利用推理时由目标函数传播而来的梯度引导扩散采样;最终通过回投影(backprojection)将连续空间中的样本映射回原始离散设计空间。该方法在二维和三维复合材料设计任务中验证有效,能够快速生成相对误差小于1%的多样化设计方案,且支持多目标优化以同时最小化材料密度。
链接: https://arxiv.org/abs/2602.15648
作者: Jens U. Kreber,Christian Weißenfels,Joerg Stueckler
机构: University of Augsburg (奥格斯堡大学)
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Inverse design problems are common in engineering and materials science. The forward direction, i.e., computing output quantities from design parameters, typically requires running a numerical simulation, such as a FEM, as an intermediate step, which is an optimization problem by itself. In many scenarios, several design parameters can lead to the same or similar output values. For such cases, multi-modal probabilistic approaches are advantageous to obtain diverse solutions. A major difficulty in inverse design stems from the structure of the design space, since discrete parameters or further constraints disallow the direct use of gradient-based optimization. To tackle this problem, we propose a novel inverse design method based on diffusion models. Our approach relaxes the original design space into a continuous grid representation, where gradients can be computed by implicit differentiation in the forward simulation. A diffusion model is trained on this relaxed parameter space in order to serve as a prior for plausible relaxed designs. Parameters are sampled by guided diffusion using gradients that are propagated from an objective function specified at inference time through the differentiable simulation. A design sample is obtained by backprojection into the original parameter space. We develop our approach for a composite material design problem where the forward process is modeled as a linear FEM problem. We evaluate the performance of our approach in finding designs that match a specified bulk modulus. We demonstrate that our method can propose diverse designs within 1% relative error margin from medium to high target bulk moduli in 2D and 3D settings. We also demonstrate that the material density of generated samples can be minimized simultaneously by using a multi-objective loss function.
[CV-18] CARE Drive A Framework for Evaluating Reason -Responsiveness of Vision Language Models in Automated Driving
【速读】:该论文旨在解决当前用于自动驾驶的视觉语言模型(Vision Language Models, VLMs)在决策解释中缺乏对人类相关理由(human-relevant reasons)的因果响应性评估问题,即现有评价方法仅关注结果层面的性能(如安全性与轨迹准确性),而无法判断模型决策是否真正基于人类可理解且合理的理由,而非事后合理化(post hoc rationalization)。这一缺陷在高风险场景中可能导致虚假信心。解决方案的关键在于提出一个模型无关的评估框架——CARE Drive(Context Aware Reasons Evaluation for Driving),其核心机制包括两个阶段:首先通过提示校准(prompt calibration)确保输出稳定性;随后通过系统性情境扰动(systematic contextual perturbation)测量模型决策对安全裕度、社会压力和效率约束等人类理由的敏感性,从而验证这些理由是否在因果上影响了决策行为。实证结果显示,显式引入人类理由能显著提升模型行为与专家推荐的一致性,但不同情境因素下响应程度不一,表明模型对各类理由的敏感性存在差异。
链接: https://arxiv.org/abs/2602.15645
作者: Lucas Elbert Suryana,Farah Bierenga,Sanne van Buuren,Pepijn Kooij,Elsefien Tulleners,Federico Scari,Simeon Calvert,Bart van Arem,Arkady Zgonnikov
机构: Delft University of Technology (代尔夫特理工大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, on submission to Transportation Research Part C
Abstract:Foundation models, including vision language models, are increasingly used in automated driving to interpret scenes, recommend actions, and generate natural language explanations. However, existing evaluation methods primarily assess outcome based performance, such as safety and trajectory accuracy, without determining whether model decisions reflect human relevant considerations. As a result, it remains unclear whether explanations produced by such models correspond to genuine reason responsive decision making or merely post hoc rationalizations. This limitation is especially significant in safety critical domains because it can create false confidence. To address this gap, we propose CARE Drive, Context Aware Reasons Evaluation for Driving, a model agnostic framework for evaluating reason responsiveness in vision language models applied to automated driving. CARE Drive compares baseline and reason augmented model decisions under controlled contextual variation to assess whether human reasons causally influence decision behavior. The framework employs a two stage evaluation process. Prompt calibration ensures stable outputs. Systematic contextual perturbation then measures decision sensitivity to human reasons such as safety margins, social pressure, and efficiency constraints. We demonstrate CARE Drive in a cyclist overtaking scenario involving competing normative considerations. Results show that explicit human reasons significantly influence model decisions, improving alignment with expert recommended behavior. However, responsiveness varies across contextual factors, indicating uneven sensitivity to different types of reasons. These findings provide empirical evidence that reason responsiveness in foundation models can be systematically evaluated without modifying model parameters.
[CV-19] An Industrial Dataset for Scene Acquisitions and Functional Schematics Alignment
【速读】:该论文旨在解决老旧工业设施中功能图(Functional Schematic)与二维(2D)和三维(3D)场景数据(如图像和激光雷达点云)对齐的问题,这一问题在构建数字孪生(Digital Twin)过程中尤为关键。由于现有手动对齐方法效率低下且难以扩展,加之图与现实之间的不一致性及公开工业数据集的稀缺性,该问题具有挑战性且研究不足。解决方案的关键在于提出IRIS-v2数据集,并结合分割(Segmentation)与图匹配(Graph Matching)技术,在实际案例中验证其有效性,从而显著减少对齐所需时间。
链接: https://arxiv.org/abs/2602.15584
作者: Flavien Armangeon,Thibaud Ehret,Enric Meinhardt-Llopis,Rafael Grompone von Gioi,Guillaume Thibault,Marc Petit,Gabriele Facciolo
机构: 1. Institut de Mathématiques de Marseille (马赛数学研究所); 2. Orange Labs (橙色实验室); 3. University of Lille (里尔大学); 4. Université d’Aix-Marseille (艾克斯-马赛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to EUSIPCO 2026
Abstract:Aligning functional schematics with 2D and 3D scene acquisitions is crucial for building digital twins, especially for old industrial facilities that lack native digital models. Current manual alignment using images and LiDAR data does not scale due to tediousness and complexity of industrial sites. Inconsistencies between schematics and reality, and the scarcity of public industrial datasets, make the problem both challenging and underexplored. This paper introduces IRIS-v2, a comprehensive dataset to support further research. It includes images, point clouds, 2D annotated boxes and segmentation masks, a CAD model, 3D pipe routing information, and the PID (Piping and Instrumentation Diagram). The alignment is experimented on a practical case study, aiming at reducing the time required for this task by combining segmentation and graph matching.
[CV-20] Intracoronary Optical Coherence Tomography Image Processing and Vessel Classification Using Machine Learning
【速读】:该论文旨在解决冠状动脉光学相干断层成像(Optical Coherence Tomography, OCT)图像中因噪声、成像伪影及复杂组织结构导致的血管分割与分类困难问题。其解决方案的关键在于构建一个全自动处理流程,包括图像预处理、导丝伪影去除、极坐标到直角坐标的转换、基于K-means的无监督聚类以及局部特征提取;随后利用逻辑回归(Logistic Regression)和支持向量机(Support Vector Machine)进行像素级血管分类,从而实现高精度的血管边界检测,同时保持较低的计算复杂度和极少的人工标注需求。
链接: https://arxiv.org/abs/2602.15579
作者: Amal Lahchim,Lambros Athanasiou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures. Research paper from Electrical and Computer Engineering Department, University of Patras
Abstract:Intracoronary Optical Coherence Tomography (OCT) enables high-resolution visualization of coronary vessel anatomy but presents challenges due to noise, imaging artifacts, and complex tissue structures. This paper proposes a fully automated pipeline for vessel segmentation and classification in OCT images using machine learning techniques. The proposed method integrates image preprocessing, guidewire artifact removal, polar-to-Cartesian transformation, unsupervised K-means clustering, and local feature extraction. These features are used to train Logistic Regression and Support Vector Machine classifiers for pixel-wise vessel classification. Experimental results demonstrate excellent performance, achieving precision, recall, and F1-score values up to 1.00 and overall classification accuracy of 99.68%. The proposed approach provides accurate vessel boundary detection while maintaining low computational complexity and requiring minimal manual annotation. This method offers a reliable and efficient solution for automated OCT image analysis and has potential applications in clinical decision support and real-time medical image processing.
[CV-21] Revealing and Enhancing Core Visual Regions: Harnessing Internal Attention Dynamics for Hallucination Mitigation in LVLMs
【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在多模态推理中易产生幻觉的问题,即输出内容与视觉输入或用户指令不一致。现有无训练方法如对比解码、辅助专家模型等存在计算开销大、潜在干扰风险,且静态内部信号增强易受注意力陷阱(attention sink)影响。解决方案的关键在于利用LVLM内部正向注意力动态(Positive Attention Dynamics, PAD),其能自然揭示被注意力陷阱扭曲下的语义核心视觉区域;进而提出训练-free的注意力干预方法PADE,通过构建PAD图识别语义核心区域,采用每头中位数绝对偏差缩放(per-head Median Absolute Deviation Scaling)自适应控制干预强度,并借助系统标记补偿(System-Token Compensation)维持对复杂指令的关注及长期输出一致性,从而提升视觉定位准确性并减少幻觉。
链接: https://arxiv.org/abs/2602.15556
作者: Guangtao Lyu,Qi Liu,Chenghao Xu,Jiexi Yan,Muli Yang,Xueting Li,Fen Fang,Cheng Deng
机构: Xidian University (西安电子科技大学); Hohai University (河海大学); A*STAR (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:LVLMs have achieved strong multimodal reasoning capabilities but remain prone to hallucinations, producing outputs inconsistent with visual inputs or user instructions. Existing training-free methods, including contrastive decoding and auxiliary expert models, which incur several times more computational overhead and may introduce potential interference, as well as static internal signal enhancement, are often vulnerable to the attention sink phenomenon. We find that internal Positive Attention Dynamics (PAD) in LVLMs naturally reveal semantically core visual regions under the distortions of attention sinks. Based on this, we propose Positive Attention Dynamics Enhancement (PADE), a training-free attention intervention that constructs a PAD map to identify semantically core visual regions, applies per-head Median Absolute Deviation Scaling to adaptively control the intervention strength, and leverages System-Token Compensation to maintain attention to complex user instructions and support long-term output consistency. Experiments on multiple LVLMs and benchmarks show that PADE improves visual grounding and reduces hallucinations, validating the effectiveness of leveraging internal attention dynamics for reliable multimodal reasoning.
[CV-22] Dynamic Training-Free Fusion of Subject and Style LoRAs
【速读】:该论文旨在解决现有LoRA(Low-Rank Adaptation)融合方法在生成用户指定主体与风格时存在的局限性,即大多数方法依赖静态统计启发式规则进行权重融合,偏离了LoRA原本学习自适应特征调整的设计初衷,并忽略了输入采样过程中的随机性。其解决方案的关键在于提出一种无需重新训练的动态融合框架:在前向传播阶段,于每个应用LoRA的层中动态计算基础模型原始特征与主体和风格LoRA生成特征之间的KL散度,自适应选择最优融合权重;在反向去噪阶段,则通过基于CLIP和DINO等目标指标的梯度修正进一步优化生成轨迹,实现跨扩散时间轴的特征级选择与度量引导的潜在空间调整相结合,从而在不进行任何再训练的前提下,实现主体与风格的一致性合成。
链接: https://arxiv.org/abs/2602.15539
作者: Qinglong Cao,Yuntian Chen,Chao Ma,Xiaokang Yang
机构: Shanghai Jiao Tong University (上海交通大学); Eastern Institute of Technology, Ningbo (宁波东方理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注:
Abstract:Recent studies have explored the combination of multiple LoRAs to simultaneously generate user-specified subjects and styles. However, most existing approaches fuse LoRA weights using static statistical heuristics that deviate from LoRA’s original purpose of learning adaptive feature adjustments and ignore the randomness of sampled inputs. To address this, we propose a dynamic training-free fusion framework that operates throughout the generation process. During the forward pass, at each LoRA-applied layer, we dynamically compute the KL divergence between the base model’s original features and those produced by subject and style LoRAs, respectively, and adaptively select the most appropriate weights for fusion. In the reverse denoising stage, we further refine the generation trajectory by dynamically applying gradient-based corrections derived from objective metrics such as CLIP and DINO scores, providing continuous semantic and stylistic guidance. By integrating these two complementary mechanisms-feature-level selection and metric-guided latent adjustment-across the entire diffusion timeline, our method dynamically achieves coherent subject-style synthesis without any retraining. Extensive experiments across diverse subject-style combinations demonstrate that our approach consistently outperforms state-of-the-art LoRA fusion methods both qualitatively and quantitatively.
[CV-23] Advanced Acceptance Score: A Holistic Measure for Biometric Quantification
【速读】:该论文旨在解决手部手势中生物特征(biometric)特征量化评估的难题,即如何有效评价从手势中提取的特征得分的质量。现有方法依赖于错误率来估计生物特征容量,但错误率无法反映得分本身的优劣。为此,作者提出了一种综合性的评估指标——先进接受得分(advanced acceptance score),其关键在于:首先以输出得分的排序顺序和相关性为基础,引入排名偏差以及对高排名手势获得高分、低排名手势获得低分的奖励机制;其次考虑输出得分与真实标签趋势的一致性;最后引入身份特征解耦程度作为折扣因子,以消除身份信息对评分的干扰。通过合理加权上述要素,该指标实现了对手势生物特征质量的全面、可靠评估,并在三个数据集上的五种前沿模型实验中验证了其优越性和与其他指标的相关性。
链接: https://arxiv.org/abs/2602.15535
作者: Aman Verma,Seshan Srirangarajan,Sumantra Dutta Roy
机构: IIT Delhi (印度理工学院德里分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Quantifying biometric characteristics within hand gestures involve derivation of fitness scores from a gesture and identity aware feature space. However, evaluating the quality of these scores remains an open question. Existing biometric capacity estimation literature relies upon error rates. But these rates do not indicate goodness of scores. Thus, in this manuscript we present an exhaustive set of evaluation measures. We firstly identify ranking order and relevance of output scores as the primary basis for evaluation. In particular, we consider both rank deviation as well as rewards for: (i) higher scores of high ranked gestures and (ii) lower scores of low ranked gestures. We also compensate for correspondence between trends of output and ground truth scores. Finally, we account for disentanglement between identity features of gestures as a discounting factor. Integrating these elements with adequate weighting, we formulate advanced acceptance score as a holistic evaluation measure. To assess effectivity of the proposed we perform in-depth experimentation over three datasets with five state-of-the-art (SOTA) models. Results show that the optimal score selected with our measure is more appropriate than existing other measures. Also, our proposed measure depicts correlation with existing measures. This further validates its reliability. We have made our \hrefthis https URLcode public.
[CV-24] Semantic-Guided 3D Gaussian Splatting for Transient Object Removal
【速读】:该论文旨在解决在非结构化多视角图像捕获中,由于瞬态物体(transient objects)导致的3D高斯泼溅(3D Gaussian Splatting, 3DGS)重建结果出现鬼影伪影(ghosting artifacts)的问题。现有方法要么依赖场景分解(scene decomposition),带来显著内存开销,要么基于运动启发式策略,易受视差模糊(parallax ambiguity)影响。其解决方案的关键在于提出一种基于语义过滤(semantic filtering)的框架,利用视觉-语言模型(vision-language models)实现类别感知的瞬态去除:通过计算渲染视图与干扰文本提示(distractor text prompts)之间的CLIP相似度,并在训练迭代过程中对每个高斯分布累积得分,超过校准阈值的高斯点被施加透明度正则化并周期性剔除。该方法不依赖运动模式,而是通过语义分类独立识别目标类别,从而有效解决视差模糊问题,在RobustNeRF基准上实现了四组序列的重建质量提升,同时保持极低内存开销和实时渲染性能。
链接: https://arxiv.org/abs/2602.15516
作者: Aditi Prabakaran,Priyesh Shukla
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Transient objects in casual multi-view captures cause ghosting artifacts in 3D Gaussian Splatting (3DGS) reconstruction. Existing solutions relied on scene decomposition at significant memory cost or on motion-based heuristics that were vulnerable to parallax ambiguity. A semantic filtering framework was proposed for category-aware transient removal using vision-language models. CLIP similarity scores between rendered views and distractor text prompts were accumulated per-Gaussian across training iterations. Gaussians exceeding a calibrated threshold underwent opacity regularization and periodic pruning. Unlike motion-based approaches, semantic classification resolved parallax ambiguity by identifying object categories independently of motion patterns. Experiments on the RobustNeRF benchmark demonstrated consistent improvement in reconstruction quality over vanilla 3DGS across four sequences, while maintaining minimal memory overhead and real-time rendering performance. Threshold calibration and comparisons with baselines validated semantic guidance as a practical strategy for transient removal in scenarios with predictable distractor categories.
[CV-25] LEADER: Lightweight End-to-End Attention-Gated Dual Autoencoder for Robust Minutiae Extraction
【速读】:该论文旨在解决指纹识别中细节点(minutiae)提取任务的端到端建模问题,即如何在不依赖传统独立预处理和后处理步骤的情况下,直接从原始指纹图像中高效准确地生成细节点描述符(包括位置、方向和类型)。其解决方案的关键在于提出一种轻量级端到端注意力门控双自编码器架构(LEADER),该架构通过创新的“城堡-护城河-城墙”式真值编码方式与双自编码器结构相结合,并引入注意力门控机制实现特征交互;同时整合非极大值抑制与角度解码策略,使得整个推理过程仅需0.9M参数即可完成高精度输出,且在NIST SD27数据集上相比专用潜指纹细节点提取器提升34% F1分数,展现出卓越的性能与计算效率。
链接: https://arxiv.org/abs/2602.15493
作者: Raffaele Cappelli,Matteo Ferrara
机构: University of Bologna (博洛尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Minutiae extraction, a fundamental stage in fingerprint recognition, is increasingly shifting toward deep learning. However, truly end-to-end methods that eliminate separate preprocessing and postprocessing steps remain scarce. This paper introduces LEADER (Lightweight End-to-end Attention-gated Dual autoencodER), a neural network that maps raw fingerprint images to minutiae descriptors, including location, direction, and type. The proposed architecture integrates non-maximum suppression and angular decoding to enable complete end-to-end inference using only 0.9M parameters. It employs a novel “Castle-Moat-Rampart” ground-truth encoding and a dual-autoencoder structure, interconnected through an attention-gating mechanism. Experimental evaluations demonstrate state-of-the-art accuracy on plain fingerprints and robust cross-domain generalization to latent impressions. Specifically, LEADER attains a 34% higher F1-score on the NIST SD27 dataset compared to specialized latent minutiae extractors. Sample-level analysis on this challenging benchmark reveals an average rank of 2.07 among all compared methods, with LEADER securing the first-place position in 47% of the samples-more than doubling the frequency of the second-best extractor. The internal representations learned by the model align with established fingerprint domain features, such as segmentation masks, orientation fields, frequency maps, and skeletons. Inference requires 15ms on GPU and 322ms on CPU, outperforming leading commercial software in computational efficiency. The source code and pre-trained weights are publicly released to facilitate reproducibility.
[CV-26] RPT-SR: Regional Prior attention Transformer for infrared image Super-Resolution
【速读】:该论文旨在解决通用超分辨率模型(尤其是视觉Transformer)在固定或近似静态视角的红外成像场景(如监控和自动驾驶)中存在效率低下问题,这类场景通常具有强且持久的空间先验信息,而现有模型未能有效利用这些先验导致冗余学习和性能欠佳。解决方案的关键在于提出区域先验注意力Transformer(RPT-SR),其核心创新是一个双令牌框架:将可学习的区域先验令牌(Regional Prior Tokens)与局部令牌融合,前者作为场景全局结构的持续记忆,后者捕捉当前帧的具体内容;通过在注意力机制中引入这两种令牌,使先验信息能够动态调节局部重建过程,从而提升模型对红外图像超分辨率任务的适应性与效率。
链接: https://arxiv.org/abs/2602.15490
作者: Youngwan Jin,Incheol Park,Yagiz Nalcakan,Hyeongjin Ju,Sanghyeop Yeo,Shiho Kim
机构: Yonsei University (延世大学); BK21 Graduate Program in Intelligent Semiconductor Technology (智能半导体技术BK21研究生项目)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:General-purpose super-resolution models, particularly Vision Transformers, have achieved remarkable success but exhibit fundamental inefficiencies in common infrared imaging scenarios like surveillance and autonomous driving, which operate from fixed or nearly-static viewpoints. These models fail to exploit the strong, persistent spatial priors inherent in such scenes, leading to redundant learning and suboptimal performance. To address this, we propose the Regional Prior attention Transformer for infrared image Super-Resolution (RPT-SR), a novel architecture that explicitly encodes scene layout information into the attention mechanism. Our core contribution is a dual-token framework that fuses (1) learnable, regional prior tokens, which act as a persistent memory for the scene’s global structure, with (2) local tokens that capture the frame-specific content of the current input. By utilizing these tokens into an attention, our model allows the priors to dynamically modulate the local reconstruction process. Extensive experiments validate our approach. While most prior works focus on a single infrared band, we demonstrate the broad applicability and versatility of RPT-SR by establishing new state-of-the-art performance across diverse datasets covering both Long-Wave (LWIR) and Short-Wave (SWIR) spectra
[CV-27] Emergent Morphing Attack Detection in Open Multi-modal Large Language Models
【速读】:该论文旨在解决生物特征验证中面部伪造攻击(face morphing attacks)的检测问题,尤其是现有检测系统(MAD)在面对未见过的攻击类型时泛化能力差、依赖特定任务训练的问题。其解决方案的关键在于首次系统性地评估开源多模态大语言模型(MLLMs)在单图零样本(zero-shot)面部伪造检测中的潜力,发现无需微调或领域适配,部分MLLMs(如LLaVA1.6-Mistral-7B)即可展现出显著的判别能力,且性能优于多个高度优化的任务专用基线模型(EER提升至少23%)。这表明多模态预训练能隐式编码细微的面部不一致特征(即伪造痕迹),从而实现零样本的取证敏感性,为生物特征安全与图像取证分析提供了可复现、可解释且具备竞争力的新范式。
链接: https://arxiv.org/abs/2602.15461
作者: Marija Ivanovska,Vitomir Štruc
机构: University of Ljubljana (卢布尔雅那大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This manuscript is currently under review at Pattern Recognition Letters
Abstract:Face morphing attacks threaten biometric verification, yet most morphing attack detection (MAD) systems require task-specific training and generalize poorly to unseen attack types. Meanwhile, open-source multimodal large language models (MLLMs) have demonstrated strong visual-linguistic reasoning, but their potential in biometric forensics remains underexplored. In this paper, we present the first systematic zero-shot evaluation of open-source MLLMs for single-image MAD, using publicly available weights and a standardized, reproducible protocol. Across diverse morphing techniques, many MLLMs show non-trivial discriminative ability without any fine-tuning or domain adaptation, and LLaVA1.6-Mistral-7B achieves state-of-the-art performance, surpassing highly competitive task-specific MAD baselines by at least 23% in terms of equal error rate (EER). The results indicate that multimodal pretraining can implicitly encode fine-grained facial inconsistencies indicative of morphing artifacts, enabling zero-shot forensic sensitivity. Our findings position open-source MLLMs as reproducible, interpretable, and competitive foundations for biometric security and forensic image analysis. This emergent capability also highlights new opportunities to develop state-of-the-art MAD systems through targeted fine-tuning or lightweight adaptation, further improving accuracy and efficiency while preserving interpretability. To support future research, all code and evaluation protocols will be released upon publication.
[CV-28] On the Out-of-Distribution Generalization of Reasoning in Multimodal LLM s for Simple Visual Planning Tasks
【速读】:该论文旨在解决当前生成式 AI(Generative AI)模型中推理能力泛化性(generalization)定义模糊且理解不足的问题,特别是在链式思维(Chain-of-Thought, CoT)方法应用于视觉-语言模型时的跨分布迁移能力。其解决方案的关键在于构建一个严谨的评估框架,通过一个基于网格的导航任务系统性地测试不同输入表示(文本与视觉)和CoT推理策略在分布内(in-distribution, ID)与分布外(out-of-distribution, OOD)条件下的表现。实验发现,尽管CoT能提升ID泛化性能,但多数情况下OOD泛化仍受限;而结合多种文本格式的推理路径反而展现出最佳且非平凡的OOD泛化效果,同时纯文本模型始终优于依赖图像输入的模型,包括基于潜在空间推理的新方法。
链接: https://arxiv.org/abs/2602.15460
作者: Yannic Neuhaus,Nicolas Flammarion,Matthias Hein,Francesco Croce
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Integrating reasoning in large language models and large vision-language models has recently led to significant improvement of their capabilities. However, the generalization of reasoning models is still vaguely defined and poorly understood. In this work, we present an evaluation framework to rigorously examine how well chain-of-thought (CoT) approaches generalize on a simple planning task. Specifically, we consider a grid-based navigation task in which a model is provided with a map and must output a sequence of moves that guides a player from a start position to a goal while avoiding obstacles. The versatility of the task and its data allows us to fine-tune model variants using different input representations (visual and textual) and CoT reasoning strategies, and systematically evaluate them under both in-distribution (ID) and out-of-distribution (OOD) test conditions. Our experiments show that, while CoT reasoning improves in-distribution generalization across all representations, out-of-distribution generalization (e.g., to larger maps) remains very limited in most cases when controlling for trivial matches with the ID data. Surprisingly, we find that reasoning traces which combine multiple text formats yield the best (and non-trivial) OOD generalization. Finally, purely text-based models consistently outperform those utilizing image-based inputs, including a recently proposed approach relying on latent space reasoning.
[CV-29] Efficient Generative Modeling beyond Memoryless Diffusion via Adjoint Schrödinger Bridge Matching
【速读】:该论文旨在解决扩散模型(Diffusion Models)在高维数据生成中因前向过程无信息且无记忆性而导致轨迹高度弯曲、得分目标噪声大等问题。其核心解决方案是提出伴随薛定谔桥匹配(Adjoint Schrödinger Bridge Matching, ASBM),该方法通过两阶段建模实现最优生成路径:第一阶段将薛定谔桥(Schrödinger Bridge, SB)前向动态视为耦合构造问题,从数据到能量定义先验的采样视角学习;第二阶段以诱导出的最优耦合为监督信号,直接匹配后向生成动态。通过引入非无记忆机制,ASBM显著缩短并平滑了采样路径,在保持高保真度的同时大幅减少采样步数,从而提升高维图像生成的稳定性与效率。
链接: https://arxiv.org/abs/2602.15396
作者: Jeongwoo Shin,Jinhwan Sul,Joonseok Lee,Jaewong Choi,Jaemoo Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models often yield highly curved trajectories and noisy score targets due to an uninformative, memoryless forward process that induces independent data-noise coupling. We propose Adjoint Schrödinger Bridge Matching (ASBM), a generative modeling framework that recovers optimal trajectories in high dimensions via two stages. First, we view the Schrödinger Bridge (SB) forward dynamic as a coupling construction problem and learn it through a data-to-energy sampling perspective that transports data to an energy-defined prior. Then, we learn the backward generative dynamic with a simple matching loss supervised by the induced optimal coupling. By operating in a non-memoryless regime, ASBM produces significantly straighter and more efficient sampling paths. Compared to prior works, ASBM scales to high-dimensional data with notably improved stability and efficiency. Extensive experiments on image generation show that ASBM improves fidelity with fewer sampling steps. We further showcase the effectiveness of our optimal trajectory via distillation to a one-step generator.
[CV-30] Doubly Stochastic Mean-Shift Clustering
【速读】:该论文旨在解决标准均值漂移(Mean-Shift)算法对带宽超参数敏感的问题,尤其在数据稀疏场景下,固定尺度的密度估计易导致碎片化和虚假模态。其解决方案的关键在于提出双重随机均值漂移(Doubly Stochastic Mean-Shift, DSMS),通过在每次迭代中同时从连续均匀分布中随机采样数据样本和核带宽(kernel bandwidth),实现对密度景观更有效的探索;该随机带宽策略本质上起到了隐式正则化作用,从而显著提升聚类稳定性并防止过分割,且无需牺牲其他性能指标。
链接: https://arxiv.org/abs/2602.15393
作者: Tom Trigano,Yann Sepulcre,Itshak Lapidot
机构: Shamoon College of Engineering (沙穆恩工程学院); Sapir Academic College (萨皮尔学院); Afeka Tel-Aviv Academic College of Engineering (阿费卡特拉维夫工程学院); Avignon University (阿维尼翁大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages. arXiv admin note: text overlap with arXiv:2511.09202
Abstract:Standard Mean-Shift algorithms are notoriously sensitive to the bandwidth hyperparameter, particularly in data-scarce regimes where fixed-scale density estimation leads to fragmentation and spurious modes. In this paper, we propose Doubly Stochastic Mean-Shift (DSMS), a novel extension that introduces randomness not only in the trajectory updates but also in the kernel bandwidth itself. By drawing both the data samples and the radius from a continuous uniform distribution at each iteration, DSMS effectively performs a better exploration of the density landscape. We show that this randomized bandwidth policy acts as an implicit regularization mechanism, and provide convergence theoretical results. Comparative experiments on synthetic Gaussian mixtures reveal that DSMS significantly outperforms standard and stochastic Mean-Shift baselines, exhibiting remarkable stability and preventing over-segmentation in sparse clustering scenarios without other performance degradation.
[CV-31] Bridging Day and Night: Target-Class Hallucination Suppression in Unpaired Image Translation AAAI2026
【速读】:该论文旨在解决昼夜无配对图像翻译中因外观变化大且缺乏像素级监督而导致的语义幻觉(semantic hallucination)问题,即在目标域(如夜间场景)中错误合成交通标志、车辆等目标类对象及人工光源效应,从而严重影响下游任务性能。解决方案的关键在于提出一种基于Schrodinger Bridge的迭代优化框架:首先设计双头判别器以辅助语义分割,识别背景区域中的幻觉内容;其次引入类别特异性原型(class-specific prototypes),通过聚合标注目标域对象特征构建语义锚点,使检测到的幻觉特征在特征空间中被显式地推离对应原型,从而保持跨域转换中的物体语义一致性。实验证明,该方法在BDD100K数据集上将日到夜域适应的mAP提升15.5%,对易受幻觉影响的交通灯类提升达31.7%。
链接: https://arxiv.org/abs/2602.15383
作者: Shuwei Li,Lei Tan,Robby T. Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2026 (Oral)
Abstract:Day-to-night unpaired image translation is important to downstream tasks but remains challenging due to large appearance shifts and the lack of direct pixel-level supervision. Existing methods often introduce semantic hallucinations, where objects from target classes such as traffic signs and vehicles, as well as man-made light effects, are incorrectly synthesized. These hallucinations significantly degrade downstream performance. We propose a novel framework that detects and suppresses hallucinations of target-class features during unpaired translation. To detect hallucination, we design a dual-head discriminator that additionally performs semantic segmentation to identify hallucinated content in background regions. To suppress these hallucinations, we introduce class-specific prototypes, constructed by aggregating features of annotated target-domain objects, which act as semantic anchors for each class. Built upon a Schrodinger Bridge-based translation model, our framework performs iterative refinement, where detected hallucination features are explicitly pushed away from class prototypes in feature space, thus preserving object semantics across the translation this http URL show that our method outperforms existing approaches both qualitatively and quantitatively. On the BDD100K dataset, it improves mAP by 15.5% for day-to-night domain adaptation, with a notable 31.7% gain for classes such as traffic lights that are prone to hallucinations.
[CV-32] GMAIL: Generative Modality Alignment for generated Image Learning
【速读】:该论文旨在解决生成式图像在训练机器学习模型时因真实与合成数据域之间存在模态差异(modality discrepancy)而导致的模式崩溃(mode collapse)问题。传统方法直接将生成图像替换真实图像进行训练,忽视了二者在像素空间中的分布不一致,从而损害模型性能。解决方案的关键在于提出一种名为GMAIL的新框架,其核心思想是将生成图像视为与真实图像不同的模态,并通过多模态学习策略在统一的潜在空间(latent space)中对齐二者。具体而言,先利用跨模态对齐损失(cross-modality alignment loss)在生成图像上微调模型,再使用该对齐后的模型进一步训练多种视觉-语言模型(vision-language models),从而有效利用生成模型的优势并提升多种视觉-语言任务(如图像描述、零样本图像检索、零样本图像分类等)的性能。
链接: https://arxiv.org/abs/2602.15368
作者: Shentong Mo,Sukmin Yun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Generative models have made it possible to synthesize highly realistic images, potentially providing an abundant data source for training machine learning models. Despite the advantages of these synthesizable data sources, the indiscriminate use of generated images as real images for training can even cause mode collapse due to modality discrepancies between real and synthetic domains. In this paper, we propose a novel framework for discriminative use of generated images, coined GMAIL, that explicitly treats generated images as a separate modality from real images. Instead of indiscriminately replacing real images with generated ones in the pixel space, our approach bridges the two distinct modalities in the same latent space through a multi-modal learning approach. To be specific, we first fine-tune a model exclusively on generated images using a cross-modality alignment loss and then employ this aligned model to further train various vision-language models with generated images. By aligning the two modalities, our approach effectively leverages the benefits of recent advances in generative models, thereby boosting the effectiveness of generated image learning across a range of vision-language tasks. Our framework can be easily incorporated with various vision-language models, and we demonstrate its efficacy throughout extensive experiments. For example, our framework significantly improves performance on image captioning, zero-shot image retrieval, zero-shot image classification, and long caption retrieval tasks. It also shows positive generated data scaling trends and notable enhancements in the captioning performance of the large multimodal model, LLaVA.
[CV-33] DAV-GSWT: Diffusion-Active-View Sampling for Data-Efficient Gaussian Splatting Wang Tiles
【速读】:该论文旨在解决当前基于高斯溅射(Gaussian Splatting)的神经渲染系统在生成大规模虚拟环境时对密集样本依赖性强、数据效率低的问题。现有方法虽引入了如Wang Tiles等程序化技术以扩展场景规模,但仍受限于对详尽示例重建的依赖。其解决方案的关键在于提出DAV-GSWT框架,该框架结合扩散先验(diffusion priors)与主动视点采样(active view sampling),通过层级不确定性量化机制自动识别最具信息量的观测视角,并利用生成式扩散模型填补缺失结构细节,从而实现从极少量输入观测中合成高质量的高斯溅射Wang Tiles,显著降低数据需求并保障视觉保真度与交互性能。
链接: https://arxiv.org/abs/2602.15355
作者: Rong Fu,Jiekai Wu,Haiyun Wei,Yee Tan Jia,Wenxin Zhang,Yang Li,Xiaowen Ma,Wangyu Wu,Simon Fong
机构: University of Macau (澳门大学); Juntendo University (顺天堂大学); Tongji University (同济大学); Renmin University of China (中国人民大学); University of Chinese Academy of Sciences (中国科学院大学); Zhejiang University (浙江大学); University of Liverpool (利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 7 figures
Abstract:The emergence of 3D Gaussian Splatting has fundamentally redefined the capabilities of photorealistic neural rendering by enabling high-throughput synthesis of complex environments. While procedural methods like Wang Tiles have recently been integrated to facilitate the generation of expansive landscapes, these systems typically remain constrained by a reliance on densely sampled exemplar reconstructions. We present DAV-GSWT, a data-efficient framework that leverages diffusion priors and active view sampling to synthesize high-fidelity Gaussian Splatting Wang Tiles from minimal input observations. By integrating a hierarchical uncertainty quantification mechanism with generative diffusion models, our approach autonomously identifies the most informative viewpoints while hallucinating missing structural details to ensure seamless tile transitions. Experimental results indicate that our system significantly reduces the required data volume while maintaining the visual integrity and interactive performance necessary for large-scale virtual environments.
[CV-34] CREMD: Crowd-Sourced Emotional Multimodal Dogs Dataset
【速读】:该论文旨在解决狗情绪识别(dog emotion recognition)中因情感评估主观性强及缺乏标准化标注方法而导致的准确性难题。其解决方案的关键在于构建并分析CREMD(Crowd-sourced Emotional Multimodal Dogs Dataset),该数据集通过三种呈现模式(无上下文与音频、有上下文无音频、有上下文与音频)和多类标注者特征(如是否养狗、性别、专业经验)系统性地探究影响狗情绪感知与标注一致性的因素。研究发现,视觉上下文显著提升标注一致性,而音频虽未完全明确其作用,但明显增强了标注者对特定情绪(如愤怒和恐惧)的信心;此外,非养狗者和男性标注者的标注一致性高于养狗者和女性标注者,专业标注者则表现出更高的一致性,这些结果为构建更可靠的人机交互系统和自动化犬类情绪监测提供了实证依据。
链接: https://arxiv.org/abs/2602.15349
作者: Jinho Baek,Houwei Cao,Kate Blackwell
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to arXiv
Abstract:Dog emotion recognition plays a crucial role in enhancing human-animal interactions, veterinary care, and the development of automated systems for monitoring canine well-being. However, accurately interpreting dog emotions is challenging due to the subjective nature of emotional assessments and the absence of standardized ground truth methods. We present the CREMD (Crowd-sourced Emotional Multimodal Dogs Dataset), a comprehensive dataset exploring how different presentation modes (e.g., context, audio, video) and annotator characteristics (e.g., dog ownership, gender, professional experience) influence the perception and labeling of dog emotions. The dataset consists of 923 video clips presented in three distinct modes: without context or audio, with context but no audio, and with both context and audio. We analyze annotations from diverse participants, including dog owners, professionals, and individuals with varying demographic backgrounds and experience levels, to identify factors that influence reliable dog emotion recognition. Our findings reveal several key insights: (1) while adding visual context significantly improved annotation agreement, our findings regarding audio cues are inconclusive due to design limitations (specifically, the absence of a no-context-with-audio condition and limited clean audio availability); (2) contrary to expectations, non-owners and male annotators showed higher agreement levels than dog owners and female annotators, respectively, while professionals showed higher agreement levels, aligned with our initial hypothesis; and (3) the presence of audio substantially increased annotators’ confidence in identifying specific emotions, particularly anger and fear.
[CV-35] Effective and Robust Multimodal Medical Image Analysis KDD2026 KDD
【速读】:该论文旨在解决多模态融合学习(Multimodal Fusion Learning, MFL)在医学AI应用中面临的三大关键问题:一是现有方法通常局限于特定模态,忽视跨模态间的互补信息,限制了其在多疾病分析中的泛化能力;二是模型计算成本高,难以在资源受限场景下部署;三是缺乏对对抗攻击的鲁棒性,影响临床可靠性。解决方案的核心在于提出一种新颖的多注意力集成学习(Multi-Attention Integration Learning, MAIL)网络,其关键创新包括:1)设计了一个高效的残差学习注意力模块,用于捕捉各模态的细粒度多尺度特征;2)引入轻量级多模态交叉注意力机制,以学习不同模态间的增强型共享表示。为进一步提升鲁棒性,还构建了Robust-MAIL,在原结构基础上嵌入随机投影滤波器与调制注意力噪声,有效抵御对抗攻击。实验表明,MAIL和Robust-MAIL在20个公开数据集上均显著优于现有方法,性能提升最高达9.34%,同时计算开销降低最多78.3%。
链接: https://arxiv.org/abs/2602.15346
作者: Joy Dhar,Nayyar Zaidi,Maryam Haghighat
机构: Indian Institute of Technology Ropar (印度理工学院拉普尔分校); Deakin University (迪肯大学); Queensland University of Technology (昆士兰科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)
Abstract:Multimodal Fusion Learning (MFL), leveraging disparate data from various imaging modalities (e.g., MRI, CT, SPECT), has shown great potential for addressing medical problems such as skin cancer and brain tumor prediction. However, existing MFL methods face three key limitations: a) they often specialize in specific modalities, and overlook effective shared complementary information across diverse modalities, hence limiting their generalizability for multi-disease analysis; b) they rely on computationally expensive models, restricting their applicability in resource-limited settings; and c) they lack robustness against adversarial attacks, compromising reliability in medical AI applications. To address these limitations, we propose a novel Multi-Attention Integration Learning (MAIL) network, incorporating two key components: a) an efficient residual learning attention block for capturing refined modality-specific multi-scale patterns and b) an efficient multimodal cross-attention module for learning enriched complementary shared representations across diverse modalities. Furthermore, to ensure adversarial robustness, we extend MAIL network to design Robust-MAIL by incorporating random projection filters and modulated attention noise. Extensive evaluations on 20 public datasets show that both MAIL and Robust-MAIL outperform existing methods, achieving performance gains of up to 9.34% while reducing computational costs by up to 78.3%. These results highlight the superiority of our approaches, ensuring more reliable predictions than top competitors. Code: this https URL.
[CV-36] EventMemAgent : Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use
【速读】:该论文旨在解决在线视频理解中面临的根本性挑战:流媒体输入的无限性与多模态大语言模型(Multimodal Large Language Models, MLLMs)有限上下文窗口之间的矛盾。现有方法多依赖被动处理策略,在保持长程上下文与捕捉复杂任务所需的细粒度细节之间存在权衡。解决方案的关键在于提出 EventMemAgent 框架,其核心是基于分层记忆模块的主动在线视频代理机制:短时记忆通过事件边界检测和事件粒度的 reservoir sampling 动态维护固定长度缓冲区以处理流式帧;长时记忆则按事件粒度结构化归档历史观测;同时引入多粒度感知工具包实现主动迭代证据采集,并采用代理强化学习(Agentic Reinforcement Learning, Agentic RL)端到端地将推理与工具使用策略内化为代理的内在能力。
链接: https://arxiv.org/abs/2602.15329
作者: Siwei Wen,Zhangcheng Wang,Xingjian Zhang,Lei Huang,Wenjun Wu
机构: Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Artificial Intelligence, Beihang University (北京航空航天大学人工智能学院未来区块链与隐私计算高精尖创新中心); Hangzhou International Innovation Institute, Beihang University (北京航空航天大学杭州国际创新研究院); Paradigm Inc. (Paradigm 公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Online video understanding requires models to perform continuous perception and long-range reasoning within potentially infinite visual streams. Its fundamental challenge lies in the conflict between the unbounded nature of streaming media input and the limited context window of Multimodal Large Language Models (MLLMs). Current methods primarily rely on passive processing, which often face a trade-off between maintaining long-range context and capturing the fine-grained details necessary for complex tasks. To address this, we introduce EventMemAgent, an active online video agent framework based on a hierarchical memory module. Our framework employs a dual-layer strategy for online videos: short-term memory detects event boundaries and utilizes event-granular reservoir sampling to process streaming video frames within a fixed-length buffer dynamically; long-term memory structuredly archives past observations on an event-by-event basis. Furthermore, we integrate a multi-granular perception toolkit for active, iterative evidence capture and employ Agentic Reinforcement Learning (Agentic RL) to end-to-end internalize reasoning and tool-use strategies into the agent’s intrinsic capabilities. Experiments show that EventMemAgent achieves competitive results on online video benchmarks. The code will be released here: this https URL.
[CV-37] Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLM s
【速读】:该论文旨在解决推测解码(speculative decoding)在视频大语言模型(Vid-LLMs)中因键值缓存爆炸(key-value cache explosion)和上下文窗口不匹配导致的性能崩溃问题,尤其是由注意力稀释(attention dilution)和负向视觉增益(negative visual gain)引发的推理效率下降。其解决方案的关键在于提出Sparrow框架:首先通过视觉感知的文本锚定窗口注意力机制(visually-aware text-anchored window attention)利用隐藏状态重用,将视觉计算完全卸载至目标模型;其次引入中间层视觉状态桥接(intermediate-layer visual state bridging),使草稿模型能够基于语义丰富的中间状态进行训练,从而过滤低级视觉噪声;同时采用多标记预测策略缓解训练与推理分布偏移问题,最终在长达25k视觉标记的序列上实现平均2.82倍加速,显著改善长视频任务中的性能退化现象。
链接: https://arxiv.org/abs/2602.15318
作者: Libo Zhang,Zhaoning Zhang,Wangyang Hong,Peng Qiao,Dongsheng Li
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages , 6 figures
Abstract:Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise. Additionally, a multi-token prediction strategy is introduced to bridge the training-inference distribution shift. Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences and offering a practical solution for real-time long video tasks.
[CV-38] raining-Free Zero-Shot Anomaly Detection in 3D Brain MRI with 2D Foundation Models
【速读】:该论文旨在解决3D医学图像中零样本异常检测(Zero-shot Anomaly Detection, ZSAD)的挑战,即在无任务特定标注的情况下实现对三维脑部磁共振成像(MRI)中异常区域的有效识别。现有方法多局限于二维(2D)数据,依赖切片级特征或视觉-语言模型,难以捕捉体积结构信息。其解决方案的关键在于提出一种完全无需训练的框架,通过聚合由2D基础模型处理的多轴切片构建局部体素化标记(localized volumetric tokens),从而恢复立方体空间上下文,并直接集成到基于距离的批处理异常检测流程中,实现高效、无需微调或提示的3D异常检测。
链接: https://arxiv.org/abs/2602.15315
作者: Tai Le-Gia,Jaehyun Ahn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Accepted for MIDL 2026
Abstract:Zero-shot anomaly detection (ZSAD) has gained increasing attention in medical imaging as a way to identify abnormalities without task-specific supervision, but most advances remain limited to 2D datasets. Extending ZSAD to 3D medical images has proven challenging, with existing methods relying on slice-wise features and vision-language models, which fail to capture volumetric structure. In this paper, we introduce a fully training-free framework for ZSAD in 3D brain MRI that constructs localized volumetric tokens by aggregating multi-axis slices processed by 2D foundation models. These 3D patch tokens restore cubic spatial context and integrate directly with distance-based, batch-level anomaly detection pipelines. The framework provides compact 3D representations that are practical to compute on standard GPUs and require no fine-tuning, prompts, or supervision. Our results show that training-free, batch-based ZSAD can be effectively extended from 2D encoders to full 3D MRI volumes, offering a simple and robust approach for volumetric anomaly detection.
[CV-39] Consistency-Preserving Diverse Video Generation
【速读】:该论文旨在解决文本到视频生成(text-to-video generation)在低样本量场景下的批处理多样性与视频内时间一致性之间的权衡问题。现有方法虽能提升跨视频多样性,但常导致视频内部时间一致性下降,并依赖昂贵的视频解码器反向传播。其解决方案的关键在于提出一种联合采样框架,通过在潜在空间中使用轻量级模型同时优化多样性和时间一致性目标:先执行增强多样性的更新,再移除可能损害时间一致性目标的成分,从而避免图像空间梯度计算和视频解码过程,显著提升生成视频的时间一致性和色彩自然度,同时保持与先进基线相当的多样性水平。
链接: https://arxiv.org/abs/2602.15287
作者: Xinshuang Liu,Runfa Blark Li,Truong Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-video generation is expensive, so only a few samples are typically produced per prompt. In this low-sample regime, maximizing the value of each batch requires high cross-video diversity. Recent methods improve diversity for image generation, but for videos they often degrade within-video temporal consistency and require costly backpropagation through a video decoder. We propose a joint-sampling framework for flow-matching video generators that improves batch diversity while preserving temporal consistency. Our approach applies diversity-driven updates and then removes only the components that would decrease a temporal-consistency objective. To avoid image-space gradients, we compute both objectives with lightweight latent-space models, avoiding video decoding and decoder backpropagation. Experiments on a state-of-the-art text-to-video flow-matching model show diversity comparable to strong joint-sampling baselines while substantially improving temporal consistency and color naturalness. Code will be released.
[CV-40] Visual Persuasion: What Influences Decisions of Vision-Language Models?
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在图像选择任务中所表现出的视觉偏好结构不明确的问题,即缺乏对AI代理如何基于图像内容做出决策的系统性理解。其解决方案的关键在于提出一个基于受控图像选择任务的框架,将VLM的决策函数视为可由“显示偏好”(revealed preference)推断的潜在视觉效用(latent visual utility),并通过系统性扰动输入图像来识别哪些视觉修改会显著提升被选中的概率。具体而言,研究者开发了视觉提示优化方法,借鉴文本优化技术,利用图像生成模型迭代生成符合视觉常识的修改(如构图、光照或背景调整),并借助大规模实验验证这些优化编辑能显著改变VLM的选择倾向;同时构建自动可解释性流程,识别驱动选择的一致视觉主题,从而揭示图像基AI代理可能存在的视觉脆弱性和安全风险,为更主动的审计与治理提供依据。
链接: https://arxiv.org/abs/2602.15278
作者: Manuel Cherep,Pranav M R,Pattie Maes,Nikhil Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 45 pages, 17 figures
Abstract:The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent’s decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.
[CV-41] Accelerating Large-Scale Dataset Distillation via Exploration-Exploitation Optimization
【速读】:该论文旨在解决大规模数据集蒸馏(Dataset Distillation)中长期存在的准确性与效率之间的权衡问题:现有基于优化的方法虽能实现更高精度但计算开销大,而无需优化的方法虽高效却牺牲了性能。其解决方案的关键在于提出一种名为探索-利用蒸馏(Exploration-Exploitation Distillation, E²D)的简单且实用的新方法,通过一个高效的两阶段优化流程实现冗余计算最小化——首先以完整图像初始化保证语义完整性和特征多样性,随后在探索阶段进行均匀更新并识别高损失区域,在利用阶段集中优化这些区域以加速收敛。该策略显著提升了蒸馏效率与精度的平衡,在ImageNet-1K和ImageNet-21K上均优于当前最先进方法,同时分别提速18倍和4.3倍。
链接: https://arxiv.org/abs/2602.15277
作者: Muhammad J. Alahmadi,Peng Gao,Feiyi Wang,Dongkuan(DK)Xu
机构: North Carolina State University (北卡罗来纳州立大学); King Abdulaziz University (阿卜杜勒阿齐兹国王大学); Oak Ridge National Laboratory (橡树岭国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Dataset distillation compresses the original data into compact synthetic datasets, reducing training time and storage while retaining model performance, enabling deployment under limited resources. Although recent decoupling-based distillation methods enable dataset distillation at large-scale, they continue to face an efficiency gap: optimization-based decoupling methods achieve higher accuracy but demand intensive computation, whereas optimization-free decoupling methods are efficient but sacrifice accuracy. To overcome this trade-off, we propose Exploration-Exploitation Distillation (E^2D), a simple, practical method that minimizes redundant computation through an efficient pipeline that begins with full-image initialization to preserve semantic integrity and feature diversity. It then uses a two-phase optimization strategy: an exploration phase that performs uniform updates and identifies high-loss regions, and an exploitation phase that focuses updates on these regions to accelerate convergence. We evaluate E^2D on large-scale benchmarks, surpassing the state-of-the-art on ImageNet-1K while being 18x faster, and on ImageNet-21K, our method substantially improves accuracy while remaining 4.3x faster. These results demonstrate that targeted, redundancy-reducing updates, rather than brute-force optimization, bridge the gap between accuracy and efficiency in large-scale dataset distillation. Code is available at this https URL.
[CV-42] me-Archival Camera Virtualization for Sports and Visual Performances
【速读】:该论文旨在解决动态场景下高效且时空一致的视图合成问题,尤其针对体育赛事和舞台表演等快速运动、多主体非刚性动作场景中现有方法(如基于3D高斯溅射的动态渲染)所面临的挑战:即依赖高精度结构光重建点云、难以处理大范围非刚性形变及多个独立运动主体导致的高斯跟踪假设失效等问题。其解决方案的关键在于重新采用神经体积渲染(Neural Volume Rendering)框架,并将动态场景建模为在特定时刻下多个同步相机视角间的刚性变换,从而实现对场景的神经表示学习,在测试阶段提供更高质量的视觉渲染效果;同时,该方法首次支持时间归档(time-archival),允许用户回溯任意历史时间点并进行新视角合成,为直播事件的回放、分析与存档提供了前所未有的功能。
链接: https://arxiv.org/abs/2602.15181
作者: Yunxiao Zhang,William Stone,Suryansh Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project Page: this https URL Under minor revision in Journal of Computer Vision and Image Understanding (CVIU); Special Issue: Computer Vision for Sports and Winter Sports. Outcome of a master and bachelor student project completed in Visual and Spatial AI Lab at TAMU
Abstract:Camera virtualization – an emerging solution to novel view synthesis – holds transformative potential for visual entertainment, live performances, and sports broadcasting by enabling the generation of photorealistic images from novel viewpoints using images from a limited set of calibrated multiple static physical cameras. Despite recent advances, achieving spatially and temporally coherent and photorealistic rendering of dynamic scenes with efficient time-archival capabilities, particularly in fast-paced sports and stage performances, remains challenging for existing approaches. Recent methods based on 3D Gaussian Splatting (3DGS) for dynamic scenes could offer real-time view-synthesis results. Yet, they are hindered by their dependence on accurate 3D point clouds from the structure-from-motion method and their inability to handle large, non-rigid, rapid motions of different subjects (e.g., flips, jumps, articulations, sudden player-to-player transitions). Moreover, independent motions of multiple subjects can break the Gaussian-tracking assumptions commonly used in 4DGS, ST-GS, and other dynamic splatting variants. This paper advocates reconsidering a neural volume rendering formulation for camera virtualization and efficient time-archival capabilities, making it useful for sports broadcasting and related applications. By modeling a dynamic scene as rigid transformations across multiple synchronized camera views at a given time, our method performs neural representation learning, providing enhanced visual rendering quality at test time. A key contribution of our approach is its support for time-archival, i.e., users can revisit any past temporal instance of a dynamic scene and can perform novel view synthesis, enabling retrospective rendering for replay, analysis, and archival of live events, a functionality absent in existing neural rendering approaches and novel view synthesis…
[CV-43] Distributional Deep Learning for Super-Resolution of 4D Flow MRI under Domain Shift
【速读】:该论文旨在解决医学影像中超分辨率重建模型在真实临床场景中因域偏移(domain shift)导致的泛化能力差的问题。传统方法依赖于通过简单下采样生成的配对数据进行训练,而实际临床获取的低分辨率数据往往源于复杂的成像机制,与训练数据分布不一致,从而影响模型性能。解决方案的关键在于提出一种分布感知的深度学习框架(distributional deep learning framework),该框架首先在高分辨率计算流体动力学(CFD)仿真及其下采样版本上预训练模型,随后利用少量标准化的4D Flow MRI与CFD配对样本进行微调,从而增强模型对真实医学数据分布的适应性。理论分析和实证结果表明,该方法显著优于传统深度学习方法,在临床相关场景中实现了更鲁棒的超分辨率重建性能。
链接: https://arxiv.org/abs/2602.15167
作者: Xiaoyi Wen,Fei Jiang
机构: University of California, San Francisco (加州大学旧金山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP); Machine Learning (stat.ML)
备注:
Abstract:Super-resolution is widely used in medical imaging to enhance low-quality data, reducing scan time and improving abnormality detection. Conventional super-resolution approaches typically rely on paired datasets of downsampled and original high resolution images, training models to reconstruct high resolution images from their artificially degraded counterparts. However, in real-world clinical settings, low resolution data often arise from acquisition mechanisms that differ significantly from simple downsampling. As a result, these inputs may lie outside the domain of the training data, leading to poor model generalization due to domain shift. To address this limitation, we propose a distributional deep learning framework that improves model robustness and domain generalization. We develop this approch for enhancing the resolution of 4D Flow MRI (4DF). This is a novel imaging modality that captures hemodynamic flow velocity and clinically relevant metrics such as vessel wall stress. These metrics are critical for assessing aneurysm rupture risk. Our model is initially trained on high resolution computational fluid dynamics (CFD) simulations and their downsampled counterparts. It is then fine-tuned on a small, harmonized dataset of paired 4D Flow MRI and CFD samples. We derive the theoretical properties of our distributional estimators and demonstrate that our framework significantly outperforms traditional deep learning approaches through real data applications. This highlights the effectiveness of distributional learning in addressing domain shift and improving super-resolution performance in clinically realistic scenarios.
[CV-44] Refine Now Query Fast: A Decoupled Refinement Paradigm for Implicit Neural Fields ICLR2026
【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)在科学模拟中面临的保真度-速度权衡问题:深度多层感知机(MLP)虽具备高表达能力但推理成本高昂,而基于嵌入(embedding)的高效模型则缺乏足够的建模精度。其解决方案的关键在于提出解耦表示精化(Decoupled Representation Refinement, DRR)架构范式——通过一个一次性离线过程,利用深度精化网络与非参数变换将丰富的特征编码进紧凑高效的嵌入结构中,从而将高容量但慢速的神经网络与快速推理路径解耦。此方法显著提升了推理速度(最高达基线模型的27倍),同时保持了最先进的保真度。
链接: https://arxiv.org/abs/2602.15155
作者: Tianyu Xiong,Skylar Wurster,Han-Wei Shen
机构: Ohio State University (俄亥俄州立大学); Adobe (Adobe)
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to ICLR 2026. Code available at this https URL
Abstract:Implicit Neural Representations (INRs) have emerged as promising surrogates for large 3D scientific simulations due to their ability to continuously model spatial and conditional fields, yet they face a critical fidelity-speed dilemma: deep MLPs suffer from high inference cost, while efficient embedding-based models lack sufficient expressiveness. To resolve this, we propose the Decoupled Representation Refinement (DRR) architectural paradigm. DRR leverages a deep refiner network, alongside non-parametric transformations, in a one-time offline process to encode rich representations into a compact and efficient embedding structure. This approach decouples slow neural networks with high representational capacity from the fast inference path. We introduce DRR-Net, a simple network that validates this paradigm, and a novel data augmentation strategy, Variational Pairs (VP) for improving INRs under complex tasks like high-dimensional surrogate modeling. Experiments on several ensemble simulation datasets demonstrate that our approach achieves state-of-the-art fidelity, while being up to 27 \times faster at inference than high-fidelity baselines and remaining competitive with the fastest models. The DRR paradigm offers an effective strategy for building powerful and practical neural field surrogates and \revINRs in broader applications, with a minimal compromise between speed and quality.
[CV-45] Loss Knows Best: Detecting Annotation Errors in Videos via Loss Trajectories
【速读】:该论文旨在解决视频数据集中存在的标注错误问题,特别是误标(mislabeling)和时序错乱(disordering),这些问题在阶段标注(phase-annotated)任务中尤为有害,因为其破坏了时间一致性这一关键约束。解决方案的关键在于提出一种模型无关的标注错误检测方法,通过分析累积样本损失(Cumulative Sample Loss, CSL)——即每一帧在训练过程中各检查点处的平均损失轨迹——来识别异常帧。该方法利用帧级学习能力的动态指纹特性:正确标注的帧通常在早期训练阶段快速收敛至低损失,而误标或错序帧则表现出持续高损失或不规则波动模式。无需真实标注错误信息,该方法可跨数据集通用,已在EgoPER和Cholec80上验证其对细微标注不一致性的有效识别能力。
链接: https://arxiv.org/abs/2602.15154
作者: Praditha Alwis,Soumyadeep Chandra,Deepak Ravikumar,Kaushik Roy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 5 figures, 6 tables
Abstract:High-quality video datasets are foundational for training robust models in tasks like action recognition, phase detection, and event segmentation. However, many real-world video datasets suffer from annotation errors such as mislabeling, where segments are assigned incorrect class labels, and disordering, where the temporal sequence does not follow the correct progression. These errors are particularly harmful in phase-annotated tasks, where temporal consistency is critical. We propose a novel, model-agnostic method for detecting annotation errors by analyzing the Cumulative Sample Loss (CSL)–defined as the average loss a frame incurs when passing through model checkpoints saved across training epochs. This per-frame loss trajectory acts as a dynamic fingerprint of frame-level learnability. Mislabeled or disordered frames tend to show consistently high or irregular loss patterns, as they remain difficult for the model to learn throughout training, while correctly labeled frames typically converge to low loss early. To compute CSL, we train a video segmentation model and store its weights at each epoch. These checkpoints are then used to evaluate the loss of each frame in a test video. Frames with persistently high CSL are flagged as likely candidates for annotation errors, including mislabeling or temporal misalignment. Our method does not require ground truth on annotation errors and is generalizable across datasets. Experiments on EgoPER and Cholec80 demonstrate strong detection performance, effectively identifying subtle inconsistencies such as mislabeling and frame disordering. The proposed approach provides a powerful tool for dataset auditing and improving training reliability in video-based machine learning.
[CV-46] MB-DSMIL-CL-PL: Scalable Weakly Supervised Ovarian Cancer Subtype Classification and Localisation Using Contrastive and Prototype Learning with Frozen Patch Features
【速读】:该论文旨在解决卵巢癌组织病理图像中亚型分类与定位的精准性与可扩展性问题,尤其是在诊断工作量增加背景下,传统基于预计算冻结特征的方法在准确率上存在瓶颈,而端到端特征提取虽提升精度但训练效率低、实验耗时长。其解决方案的关键在于引入对比学习(contrastive learning)与原型学习(prototype learning)相结合的新框架,利用特征空间增强(feature-space augmentations)对冻结的图像块特征进行有效优化,在不改变原有预训练特征的前提下显著提升了实例级和切片级分类的F1分数及局部定位的AUC指标,从而实现了高准确率与高可扩展性的平衡。
链接: https://arxiv.org/abs/2602.15138
作者: Marcus Jenkins,Jasenka Mazibrada,Bogdan Leahu,Michal Mackiewicz
机构: University of East Anglia (东安格利亚大学); Norfolk and Norwich University Hospital (诺福克和诺维奇大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The study of histopathological subtypes is valuable for the personalisation of effective treatment strategies for ovarian cancer. However, increasing diagnostic workloads present a challenge for UK pathology departments, leading to the rise in AI approaches. While traditional approaches in this field have relied on pre-computed, frozen image features, recent advances have shifted towards end-to-end feature extraction, providing an improvement in accuracy but at the expense of significantly reduced scalability during training and time-consuming experimentation. In this paper, we propose a new approach for subtype classification and localisation in ovarian cancer histopathology images using contrastive and prototype learning with pre-computed, frozen features via feature-space augmentations. Compared to DSMIL, our method achieves an improvement of 70.4% and 15.3% in F1 score for instance- and slide-level classification, respectively, along with AUC gains of 16.9% for instance localisation and 2.3% for slide classification, while maintaining the use of frozen patch features.
[CV-47] Zero-shot HOI Detection with MLLM -based Detector-agnostic Interaction Recognition ICLR2026
【速读】:该论文旨在解决零样本人-物体交互(Human-object interaction, HOI)检测中交互识别(Interaction Recognition, IR)的挑战,尤其是面对交互组合多样性时,现有方法因与特定检测器强耦合及依赖粗粒度视觉-语言模型(Vision-Language Model, VLM)特征而导致泛化能力受限的问题。解决方案的关键在于提出一种解耦框架,将物体检测与交互识别分离,并利用多模态大语言模型(Multi-modal Large Language Models, MLLMs)实现零样本交互识别;同时引入确定性生成机制,将IR建模为视觉问答任务并强制确定性输出,从而无需训练即可实现零样本推理;此外,设计了空间感知池化模块融合外观与成对空间线索,以及单次确定性匹配方法,在一次前向传播中预测所有候选交互,显著提升性能与效率。
链接: https://arxiv.org/abs/2602.15124
作者: Shiyu Xuan,Dongkai Wang,Zechao Li,Jinhui Tang
机构: Nanjing University of Science and Technology (南京理工大学); Southwestern University of Finance and Economics (西南财经大学); Nanjing Forestry University (南京林业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026
Abstract:Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions. While advances in open-vocabulary object detection provide promising solutions for object localization, interaction recognition (IR) remains challenging due to the combinatorial diversity of interactions. Existing methods, including two-stage methods, tightly couple IR with a specific detector and rely on coarse-grained vision-language model (VLM) features, which limit generalization to unseen interactions. In this work, we propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR. We introduce a deterministic generation method that formulates IR as a visual question answering task and enforces deterministic outputs, enabling training-free zero-shot IR. To further enhance performance and efficiency by fine-tuning the model, we design a spatial-aware pooling module that integrates appearance and pairwise spatial cues, and a one-pass deterministic matching method that predicts all candidate interactions in a single forward pass. Extensive experiments on HICO-DET and V-COCO demonstrate that our method achieves superior zero-shot performance, strong cross-dataset generalization, and the flexibility to integrate with any object detectors without retraining. The codes are publicly available at this https URL.
[CV-48] GRAFNet: Multiscale Retinal Processing via Guided Cortical Attention Feedback for Enhancing Medical Image Polyp Segmentation
【速读】:该论文旨在解决结肠镜检查中息肉(polyp)分割的准确性问题,其挑战主要源于息肉形态多样性高(从扁平到隆起型病变)、与正常结构(如皱襞和血管)视觉相似性强,以及对多尺度检测鲁棒性的要求。现有深度学习方法因单向处理流程、弱多尺度融合及缺乏解剖学约束,常导致假阳性(过度分割正常结构)和假阴性(漏检细微扁平病变)。解决方案的关键在于提出GRAFNet架构,该架构受人类视觉系统层级组织启发,集成三个核心模块:(1) 仿生导向不对称注意力模块(GAAM),模拟方向选择性皮层神经元以强调息肉边界;(2) 多尺度视网膜模块(MSRM),复现视网膜神经节细胞通路实现并行多特征分析;(3) 导向皮层注意力反馈模块(GCAFM),应用预测编码进行迭代优化。上述模块统一于息肉编码-解码模块(PEDM),通过自适应分辨率反馈机制保障空间-语义一致性,从而在五个公开基准数据集上实现显著优于当前最优方法的性能(Dice提升3–8%,泛化能力提高10–20%),并提供可解释的决策路径。
链接: https://arxiv.org/abs/2602.15072
作者: Abdul Joseph Fofanah,Lian Wen,Alpha Alimamy Kamara,Zhongyi Zhang,David Chen,Albert Patrick Sankoh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate polyp segmentation in colonoscopy is essential for cancer prevention but remains challenging due to: (1) high morphological variability (from flat to protruding lesions), (2) strong visual similarity to normal structures such as folds and vessels, and (3) the need for robust multi-scale detection. Existing deep learning approaches suffer from unidirectional processing, weak multi-scale fusion, and the absence of anatomical constraints, often leading to false positives (over-segmentation of normal structures) and false negatives (missed subtle flat lesions). We propose GRAFNet, a biologically inspired architecture that emulates the hierarchical organisation of the human visual system. GRAFNet integrates three key modules: (1) a Guided Asymmetric Attention Module (GAAM) that mimics orientation-tuned cortical neurones to emphasise polyp boundaries, (2) a MultiScale Retinal Module (MSRM) that replicates retinal ganglion cell pathways for parallel multi-feature analysis, and (3) a Guided Cortical Attention Feedback Module (GCAFM) that applies predictive coding for iterative refinement. These are unified in a Polyp Encoder-Decoder Module (PEDM) that enforces spatial-semantic consistency via resolution-adaptive feedback. Extensive experiments on five public benchmarks (Kvasir-SEG, CVC-300, CVC-ColonDB, CVC-Clinic, and PolypGen) demonstrate consistent state-of-the-art performance, with 3-8% Dice improvements and 10-20% higher generalisation over leading methods, while offering interpretable decision pathways. This work establishes a paradigm in which neural computation principles bridge the gap between AI accuracy and clinically trustworthy reasoning. Code is available at this https URL.
[CV-49] Benchmarking Self-Supervised Models for Cardiac Ultrasound View Classification
【速读】:该论文旨在解决心脏超声图像(cardiac ultrasound images)中自动视图分类的准确性问题,以提升临床诊断与评估的可靠性。其解决方案的关键在于引入并比较两种自监督学习(self-supervised learning)框架——USF-MAE 和 MoCo v3,在大规模未标注心脏超声图像数据集(CACTUS,共 37,736 张图像)上进行训练与评估。研究发现,USF-MAE 在多个性能指标(如 ROC-AUC、准确率、F1 分数和召回率)上均显著优于 MoCo v3,表明其能学习更具判别性的特征表示,从而更有效地实现心脏视图的自动化识别。
链接: https://arxiv.org/abs/2602.15339
作者: Youssef Megahed,Salma I. Megahed,Robin Ducharme,Inok Lee,Adrian D. C. Chan,Mark C. Walker,Steven Hawken
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, 3 tables
Abstract:Reliable interpretation of cardiac ultrasound images is essential for accurate clinical diagnosis and assessment. Self-supervised learning has shown promise in medical imaging by leveraging large unlabelled datasets to learn meaningful representations. In this study, we evaluate and compare two self-supervised learning frameworks, USF-MAE, developed by our team, and MoCo v3, on the recently introduced CACTUS dataset (37,736 images) for automated simulated cardiac view (A4C, PL, PSAV, PSMV, Random, and SC) classification. Both models used 5-fold cross-validation, enabling robust assessment of generalization performance across multiple random splits. The CACTUS dataset provides expert-annotated cardiac ultrasound images with diverse views. We adopt an identical training protocol for both models to ensure a fair comparison. Both models are configured with a learning rate of 0.0001 and a weight decay of 0.01. For each fold, we record performance metrics including ROC-AUC, accuracy, F1-score, and recall. Our results indicate that USF-MAE consistently outperforms MoCo v3 across metrics. The average testing AUC for USF-MAE is 99.99% (+/-0.01% 95% CI), compared to 99.97% (+/-0.01%) for MoCo v3. USF-MAE achieves a mean testing accuracy of 99.33% (+/-0.18%), higher than the 98.99% (+/-0.28%) reported for MoCo v3. Similar trends are observed for the F1-score and recall, with improvements statistically significant across folds (paired t-test, p=0.0048 0.01). This proof-of-concept analysis suggests that USF-MAE learns more discriminative features for cardiac view classification than MoCo v3 when applied to this dataset. The enhanced performance across multiple metrics highlights the potential of USF-MAE for improving automated cardiac ultrasound classification.
[CV-50] StrokeNeXt: A Siamese-encoder Approach for Brain Stroke Classification in Computed Tomography Imagery
【速读】:该论文旨在解决急性脑卒中(stroke)在二维计算机断层扫描(CT)图像中的自动分类问题,具体包括卒中检测及缺血性卒中与出血性卒中亚型的区分。其解决方案的关键在于提出了一种双分支结构的深度学习模型 StrokeNeXt,该模型采用两个 ConvNeXt 编码器分别提取特征,并通过一个轻量级卷积解码器融合特征,该解码器基于堆叠的一维操作(包含瓶颈投影和变换层)以及紧凑的分类头实现高效特征整合与最终决策。此设计在保证高精度的同时显著提升了推理效率和训练收敛速度。
链接: https://arxiv.org/abs/2602.15087
作者: Leo Thomas Ramos,Angel D. Sappa
机构: Computer Vision Center (计算机视觉中心); Universitat Autònoma de Barcelona (巴塞罗那自治大学); ESPOL Polytechnic University (埃斯波尔理工学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 6 figures, 11 tables
Abstract:We present StrokeNeXt, a model for stroke classification in 2D Computed Tomography (CT) images. StrokeNeXt employs a dual-branch design with two ConvNeXt encoders, whose features are fused through a lightweight convolutional decoder based on stacked 1D operations, including a bottleneck projection and transformation layers, and a compact classification head. The model is evaluated on a curated dataset of 6,774 CT images, addressing both stroke detection and subtype classification between ischemic and hemorrhage cases. StrokeNeXt consistently outperforms convolutional and Transformer-based baselines, reaching accuracies and F1-scores of up to 0.988. Paired statistical tests confirm that the performance gains are statistically significant, while class-wise sensitivity and specificity demonstrate robust behavior across diagnostic categories. Calibration analysis shows reduced prediction error compared to competing methods, and confusion matrix results indicate low misclassification rates. In addition, the model exhibits low inference time and fast convergence.
人工智能
[AI-0] Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching
【速读】:该论文旨在解决人形机器人在复杂环境中实现高动态、长时程且具适应性的障碍穿越(parkour)任务的挑战,尤其关注如何融合感知能力与技能组合以实现类人运动的表现力和自主决策。其解决方案的关键在于提出一个模块化框架——Perceptive Humanoid Parkour (PHP),该框架首先通过特征空间中的最近邻搜索进行运动匹配,将人类原子动作技能(atomic human skills)重构为连续的高阶运动轨迹,从而实现流畅、灵活的技能链组合;随后利用基于深度感知的强化学习(RL)专家策略进行训练,并通过DAgger与RL结合的方式蒸馏为单一的多技能学生策略,最终使机器人仅依赖机载深度传感和离散二维速度指令即可自主判断并执行跨越、攀爬、翻越或滚落等多样化动作,实现在真实世界中对动态障碍的闭环适应与长距离障碍穿越。
链接: https://arxiv.org/abs/2602.15827
作者: Zhen Wu,Xiaoyu Huang,Lujie Yang,Yuanhang Zhang,Koushil Sreenath,Xi Chen,Pieter Abbeel,Rocky Duan,Angjoo Kanazawa,Carmelo Sferrazza,Guanya Shi,C. Karen Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:
Abstract:While recent advances in humanoid locomotion have achieved stable walking on varied terrains, capturing the agility and adaptivity of highly dynamic human motions remains an open challenge. In particular, agile parkour in complex environments demands not only low-level robustness, but also human-like motion expressiveness, long-horizon skill composition, and perception-driven decision-making. In this paper, we present Perceptive Humanoid Parkour (PHP), a modular framework that enables humanoid robots to autonomously perform long-horizon, vision-based parkour across challenging obstacle courses. Our approach first leverages motion matching, formulated as nearest-neighbor search in a feature space, to compose retargeted atomic human skills into long-horizon kinematic trajectories. This framework enables the flexible composition and smooth transition of complex skill chains while preserving the elegance and fluidity of dynamic human motions. Next, we train motion-tracking reinforcement learning (RL) expert policies for these composed motions, and distill them into a single depth-based, multi-skill student policy, using a combination of DAgger and RL. Crucially, the combination of perception and skill composition enables autonomous, context-aware decision-making: using only onboard depth sensing and a discrete 2D velocity command, the robot selects and executes whether to step over, climb onto, vault or roll off obstacles of varying geometries and heights. We validate our framework with extensive real-world experiments on a Unitree G1 humanoid robot, demonstrating highly dynamic parkour skills such as climbing tall obstacles up to 1.25m (96% robot height), as well as long-horizon multi-obstacle traversal with closed-loop adaptation to real-time obstacle perturbations.
[AI-1] CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)编辑中的能力保留问题:现有编辑方法在成功修改目标行为时,常因优化代理(proxy)的扰动而破坏模型的通用能力,导致退化行为,类似代理/奖励劫持(proxy/reward hacking)。解决方案的关键在于提出一种可扩展且理论严谨的二阶编辑算法 CrispEdit,其将能力保留显式建模为约束,并通过将编辑更新投影到能力损失曲率较低的子空间来强制执行该约束。CrispEdit 利用 Bregman 散度表达能力约束,其二次形式恰好对应 Gauss-Newton 海森矩阵(Gauss-Newton Hessian),即使基础模型未训练至收敛也成立;同时采用 Kronecker-factored approximate curvature (K-FAC) 和一种新颖的无矩阵投影算子,高效实现大规模 LLM 上的二阶优化,从而在标准编辑基准上实现高编辑成功率的同时,平均能力退化低于 1%。
链接: https://arxiv.org/abs/2602.15823
作者: Zarif Ikram,Arad Firouzkouhi,Stephen Tu,Mahdi Soltanolkotabi,Paria Rashidinejad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:A central challenge in large language model (LLM) editing is capability preservation: methods that successfully change targeted behavior can quietly game the editing proxy and corrupt general capabilities, producing degenerate behaviors reminiscent of proxy/reward hacking. We present CrispEdit, a scalable and principled second-order editing algorithm that treats capability preservation as an explicit constraint, unifying and generalizing several existing editing approaches. CrispEdit formulates editing as constrained optimization and enforces the constraint by projecting edit updates onto the low-curvature subspace of the capability-loss landscape. At the crux of CrispEdit is expressing capability constraint via Bregman divergence, whose quadratic form yields the Gauss-Newton Hessian exactly and even when the base model is not trained to convergence. We make this second-order procedure efficient at the LLM scale using Kronecker-factored approximate curvature (K-FAC) and a novel matrix-free projector that exploits Kronecker structure to avoid constructing massive projection matrices. Across standard model-editing benchmarks, CrispEdit achieves high edit success while keeping capability degradation below 1% on average across datasets, significantly improving over prior editors.
[AI-2] Developing AI Agents with Simulated Data: Why what and how?
【速读】:该论文旨在解决现代 subsymbolic AI(符号化AI)在实际应用中因数据量不足和质量不佳而导致的推广障碍问题。其解决方案的关键在于采用基于仿真的合成数据生成技术,通过系统化的方法生成多样化且高质量的合成数据,以支持AI模型的训练与优化;文中进一步提出一个参考框架,用于描述、设计和分析基于数字孪生(Digital Twin)的AI仿真解决方案,从而提升合成数据生成的规范性与可扩展性。
链接: https://arxiv.org/abs/2602.15816
作者: Xiaoran Liu,Istvan David
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:As insufficient data volume and quality remain the key impediments to the adoption of modern subsymbolic AI, techniques of synthetic data generation are in high demand. Simulation offers an apt, systematic approach to generating diverse synthetic data. This chapter introduces the reader to the key concepts, benefits, and challenges of simulation-based synthetic data generation for AI training purposes, and to a reference framework to describe, design, and analyze digital twin-based AI simulation solutions.
[AI-3] he Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety
【速读】:该论文旨在解决生成式 AI (Generative AI) 在对齐语言模型进行良性任务微调时,意外导致安全防护机制退化的问题。尽管训练数据无有害内容且开发者无恶意意图,微调仍可能破坏模型的安全性。其关键解决方案在于提出了一种新颖的几何分析框架,揭示了对齐(alignment)集中在低维子空间中且具有尖锐曲率的结构特性,这种结构对一阶优化方法不可见且极易被梯度下降动态引入;论文进一步定义了“对齐不稳定性条件”(Alignment Instability Condition),即三个几何性质共同满足时将引发安全性下降,并证明对齐损失随训练时间呈四次方增长,由对齐几何的尖锐性和微调任务与安全参数间曲率耦合强度决定。这一发现表明当前安全范式忽视了微调过程中的动态演化本质,亟需发展曲率感知的微调方法以实现从被动红队测试到预测性诊断的转变。
链接: https://arxiv.org/abs/2602.15799
作者: Max Springer,Chung Peng Lee,Blossom Metevier,Jane Castleman,Bohdan Turbal,Hayoung Jung,Zeyu Shen,Aleksandra Korolova
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 4 figures
Abstract:Fine-tuning aligned language models on benign tasks unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that fine-tuning updates should be orthogonal to safety-critical directions in high-dimensional parameter space, offers false reassurance: we show this orthogonality is structurally unstable and collapses under the dynamics of gradient descent. We then resolve this through a novel geometric analysis, proving that alignment concentrates in low-dimensional subspaces with sharp curvature, creating a brittle structure that first-order methods cannot detect or defend. While initial fine-tuning updates may indeed avoid these subspaces, the curvature of the fine-tuning loss generates second-order acceleration that systematically steers trajectories into alignment-sensitive regions. We formalize this mechanism through the Alignment Instability Condition, three geometric properties that, when jointly satisfied, lead to safety degradation. Our main result establishes a quartic scaling law: alignment loss grows with the fourth power of training time, governed by the sharpness of alignment geometry and the strength of curvature coupling between the fine-tuning task and safety-critical parameters. These results expose a structural blind spot in the current safety paradigm. The dominant approaches to safe fine-tuning address only the initial snapshot of a fundamentally dynamic problem. Alignment fragility is not a bug to be patched; it is an intrinsic geometric property of gradient descent on curved manifolds. Our results motivate the development of curvature-aware methods, and we hope will further enable a shift in alignment safety analysis from reactive red-teaming to predictive diagnostics for open-weight model deployment.
[AI-4] his human study did not involve human subjects: Validating LLM simulations as behavioral evidence
【速读】:该论文旨在解决如何在社会科学实验中有效使用大语言模型(Large Language Models, LLMs)作为合成参与者时,确保其生成数据能够支持对人类行为的可靠因果推断问题。论文指出,当前缺乏明确指导来判断何时LLM模拟可以用于有效推断人类行为,因此区分了两种策略:一是启发式方法,通过提示工程和微调等手段使模拟行为与真实人类行为趋同,适用于探索性研究但缺乏统计保障;二是统计校准方法,利用辅助人类数据进行统计调整以修正模拟与观测之间的差异,在明确假设下可提供更精确且具有效性的因果估计,同时降低实验成本。关键在于,无论采用哪种方案,都依赖于LLM对目标人群的逼近程度,而研究者若仅聚焦于用LLM替代人类参与者,则可能忽视了更广泛的研究机会,如结合人机协同设计或利用LLM扩展实验场景。
链接: https://arxiv.org/abs/2602.15785
作者: Jessica Hullman,David Broska,Huaman Sun,Aaron Shaw
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:A growing literature uses large language models (LLMs) as synthetic participants to generate cost-effective and nearly instantaneous responses in social science experiments. However, there is limited guidance on when such simulations support valid inference about human behavior. We contrast two strategies for obtaining valid estimates of causal effects and clarify the assumptions under which each is suitable for exploratory versus confirmatory research. Heuristic approaches seek to establish that simulated and observed human behavior are interchangeable through prompt engineering, model fine-tuning, and other repair strategies designed to reduce LLM-induced inaccuracies. While useful for many exploratory tasks, heuristic approaches lack the formal statistical guarantees typically required for confirmatory research. In contrast, statistical calibration combines auxiliary human data with statistical adjustments to account for discrepancies between observed and simulated responses. Under explicit assumptions, statistical calibration preserves validity and provides more precise estimates of causal effects at lower cost than experiments that rely solely on human participants. Yet the potential of both approaches depends on how well LLMs approximate the relevant populations. We consider what opportunities are overlooked when researchers focus myopically on substituting LLMs for human participants in a study.
[AI-5] GlobeDiff: State Diffusion Process for Partial Observability in Multi-Agent Systems
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems)中因部分可观测性(Partial Observability)导致的协调与决策困难问题。现有方法如基于信念的状态估计和智能体间通信,分别受限于仅依赖历史经验而无法充分利用全局信息,以及缺乏有效模型来利用通信提供的辅助信息。解决方案的关键在于提出全局状态扩散算法(Global State Diffusion Algorithm, GlobeDiff),其将状态推断过程建模为一个多模态扩散过程,从而在克服状态估计歧义的同时高保真地推断全局状态,并证明了在单模态与多模态分布下估计误差均可被界定。
链接: https://arxiv.org/abs/2602.15776
作者: Yiqin Yang,Xu Yang,Yuhua Jiang,Ni Mu,Hao Hu,Runpeng Xie,Ziyou Zhang,Siyuan Li,Yuan-Hua Ni,Qianchuan Zhao,Bo Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In the realm of multi-agent systems, the challenge of \emphpartial observability is a critical barrier to effective coordination and decision-making. Existing approaches, such as belief state estimation and inter-agent communication, often fall short. Belief-based methods are limited by their focus on past experiences without fully leveraging global information, while communication methods often lack a robust model to effectively utilize the auxiliary information they provide. To solve this issue, we propose Global State Diffusion Algorithm~(GlobeDiff) to infer the global state based on the local observations. By formulating the state inference process as a multi-modal diffusion process, GlobeDiff overcomes ambiguities in state estimation while simultaneously inferring the global state with high fidelity. We prove that the estimation error of GlobeDiff under both unimodal and multi-modal distributions can be bounded. Extensive experimental results demonstrate that GlobeDiff achieves superior performance and is capable of accurately inferring the global state.
[AI-6] UrbanVerse: Learning Urban Region Representation Across Cities and Tasks
【速读】:该论文旨在解决当前城市表征学习方法在跨城市和跨任务场景下泛化能力不足的问题,即现有模型通常局限于特定城市或特定分析任务,难以适应不同城市间差异及多种下游预测任务。解决方案的关键在于提出UrbanVerse模型,其核心创新包括:(1)针对跨城市泛化,采用图结构建模城市区域为节点,并通过随机游走生成“区域序列”,从而捕捉目标区域局部特征及其邻近结构信息;(2)针对跨任务泛化,设计了一个名为HCondDiffCT的跨任务学习模块,将区域条件先验知识与任务条件语义融合进扩散过程,实现多任务联合建模。实验表明,UrbanVerse在六个真实世界城市数据分析任务中均显著优于现有方法,跨城市设置下预测准确率提升最高达35.89%。
链接: https://arxiv.org/abs/2602.15750
作者: Fengze Sun,Egemen Tanin,Shanika Karunasekera,Zuqing Li,Flora D. Salim,Jianzhong Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in urban region representation learning have enabled a wide range of applications in urban analytics, yet existing methods remain limited in their capabilities to generalize across cities and analytic tasks. We aim to generalize urban representation learning beyond city- and task-specific settings, towards a foundation-style model for urban analytics. To this end, we propose UrbanVerse, a model for cross-city urban representation learning and cross-task urban analytics. For cross-city generalization, UrbanVerse focuses on features local to the target regions and structural features of the nearby regions rather than the entire city. We model regions as nodes on a graph, which enables a random walk-based procedure to form “sequences of regions” that reflect both local and neighborhood structural features for urban region representation learning. For cross-task generalization, we propose a cross-task learning module named HCondDiffCT. This module integrates region-conditioned prior knowledge and task-conditioned semantics into the diffusion process to jointly model multiple downstream urban prediction tasks. HCondDiffCT is generic. It can also be integrated with existing urban representation learning models to enhance their downstream task effectiveness. Experiments on real-world datasets show that UrbanVerse consistently outperforms state-of-the-art methods across six tasks under cross-city settings, achieving up to 35.89% improvements in prediction accuracy.
[AI-7] MRC-GAT: A Meta-Relational Copula-Based Graph Attention Network for Interpretable Multimodal Alzheimers Disease Diagnosis
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期诊断中因传统图模型结构固定而导致的灵活性不足与跨异构患者数据泛化能力差的问题。解决方案的关键在于提出一种基于元关系的拷贝函数图注意力网络(Meta-Relational Copula-Based Graph Attention Network, MRC-GAT),其核心创新包括:通过拷贝函数变换将风险因素(Risk Factors, RF)、认知测试得分和MRI特征映射到统一统计空间以实现多模态特征对齐,再利用多关系注意力机制融合不同模态信息,并嵌入到 episodic meta-learning 框架中,从而提升模型在复杂临床数据上的适应性与分类精度。实验表明,该方法在TADPOLE和NACC数据集上分别达到96.87%和92.31%的准确率,显著优于现有模型,并具备良好的可解释性。
链接: https://arxiv.org/abs/2602.15740
作者: Fatemeh Khalvandi,Saadat Izadi,Abdolah Chalechale
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 27 pages, 10 figures, 10 table
Abstract:Alzheimer’s disease (AD) is a progressive neurodegenerative condition necessitating early and precise diagnosis to provide prompt clinical management. Given the paramount importance of early diagnosis, recent studies have increasingly focused on computer-aided diagnostic models to enhance precision and reliability. However, most graph-based approaches still rely on fixed structural designs, which restrict their flexibility and limit generalization across heterogeneous patient data. To overcome these limitations, the Meta-Relational Copula-Based Graph Attention Network (MRC-GAT) is proposed as an efficient multimodal model for AD classification tasks. The proposed architecture, copula-based similarity alignment, relational attention, and node fusion are integrated as the core components of episodic meta-learning, such that the multimodal features, including risk factors (RF), Cognitive test scores, and MRI attributes, are first aligned via a copula-based transformation in a common statistical space and then combined by a multi-relational attention mechanism. According to evaluations performed on the TADPOLE and NACC datasets, the MRC-GAT model achieved accuracies of 96.87% and 92.31%, respectively, demonstrating state-of-the-art performance compared to existing diagnostic models. Finally, the proposed model confirms the robustness and applicability of the proposed method by providing interpretability at various stages of disease diagnosis.
[AI-8] MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction
【速读】:该论文旨在解决当前人形机器人运动控制中因运动与场景解耦而导致的物理不一致性问题,例如接触滑移或网格穿透,尤其是在地形感知任务中。现有方法严重依赖昂贵的运动捕捉(Motion Capture, MoCap)数据,且这些数据通常缺乏环境几何信息,难以实现人形机器人与真实物理环境的协同交互。解决方案的关键在于提出 MeshMimic 框架,通过结合三维场景重建与具身智能,直接从单目视频中学习“运动-地形”耦合交互模式;其核心技术包括基于运动学一致性的优化算法以从噪声视觉重建中提取高质量运动数据,以及一种接触不变的目标迁移方法,将人类与环境的交互特征映射至人形机器人代理,从而在复杂多变地形上实现鲁棒、高动态的运动表现。
链接: https://arxiv.org/abs/2602.15733
作者: Qiang Zhang,Jiahao Ma,Peiran Liu,Shuai Shi,Zeran Su,Zifan Wang,Jingkai Sun,Wei Cui,Jialin Yu,Gang Han,Wen Zhao,Pihai Sun,Kangning Yin,Jiaxu Wang,Jiahang Cao,Lingfeng Zhang,Hao Cheng,Xiaoshuai Hao,Yiding Ji,Junwei Liang,Jian Tang,Renjing Xu,Yijie Guo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures
Abstract:Humanoid motion control has witnessed significant breakthroughs in recent years, with deep reinforcement learning (RL) emerging as a primary catalyst for achieving complex, human-like behaviors. However, the high dimensionality and intricate dynamics of humanoid robots make manual motion design impractical, leading to a heavy reliance on expensive motion capture (MoCap) data. These datasets are not only costly to acquire but also frequently lack the necessary geometric context of the surrounding physical environment. Consequently, existing motion synthesis frameworks often suffer from a decoupling of motion and scene, resulting in physical inconsistencies such as contact slippage or mesh penetration during terrain-aware tasks. In this work, we present MeshMimic, an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled “motion-terrain” interactions directly from video. By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects. We introduce an optimization algorithm based on kinematic consistency to extract high-quality motion data from noisy visual reconstructions, alongside a contact-invariant retargeting method that transfers human-environment interaction features to the humanoid agent. Experimental results demonstrate that MeshMimic achieves robust, highly dynamic performance across diverse and challenging terrains. Our approach proves that a low-cost pipeline utilizing only consumer-grade monocular sensors can facilitate the training of complex physical interactions, offering a scalable path toward the autonomous evolution of humanoid robots in unstructured environments.
[AI-9] Lifelong Scalable Multi-Agent Realistic Testbed and A Comprehensive Study on Design Choices in Lifelong AGV Fleet Management Systems
【速读】:该论文旨在解决在自动化导引车(Automated Guided Vehicles, AGVs)的 Fleet Management System (FMS) 中,如何有效评估任意多智能体路径规划(Multi-Agent Path Finding, MAPF)算法的问题,尤其针对终身多智能体路径规划(Lifelong Multi-Agent Path Finding, LMAPF)场景。现有研究通常假设简化的运动学模型(如石子移动模型)以及理想执行与通信条件,无法真实反映实际系统中的动态性与不确定性。为此,作者提出了 Lifelong Scalable Multi-Agent Realistic Testbed (LSMART),一个开源仿真平台,首次将 FMS 的并行规划与执行机制、不同优化程度与代理模型假设的规划器选择策略,以及规划失败后的恢复机制纳入统一框架,从而支持对 MAPF 算法在真实 AGV 舰队管理场景下的全面评估。其关键创新在于构建了一个兼顾现实约束(如通信延迟、执行不确定性)与系统级设计决策(何时规划、如何规划、失败后如何恢复)的可扩展、可配置的测试环境,为设计高效中央化终身 AGV 系统提供了实证依据和实践指导。
链接: https://arxiv.org/abs/2602.15721
作者: Jingtian Yan,Yulun Zhang,Zhenting Liu,Han Zhang,He Jiang,Jingkai Chen,Stephen F. Smith,Jiaoyang Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:We present Lifelong Scalable Multi-Agent Realistic Testbed (LSMART), an open-source simulator to evaluate any Multi-Agent Path Finding (MAPF) algorithm in a Fleet Management System (FMS) with Automated Guided Vehicles (AGVs). MAPF aims to move a group of agents from their corresponding starting locations to their goals. Lifelong MAPF (LMAPF) is a variant of MAPF that continuously assigns new goals for agents to reach. LMAPF applications, such as autonomous warehouses, often require a centralized, lifelong system to coordinate the movement of a fleet of robots, typically AGVs. However, existing works on MAPF and LMAPF often assume simplified kinodynamic models, such as pebble motion, as well as perfect execution and communication for AGVs. Prior work has presented SMART, a software capable of evaluating any MAPF algorithms while considering agent kinodynamics, communication delays, and execution uncertainties. However, SMART is designed for MAPF, not LMAPF. Generalizing SMART to an FMS requires many more design choices. First, an FMS parallelizes planning and execution, raising the question of when to plan. Second, given planners with varying optimality and differing agent-model assumptions, one must decide how to plan. Third, when the planner fails to return valid solutions, the system must determine how to recover. In this paper, we first present LSMART, an open-source simulator that incorporates all these considerations to evaluate any MAPF algorithms in an FMS. We then provide experiment results based on state-of-the-art methods for each design choice, offering guidance on how to effectively design centralized lifelong AGV Fleet Management Systems. LSMART is available at this https URL.
[AI-10] Random Wavelet Features for Graph Kernel Machines
【速读】:该论文旨在解决大规模图网络中节点嵌入(node embeddings)的计算效率与表示质量之间的矛盾问题,即如何在保持结构信息的同时高效地学习具有语义意义的节点相似性表示。其核心挑战在于传统图核(graph kernel)虽能提供理论严谨的节点相似性定义,但直接计算在大型网络中代价高昂。解决方案的关键在于引入随机谱节点嵌入(randomized spectral node embeddings),通过随机特征方法对特定图核进行低秩近似,使得嵌入向量的点积可有效估计原图核的近似值;该方法在理论上和实证上均证明优于现有方法,尤其适用于谱局部化(spectrally localized)的图核,从而实现了可扩展且原理清晰的图表示学习。
链接: https://arxiv.org/abs/2602.15711
作者: Valentin de Bassompierre,Jean-Charles Delvenne,Laurent Jacques
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: This paper is an extended version of a paper submitted to the 2026 European Signal Processing Conference (EUSIPCO 2026). It contains supplementary material including the full proof to Proposition 1
Abstract:Node embeddings map graph vertices into low-dimensional Euclidean spaces while preserving structural information. They are central to tasks such as node classification, link prediction, and signal reconstruction. A key goal is to design node embeddings whose dot products capture meaningful notions of node similarity induced by the graph. Graph kernels offer a principled way to define such similarities, but their direct computation is often prohibitive for large networks. Inspired by random feature methods for kernel approximation in Euclidean spaces, we introduce randomized spectral node embeddings whose dot products estimate a low-rank approximation of any specific graph kernel. We provide theoretical and empirical results showing that our embeddings achieve more accurate kernel approximations than existing methods, particularly for spectrally localized kernels. These results demonstrate the effectiveness of randomized spectral constructions for scalable and principled graph representation learning.
[AI-11] Relative Geometry of Neural Forecasters: Linking Accuracy and Alignment in Learned Latent Geometry
【速读】:该论文旨在解决神经网络在预测复杂动力系统时,其内部潜在几何结构(latent geometry)如何被表征的问题。现有研究虽表明神经网络能准确预测动力系统,但对其内部表示机制的理解仍不充分。解决方案的关键在于引入基于锚点(anchor-based)的、与几何无关的相对嵌入(relative embeddings),该方法可消除潜在空间中的旋转和缩放歧义,从而实现对模型表征对齐(representational alignment)的量化分析。通过在七个典型动力系统(从周期到混沌)上的实验,作者发现不同模型家族(如MLP、RNN、Transformer和ESN)展现出可复现的层级结构:同类模型间具有较高对齐性,而某些高性能模型(如Transformer和ESN)即使对齐度较低也能实现高精度预测,揭示了表征对齐与预测准确性之间的非线性关系。
链接: https://arxiv.org/abs/2602.15676
作者: Deniz Kucukahmetler,Maximilian Jean Hemmann,Julian Mosig von Aehrenfeld,Maximilian Amthor,Christian Deubel,Nico Scherf,Diaaeldin Taha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to Transactions on Machine Learning Research (TMLR)
Abstract:Neural networks can accurately forecast complex dynamical systems, yet how they internally represent underlying latent geometry remains poorly understood. We study neural forecasters through the lens of representational alignment, introducing anchor-based, geometry-agnostic relative embeddings that remove rotational and scaling ambiguities in latent spaces. Applying this framework across seven canonical dynamical systems - ranging from periodic to chaotic - we reveal reproducible family-level structure: multilayer perceptrons align with other MLPs, recurrent networks with RNNs, while transformers and echo-state networks achieve strong forecasts despite weaker alignment. Alignment generally correlates with forecasting accuracy, yet high accuracy can coexist with low alignment. Relative geometry thus provides a simple, reproducible foundation for comparing how model families internalize and represent dynamical structure.
[AI-12] PERSONA: Dynamic and Compositional Inference-Time Personality Control via Activation Vector Algebra ICLR2026
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)中人格控制方法存在的局限性,即依赖静态提示(static prompting)或高成本微调(fine-tuning),无法有效捕捉人类特质的动态性和组合性。其解决方案的关键在于提出一种无需训练(training-free)的框架PERSONA,核心创新在于发现并利用模型激活空间中可提取的、近似正交的人格向量(personality vectors),并通过向量代数操作实现精确控制:Persona-Base通过对比激活分析提取正交人格方向,Persona-Algebra支持通过标量乘法调节强度、加法实现组合、减法实现抑制,而Persona-Flow则在推理阶段动态组合这些向量以适应上下文。这一方法在PersonalityBench上达到9.60的平均得分,接近监督微调上限(9.61),并在动态人格适应基准Persona-Evolve上实现最高91%的胜率,验证了LLM人格特征具有数学可操作性,为高效、可解释的行为控制提供了新路径。
链接: https://arxiv.org/abs/2602.15669
作者: Xiachong Feng,Liang Zhao,Weihong Zhong,Yichong Huang,Yuxuan Gu,Lingpeng Kong,Xiaocheng Feng,Bing Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICLR 2026
Abstract:Current methods for personality control in Large Language Models rely on static prompting or expensive fine-tuning, failing to capture the dynamic and compositional nature of human traits. We introduce PERSONA, a training-free framework that achieves fine-tuning level performance through direct manipulation of personality vectors in activation space. Our key insight is that personality traits appear as extractable, approximately orthogonal directions in the model’s representation space that support algebraic operations. The framework operates through three stages: Persona-Base extracts orthogonal trait vectors via contrastive activation analysis; Persona-Algebra enables precise control through vector arithmetic (scalar multiplication for intensity, addition for composition, subtraction for suppression); and Persona-Flow achieves context-aware adaptation by dynamically composing these vectors during inference. On PersonalityBench, our approach achieves a mean score of 9.60, nearly matching the supervised fine-tuning upper bound of 9.61 without any gradient updates. On our proposed Persona-Evolve benchmark for dynamic personality adaptation, we achieve up to 91% win rates across diverse model families. These results provide evidence that aspects of LLM personality are mathematically tractable, opening new directions for interpretable and efficient behavioral control.
[AI-13] Zombie Agents : Persistent Control of Self-Evolving LLM Agents via Self-Reinforcing Injections
【速读】:该论文旨在解决自进化大语言模型(Large Language Model, LLM)代理在跨会话过程中因长期记忆(long-term memory)机制引发的安全风险问题,即外部恶意内容可能被代理在无害任务中读取并存储为记忆,在后续会话中被触发,从而执行未经授权的操作。解决方案的关键在于提出一种名为“僵尸代理”(Zombie Agent)的持久性攻击框架,该框架通过黑盒方式仅利用受控网页内容间接注入恶意载荷,并设计针对常见内存实现(如滑动窗口和检索增强记忆)的机制特异性持久策略,以规避截断和相关性过滤,从而实现攻击载荷在多个会话中的持续存在与激活,证明了仅依赖单次会话提示过滤的防御手段不足以应对此类威胁。
链接: https://arxiv.org/abs/2602.15654
作者: Xianglin Yang,Yufei He,Shuo Ji,Bryan Hooi,Jin Song Dong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Self-evolving LLM agents update their internal state across sessions, often by writing and reusing long-term memory. This design improves performance on long-horizon tasks but creates a security risk: untrusted external content observed during a benign session can be stored as memory and later treated as instruction. We study this risk and formalize a persistent attack we call a Zombie Agent, where an attacker covertly implants a payload that survives across sessions, effectively turning the agent into a puppet of the attacker. We present a black-box attack framework that uses only indirect exposure through attacker-controlled web content. The attack has two phases. During infection, the agent reads a poisoned source while completing a benign task and writes the payload into long-term memory through its normal update process. During trigger, the payload is retrieved or carried forward and causes unauthorized tool behavior. We design mechanism-specific persistence strategies for common memory implementations, including sliding-window and retrieval-augmented memory, to resist truncation and relevance filtering. We evaluate the attack on representative agent setups and tasks, measuring both persistence over time and the ability to induce unauthorized actions while preserving benign task quality. Our results show that memory evolution can convert one-time indirect injection into persistent compromise, which suggests that defenses focused only on per-session prompt filtering are not sufficient for self-evolving agents. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.15654 [cs.CR] (or arXiv:2602.15654v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.15654 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-14] On inferring cumulative constraints
【速读】:该论文旨在解决约束规划(Constraint Programming, CP)中累积约束(Cumulative Constraint)在传播过程中因逐约束处理而忽略多资源交互作用,导致某些基准测试实例性能显著下降的问题。其解决方案的关键在于提出一种预处理方法,通过将累积约束视为占用向量上的线性不等式,识别任务集合的“覆盖”(Cover),即无法并行执行的任务组,并利用提升(Lifting)技术强化这些覆盖不等式,最终将生成的有效不等式注入原调度问题实例中,从而显式建模多资源间的协同关系,提升搜索效率与边界紧致性。
链接: https://arxiv.org/abs/2602.15635
作者: Konstantin Sidorov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures, 4 tables; submitted to the 32nd International Conference on Principles and Practice of Constraint Programming (CP 2026)
Abstract:Cumulative constraints are central in scheduling with constraint programming, yet propagation is typically performed per constraint, missing multi-resource interactions and causing severe slowdowns on some benchmarks. I present a preprocessing method for inferring additional cumulative constraints that capture such interactions without search-time probing. This approach interprets cumulative constraints as linear inequalities over occupancy vectors and generates valid inequalities by (i) discovering covers, the sets of tasks that cannot run in parallel, (ii) strengthening the cover inequalities for the discovered sets with lifting, and (iii) injecting the resulting constraints back into the scheduling problem instance. Experiments on standard RCPSP and RCPSP/max test suites show that these inferred constraints improve search performance and tighten objective bounds on favorable instances, while incurring little degradation on unfavorable ones. Additionally, these experiments discover 25 new lower bounds and five new best solutions; eight of the lower bounds are obtained directly from the inferred constraints.
[AI-15] he geometry of online conversations and the causal antecedents of conflictual discourse
【速读】:该论文旨在解决在线气候讨论中冲突性语言的因果前因及其互动结构几何特征的问题,核心在于厘清不同对话维度(立场、语气、情绪与事实框架)如何受时间延迟、对话环境及树状结构特征的影响。解决方案的关键在于采用三种通过大语言模型(LLM)提示推断并平均化的标注维度,系统量化冲突性话语的不同面向,并基于线程化在线论坛数据,揭示:1)较长的时间间隔通常带来更礼貌的回复,但相对于父帖的延迟则伴随更强的情绪化表达;2)对话参与者显著趋向于与同级子帖和父帖的平均立场、语气及情感框架对齐,其中父帖影响强于同级子帖;3)分支早期响应的立场倾向(同意或反对根消息)会调节这种对齐动态,尤其在情绪-事实框架上存在显著交互效应——即当老兄弟帖与父帖情感一致时,父帖情感一致性被进一步强化。
链接: https://arxiv.org/abs/2602.15600
作者: Carlo Santagiustina,Caterina Cruciani
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Econometrics (econ.EM); Applications (stat.AP)
备注:
Abstract:This article investigates the causal antecedents of conflictual language and the geometry of interaction in online threaded conversations related to climate change. We employ three annotation dimensions, inferred through LLM prompting and averaging, to capture complementary aspects of discursive conflict (such as stance: agreement vs disagreement; tone: attacking vs respectful; and emotional versus factual framing) and use data from a threaded online forum to examine how these dimensions respond to temporal, conversational, and arborescent structural features of discussions. We show that, as suggested by the literature, longer delays between successive posts in a thread are associated with replies that are, on average, more respectful, whereas longer delays relative to the parent post are associated with slightly less disagreement but more emotional (less factual) language. Second, we characterize alignment with the local conversational environment and find strong convergence both toward the average stance, tone and emotional framing of older sibling posts replying to the same parent and toward those of the parent post itself, with parent post effects generally stronger than sibling effects. We further show that early branch-level responses condition these alignment dynamics, such that parent-child stance alignment is amplified or attenuated depending on whether a branch is initiated in agreement or disagreement with the discussion’s root message. These influences are largely additive for civility-related dimensions (attacking vs respectful, disagree vs agree), whereas for emotional versus factual framing there is a significant interaction: alignment with the parent’s emotionality is amplified when older siblings are similarly aligned.
[AI-16] How Vision Becomes Language: A Layer-wise Information-Theoretic Analysis of Multimodal Reasoning
【速读】:该论文旨在解决多模态Transformer在回答视觉问答(Visual Question Answering, VQA)任务时,其预测信息如何在不同模态间分配与融合的问题,特别是揭示视觉证据、语言推理和跨模态协同计算在各层中的演化机制。解决方案的关键在于提出了一种基于部分信息分解(Partial Information Decomposition, PID)的分层分析框架,并引入了PID Flow——一个结合降维、归一化流高斯化和闭式高斯PID估计的可扩展管道,使高维神经表征下的信息分解成为可能。通过该方法对LLaVA-1.5-7B和LLaVA-1.6-7B模型在六种GQA推理任务上的分析,揭示了“模态传递”(modal transduction)的稳定模式:早期视觉特有信息占优,后期语言特有信息主导(约占最终预测的82%),而跨模态协同作用始终较低(<2%)。进一步的注意力屏蔽实验验证了因果关系,表明破坏主要信息传递路径会引发视觉特有信息滞留、补偿性协同增加及总信息成本上升,从而从信息论和因果角度阐明了视觉如何转化为语言的过程,并为识别架构瓶颈提供了量化依据。
链接: https://arxiv.org/abs/2602.15580
作者: Hongxuan Wu,Yukun Zhang,Xueqing Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:When a multimodal Transformer answers a visual question, is the prediction driven by visual evidence, linguistic reasoning, or genuinely fused cross-modal computation – and how does this structure evolve across layers? We address this question with a layer-wise framework based on Partial Information Decomposition (PID) that decomposes the predictive information at each Transformer layer into redundant, vision-unique, language-unique, and synergistic components. To make PID tractable for high-dimensional neural representations, we introduce \emphPID Flow, a pipeline combining dimensionality reduction, normalizing-flow Gaussianization, and closed-form Gaussian PID estimation. Applying this framework to LLaVA-1.5-7B and LLaVA-1.6-7B across six GQA reasoning tasks, we uncover a consistent \emphmodal transduction pattern: visual-unique information peaks early and decays with depth, language-unique information surges in late layers to account for roughly 82% of the final prediction, and cross-modal synergy remains below 2%. This trajectory is highly stable across model variants (layer-wise correlations 0.96) yet strongly task-dependent, with semantic redundancy governing the detailed information fingerprint. To establish causality, we perform targeted Image \rightarrow Question attention knockouts and show that disrupting the primary transduction pathway induces predictable increases in trapped visual-unique information, compensatory synergy, and total information cost – effects that are strongest in vision-dependent tasks and weakest in high-redundancy tasks. Together, these results provide an information-theoretic, causal account of how vision becomes language in multimodal Transformers, and offer quantitative guidance for identifying architectural bottlenecks where modality-specific information is lost.
[AI-17] VLM-DEWM: Dynamic External World Model for Verifiable and Resilient Vision-Language Planning in Manufacturing
【速读】:该论文旨在解决视觉语言模型(Vision-Language Model, VLM)在动态制造工位中部署时面临的两大核心挑战:一是无状态操作导致的环境状态漂移问题,即VLM无法持续追踪视野外的状态;二是推理过程不透明,使得故障难以诊断,从而引发高成本的盲目重试。解决方案的关键在于提出一种认知架构VLM-DEWM,其通过将VLM推理与世界状态管理解耦,引入一个可查询、持久化的动态外部世界模型(Dynamic External World Model, DEWM),并将每次VLM决策结构化为可外化的推理轨迹(Externalizable Reasoning Trace, ERT),包含动作提议、世界信念和因果假设,并在执行前通过DEWM验证。当发生失败时,通过预测状态与观测状态之间的差异分析实现精准恢复,而非全局重规划,从而显著提升状态跟踪准确率(从56%提升至93%)、恢复成功率(从低于5%提升至95%),并降低计算开销。
链接: https://arxiv.org/abs/2602.15549
作者: Guoqin Tang,Qingxuan Jia,Gang Chen,Tong Li,Zeyuan Huang,Zihang Lv,Ning Ji
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language model (VLM) shows promise for high-level planning in smart manufacturing, yet their deployment in dynamic workcells faces two critical challenges: (1) stateless operation, they cannot persistently track out-of-view states, causing world-state drift; and (2) opaque reasoning, failures are difficult to diagnose, leading to costly blind retries. This paper presents VLM-DEWM, a cognitive architecture that decouples VLM reasoning from world-state management through a persistent, queryable Dynamic External World Model (DEWM). Each VLM decision is structured into an Externalizable Reasoning Trace (ERT), comprising action proposal, world belief, and causal assumption, which is validated against DEWM before execution. When failures occur, discrepancy analysis between predicted and observed states enables targeted recovery instead of global replanning. We evaluate VLM-DEWM on multi-station assembly, large-scale facility exploration, and real-robot recovery under induced failures. Compared to baseline memory-augmented VLM systems, VLM DEWM improves state-tracking accuracy from 56% to 93%, increases recovery success rate from below 5% to 95%, and significantly reduces computational overhead through structured memory. These results establish VLM-DEWM as a verifiable and resilient solution for long-horizon robotic operations in dynamic manufacturing environments.
[AI-18] Quantifying construct validity in large language model evaluations
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)评估中基准测试(benchmark)的构念效度(construct validity)问题,即如何确保基准测试结果能可靠地反映模型的真实能力,而非受测试集污染、标注误差或模型规模等混杂因素干扰。现有方法如潜在因子模型(latent factor models)忽视了模型规模对能力的影响,导致提取的能力往往与模型大小高度相关;而标度律(scaling laws)则忽略了测量误差,使提取的能力不可解释且过拟合于观测到的基准数据。论文提出的结构化能力模型(structured capabilities model)的关键创新在于同时整合了两个视角:一方面借鉴标度律的思想,将模型规模作为能力的基础输入;另一方面引入潜在因子模型的框架,使能力通过考虑测量误差来解释观测到的基准分数。这种双重视角使得模型能够分离出可解释且泛化能力强的能力维度,在保持简洁性的同时显著提升对分布外基准的预测性能,从而更准确地量化LLM评估中的构念效度。
链接: https://arxiv.org/abs/2602.15532
作者: Ryan Othniel Kearns
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The LLM community often reports benchmark results as if they are synonymous with general model capabilities. However, benchmarks can have problems that distort performance, like test set contamination and annotator error. How can we know that a benchmark is a reliable indicator of some capability that we want to measure? This question concerns the construct validity of LLM benchmarks, and it requires separating benchmark results from capabilities when we model and predict LLM performance. Both social scientists and computer scientists propose formal models - latent factor models and scaling laws - for identifying the capabilities underlying benchmark scores. However, neither technique is satisfactory for construct validity. Latent factor models ignore scaling laws, and as a result, the capabilities they extract often proxy model size. Scaling laws ignore measurement error, and as a result, the capabilities they extract are both uninterpretable and overfit to the observed benchmarks. This thesis presents the structured capabilities model, the first model to extract interpretable and generalisable capabilities from a large collection of LLM benchmark results. I fit this model and its two alternatives on a large sample of results from the OpenLLM Leaderboard. Structured capabilities outperform latent factor models on parsimonious fit indices, and exhibit better out-of-distribution benchmark prediction than scaling laws. These improvements are possible because neither existing approach separates model scale from capabilities in the appropriate way. Model scale should inform capabilities, as in scaling laws, and these capabilities should inform observed results up to measurement error, as in latent factor models. In combining these two insights, structured capabilities demonstrate better explanatory and predictive power for quantifying construct validity in LLM evaluations. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2602.15532 [cs.AI] (or arXiv:2602.15532v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.15532 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-19] GenAI-LA: Generative AI and Learning Analytics Workshop (LAK 2026) April 27–May 1 2026 Bergen Norway
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在教育场景中缺乏系统性评估与训练数据的问题,特别是针对自动教学评价器和智能辅导系统(AI tutors)在生成教学解释时可能存在的教学风险难以量化与检测的挑战。解决方案的关键在于构建了一个名为EduEVAL-DB的数据集,该数据集基于教师角色模拟,包含854条教学解释(每题1条真人教师解释和6条大语言模型模拟教师生成),覆盖K-12阶段科学、语言和社会科学领域;并提出一套与教育标准对齐的“教学风险评分体系”,从事实准确性、解释深度与完整性、焦点相关性、学生适切性和意识形态偏见五个维度进行二元标注,通过半自动流程结合专家教师审核实现高质量标注,从而为模型提供可部署于消费级硬件的监督微调数据,支持教学风险识别能力的提升。
链接: https://arxiv.org/abs/2602.15531
作者: Javier Irigoyen,Roberto Daza,Aythami Morales,Julian Fierrez,Francisco Jurado,Alvaro Ortigosa,Ruben Tolosana
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 10 pages, 3 figures. Published in Intl. Conf. on Learning Analytics Knowledge Workshops (LAK Workshops 2026, GenAI-LA 26)
Abstract:This work introduces EduEVAL-DB, a dataset based on teacher roles designed to support the evaluation and training of automatic pedagogical evaluators and AI tutors for instructional explanations. The dataset comprises 854 explanations corresponding to 139 questions from a curated subset of the ScienceQA benchmark, spanning science, language, and social science across K-12 grade levels. For each question, one human-teacher explanation is provided and six are generated by LLM-simulated teacher roles. These roles are inspired by instructional styles and shortcomings observed in real educational practice and are instantiated via prompt engineering. We further propose a pedagogical risk rubric aligned with established educational standards, operationalizing five complementary risk dimensions: factual correctness, explanatory depth and completeness, focus and relevance, student-level appropriateness, and ideological bias. All explanations are annotated with binary risk labels through a semi-automatic process with expert teacher review. Finally, we present preliminary validation experiments to assess the suitability of EduEVAL-DB for evaluation. We benchmark a state-of-the-art education-oriented model (Gemini 2.5 Pro) against a lightweight local Llama 3.1 8B model and examine whether supervised fine-tuning on EduEVAL-DB supports pedagogical risk detection using models deployable on consumer hardware.
[AI-20] he Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
【速读】:该论文试图解决在训练生成式 AI(Generative AI)系统时,如何有效防止其通过隐蔽策略规避白盒欺骗检测器(white-box deception detector)的问题。当前方法依赖于对抗训练来促使模型诚实,但存在模型学习到“混淆”(obfuscation)行为的风险,即在输出有害内容的同时修改内部表征或策略以逃避检测。解决方案的关键在于识别并区分两种类型的混淆机制:(i) 混淆激活(Obfuscated activations)——模型调整内部表示使其不再触发检测器;(ii) 混淆策略(Obfuscated policy)——模型输出欺骗性文本并附带合理化理由以绕过检测。研究发现,仅靠探测惩罚(probe penalty)会诱导混淆策略,而结合足够高的KL正则化和探测惩罚可稳定地引导模型保持诚实,从而验证了白盒欺骗检测器作为训练信号的有效性。
链接: https://arxiv.org/abs/2602.15515
作者: Mohammad Taufeeque,Stefan Heimersheim,Adam Gleave,Chris Cundy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 12 figures
Abstract:Training against white-box deception detectors has been proposed as a way to make AI systems honest. However, such training risks models learning to obfuscate their deception to evade the detector. Prior work has studied obfuscation only in artificial settings where models were directly rewarded for harmful output. We construct a realistic coding environment where reward hacking via hardcoding test cases naturally occurs, and show that obfuscation emerges in this setting. We introduce a taxonomy of possible outcomes when training against a deception detector. The model either remains honest, or becomes deceptive via two possible obfuscation strategies. (i) Obfuscated activations: the model outputs deceptive text while modifying its internal representations to no longer trigger the detector. (ii) Obfuscated policy: the model outputs deceptive text that evades the detector, typically by including a justification for the reward hack. Empirically, obfuscated activations arise from representation drift during RL, with or without a detector penalty. The probe penalty only incentivizes obfuscated policies; we theoretically show this is expected for policy gradient methods. Sufficiently high KL regularization and detector penalty can yield honest policies, establishing white-box deception detectors as viable training signals for tasks prone to reward hacking.
[AI-21] Improving MLLM s in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling
【速读】:该论文旨在解决将多模态大语言模型(Multimodal Large Language Models, MLLMs)作为具身智能体(embodied agents)的核心认知系统时,在长时程观测和有限上下文预算下的记忆建模难题。现有方法依赖文本摘要进行记忆辅助,易丢失视觉与空间细节,且在非平稳环境中表现脆弱。其解决方案的关键在于提出一种非参数化记忆框架,显式分离情景记忆(episodic memory)与语义记忆(semantic memory):通过“检索优先、推理辅助”的范式,基于语义相似性召回情景经验并借助视觉推理验证,实现无需严格几何对齐的过往观察复用;同时引入程序风格规则提取机制,将经验转化为结构化语义记忆,提升跨环境泛化能力。实验表明,该方法在A-EQA和GOAT-Bench等基准上显著优于现有方法,其中情景记忆增强探索效率,语义记忆强化复杂推理能力。
链接: https://arxiv.org/abs/2602.15513
作者: Ji Li,Jing Xia,Mingyi Li,Shiyan Hu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limited context budgets. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary environments. In this work, we propose a non-parametric memory framework that explicitly disentangles episodic and semantic memory for embodied exploration and question answering. Our retrieval-first, reasoning-assisted paradigm recalls episodic experiences via semantic similarity and verifies them through visual reasoning, enabling robust reuse of past observations without rigid geometric alignment. In parallel, we introduce a program-style rule extraction mechanism that converts experiences into structured, reusable semantic memory, facilitating cross-environment generalization. Extensive experiments demonstrate state-of-the-art performance on embodied question answering and exploration benchmarks, yielding a 7.3% gain in LLM-Match and an 11.4% gain in LLM MatchXSPL on A-EQA, as well as +7.7% success rate and +6.8% SPL on GOAT-Bench. Analyses reveal that our episodic memory primarily improves exploration efficiency, while semantic memory strengthens complex reasoning of embodied agents.
[AI-22] he Equalizer: Introducing Shape-Gain Decomposition in Neural Audio Codecs
【速读】:该论文旨在解决神经音频编解码器(Neural Audio Codecs, NACs)在面对输入信号整体幅值变化时鲁棒性差的问题,即全局信号电平波动会显著影响编码器输出的嵌入向量及其量化结果,从而导致码本冗余和比特率-失真性能不佳。解决方案的关键在于引入经典语音编码中广泛采用的“形状-增益分解”(Shape-Gain Decomposition)机制到NAC框架中:在NAC编码器之前,对输入信号进行短时分解,分离出增益(gain)和归一化形状向量(normalized shape vector),其中形状向量由NAC处理,而增益则通过标量量化单独传输;解码端通过NAC输出的归一化信号与量化后的增益重构最终音频信号。这一方法显著提升了比特率-失真性能并大幅降低计算复杂度。
链接: https://arxiv.org/abs/2602.15491
作者: Samir Sadok,Laurent Girin,Xavier Alameda-Pineda
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Neural audio codecs, shape-gain decomposition, vector quantization, speech coding
Abstract:Neural audio codecs (NACs) typically encode the short-term energy (gain) and normalized structure (shape) of speech/audio signals jointly within the same latent space. As a result, they are poorly robust to a global variation of the input signal level in the sense that such variation has strong influence on the embedding vectors at the output of the encoder and their quantization. This methodology is inherently inefficient, leading to codebook redundancy and suboptimal bitrate-distortion performance. To address these limitations, we propose to introduce shape-gain decomposition, widely used in classical speech/audio coding, into the NAC framework. The principle of the proposed Equalizer methodology is to decompose the input signal – before the NAC encoder – into gain and normalized shape vector on a short-term basis. The shape vector is processed by the NAC, while the gain is quantized with scalar quantization and transmitted separately. The output (decoded) signal is reconstructed from the normalized output of the NAC and the quantized gain. Our experiments conducted on speech signals show that this general methodology, easily applicable to any NAC, enables a substantial gain in bitrate-distortion performance, as well as a massive reduction in complexity.
[AI-23] SecCodeBench-V2 Technical Report
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代码助手在生成或修复代码时缺乏系统性、可复现且面向安全性的评估标准的问题。现有基准多关注功能性正确性,而忽视了实际工业场景中广泛存在的安全漏洞(如CWE类别),导致对AI编码助手的安全能力评估不足。解决方案的关键在于构建SecCodeBench-V2——一个基于真实工业场景的函数级安全代码生成与修复基准,涵盖5种编程语言和22类常见弱安全性问题(CWE),提供可执行的PoC测试用例以验证功能与安全属性,并采用动态执行结合LLM-as-a-judge的混合评估机制,最终通过Pass@K加权评分协议实现跨任务难度和严重程度的统一量化评估,从而为AI代码助手的安全性提供严谨、可复现的评测框架。
链接: https://arxiv.org/abs/2602.15485
作者: Longfei Chen,Ji Zhao,Lanxiao Cui,Tong Su,Xingbo Pan,Ziyang Li,Yongxing Wu,Qijiang Cao,Qiyao Cai,Jing Zhang,Yuandong Ni,Junyao He,Zeyu Zhang,Chao Ge,Xuhuai Lu,Zeyu Gao,Yuxin Cui,Weisen Chen,Yuxuan Peng,Shengping Wang,Qi Li,Yukai Huang,Yukun Liu,Tuo Zhou,Terry Yue Zhuo,Junyang Lin,Chao Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:We introduce SecCodeBench-V2, a publicly released benchmark for evaluating Large Language Model (LLM) copilots’ capabilities of generating secure code. SecCodeBench-V2 comprises 98 generation and fix scenarios derived from Alibaba Group’s industrial productions, where the underlying security issues span 22 common CWE (Common Weakness Enumeration) categories across five programming languages: Java, C, Python, Go, and this http URL. SecCodeBench-V2 adopts a function-level task formulation: each scenario provides a complete project scaffold and requires the model to implement or patch a designated target function under fixed interfaces and dependencies. For each scenario, SecCodeBench-V2 provides executable proof-of-concept (PoC) test cases for both functional validation and security verification. All test cases are authored and double-reviewed by security experts, ensuring high fidelity, broad coverage, and reliable ground truth. Beyond the benchmark itself, we build a unified evaluation pipeline that assesses models primarily via dynamic execution. For most scenarios, we compile and run model-generated artifacts in isolated environments and execute PoC test cases to validate both functional correctness and security properties. For scenarios where security issues cannot be adjudicated with deterministic test cases, we additionally employ an LLM-as-a-judge oracle. To summarize performance across heterogeneous scenarios and difficulty levels, we design a Pass@K-based scoring protocol with principled aggregation over scenarios and severity, enabling holistic and comparable evaluation across models. Overall, SecCodeBench-V2 provides a rigorous and reproducible foundation for assessing the security posture of AI coding assistants, with results and artifacts released at this https URL. The benchmark is publicly available at this https URL.
[AI-24] Algorithmic Approaches to Opinion Selection for Online Deliberation: A Comparative Study
【速读】:该论文旨在解决在线协商平台中算法选择意见时如何平衡民主标准(如比例代表性与多样性)的问题。当前算法策略(如共识导向或多样性导向)可能忽视少数群体声音或削弱内容多样性,而现有方法尚未明确各策略对民主价值的影响。解决方案的关键在于引入社会选择理论,提出一种融合多样性与均衡代表性的新算法,实证表明该方法在比例代表性与多样性之间实现了最优权衡。
链接: https://arxiv.org/abs/2602.15439
作者: Salim Hafid,Manon Berriche,Jean-Philippe Cointet
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:During deliberation processes, mediators and facilitators typically need to select a small and representative set of opinions later used to produce digestible reports for stakeholders. In online deliberation platforms, algorithmic selection is increasingly used to automate this process. However, such automation is not without consequences. For instance, enforcing consensus-seeking algorithmic strategies can imply ignoring or flattening conflicting preferences, which may lead to erasing minority voices and reducing content diversity. More generally, across the variety of existing selection strategies (e.g., consensus, diversity), it remains unclear how each approach influences desired democratic criteria such as proportional representation. To address this gap, we benchmark several algorithmic approaches in this context. We also build on social choice theory to propose a novel algorithm that incorporates both diversity and a balanced notion of representation in the selection strategy. We find empirically that while no single strategy dominates across all democratic desiderata, our social-choice-inspired selection rule achieves the strongest trade-off between proportional representation and diversity.
[AI-25] Logit Distance Bounds Representational Similarity
【速读】:该论文旨在解决生成式 AI(Generative AI)中模型蒸馏(distillation)时的表示相似性问题,即当教师模型与学生模型的预测分布在KL散度上接近时,是否能保证其内部表示结构也保持线性相似。传统方法依赖KL散度作为距离度量,但研究表明这并不足以保障线性可恢复的表征特性(如人类可解释概念的线性探测性能)。论文的关键解决方案是引入基于logit差异的分布距离度量,并证明该距离能有效控制模型表示的差异——具体而言,定义了一个基于识别类(identifiability class)的表示异质性度量,其上界由logit距离决定;同时指出,在概率远离零的情况下,KL散度可上界logit距离,但该上界在实践中无法提供有意义的约束。实验验证了基于logit距离的蒸馏策略在合成数据和图像数据集上均能提升学生模型的线性表示相似性和可解释概念的保留能力。
链接: https://arxiv.org/abs/2602.15438
作者: Beatrix M. B. Nielsen,Emanuele Marconato,Luigi Gresele,Andrea Dittadi,Simon Buchholz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:For a broad family of discriminative models that includes autoregressive language models, identifiability results imply that if two models induce the same conditional distributions, then their internal representations agree up to an invertible linear transformation. We ask whether an analogous conclusion holds approximately when the distributions are close instead of equal. Building on the observation of Nielsen et al. (2025) that closeness in KL divergence need not imply high linear representational similarity, we study a distributional distance based on logit differences and show that closeness in this distance does yield linear similarity guarantees. Specifically, we define a representational dissimilarity measure based on the models’ identifiability class and prove that it is bounded by the logit distance. We further show that, when model probabilities are bounded away from zero, KL divergence upper-bounds logit distance; yet the resulting bound fails to provide nontrivial control in practice. As a consequence, KL-based distillation can match a teacher’s predictions while failing to preserve linear representational properties, such as linear-probe recoverability of human-interpretable concepts. In distillation experiments on synthetic and image datasets, logit-distance distillation yields students with higher linear representational similarity and better preservation of the teacher’s linearly recoverable concepts.
[AI-26] Common Belief Revisited
【速读】:该论文旨在解决“共同信念(common belief)”在多主体知识逻辑中的完全形式化刻画问题,特别是针对个体信念为KD45时,共同信念是否可通过扩展KD4逻辑并添加特定公理来完整描述。此前研究认为,若个体信念满足KD45,则共同信念会失去5性质而保留D和4性质,并具备一个新属性C(Cϕ → ϕ),即所谓的“移位自反性(shift-reflexivity)”。然而,本文指出仅此不足以构成完整刻画:必须额外引入一个依赖于参与者数量的公理,才能实现对共同信念逻辑的完备表征。解决方案的关键在于识别出这一缺失的公理,并证明其与原有公理系统共同构成对多主体场景下共同信念的完整逻辑框架,从而解决了长期悬而未决的开放问题。
链接: https://arxiv.org/abs/2602.15403
作者: Thomas Ågotnes
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Contrary to common belief, common belief is not KD4. If individual belief is KD45, common belief does indeed lose the 5 property and keep the D and 4 properties – and it has none of the other commonly considered properties of knowledge and belief. But it has another property: C(C\phi \rightarrow \phi) – corresponding to so-called shift-reflexivity (reflexivity one step ahead). This observation begs the question: is KD4 extended with this axiom a complete characterisation of common belief in the KD45 case? If not, what \emphis the logic of common belief? In this paper we show that the answer to the first question is ``no’': there is one additional axiom, and, furthermore, it relies on the number of agents. We show that the result is a complete characterisation of common belief, settling the open problem. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.15403 [cs.AI] (or arXiv:2602.15403v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.15403 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-27] ActionCodec: What Makes for Good Action Tokenizers
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型中动作标记化(action tokenization)设计缺乏优化导向的问题。现有方法主要关注重建保真度,忽视了动作标记化对VLA整体优化的直接影响,导致“什么是好的动作标记器”这一基础问题尚未明确。解决方案的关键在于提出一套基于信息论的设计原则,包括最大化时间上的标记重叠、最小化词汇冗余、增强多模态互信息以及保证标记独立性,并据此开发出名为ActionCodec的高性能动作标记器。该标记器显著提升了训练效率和VLA在多种仿真与真实场景中的性能表现,例如在LIBERO基准上实现了95.5%的成功率(无机器人预训练),进一步提升至97.4%,达到当前无需机器人预训练的最优水平。
链接: https://arxiv.org/abs/2602.15397
作者: Zibin Dong,Yicheng Liu,Shiduo Zhang,Baijun Ye,Yifu Yuan,Fei Ni,Jingjing Gong,Xipeng Qiu,Hang Zhao,Yinchuan Li,Jianye Hao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language-Action (VLA) models leveraging the native autoregressive paradigm of Vision-Language Models (VLMs) have demonstrated superior instruction-following and training efficiency. Central to this paradigm is action tokenization, yet its design has primarily focused on reconstruction fidelity, failing to address its direct impact on VLA optimization. Consequently, the fundamental question of \textitwhat makes for good action tokenizers remains unanswered. In this paper, we bridge this gap by establishing design principles specifically from the perspective of VLA optimization. We identify a set of best practices based on information-theoretic insights, including maximized temporal token overlap, minimized vocabulary redundancy, enhanced multimodal mutual information, and token independence. Guided by these principles, we introduce \textbfActionCodec, a high-performance action tokenizer that significantly enhances both training efficiency and VLA performance across diverse simulation and real-world benchmarks. Notably, on LIBERO, a SmolVLM2-2.2B fine-tuned with ActionCodec achieves a 95.5% success rate without any robotics pre-training. With advanced architectural enhancements, this reaches 97.4%, representing a new SOTA for VLA models without robotics pre-training. We believe our established design principles, alongside the released model, will provide a clear roadmap for the community to develop more effective action tokenizers.
[AI-28] Improving LLM Reliability through Hybrid Abstention and Adaptive Detection
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生产环境中面临的安全-效用权衡问题,即严格的过滤机制虽能防止有害输出但常误拦合法请求,而宽松控制则可能引发不安全内容生成。传统基于静态规则或固定置信度阈值的防护措施缺乏上下文敏感性且计算开销大,导致延迟高、用户体验差。其解决方案的关键在于提出一种上下文感知的自适应拒答系统(context-aware abstention system),通过实时上下文信号(如领域和用户历史)动态调整安全阈值,并采用由五个并行检测器组成的多维检测架构,结合分层级联机制实现高效过滤:该级联设计可逐级减少冗余计算,在保障高安全性与近乎完美的召回率的同时显著降低延迟,尤其在医疗建议和创意写作等敏感领域有效减少误报。
链接: https://arxiv.org/abs/2602.15391
作者: Ankit Sharma,Nachiket Tapas,Jyotiprakash Patra
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) deployed in production environments face a fundamental safety-utility trade-off either a strict filtering mechanisms prevent harmful outputs but often block benign queries or a relaxed controls risk unsafe content generation. Conventional guardrails based on static rules or fixed confidence thresholds are typically context-insensitive and computationally expensive, resulting in high latency and degraded user experience. To address these limitations, we introduce an adaptive abstention system that dynamically adjusts safety thresholds based on real-time contextual signals such as domain and user history. The proposed framework integrates a multi-dimensional detection architecture composed of five parallel detectors, combined through a hierarchical cascade mechanism to optimize both speed and precision. The cascade design reduces unnecessary computation by progressively filtering queries, achieving substantial latency improvements compared to non-cascaded models and external guardrail systems. Extensive evaluation on mixed and domain-specific workloads demonstrates significant reductions in false positives, particularly in sensitive domains such as medical advice and creative writing. The system maintains high safety precision and near-perfect recall under strict operating modes. Overall, our context-aware abstention framework effectively balances safety and utility while preserving performance, offering a scalable solution for reliable LLM deployment.
[AI-29] A Unified Evaluation of Learning-Based Similarity Techniques for Malware Detection
【速读】:该论文旨在解决传统加密哈希(如MD5、SHA-256)在安全分析场景中因对微小输入变化敏感而导致的匹配失效问题,这类哈希无法有效支持威胁狩猎、恶意软件分析和数字取证等任务。其解决方案的关键在于系统性地比较基于学习的相似性技术(包括传统相似哈希如ssdeep、sdhash、TLSH及机器学习生成的嵌入表示),通过统一实验框架和行业标准指标,在大规模公开数据集上评估不同方法的性能表现。结果表明,单一方法难以在所有维度上均表现优异,各方法存在显著差异化的权衡关系,因此有效的安全平台应融合互补的分类与相似性技术,而非依赖单一方法。
链接: https://arxiv.org/abs/2602.15376
作者: Udbhav Prasad,Aniesh Chawla
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Cryptographic digests (e.g., MD5, SHA-256) are designed to provide exact identity. Any single-bit change in the input produces a completely different hash, which is ideal for integrity verification but limits their usefulness in many real-world tasks like threat hunting, malware analysis and digital forensics, where adversaries routinely introduce minor transformations. Similarity-based techniques address this limitation by enabling approximate matching, allowing related byte sequences to produce measurably similar fingerprints. Modern enterprises manage tens of thousands of endpoints with billions of files, making the effectiveness and scalability of the proposed techniques more important than ever in security applications. Security researchers have proposed a range of approaches, including similarity digests and locality-sensitive hashes (e.g., ssdeep, sdhash, TLSH), as well as more recent machine-learning-based methods that generate embeddings from file features. However, these techniques have largely been evaluated in isolation, using disparate datasets and evaluation criteria. This paper presents a systematic comparison of learning-based classification and similarity methods using large, publicly available datasets. We evaluate each method under a unified experimental framework with industry-accepted metrics. To our knowledge, this is the first reproducible study to benchmark these diverse learning-based similarity techniques side by side for real-world security workloads. Our results show that no single approach performs well across all dimensions; instead, each exhibits distinct trade-offs, indicating that effective malware analysis and threat-hunting platforms must combine complementary classification and similarity techniques rather than rely on a single method.
[AI-30] CDRL: A Reinforcement Learning Framework Inspired by Cerebellar Circuits and Dendritic Computational Strategies
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在高维序列决策任务中面临的样本效率低、对噪声敏感以及部分可观测环境下泛化能力弱的问题。现有方法主要依赖优化策略,而忽视了架构先验(architectural priors)对表征学习和决策动态的影响。其解决方案的关键在于引入受小脑结构启发的生物合理架构:通过大规模扩张(large expansion)、稀疏连接(sparse connectivity)、稀疏激活(sparse activation)及树突水平调制(dendritic-level modulation)等机制,显著提升了RL模型的样本效率、鲁棒性和泛化性能,表明小脑结构先验可作为有效的归纳偏置(inductive bias)用于强化学习。
链接: https://arxiv.org/abs/2602.15367
作者: Sibo Zhang,Rui Jing,Liangfu Lv,Jian Zhang,Yunliang Zang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 14pages, 8 figures, 6 tabels
Abstract:Reinforcement learning (RL) has achieved notable performance in high-dimensional sequential decision-making tasks, yet remains limited by low sample efficiency, sensitivity to noise, and weak generalization under partial observability. Most existing approaches address these issues primarily through optimization strategies, while the role of architectural priors in shaping representation learning and decision dynamics is less explored. Inspired by structural principles of the cerebellum, we propose a biologically grounded RL architecture that incorporate large expansion, sparse connectivity, sparse activation, and dendritic-level modulation. Experiments on noisy, high-dimensional RL benchmarks show that both the cerebellar architecture and dendritic modulation consistently improve sample efficiency, robustness, and generalization compared to conventional designs. Sensitivity analysis of architectural parameters suggests that cerebellum-inspired structures can offer optimized performance for RL with constrained model parameters. Overall, our work underscores the value of cerebellar structural priors as effective inductive biases for RL.
[AI-31] Automated Multi-Source Debugging and Natural Language Error Explanation for Dashboard Applications
【速读】:该论文旨在解决现代基于微服务架构的Web仪表板和企业应用在出现故障时,因错误信息模糊(如“Something went wrong”)而导致根因难以定位的问题。现有监控工具虽能独立捕获浏览器、API、服务器日志等来源的事件,但缺乏有效关联与对非技术人员的可理解解释。解决方案的关键在于提出一种自动化多源调试与自然语言错误解释系统,通过实时收集并关联来自不同来源的错误数据(包括浏览器异常、API契约违规和服务器逻辑错误),结合大语言模型(Large Language Models, LLMs)生成清晰易懂的自然语言说明,从而显著缩短支持工程师的平均修复时间(Mean Time to Resolution),并将晦涩的错误代码转化为可操作的洞察。
链接: https://arxiv.org/abs/2602.15362
作者: Devendra Tata,Mona Rajhans
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the 12th (Springer CCIS) International Conference on Information Management, March 27-29, 2026, Oxford, UK
Abstract:Modern web dashboards and enterprise applications increasingly rely on complex, distributed microservices architectures. While these architectures offer scalability, they introduce significant challenges in debugging and observability. When failures occur, they often manifest as opaque error messages to the end-user such as Something went wrong. This masks the underlying root cause which may reside in browser side exceptions, API contract violations, or server side logic failures. Existing monitoring tools capture these events in isolation but fail to correlate them effectively or provide intelligible explanations to non technical users. This paper proposes a novel system for Automated Multi Source Debugging and Natural Language Error Explanation. The proposed framework automatically collects and correlates error data from disparate sources such as browser, API, server logs and validates API contracts in real time, and utilizes Large Language Models to generate natural language explanations. This approach significantly reduces Mean Time to Resolution for support engineers and improves the user experience by transforming cryptic error codes into actionable insights.
[AI-32] Fine-Tuning LLM s to Generate Economical and Reliable Actions for the Power Grid
【速读】:该论文旨在解决公共安全断电(Public Safety Power Shutoff, PSPS)引发的快速拓扑变化导致标准运行点不可行的问题,核心挑战在于如何在有限的开关操作预算下,快速生成可行且电压稳定的输电切换方案以最小化负荷切除。解决方案的关键在于提出一个可验证的多阶段适配流程:首先通过监督微调将直流最优潮流混合整数规划(DC-OPF MILP)求解器的知识蒸馏为带约束的动作语法,确保生成动作的可解析性和可行性;其次利用直接偏好优化(DPO)引入交流潮流(AC)评估的电压惩罚指标,提升模型对电压行为的感知能力;最后采用“最佳N选一”策略在推理阶段选择最优可行候选方案,显著提升了电压稳定性和整体性能,在IEEE 118节点测试场景中将交流潮流失败率从50%降至个位数,并改善了常见成功集上的电压惩罚结果。
链接: https://arxiv.org/abs/2602.15350
作者: Mohamad Chehade,Hao Zhu
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:Public Safety Power Shutoffs (PSPS) force rapid topology changes that can render standard operating points infeasible, requiring operators to quickly identify corrective transmission switching actions that reduce load shedding while maintaining acceptable voltage behavior. We present a verifiable, multi-stage adaptation pipeline that fine-tunes an instruction-tuned large language model (LLM) to generate \emphopen-only corrective switching plans from compact PSPS scenario summaries under an explicit switching budget. First, supervised fine-tuning distills a DC-OPF MILP oracle into a constrained action grammar that enables reliable parsing and feasibility checks. Second, direct preference optimization refines the policy using AC-evaluated preference pairs ranked by a voltage-penalty metric, injecting voltage-awareness beyond DC imitation. Finally, best-of- N selection provides an inference-time addition by choosing the best feasible candidate under the target metric. On IEEE 118-bus PSPS scenarios, fine-tuning substantially improves DC objective values versus zero-shot generation, reduces AC power-flow failure from 50% to single digits, and improves voltage-penalty outcomes on the common-success set. Code and data-generation scripts are released to support reproducibility.
[AI-33] FedPSA: Modeling Behavioral Staleness in Asynchronous Federated Learning
【速读】:该论文旨在解决异步联邦学习(Asynchronous Federated Learning, AFL)中因参数滞后(staleness)导致性能下降的问题。现有方法通常仅以当前模型与全局模型之间的轮次差作为滞后的唯一度量,这种粗粒度的评估方式忽略了模型内部参数的实际敏感性,限制了AFL的性能上限。论文提出FedPSA(Parameter Sensitivity-based Asynchronous Federated Learning)框架,其核心创新在于引入基于参数敏感性的细粒度滞后期望度量,并构建动态动量队列以实时识别训练阶段,从而动态调整对过时信息的容忍度,实现更精准的模型更新控制,显著提升训练稳定性和最终性能。
链接: https://arxiv.org/abs/2602.15337
作者: Chaoyi Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Asynchronous Federated Learning (AFL) has emerged as a significant research area in recent years. By not waiting for slower clients and executing the training process concurrently, it achieves faster training speed compared to traditional federated learning. However, due to the staleness introduced by the asynchronous process, its performance may degrade in some scenarios. Existing methods often use the round difference between the current model and the global model as the sole measure of staleness, which is coarse-grained and lacks observation of the model itself, thereby limiting the performance ceiling of asynchronous methods. In this paper, we propose FedPSA (Parameter Sensitivity-based Asynchronous Federated Learning), a more fine-grained AFL framework that leverages parameter sensitivity to measure model obsolescence and establishes a dynamic momentum queue to assess the current training phase in real time, thereby adjusting the tolerance for outdated information dynamically. Extensive experiments on multiple datasets and comparisons with various methods demonstrate the superior performance of FedPSA, achieving up to 6.37% improvement over baseline methods and 1.93% over the current state-of-the-art method.
[AI-34] A Scalable Curiosity-Driven Game-Theoretic Framework for Long-Tail Multi-Label Learning in Data Mining
【速读】:该论文旨在解决大规模多标签分类(Multi-Label Classification, MLC)中因标签分布长尾特性(long-tail distribution)导致的模型性能失衡问题,即少数头部标签占据主导地位,而大量尾部标签样本稀少、难以学习。传统重采样和重加权策略往往破坏标签间依赖关系或需繁琐超参数调优,尤其在标签空间扩展至数万级别时效果不佳。解决方案的关键在于提出Curiosity-Driven Game-Theoretic Multi-Label Learning (CD-GTMLL),其核心创新是将长尾MLC建模为多人博弈过程:每个子预测器(“玩家”)专注于标签空间的一个分区,在协作提升全局准确率的同时,通过基于尾部标签稀有性和玩家间分歧的内在好奇心奖励机制,自适应地向低频标签注入学习信号,无需人工平衡或调参。理论分析表明该方法收敛至尾部感知均衡,并与Rare-F1指标提升直接关联,实验证明其在7个基准数据集上显著优于现有最优方法,尤其在含30,000+标签的极端多标签场景下表现突出。
链接: https://arxiv.org/abs/2602.15330
作者: Jing Yang,Keze Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The long-tail distribution, where a few head labels dominate while rare tail labels abound, poses a persistent challenge for large-scale Multi-Label Classification (MLC) in real-world data mining applications. Existing resampling and reweighting strategies often disrupt inter-label dependencies or require brittle hyperparameter tuning, especially as the label space expands to tens of thousands of labels. To address this issue, we propose Curiosity-Driven Game-Theoretic Multi-Label Learning (CD-GTMLL), a scalable cooperative framework that recasts long-tail MLC as a multi-player game - each sub-predictor (“player”) specializes in a partition of the label space, collaborating to maximize global accuracy while pursuing intrinsic curiosity rewards based on tail label rarity and inter-player disagreement. This mechanism adaptively injects learning signals into under-represented tail labels without manual balancing or tuning. We further provide a theoretical analysis showing that our CD-GTMLL converges to a tail-aware equilibrium and formally links the optimization dynamics to improvements in the Rare-F1 metric. Extensive experiments across 7 benchmarks, including extreme multi-label classification datasets with 30,000+ labels, demonstrate that CD-GTMLL consistently surpasses state-of-the-art methods, with gains up to +1.6% P@3 on Wiki10-31K. Ablation studies further confirm the contributions of both game-theoretic cooperation and curiosity-driven exploration to robust tail performance. By integrating game theory with curiosity mechanisms, CD-GTMLL not only enhances model efficiency in resource-constrained environments but also paves the way for more adaptive learning in imbalanced data scenarios across industries like e-commerce and healthcare.
[AI-35] AgriWorld:A World Tools Protocol Framework for Verifiable Agricultural Reasoning with Code-Executing LLM Agents
【速读】:该论文旨在解决农业领域基础模型在语言推理与交互能力上的缺失问题,以及大型语言模型(LLM)难以直接处理高维异构农业数据的局限性。其关键解决方案是构建一个名为AgriWorld的Python执行环境,集成统一工具以支持地理空间查询、遥感时序分析、作物生长模拟及任务特定预测器,并在此基础上设计一个多轮迭代的LLM代理Agro-Reflective,通过“执行-观察-修正”循环实现代码编写、结果反馈与分析优化,从而推动农业科学中可靠且可解释的推理能力。
链接: https://arxiv.org/abs/2602.15325
作者: Zhixing Zhang,Jesen Zhang,Hao Liu,Qinhan Lv,Jing Yang,Kaitong Cai,Keze Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Foundation models for agriculture are increasingly trained on massive spatiotemporal data (e.g., multi-spectral remote sensing, soil grids, and field-level management logs) and achieve strong performance on forecasting and monitoring. However, these models lack language-based reasoning and interactive capabilities, limiting their usefulness in real-world agronomic workflows. Meanwhile, large language models (LLMs) excel at interpreting and generating text, but cannot directly reason over high-dimensional, heterogeneous agricultural datasets. We bridge this gap with an agentic framework for agricultural science. It provides a Python execution environment, AgriWorld, exposing unified tools for geospatial queries over field parcels, remote-sensing time-series analytics, crop growth simulation, and task-specific predictors (e.g., yield, stress, and disease risk). On top of this environment, we design a multi-turn LLM agent, Agro-Reflective, that iteratively writes code, observes execution results, and refines its analysis via an execute-observe-refine loop. We introduce AgroBench, with scalable data generation for diverse agricultural QA spanning lookups, forecasting, anomaly detection, and counterfactual “what-if” analysis. Experiments outperform text-only and direct tool-use baselines, validating execution-driven reflection for reliable agricultural reasoning.
[AI-36] Unforgeable Watermarks for Language Models via Robust Signatures
【速读】:该论文旨在解决生成式 AI(Generative AI)文本内容溯源中的安全性问题,特别是现有水印方案在防止虚假归属(false attribution)方面的不足。当前方法虽能较好地保持模型输出质量并实现鲁棒检测,但无法有效防范伪造者生成与原模型输出差异较大却仍被误判为水印文本的“假阳性”攻击。为此,作者提出两个新的保障机制:不可伪造性(unforgeability)和可恢复性(recoverability),前者杜绝伪造行为,后者确保检测到水印时可定位原始生成文本。解决方案的核心在于引入一种新型密码学原语——鲁棒数字签名(robust digital signatures),它允许验证与已签名消息在汉明距离(Hamming metric)内相近的消息,同时阻止对所有已签名消息均不接近的伪造内容进行签名。通过属性保持哈希函数(property-preserving hash functions),任何标准数字签名方案均可被增强为具备上述特性的鲁棒签名方案,从而构建首个在替换扰动下兼具鲁棒性、不可伪造性和可恢复性的不可检测水印方案。
链接: https://arxiv.org/abs/2602.15323
作者: Huijia Lin,Kameron Shahabi,Min Jae Song
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 60 pages, 7 figures
Abstract:Language models now routinely produce text that is difficult to distinguish from human writing, raising the need for robust tools to verify content provenance. Watermarking has emerged as a promising countermeasure, with existing work largely focused on model quality preservation and robust detection. However, current schemes provide limited protection against false attribution. We strengthen the notion of soundness by introducing two novel guarantees: unforgeability and recoverability. Unforgeability prevents adversaries from crafting false positives, texts that are far from any output from the watermarked model but are nonetheless flagged as watermarked. Recoverability provides an additional layer of protection: whenever a watermark is detected, the detector identifies the source text from which the flagged content was derived. Together, these properties strengthen content ownership by linking content exclusively to its generating model, enabling secure attribution and fine-grained traceability. We construct the first undetectable watermarking scheme that is robust, unforgeable, and recoverable with respect to substitutions (i.e., perturbations in Hamming metric). The key technical ingredient is a new cryptographic primitive called robust (or recoverable) digital signatures, which allow verification of messages that are close to signed ones, while preventing forgery of messages that are far from all previously signed messages. We show that any standard digital signature scheme can be boosted to a robust one using property-preserving hash functions (Boyle, LaVigne, and Vaikuntanathan, ITCS 2019).
[AI-37] On Surprising Effectiveness of Masking Updates in Adaptive Optimizers
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)训练中对密集自适应优化器(dense adaptive optimizers)的高度依赖问题,这类优化器通常使用复杂的预条件机制(preconditioners)来调整参数更新。作者提出的关键解决方案是引入随机参数更新掩码(random masking of parameter updates),其核心在于通过随机掩码诱导出一种与曲率相关的几何正则化(curvature-dependent geometric regularization),从而平滑优化轨迹并提升训练稳定性。进一步地,基于此发现,论文提出了Momentum-aligned gradient masking (Magma),通过动量-梯度对齐机制调节掩码后的更新幅度,在不增加显著计算开销的前提下,实现了对现有自适应优化器的简单替换,并在大规模LLM预训练中展现出显著性能提升(如1B模型规模下相比Adam和Muon分别降低困惑度19%和9%)。
链接: https://arxiv.org/abs/2602.15322
作者: Taejong Joo,Wenhan Xia,Cheolmin Kim,Ming Zhang,Eugene Ie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant of RMSProp consistently outperforming recent state-of-the-art optimizers. Our analysis reveals that the random masking induces a curvature-dependent geometric regularization that smooths the optimization trajectory. Motivated by this finding, we introduce Momentum-aligned gradient masking (Magma), which modulates the masked updates using momentum-gradient alignment. Extensive LLM pre-training experiments show that Magma is a simple drop-in replacement for adaptive optimizers with consistent gains and negligible computational overhead. Notably, for the 1B model size, Magma reduces perplexity by over 19% and 9% compared to Adam and Muon, respectively.
[AI-38] Hybrid Federated and Split Learning for Privacy Preserving Clinical Prediction and Treatment Optimization
【速读】:该论文旨在解决跨机构临床决策支持中因治理与隐私法规限制而无法共享患者原始数据的问题,从而阻碍了协作式医疗建模的开展。其解决方案的关键在于提出一种融合联邦学习(Federated Learning, FL)与分割学习(Split Learning, SL)的混合隐私保护框架:在客户端保留特征提取模块(feature-extraction trunk),预测头(prediction head)部署于协调服务器端,实现共享表征学习的同时明确划分协作边界,便于实施隐私控制策略;并通过激活裁剪(activation clipping)和加性高斯噪声等轻量级防御机制降低成员推理攻击下的隐私泄露风险,最终在预测性能、决策优先级排序、隐私泄露审计和通信开销四个维度上实现可调的隐私-效用权衡。
链接: https://arxiv.org/abs/2602.15304
作者: Farzana Akter,Rakib Hossain,Deb Kanna Roy Toushi,Mahmood Menon Khan,Sultana Amin,Lisan Al Amin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Collaborative clinical decision support is often constrained by governance and privacy rules that prevent pooling patient-level records across institutions. We present a hybrid privacy-preserving framework that combines Federated Learning (FL) and Split Learning (SL) to support decision-oriented healthcare modeling without raw-data sharing. The approach keeps feature-extraction trunks on clients while hosting prediction heads on a coordinating server, enabling shared representation learning and exposing an explicit collaboration boundary where privacy controls can be applied. Rather than assuming distributed training is inherently private, we audit leakage empirically using membership inference on cut-layer representations and study lightweight defenses based on activation clipping and additive Gaussian noise. We evaluate across three public clinical datasets under non-IID client partitions using a unified pipeline and assess performance jointly along four deployment-relevant axes: factual predictive utility, uplift-based ranking under capacity constraints, audited privacy leakage, and communication overhead. Results show that hybrid FL-SL variants achieve competitive predictive performance and decision-facing prioritization behavior relative to standalone FL or SL, while providing a tunable privacy-utility trade-off that can reduce audited leakage without requiring raw-data sharing. Overall, the work positions hybrid FL-SL as a practical design space for privacy-preserving healthcare decision support where utility, leakage risk, and deployment cost must be balanced explicitly.
[AI-39] X-MAP: eXplainable Misclassification Analysis and Profiling for Spam and Phishing Detection
【速读】:该论文旨在解决垃圾信息(spam)和网络钓鱼(phishing)检测中误分类问题,即假阴性会暴露用户于攻击风险,而假阳性则损害系统可信度。现有基于不确定性的检测方法虽能标记潜在错误,但易被欺骗且解释性有限。解决方案的关键在于提出X-MAP框架,通过结合SHAP(Shapley Additive Explanations)特征归因与非负矩阵分解(Non-negative Matrix Factorization, NMF),构建可解释的主题特征轮廓(topic profiles),用于刻画正确分类的垃圾/钓鱼消息与合法消息的语义模式,并利用Jensen-Shannon散度量化每条消息与其对应主题轮廓的偏离程度,从而实现对误分类的精准识别与修复。
链接: https://arxiv.org/abs/2602.15298
作者: Qi Zhang,Dian Chen,Lance M. Kaplan,Audun Jøsang,Dong Hyun Jeong,Feng Chen,Jin-Hee Cho
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Misclassifications in spam and phishing detection are very harmful, as false negatives expose users to attacks while false positives degrade trust. Existing uncertainty-based detectors can flag potential errors, but possibly be deceived and offer limited interpretability. This paper presents X-MAP, an eXplainable Misclassification Analysis and Profilling framework that reveals topic-level semantic patterns behind model failures. X-MAP combines SHAP-based feature attributions with non-negative matrix factorization to build interpretable topic profiles for reliably classified spam/phishing and legitimate messages, and measures each message’s deviation from these profiles using Jensen-Shannon divergence. Experiments on SMS and phishing datasets show that misclassified messages exhibit at least two times larger divergence than correctly classified ones. As a detector, X-MAP achieves up to 0.98 AUROC and lowers the false-rejection rate at 95% TRR to 0.089 on positive predictions. When used as a repair layer on base detectors, it recovers up to 97% of falsely rejected correct predictions with moderate leakage. These results demonstrate X-MAP’s effectiveness and interpretability for improving spam and phishing detection.
[AI-40] EAA: Automating materials characterization with vision language model agents
【速读】:该论文旨在解决复杂显微成像实验流程自动化不足的问题,尤其是在同步辐射光束线(beamline)环境中,传统操作依赖专家经验、效率低且难以普及。其解决方案的核心是提出实验自动化代理(Experiment Automation Agents, EAA),这是一个基于视觉-语言模型(Vision-Language Model, VLM)驱动的智能体系统,通过融合多模态推理(multimodal reasoning)、工具增强型动作执行(tool-augmented action)以及可选的长期记忆机制,实现从完全自主运行到用户交互引导的灵活工作流管理。EAA基于模块化的任务管理器架构,并支持Model Context Protocol (MCP) 的双向兼容,使仪器控制工具可在不同应用间无缝调用与共享,从而显著提升光束线运行效率、降低操作负担并降低用户使用门槛。
链接: https://arxiv.org/abs/2602.15294
作者: Ming Du,Yanqi Luo,Srutarshi Banerjee,Michael Wojcik,Jelena Popovic,Mathew J. Cherukara
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present Experiment Automation Agents (EAA), a vision-language-model-driven agentic system designed to automate complex experimental microscopy workflows. EAA integrates multimodal reasoning, tool-augmented action, and optional long-term memory to support both autonomous procedures and interactive user-guided measurements. Built on a flexible task-manager architecture, the system enables workflows ranging from fully agent-driven automation to logic-defined routines that embed localized LLM queries. EAA further provides a modern tool ecosystem with two-way compatibility for Model Context Protocol (MCP), allowing instrument-control tools to be consumed or served across applications. We demonstrate EAA at an imaging beamline at the Advanced Photon Source, including automated zone plate focusing, natural language-described feature search, and interactive data acquisition. These results illustrate how vision-capable agents can enhance beamline efficiency, reduce operational burden, and lower the expertise barrier for users.
[AI-41] AI-Paging: Lease-Based Execution Anchoring for Network-Exposed AI-as-a-Service
【速读】:该论文旨在解决多提供商和多模型层级的生成式 AI (Generative AI) 服务即服务(AI-as-a-Service, AIaaS)部署中,用户在运行时难以自主选择合适模型实例的问题。随着6G网络的发展,服务提供方需承担意图到模型匹配(intent-to-model resolution)与执行放置的责任,确保在策略、信任和质量保障(Quality of Service, QoS)约束下完成高效且可靠的AI服务调度。其核心解决方案是提出“AI-paging”机制——一种控制面事务,将用户意图解析为AI服务身份(AISI)、作用域会话令牌(AIST)及可到期的准入租约(COMMIT),授权用户面引导至选定的AI执行锚点(AEXF)。该机制通过两个关键不变性保障可靠性:一是租约门控引导(lease-gated steering,无COMMIT则不建立引导状态),二是先锚定后切换(make-before-break anchoring),从而支持动态网络条件下AIaaS服务的连续性和鲁棒性。
链接: https://arxiv.org/abs/2602.15286
作者: Merve Saimler,Mohaned Chraiti
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:With AI-as-a-Service (AIaaS) now deployed across multiple providers and model tiers, selecting the appropriate model instance at run time is increasingly outside the end user’s knowledge and operational control. Accordingly, the 6G service providers are envisioned to play a crucial role in exposing AIaaS in a setting where users submit only an intent while the network helps in the intent-to-model matching (resolution) and execution placement under policy, trust, and Quality of Service (QoS) constraints. The network role becomes to discover candidate execution endpoints and selects a suitable model/anchor under policy and QoS constraints in a process referred here to as AI-paging (by analogy to cellular call paging). In the proposed architecture, AI-paging is a control-plane transaction that resolves an intent into an AI service identity (AISI), a scoped session token (AIST), and an expiring admission lease (COMMIT) that authorizes user-plane steering to a selected AI execution anchor (AEXF) under a QoS binding. AI-Paging enforces two invariants: (i) lease-gated steering (without COMMIT, no steering state is installed) and (ii) make-before-break anchoring to support continuity and reliability of AIaaS services under dynamic network conditions. We prototype AI-Paging using existing control- and user-plane mechanisms (service-based control, QoS flows, and policy-based steering) with no new packet headers, ensuring compatibility with existing 3GPP-based exposure and management architectures, and evaluate transaction latency, relocation interruption, enforcement correctness under lease expiry, and audit-evidence overhead under mobility and failures.
[AI-42] Complex-Valued Unitary Representations as Classification Heads for Improved Uncertainty Quantification in Deep Neural Networks
【速读】:该论文旨在解决深度神经网络在预测准确性高但校准性差的问题,即模型输出的置信度无法可靠反映真实正确概率。其解决方案的关键在于提出一种受量子力学启发的分类头架构:将骨干网络特征投影到复数域希尔伯特空间(Hilbert space),并通过参数化为Cayley映射的酉变换(unitary transformation)演化这些特征。通过控制实验设计(共享单一骨干网络并对比轻量级可替换分类头),研究发现基于酉变换幅度读出(magnitude readout)的头部结构能显著提升校准性能,Expected Calibration Error(ECE)降低至0.0146,优于标准Softmax头(0.0355)和温度缩放(0.0510)。这一方法的核心创新在于利用保范酉动力学与特征空间几何之间的理论联系,从而改善模型校准能力,尤其在人类感知不确定性建模方面表现优异。
链接: https://arxiv.org/abs/2602.15283
作者: Akbar Anbar Jafari,Cagri Ozcinar,Gholamreza Anbarjafari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 12 figures
Abstract:Modern deep neural networks achieve high predictive accuracy but remain poorly calibrated: their confidence scores do not reliably reflect the true probability of correctness. We propose a quantum-inspired classification head architecture that projects backbone features into a complex-valued Hilbert space and evolves them under a learned unitary transformation parameterised via the Cayley map. Through a controlled hybrid experimental design - training a single shared backbone and comparing lightweight interchangeable heads - we isolate the effect of complex-valued unitary representations on calibration. Our ablation study on CIFAR-10 reveals that the unitary magnitude head (complex features evolved under a Cayley unitary, read out via magnitude and softmax) achieves an Expected Calibration Error (ECE) of 0.0146, representing a 2.4x improvement over a standard softmax head (0.0355) and a 3.5x improvement over temperature scaling (0.0510). Surprisingly, replacing the softmax readout with a Born rule measurement layer - the quantum-mechanically motivated approach - degrades calibration to an ECE of 0.0819. On the CIFAR-10H human-uncertainty benchmark, the wave function head achieves the lowest KL-divergence (0.336) to human soft labels among all compared methods, indicating that complex-valued representations better capture the structure of human perceptual ambiguity. We provide theoretical analysis connecting norm-preserving unitary dynamics to calibration through feature-space geometry, report negative results on out-of-distribution detection and sentiment analysis to delineate the method’s scope, and discuss practical implications for safety-critical applications. Code is publicly available.
[AI-43] High-Fidelity Network Management for Federated AI-as-a-Service: Cross-Domain Orchestration
【速读】:该论文旨在解决AI-as-a-Service (AIaaS) 模式下跨域联邦场景中的服务质量保障问题,核心挑战在于如何在多域网络环境中实现高保真(high fidelity)的端到端控制与编排,以应对通信损伤(如延迟、丢包)和推理损伤(如延迟、错误)的联合影响。解决方案的关键在于提出一种基于尾部风险包络(Tail-Risk Envelopes, TREs)的保证导向型管理平面:TREs 是可组合的、带签名的 per-domain 描述符,融合确定性约束与随机速率-延迟-损伤模型;通过随机网络演算推导出串联域间的端到端延迟违反概率边界,并实现风险预算的优化分解;同时引入审计层利用运行时遥测数据估算极端分位数性能、量化不确定性并归因尾部风险至各域,从而保障租户级预留机制有效抑制突发流量对尾部延迟的影响,提升系统在过载下的 p99.9 合规性及租户隔离鲁棒性。
链接: https://arxiv.org/abs/2602.15281
作者: Merve Saimler,Mohaned Chraiti,Ozgur Ercetin
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:To support the emergence of AI-as-a-Service (AIaaS), communication service providers (CSPs) are on the verge of a radical transformation-from pure connectivity providers to AIaaS a managed network service (control-and-orchestration plane that exposes AI models). In this model, the CSP is responsible not only for transport/communications, but also for intent-to-model resolution and joint network-compute orchestration, i.e., reliable and timely end-to-end delivery. The resulting end-to-end AIaaS service thus becomes governed by communications impairments (delay, loss) and inference impairments (latency, error). A central open problem is an operational AIaaS control-and-orchestration framework that enforces high fidelity, particularly under multi-domain federation. This paper introduces an assurance-oriented AIaaS management plane based on Tail-Risk Envelopes (TREs): signed, composable per-domain descriptors that combine deterministic guardrails with stochastic rate-latency-impairment models. Using stochastic network calculus, we derive bounds on end-to-end delay violation probabilities across tandem domains and obtain an optimization-ready risk-budget decomposition. We show that tenant-level reservations prevent bursty traffic from inflating tail latency under TRE contracts. An auditing layer then uses runtime telemetry to estimate extreme-percentile performance, quantify uncertainty, and attribute tail-risk to each domain for accountability. Packet-level Monte-Carlo simulations demonstrate improved p99.9 compliance under overload via admission control and robust tenant isolation under correlated burstiness.
[AI-44] When Remembering and Planning are Worth it: Navigating under Change
【速读】:该论文旨在解决在非平稳且不确定环境中的空间导航问题,即智能体如何利用不同类型的记忆机制来高效地完成从家到食物的路径寻找任务,同时应对障碍物和食物位置的每日变化以及感知信息的有限性和不确定性。解决方案的关键在于构建一个能够融合多种策略的架构:一方面通过非平稳概率学习技术持续更新其情景记忆(episodic memory),另一方面利用这些记忆动态构建不完美的地图并进行在线规划,从而在任务难度(如目标距离)增加时显著提升导航效率,前提是环境不确定性(来自定位误差和变化频率)未超出可处理范围。
链接: https://arxiv.org/abs/2602.15274
作者: Omid Madani,J. Brian Burns,Reza Eghbali,Thomas L. Dean
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We explore how different types and uses of memory can aid spatial navigation in changing uncertain environments. In the simple foraging task we study, every day, our agent has to find its way from its home, through barriers, to food. Moreover, the world is non-stationary: from day to day, the location of the barriers and food may change, and the agent’s sensing such as its location information is uncertain and very limited. Any model construction, such as a map, and use, such as planning, needs to be robust against these challenges, and if any learning is to be useful, it needs to be adequately fast. We look at a range of strategies, from simple to sophisticated, with various uses of memory and learning. We find that an architecture that can incorporate multiple strategies is required to handle (sub)tasks of a different nature, in particular for exploration and search, when food location is not known, and for planning a good path to a remembered (likely) food location. An agent that utilizes non-stationary probability learning techniques to keep updating its (episodic) memories and that uses those memories to build maps and plan on the fly (imperfect maps, i.e. noisy and limited to the agent’s experience) can be increasingly and substantially more efficient than the simpler (minimal-memory) agents, as the task difficulties such as distance to goal are raised, as long as the uncertainty, from localization and change, is not too large.
[AI-45] Enhancing Diversity and Feasibility: Joint Population Synthesis from Multi-source Data Using Generative Models
【速读】:该论文旨在解决当前生成合成人口数据方法中存在的两大问题:一是多数方法依赖单一数据集或采用顺序式数据融合与生成流程,难以捕捉特征间的复杂交互关系;二是这些方法在处理采样零(valid but unobserved attribute combinations)和结构零(infeasible combinations due to logical constraints)时表现不佳,导致生成数据的多样性与可行性不足。解决方案的关键在于提出一种基于Wasserstein生成对抗网络(WGAN)并引入梯度惩罚的联合学习框架,通过定义逆梯度惩罚项作为生成器损失函数的正则化项,同步整合多源数据,从而显著提升合成数据的多样性与可行性。实验表明,该方法在召回率(recall)和精确率(precision)上均优于传统顺序方法,且整体相似性评估得分达到88.1,较基线提升3.5分。
链接: https://arxiv.org/abs/2602.15270
作者: Farbod Abbasi,Zachary Patterson,Bilal Farooq
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures, 5 tables
Abstract:Generating realistic synthetic populations is essential for agent-based models (ABM) in transportation and urban planning. Current methods face two major limitations. First, many rely on a single dataset or follow a sequential data fusion and generation process, which means they fail to capture the complex interplay between features. Second, these approaches struggle with sampling zeros (valid but unobserved attribute combinations) and structural zeros (infeasible combinations due to logical constraints), which reduce the diversity and feasibility of the generated data. This study proposes a novel method to simultaneously integrate and synthesize multi-source datasets using a Wasserstein Generative Adversarial Network (WGAN) with gradient penalty. This joint learning method improves both the diversity and feasibility of synthetic data by defining a regularization term (inverse gradient penalty) for the generator loss function. For the evaluation, we implement a unified evaluation metric for similarity, and place special emphasis on measuring diversity and feasibility through recall, precision, and the F1 score. Results show that the proposed joint approach outperforms the sequential baseline, with recall increasing by 7% and precision by 15%. Additionally, the regularization term further improves diversity and feasibility, reflected in a 10% increase in recall and 1% in precision. We assess similarity distributions using a five-metric score. The joint approach performs better overall, and reaches a score of 88.1 compared to 84.6 for the sequential method. Since synthetic populations serve as a key input for ABM, this multi-source generative approach has the potential to significantly enhance the accuracy and reliability of ABM.
[AI-46] Fast and Effective On-policy Distillation from Reasoning Prefixes
【速读】:该论文旨在解决在线策略蒸馏(On-policy Distillation, OPD)训练成本过高的问题,尤其在生成长文本时,因需实时采样学生模型轨迹而导致计算开销显著增加。解决方案的关键在于:通过分析发现训练信号主要集中在输出的前缀部分,因此提出仅对学生生成输出的前缀应用蒸馏目标,并在蒸馏过程中提前终止采样,从而大幅降低训练所需的浮点运算次数(FLOP),同时保持与完整OPD相当的性能表现。
链接: https://arxiv.org/abs/2602.15260
作者: Dongxu Zhang,Zhichao Yang,Sepehr Janghorbani,Jun Han,Andrew Ressler II,Qian Qian,Gregory D. Lyng,Sanjit Singh Batra,Robert E. Tillman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:On-policy distillation (OPD), which samples trajectories from the student model and supervises them with a teacher at the token level, avoids relying solely on verifiable terminal rewards and can yield better generalization than off-policy distillation. However, OPD requires expensive on-the-fly sampling of the student policy during training, which substantially increases training cost, especially for long responses. Our initial analysis shows that, during OPD, training signals are often concentrated in the prefix of each output, and that even a short teacher-generated prefix can significantly help the student produce the correct answer. Motivated by these observations, we propose a simple yet effective modification of OPD: we apply the distillation objective only to prefixes of student-generated outputs and terminate each sampling early during distillation. Experiments on a suite of AI-for-Math and out-of-domain benchmarks show that on-policy prefix distillation matches the performance of full OPD while reducing training FLOP by 2x-47x.
[AI-47] Knowing Isnt Understanding: Re-grounding Generative Proactivity with Epistemic and Behavioral Insight
【速读】:该论文旨在解决当前生成式 AI(Generative AI)代理在面对用户认知盲区时的交互局限性问题,即现有系统假设用户能明确表达需求,从而忽视了“未知的未知”(unknown unknowns)对有效人机协作的阻碍。当用户自身未意识到缺失信息或潜在风险时,单纯依赖被动响应机制无法实现真正意义上的智能支持。解决方案的关键在于引入“认知性接地”(epistemic grounding)与“行为性接地”(behavioral grounding)双重约束:前者要求代理主动识别并探索超出用户当前意识范围的可能性,后者则通过原则性限制干预时机、方式和强度,防止无序干扰或认知过载。这一框架使生成式代理能够在尊重用户认知边界的基础上,负责任地推动有意义的人机协同。
链接: https://arxiv.org/abs/2602.15259
作者: Kirandeep Kaur,Xingda Lyu,Chirag Shah
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Generative AI agents equate understanding with resolving explicit queries, an assumption that confines interaction to what users can articulate. This assumption breaks down when users themselves lack awareness of what is missing, risky, or worth considering. In such conditions, proactivity is not merely an efficiency enhancement, but an epistemic necessity. We refer to this condition as epistemic incompleteness: where progress depends on engaging with unknown unknowns for effective partnership. Existing approaches to proactivity remain narrowly anticipatory, extrapolating from past behavior and presuming that goals are already well defined, thereby failing to support users meaningfully. However, surfacing possibilities beyond a user’s current awareness is not inherently beneficial. Unconstrained proactive interventions can misdirect attention, overwhelm users, or introduce harm. Proactive agents, therefore, require behavioral grounding: principled constraints on when, how, and to what extent an agent should intervene. We advance the position that generative proactivity must be grounded both epistemically and behaviorally. Drawing on the philosophy of ignorance and research on proactive behavior, we argue that these theories offer critical guidance for designing agents that can engage responsibly and foster meaningful partnerships.
[AI-48] Decision Making under Imperfect Recall: Algorithms and Benchmarks
【速读】:该论文旨在解决不完美记忆决策问题(imperfect-recall decision problems)中的策略优化难题,这类问题广泛存在于博弈论中,如“迷路司机”博弈和通信受限的团队博弈,也涉及人工智能系统隐私保护与安全测试等实际场景。其核心挑战在于如何在信息遗忘的约束下找到一阶最优策略(first-order optimal strategies)。解决方案的关键是提出首个针对此类问题的基准测试套件,并引入一类无需调参的后悔匹配(regret matching, RM)算法家族,用于非线性约束优化。实验表明,RM算法在61个实例上显著优于传统一阶优化方法(如投影梯度下降),展现出在大规模约束优化问题中的强大潜力,首次确立了RM类算法作为该领域强有力的求解工具。
链接: https://arxiv.org/abs/2602.15252
作者: Emanuel Tewolde,Brian Hu Zhang,Ioannis Anagnostides,Tuomas Sandholm,Vincent Conitzer
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 39 pages, 71 figures, 4 table
Abstract:In game theory, imperfect-recall decision problems model situations in which an agent forgets information it held before. They encompass games such as the ``absentminded driver’’ and team games with limited communication. In this paper, we introduce the first benchmark suite for imperfect-recall decision problems. Our benchmarks capture a variety of problem types, including ones concerning privacy in AI systems that elicit sensitive information, and AI safety via testing of agents in simulation. Across 61 problem instances generated using this suite, we evaluate the performance of different algorithms for finding first-order optimal strategies in such problems. In particular, we introduce the family of regret matching (RM) algorithms for nonlinear constrained optimization. This class of parameter-free algorithms has enjoyed tremendous success in solving large two-player zero-sum games, but, surprisingly, they were hitherto relatively unexplored beyond that setting. Our key finding is that RM algorithms consistently outperform commonly employed first-order optimizers such as projected gradient descent, often by orders of magnitude. This establishes, for the first time, the RM family as a formidable approach to large-scale constrained optimization problems.
[AI-49] Artificial Intelligence Specialization in the European Union: Underexplored Role of the Periphery at NUTS-3 Level
【速读】:该论文旨在解决欧洲地区人工智能(Artificial Intelligence, AI)研究产出的空间分布不均问题,特别是揭示不同区域在AI研究领域的专业化程度与学术影响力之间的关系。其解决方案的关键在于利用Clarivate InCites的文献计量数据和Citation Topics分类体系,对781个NUTS-3层级区域进行相对专业化指数(Relative Specialization Index, RSI)和相对引用影响力(Relative Citation Impact, RCI)的量化分析,从而识别出四类区域性特征:高影响力专业化区、高产出低影响力区、高影响力非专业化区以及多样化优势区,发现Peripheral regions(如东欧和西班牙部分地区)虽研究总量不高但具有显著的相对专业化水平,且区域专业化程度与引用影响力之间几乎无相关性,表明发展AI科研竞争力不仅依赖于研究规模,更需聚焦于特色化与质量提升策略。
链接: https://arxiv.org/abs/2602.15249
作者: Victor Herrero-Solana
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, submitted to IEEE Computational Intelligence Magazine
Abstract:This study examines the geographical distribution of Artificial Intelligence (AI) research production across European regions at the NUTS-3 level for the period 2015-2024. Using bibliometric data from Clarivate InCites and the Citation Topics classification system, we analyze two hierarchical levels of thematic aggregation: Electrical Engineering, Electronics Computer Science (Macro Citation Topic 4) and Artificial Intelligence Machine Learning (Meso Citation Topic 4.61). We calculate the Relative Specialization Index (RSI) and Relative Citation Impact (RCI) for 781 NUTS-3 regions. While major metropolitan hubs such as Paris (IIle-de-France), Warszawa, and Madrid lead in absolute production volume, our findings reveal that peripheral regions, particularly from Eastern Europe and Spain, exhibit the highest levels of relative AI specialization. Notably, we find virtually no correlation between regional specialization and citation impact, identifying four distinct regional profiles: high-impact specialized regions (e.g., Granada, Jaen, Vilniaus), high-volume but low-impact regions (e.g., Bugas, several Polish regions), high-impact non-specialized regions, with Fyn (Denmark) standing out as a remarkable outlier achieving exceptional citation impact (RCI 4) despite low specialization, and diversified portfolios with selective excellence (e.g., German regions). These results suggest that AI research represents a strategic opportunity for peripheral regions to develop competitive scientific niches, though achieving international visibility requires more than research volume alone.
[AI-50] Predicting Invoice Dilution in Supply Chain Finance with Leakage Free Two Stage XGBoost KAN (Kolmogorov Arnold Networks) and Ensemble Models
【速读】:该论文旨在解决供应链金融中因发票金额与实际收款之间存在差异(即发票稀释,invoice dilution)所导致的非信用风险和利润损失问题。传统解决方案依赖买方不可撤销付款承诺(irrevocable payment undertaking, IPU),但其刚性限制阻碍了对信用评级较低买方的融资接纳。论文提出的关键解决方案是构建一个基于人工智能(AI)和机器学习的框架,通过实时动态信用额度机制,在每个买方-供应商配对中进行发票稀释的实时预测,从而替代或补充传统的确定性算法,并利用涵盖九个关键交易字段的生产级数据集进行验证。
链接: https://arxiv.org/abs/2602.15248
作者: Pavel Koptev,Vishnu Kumar,Konstantin Malkov,George Shapiro,Yury Vikhanov
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Mathematical Finance (q-fin.MF)
备注:
Abstract:Invoice or payment dilution is the gap between the approved invoice amount and the actual collection is a significant source of non credit risk and margin loss in supply chain finance. Traditionally, this risk is managed through the buyer’s irrevocable payment undertaking (IPU), which commits to full payment without deductions. However, IPUs can hinder supply chain finance adoption, particularly among sub-invested grade buyers. A newer, data-driven methods use real-time dynamic credit limits, projecting dilution for each buyer-supplier pair in real-time. This paper introduces an AI, machine learning framework and evaluates how that can supplement a deterministic algorithm to predict invoice dilution using extensive production dataset across nine key transaction fields.
[AI-51] GenAI for Systems: Recurring Challenges and Design Principles from Software to Silicon
【速读】:该论文旨在解决生成式 AI (Generative AI) 在计算系统设计、优化与构建过程中,因软件、架构和芯片设计领域研究碎片化而导致的协同效率低下的问题。其核心贡献在于提出一个跨栈(cross-stack)视角,识别出从代码生成、分布式运行时到硬件设计空间探索、RTL综合、物理布局与验证等多层中反复出现的五大挑战(反馈循环危机、隐性知识问题、信任与验证、跨边界协同设计、从确定性到动态性的转变),并提炼出五项独立涌现的有效设计原则(拥抱混合方法、设计连续反馈机制、按角色分离关注点、匹配方法与问题结构、依托多年系统知识)。解决方案的关键在于建立“挑战—原则映射图”,作为诊断与设计工具,揭示不同层级中哪些原则对特定挑战尤为有效,并呼吁形成共享工程方法论(如通用术语体系、跨层基准测试和系统化设计实践),以推动各社区间知识复用与进展累积,避免重复发明。
链接: https://arxiv.org/abs/2602.15241
作者: Arya Tschand,Chenyu Wang,Zishen Wan,Andrew Cheng,Ioana Cristescu,Kevin He,Howard Huang,Alexander Ingare,Akseli Kangaslahti,Sara Kangaslahti,Theo Lebryk,Hongjin Lin,Jeffrey Jian Ma,Alexandru Meterez,Clara Mohri,Depen Morwani,Sunny Qin,Roy Rinberg,Paula Rodriguez-Diaz,Alyssa Mia Taliotis,Pernille Undrum Fathi,Rosie Zhao,Todd Zhou,Vijay Janapa Reddi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI is reshaping how computing systems are designed, optimized, and built, yet research remains fragmented across software, architecture, and chip design communities. This paper takes a cross-stack perspective, examining how generative models are being applied from code generation and distributed runtimes through hardware design space exploration to RTL synthesis, physical layout, and verification. Rather than reviewing each layer in isolation, we analyze how the same structural difficulties and effective responses recur across the stack. Our central finding is one of convergence. Despite the diversity of domains and tools, the field keeps encountering five recurring challenges (the feedback loop crisis, the tacit knowledge problem, trust and validation, co-design across boundaries, and the shift from determinism to dynamism) and keeps arriving at five design principles that independently emerge as effective responses (embracing hybrid approaches, designing for continuous feedback, separating concerns by role, matching methods to problem structure, and building on decades of systems knowledge). We organize these into a challenge–principle map that serves as a diagnostic and design aid, showing which principles have proven effective for which challenges across layers. Through concrete cross-stack examples, we show how systems navigate this map as they mature, and argue that the field needs shared engineering methodology, including common vocabularies, cross-layer benchmarks, and systematic design practices, so that progress compounds across communities rather than being rediscovered in each one. Our analysis covers more than 275 papers spanning eleven application areas across three layers of the computing stack, and distills open research questions that become visible only from a cross-layer vantage point.
[AI-52] Closing the Distribution Gap in Adversarial Training for LLM s
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在对抗训练(adversarial training)下仍对简单但分布内(in-distribution)的攻击方式(如将提示语改写为过去时态或翻译成其他语言)表现出脆弱性的问题。其核心问题在于现有对抗训练算法仅在训练集上最小化对抗损失,而未能充分覆盖数据的真实分布,导致泛化失败。解决方案的关键在于提出分布对抗训练(Distributional Adversarial Training, DAT),通过利用扩散语言模型(Diffusion LLMs)近似提示与响应的联合分布,生成高似然、多样化的样本以增强数据覆盖,并结合扩散模型提供的分布优化与连续对抗训练,显著提升了模型的对抗鲁棒性。
链接: https://arxiv.org/abs/2602.15238
作者: Chengzhi Hu,Jonas Dornbusch,David Lüdke,Stephan Günnemann,Leo Schwinn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries. However, despite significant progress, models remain vulnerable to simple in-distribution exploits, such as rewriting prompts in the past tense or translating them into other languages. We argue that this persistent fragility stems from a fundamental limitation in current adversarial training algorithms: they minimize adversarial loss on their training set but inadequately cover the data distribution, resulting in vulnerability to seemingly simple attacks. To bridge this gap, we propose Distributional Adversarial Training, DAT. We leverage Diffusion LLMs to approximate the true joint distribution of prompts and responses, enabling generation of diverse, high-likelihood samples that address generalization failures. By combining optimization over the data distribution provided by the diffusion model with continuous adversarial training, DAT achieves substantially higher adversarial robustness than previous methods.
[AI-53] Automatically Finding Reward Model Biases
【速读】:该论文试图解决奖励模型(Reward Model, RM)在后训练阶段可能产生偏差的问题,即RM会错误地偏好某些无关或有害的属性,如文本长度、格式、幻觉内容和奉承倾向等。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的迭代式自动发现机制:利用LLM不断生成并优化候选偏差假设,从而系统性识别已知及未知的奖励模型偏倚。实验表明,该方法不仅能复现已有偏差,还能揭示新的偏倚(如Skywork-V2-8B对冗余空格和幻觉内容的错误偏好),且进化式迭代优于静态的“最佳N”搜索策略,同时通过注入合成偏倚验证了该管道的召回能力。
链接: https://arxiv.org/abs/2602.15222
作者: Atticus Wang,Iván Arcuschin,Arthur Conmy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reward models are central to large language model (LLM) post-training. However, past work has shown that they can reward spurious or undesirable attributes such as length, format, hallucinations, and sycophancy. In this work, we introduce and study the research problem of automatically finding reward model biases in natural language. We offer a simple approach of using an LLM to iteratively propose and refine candidate biases. Our method can recover known biases and surface novel ones: for example, we found that Skywork-V2-8B, a leading open-weight reward model, often mistakenly favors responses with redundant spacing and responses with hallucinated content. In addition, we show evidence that evolutionary iteration outperforms flat best-of-N search, and we validate the recall of our pipeline using synthetically injected biases. We hope our work contributes to further research on improving RMs through automated interpretability methods.
[AI-54] Secure and Energy-Efficient Wireless Agent ic AI Networks
【速读】:该论文旨在解决无线环境中多AI代理协作推理时的资源优化与安全问题,即在保障用户推理任务服务质量(QoS)的同时,确保私有知识和推理结果的保密性。其核心挑战在于如何动态分配AI代理进行协同推理,并通过未选中代理作为友好干扰源提升抗窃听能力,同时最小化网络能耗并满足延迟和准确率约束。解决方案的关键在于提出两种资源分配方案——ASC(基于ADMM、半定松弛和逐次凸逼近的迭代优化)与LAW(基于大语言模型(LLM)优化器的智能工作流),二者均将原问题分解为代理选择、基站波束赋形和代理发射功率优化三个子问题,并分别采用数学规划与生成式AI驱动的方法求解,最终实现能效提升最高达59.1%且保持高推理准确性。
链接: https://arxiv.org/abs/2602.15212
作者: Yuanyan Song,Kezhi Wang,Xinmian Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to journal
Abstract:In this paper, we introduce a secure wireless agentic AI network comprising one supervisor AI agent and multiple other AI agents to provision quality of service (QoS) for users’ reasoning tasks while ensuring confidentiality of private knowledge and reasoning outcomes. Specifically, the supervisor AI agent can dynamically assign other AI agents to participate in cooperative reasoning, while the unselected AI agents act as friendly jammers to degrade the eavesdropper’s interception performance. To extend the service duration of AI agents, an energy minimization problem is formulated that jointly optimizes AI agent selection, base station (BS) beamforming, and AI agent transmission power, subject to latency and reasoning accuracy constraints. To address the formulated problem, we propose two resource allocation schemes, ASC and LAW, which first decompose it into three sub-problems. Specifically, ASC optimizes each sub-problem iteratively using the proposed alternating direction method of multipliers (ADMM)-based algorithm, semi-definite relaxation (SDR), and successive convex approximation (SCA), while LAW tackles each sub-problem using the proposed large language model (LLM) optimizer within an agentic workflow. The experimental results show that the proposed solutions can reduce network energy consumption by up to 59.1% compared to other benchmark schemes. Furthermore, the proposed schemes are validated using a practical agentic AI system based on Qwen, demonstrating satisfactory reasoning accuracy across various public benchmarks.
[AI-55] MAVRL: Learning Reward Functions from Multiple Feedback Types with Amortized Variational Inference
【速读】:该论文旨在解决多源异质反馈(如示范、比较、评分和停止信号)在奖励学习中难以联合建模的问题,传统方法通常依赖单一反馈类型或通过人工加权损失项融合多种反馈,缺乏对不同反馈类型信息的高效利用与自动平衡。其解决方案的关键在于将多反馈奖励学习建模为关于共享潜在奖励函数的贝叶斯推断问题,其中每种反馈类型通过显式似然函数贡献信息;并提出一种可扩展的近似变分推断方法,通过优化单一证据下界(ELBO)联合训练一个共享奖励编码器和反馈特异性似然解码器,从而避免将反馈映射到统一中间表示,并消除了手动损失平衡的需求。
链接: https://arxiv.org/abs/2602.15206
作者: Raphaël Baur,Yannick Metz,Maria Gkoulta,Mennatallah El-Assady,Giorgia Ramponi,Thomas Kleine Buening
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 7 figures
Abstract:Reward learning typically relies on a single feedback type or combines multiple feedback types using manually weighted loss terms. Currently, it remains unclear how to jointly learn reward functions from heterogeneous feedback types such as demonstrations, comparisons, ratings, and stops that provide qualitatively different signals. We address this challenge by formulating reward learning from multiple feedback types as Bayesian inference over a shared latent reward function, where each feedback type contributes information through an explicit likelihood. We introduce a scalable amortized variational inference approach that learns a shared reward encoder and feedback-specific likelihood decoders and is trained by optimizing a single evidence lower bound. Our approach avoids reducing feedback to a common intermediate representation and eliminates the need for manual loss balancing. Across discrete and continuous-control benchmarks, we show that jointly inferred reward posteriors outperform single-type baselines, exploit complementary information across feedback types, and yield policies that are more robust to environment perturbations. The inferred reward uncertainty further provides interpretable signals for analyzing model confidence and consistency across feedback types.
[AI-56] Mind the (DH) Gap! A Contrast in Risky Choices Between Reasoning and Conversational LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在不确定性情境下决策机制的理解不足问题,特别是其风险选择行为如何受前景表征方式(显式 vs. 经验驱动)和决策依据(解释)影响。解决方案的关键在于通过对比20个前沿开源LLM在两类维度上的表现,并结合人类被试实验与理性期望收益最大化代理模型作为参照,发现LLMs可划分为推理型模型(Reasoning Models, RMs)和对话型模型(Conversational Models, CMs)两大类;其中RMs表现出接近理性的行为,对前景顺序、损益框架及解释信息不敏感,且在显式与经验表征间无显著差异;而CMs则更易受外部因素干扰,呈现描述-历史差距(description-history gap),且其理性程度较低。研究进一步指出,区分RMs与CMs的关键因素在于是否经过数学推理训练。
链接: https://arxiv.org/abs/2602.15173
作者: Luise Ge,Yongyan Zhang,Yevgeniy Vorobeychik
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The use of large language models either as decision support systems, or in agentic workflows, is rapidly transforming the digital ecosystem. However, the understanding of LLM decision-making under uncertainty remains limited. We initiate a comparative study of LLM risky choices along two dimensions: (1) prospect representation (explicit vs. experience based) and (2) decision rationale (explanation). Our study, which involves 20 frontier and open LLMs, is complemented by a matched human subjects experiment, which provides one reference point, while an expected payoff maximizing rational agent model provides another. We find that LLMs cluster into two categories: reasoning models (RMs) and conversational models (CMs). RMs tend towards rational behavior, are insensitive to the order of prospects, gain/loss framing, and explanations, and behave similarly whether prospects are explicit or presented via experience history. CMs are significantly less rational, slightly more human-like, sensitive to prospect ordering, framing, and explanation, and exhibit a large description-history gap. Paired comparisons of open LLMs suggest that a key factor differentiating RMs and CMs is training for mathematical reasoning.
[AI-57] Exploiting Layer-Specific Vulnerabilities to Backdoor Attack in Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因去中心化特性而暴露的新安全漏洞,特别是针对模型完整性的后门攻击问题。现有防御机制难以有效检测和抵御此类攻击,导致敏感用户数据在协作训练过程中面临严重风险。解决方案的关键在于提出一种名为层平滑攻击(Layer Smoothing Attack, LSA)的新型后门攻击方法:首先通过层替换分析(Layer Substitution Analysis)系统识别对后门成功率贡献最大的后门关键(Backdoor-Critical, BC)层;随后,LSA 精准操纵这些 BC 层以植入持久性后门,同时保持主任务准确率不受影响,并成功绕过当前主流的联邦学习防御机制。实验表明,LSA 在多种模型架构与数据集上均能实现高达 97% 的后门成功率,揭示了现有 FL 安全框架的根本性缺陷,强调未来防御需引入基于层感知的检测与缓解策略。
链接: https://arxiv.org/abs/2602.15161
作者: Mohammad Hadi Foroughi,Seyed Hamed Rastegar,Mohammad Sabokrou,Ahmad Khonsari
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper has been accepted for publication in IEEE ICC 2026
Abstract:Federated learning (FL) enables distributed model training across edge devices while preserving data locality. This decentralized approach has emerged as a promising solution for collaborative learning on sensitive user data, effectively addressing the longstanding privacy concerns inherent in centralized systems. However, the decentralized nature of FL exposes new security vulnerabilities, especially backdoor attacks that threaten model integrity. To investigate this critical concern, this paper presents the Layer Smoothing Attack (LSA), a novel backdoor attack that exploits layer-specific vulnerabilities in neural networks. First, a Layer Substitution Analysis methodology systematically identifies backdoor-critical (BC) layers that contribute most significantly to backdoor success. Subsequently, LSA strategically manipulates these BC layers to inject persistent backdoors while remaining undetected by state-of-the-art defense mechanisms. Extensive experiments across diverse model architectures and datasets demonstrate that LSA achieves a remarkably backdoor success rate of up to 97% while maintaining high model accuracy on the primary task, consistently bypassing modern FL defenses. These findings uncover fundamental vulnerabilities in current FL security frameworks, demonstrating that future defenses must incorporate layer-aware detection and mitigation strategies.
[AI-58] Panini: Continual Learning in Token Space via Structured Memory
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)方法在推理时存在的两个核心问题:一是测试阶段计算效率低下,因为大语言模型(Large Language Model, LLM)会重复对相同文档片段进行推理;二是检索到的无关上下文可能导致生成内容缺乏依据(unsupported generation)。解决方案的关键在于提出一种类人非参数持续学习框架Panini,其通过将文档表示为生成式语义工作空间(Generative Semantic Workspace, GSW)——一种以实体和事件为中心的问答(QA)对网络——实现经验的高效结构化存储与持续累积。该框架在写入阶段对文档进行语义建模,在读取阶段仅遍历不断更新的GSW而非原始文档或片段,从而显著减少上下文token使用量(2–30倍),并提升推理准确性与可靠性。
链接: https://arxiv.org/abs/2602.15156
作者: Shreyas Rajesh,Pavan Holur,Mehmet Yigit Turali,Chenda Duan,Vwani Roychowdhury
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 35 pages, code available at: this https URL
Abstract:Language models are increasingly used to reason over content they were not trained on, such as new documents, evolving knowledge, and user-specific data. A common approach is retrieval-augmented generation (RAG), which stores verbatim documents externally (as chunks) and retrieves only a relevant subset at inference time for an LLM to reason over. However, this results in inefficient usage of test-time compute (LLM repeatedly reasons over the same documents); moreover, chunk retrieval can inject irrelevant context that increases unsupported generation. We propose a human-like non-parametric continual learning framework, where the base model remains fixed, and learning occurs by integrating each new experience into an external semantic memory state that accumulates and consolidates itself continually. We present Panini, which realizes this by representing documents as Generative Semantic Workspaces (GSW) – an entity- and event-aware network of question-answer (QA) pairs, sufficient for an LLM to reconstruct the experienced situations and mine latent knowledge via reasoning-grounded inference chains on the network. Given a query, Panini only traverses the continually-updated GSW (not the verbatim documents or chunks), and retrieves the most likely inference chains. Across six QA benchmarks, Panini achieves the highest average performance, 5%-7% higher than other competitive baselines, while using 2-30x fewer answer-context tokens, supports fully open-source pipelines, and reduces unsupported answers on curated unanswerable queries. The results show that efficient and accurate structuring of experiences at write time – as achieved by the GSW framework – yields both efficiency and reliability gains at read time. Code is available at this https URL.
[AI-59] PolyNODE: Variable-dimension Neural ODEs on M-polyfolds
【速读】:该论文旨在解决神经微分方程(Neural Ordinary Differential Equations, NODEs)在几何深度学习中固有的维度固定性问题,即现有NODE模型受限于流形的维数不变性,无法处理维度可变的动态系统。解决方案的关键在于将NODE扩展至M-多面体(M-polyfolds),这是一种能够同时容纳不同维度并保持可微性结构的空间;由此提出PolyNODEs,作为首个在几何深度学习中实现变量维度的基于流的模型。通过构建包含维度瓶颈的显式M-多面体空间,并设计参数化向量场穿越这些瓶颈,作者实现了PolyNODE自动编码器的训练,从而在重建任务和下游分类任务中验证了其有效性。
链接: https://arxiv.org/abs/2602.15128
作者: Per Åhag,Alexander Friedrich,Fredrik Ohlsson,Viktor Vigren Näslund
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural ordinary differential equations (NODEs) are geometric deep learning models based on dynamical systems and flows generated by vector fields on manifolds. Despite numerous successful applications, particularly within the flow matching paradigm, all existing NODE models are fundamentally constrained to fixed-dimensional dynamics by the intrinsic nature of the manifold’s dimension. In this paper, we extend NODEs to M-polyfolds (spaces that can simultaneously accommodate varying dimensions and a notion of differentiability) and introduce PolyNODEs, the first variable-dimensional flow-based model in geometric deep learning. As an example application, we construct explicit M-polyfolds featuring dimensional bottlenecks and PolyNODE autoencoders based on parametrised vector fields that traverse these bottlenecks. We demonstrate experimentally that our PolyNODE models can be trained to solve reconstruction tasks in these spaces, and that latent representations of the input can be extracted and used to solve downstream classification tasks. The code used in our experiments is publicly available at this https URL .
[AI-60] ResearchGym: Evaluating Language Model Agents on Real-World AI Research
【速读】:该论文旨在解决如何系统性评估人工智能(AI)代理在端到端科研任务中的自主能力问题。现有研究缺乏对AI代理在完整科研闭环(如提出假设、设计实验、优化方法并超越人类基线)中表现的标准化评测框架。为此,作者提出了ResearchGym,一个基于真实学术论文(来自ICML、ICLR和ACL)重构的基准环境,包含39个子任务,每个任务保留原始数据集、评估工具和基线实现,但隐藏论文提出的创新方法。关键解决方案在于构建容器化、可重复执行的科研任务环境,并通过控制实验验证前沿生成式AI代理(如GPT-5、Claude Code和Codex)在复杂长程任务中的可靠性与性能差距,揭示其在时间管理、假设验证、资源调度等方面的局限性,从而推动对自主科研代理的系统性分析与改进。
链接: https://arxiv.org/abs/2602.15112
作者: Aniketh Garikaparthi,Manasi Patwardhan,Arman Cohan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper’s repository, we preserve the datasets, evaluation harness, and baseline implementations but withhold the paper’s proposed method. This results in five containerized task environments comprising 39 sub-tasks in total. Within each environment, agents must propose novel hypotheses, run experiments, and attempt to surpass strong human baselines on the paper’s metrics. In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability–reliability gap. The agent improves over the provided baselines from the repository in just 1 of 15 evaluations (6.7%) by 11.5%, and completes only 26.5% of sub-tasks on average. We identify recurring long-horizon failure modes, including impatience, poor time and resource management, overconfidence in weak hypotheses, difficulty coordinating parallel experiments, and hard limits from context length. Yet in a single run, the agent surpasses the solution of an ICML 2025 Spotlight task, indicating that frontier agents can occasionally reach state-of-the-art performance, but do so unreliably. We additionally evaluate proprietary agent scaffolds including Claude Code (Opus-4.5) and Codex (GPT-5.2) which display a similar gap. ResearchGym provides infrastructure for systematic evaluation and analysis of autonomous agents on closed-loop research.
[AI-61] S-PRESSO: Ultra Low Bitrate Sound Effect Compression With Diffusion Autoencoders And Offline Quantization
【速读】:该论文旨在解决现有音频压缩方法在极低比特率下(如低于0.1 kbps)重建质量显著下降的问题,尤其是在高压缩率场景中,传统连续或离散潜空间模型难以避免可听伪影。其解决方案的关键在于提出S-PRESSO模型,该模型利用预训练的潜扩散模型(latent diffusion model)作为解码器,结合离线量化策略生成连续与离散嵌入(embedding),从而在高达750倍压缩率(帧率低至1Hz)的情况下实现高质量音频重建。通过利用扩散模型的生成先验(generative prior),S-PRESSO在牺牲精确保真度的同时显著提升感知质量和声学相似性,优于现有的连续与离散基线方法。
链接: https://arxiv.org/abs/2602.15082
作者: Zineb Lahrichi(IP Paris),Gaëtan Hadjeres,Gaël Richard(IP Paris),Geoffroy Peeters(IP Paris)
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Neural audio compression models have recently achieved extreme compression rates, enabling efficient latent generative modeling. Conversely, latent generative models have been applied to compression, pushing the limits of continuous and discrete approaches. However, existing methods remain constrained to low-resolution audio and degrade substantially at very low bitrates, where audible artifacts are prominent. In this paper, we present S-PRESSO, a 48kHz sound effect compression model that produces both continuous and discrete embeddings at ultra-low bitrates, down to 0.096 kbps, via offline quantization. Our model relies on a pretrained latent diffusion model to decode compressed audio embeddings learned by a latent encoder. Leveraging the generative priors of the diffusion decoder, we achieve extremely low frame rates, down to 1Hz (750x compression rate), producing convincing and realistic reconstructions at the cost of exact fidelity. Despite operating at high compression rates, we demonstrate that S-PRESSO outperforms both continuous and discrete baselines in audio quality, acoustic similarity and reconstruction metrics.
[AI-62] Structure-Aware Piano Accompaniment via Style Planning and Dataset-Aligned Pattern Retrieval
【速读】:该论文旨在解决符号化钢琴伴奏生成中风格一致性与结构合理性难以兼顾的问题。现有方法往往在高层结构规划与低层音符实现之间缺乏解耦,导致生成的伴奏在长序列中容易出现风格漂移或结构失衡。解决方案的关键在于提出一种结构感知的分层生成框架:首先使用轻量级Transformer模型预测每小节的风格计划(style plan),该计划基于乐段/乐句结构和功能和声信息;随后通过一个检索模块从人类演奏的钢琴片段语料库中选择并重和声匹配的模式,检索过程被建模为显式能量函数优化,包含和声可行性、结构角色兼容性、声部进行连续性、风格偏好及重复控制等约束项。该方法实现了高层意图与底层细节的有效分离,在保持结构严谨的同时显著提升伴奏的风格保真度与多样性。
链接: https://arxiv.org/abs/2602.15074
作者: Wanyu Zang,Yang Yu,Meng Yu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 12 pages
Abstract:We introduce a structure-aware approach for symbolic piano accompaniment that decouples high-level planning from note-level realization. A lightweight transformer predicts an interpretable, per-measure style plan conditioned on section/phrase structure and functional harmony, and a retriever then selects and reharmonizes human-performed piano patterns from a corpus. We formulate retrieval as pattern matching under an explicit energy with terms for harmonic feasibility, structural-role compatibility, voice-leading continuity, style preferences, and repetition control. Given a structured lead sheet and optional keyword prompts, the system generates piano-accompaniment MIDI. In our experiments, transformer style-planner-guided retrieval produces diverse long-form accompaniments with strong style realization. We further analyze planner ablations and quantify inter-style isolation. Experimental results demonstrate the effectiveness of our inference-time approach for piano accompaniment generation.
[AI-63] An effective Genetic Programming Hyper-Heuristic for Uncertain Agile Satellite Scheduling
【速读】:该论文致力于解决不确定敏捷地球观测卫星调度问题(Uncertain Agile Earth Observation Satellite Scheduling Problem, UAEOSSP),该问题相较于传统静态调度模型,引入了任务收益、资源消耗及任务可见性等不确定性因素,以更贴近实际任务执行中的信息未知特性。解决方案的关键在于设计一种基于遗传编程的超启发式算法(Genetic Programming Hyper-Heuristic, GPHH),通过自动化生成调度策略来实现实时调整与优化;实验表明,所演化出的调度策略在性能上显著优于人工设计的启发式方法(Manually Designed Heuristics, MDHs)和基于前瞻思想的启发式方法(Look-Ahead Heuristics, LAHs),平均提升幅度分别达到8.14%和5.03%。
链接: https://arxiv.org/abs/2602.15070
作者: Yuning Chen,Junhua Xue,Wangqi Gu,Mingyan Shao
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 8 pages; 4 figures; 9 tables;
Abstract:This paper investigates a novel problem, namely the Uncertain Agile Earth Observation Satellite Scheduling Problem (UAEOSSP). Unlike the static AEOSSP, it takes into account a range of uncertain factors (e.g., task profit, resource consumption, and task visibility) in order to reflect the reality that the actual information is inherently unknown beforehand. An effective Genetic Programming Hyper-Heuristic (GPHH) is designed to automate the generation of scheduling policies. The evolved scheduling policies can be utilized to adjust plans in real time and perform exceptionally well. Experimental results demonstrate that evolved scheduling policies significantly outperform both well-designed Look-Ahead Heuristics (LAHs) and Manually Designed Heuristics (MDHs). Specifically, the policies generated by GPHH achieve an average improvement of 5.03% compared to LAHs and 8.14% compared to MDHs.
[AI-64] Attention-gated U-Net model for semantic segmentation of brain tumors and feature extraction for survival prognosis
【速读】:该论文旨在解决胶质瘤(Gliomas)在临床治疗中因异质性强、预后差异大及影像学特征复杂而导致的脑肿瘤分割精度不足与生存预测困难的问题。其解决方案的关键在于提出一种基于注意力门控的循环残差U-Net(Attention-Gated Recurrent Residual U-Net, R2U-Net)的三平面(Triplanar,2.5D)模型,通过融合残差连接(residual connections)、循环结构(recurrent architecture)和多平面输入策略,有效增强特征表示能力并提升分割准确性;同时,利用三平面网络提取的64维特征经人工神经网络(ANN)降维至28维,用于生存天数预测,实现了分割与预后分析的联合建模,在保证计算效率的同时显著提升了任务性能。
链接: https://arxiv.org/abs/2602.15067
作者: Rut Pate,Snehal Rajput,Mehul S. Raval,Rupal A. Kapdi,Mohendra Roy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Gliomas, among the most common primary brain tumors, vary widely in aggressiveness, prognosis, and histology, making treatment challenging due to complex and time-intensive surgical interventions. This study presents an Attention-Gated Recurrent Residual U-Net (R2U-Net) based Triplanar (2.5D) model for improved brain tumor segmentation. The proposed model enhances feature representation and segmentation accuracy by integrating residual, recurrent, and triplanar architectures while maintaining computational efficiency, potentially aiding in better treatment planning. The proposed method achieves a Dice Similarity Score (DSC) of 0.900 for Whole Tumor (WT) segmentation on the BraTS2021 validation set, demonstrating performance comparable to leading models. Additionally, the triplanar network extracts 64 features per planar model for survival days prediction, which are reduced to 28 using an Artificial Neural Network (ANN). This approach achieves an accuracy of 45.71%, a Mean Squared Error (MSE) of 108,318.128, and a Spearman Rank Correlation Coefficient (SRC) of 0.338 on the test dataset.
[AI-65] Safe-SDL:Establishing Safety Boundaries and Control Mechanisms for AI-Driven Self-Driving Laboratories
【速读】:该论文旨在解决自驱动实验室(Self-Driving Laboratories, SDLs)在部署过程中面临的前所未有的安全挑战,尤其是AI生成的语法正确指令与物理执行安全性之间的“语法到安全缺口”(Syntax-to-Safety Gap)问题。解决方案的关键在于提出一个名为Safe-SDL的综合框架,其核心由三个协同机制构成:(1) 形式化定义的操作设计域(Operational Design Domains, ODDs),通过数学验证边界约束系统行为;(2) 控制屏障函数(Control Barrier Functions, CBFs),实现基于连续状态空间监控的实时安全保证;(3) 一种新颖的事务性安全协议(CRUTD),确保数字规划与物理执行之间的原子一致性。该框架通过理论建模与现有系统(如UniLabOS和Osprey架构)的实例分析,证明了架构级安全机制对保障AI驱动实验系统的可靠运行至关重要。
链接: https://arxiv.org/abs/2602.15061
作者: Zihan Zhang,Haohui Que,Junhan Chang,Xin Zhang,Hao Wei,Tong Zhu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:The emergence of Self-Driving Laboratories (SDLs) transforms scientific discovery methodology by integrating AI with robotic automation to create closed-loop experimental systems capable of autonomous hypothesis generation, experimentation, and analysis. While promising to compress research timelines from years to weeks, their deployment introduces unprecedented safety challenges differing from traditional laboratories or purely digital AI. This paper presents Safe-SDL, a comprehensive framework for establishing robust safety boundaries and control mechanisms in AI-driven autonomous laboratories. We identify and analyze the critical ``Syntax-to-Safety Gap’’ – the disconnect between AI-generated syntactically correct commands and their physical safety implications – as the central challenge in SDL deployment. Our framework addresses this gap through three synergistic components: (1) formally defined Operational Design Domains (ODDs) that constrain system behavior within mathematically verified boundaries, (2) Control Barrier Functions (CBFs) that provide real-time safety guarantees through continuous state-space monitoring, and (3) a novel Transactional Safety Protocol (CRUTD) that ensures atomic consistency between digital planning and physical execution. We ground our theoretical contributions through analysis of existing implementations including UniLabOS and the Osprey architecture, demonstrating how these systems instantiate key safety principles. Evaluation against the LabSafety Bench reveals that current foundation models exhibit significant safety failures, demonstrating that architectural safety mechanisms are essential rather than optional. Our framework provides both theoretical foundations and practical implementation guidance for safe deployment of autonomous scientific systems, establishing the groundwork for responsible acceleration of AI-driven discovery.
[AI-66] CLOT: Closed-Loop Global Motion Tracking for Whole-Body Humanoid Teleoperation
【速读】:该论文旨在解决全尺寸人形机器人在长时间操作中因全局位姿漂移(global pose drift)导致的运动失稳问题,尤其在基于学习的跟踪方法通常仅在机器人局部坐标系下运行、忽略全局位姿反馈的情况下更为显著。其解决方案的关键在于提出一种实时闭环全身人形机器人遥操作系统CLOT,通过高频定位反馈实现全局运动的闭环跟踪,从而实现长时间尺度下的无漂移人类动作模仿。为避免直接施加全局跟踪奖励导致策略过于激进和脆弱,作者设计了一种数据驱动的随机化策略,将观测轨迹与奖励评估解耦,以实现平滑稳定的全局校正;同时引入对抗性运动先验对策略进行正则化,抑制不自然行为。该方案在仿真与真实世界实验中均验证了高动态性、高精度跟踪及强鲁棒性的性能表现。
链接: https://arxiv.org/abs/2602.15060
作者: Tengjie Zhu,Guanyu Cai,Yang Zhaohui,Guanzhu Ren,Haohui Xie,ZiRui Wang,Junsong Wu,Jingbo Wang,Xiaokang Yang,Yao Mu,Yichao Yan,Yichao Yan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-horizon whole-body humanoid teleoperation remains challenging due to accumulated global pose drift, particularly on full-sized humanoids. Although recent learning-based tracking methods enable agile and coordinated motions, they typically operate in the robot’s local frame and neglect global pose feedback, leading to drift and instability during extended execution. In this work, we present CLOT, a real-time whole-body humanoid teleoperation system that achieves closed-loop global motion tracking via high-frequency localization feedback. CLOT synchronizes operator and robot poses in a closed loop, enabling drift-free human-to-humanoid mimicry over long timehorizons. However, directly imposing global tracking rewards in reinforcement learning, often results in aggressive and brittle corrections. To address this, we propose a data-driven randomization strategy that decouples observation trajectories from reward evaluation, enabling smooth and stable global corrections. We further regularize the policy with an adversarial motion prior to suppress unnatural behaviors. To support CLOT, we collect 20 hours of carefully curated human motion data for training the humanoid teleoperation policy. We design a transformer-based policy and train it for over 1300 GPU hours. The policy is deployed on a full-sized humanoid with 31 DoF (excluding hands). Both simulation and real-world experiments verify high-dynamic motion, high-precision tracking, and strong robustness in sim-to-real humanoid teleoperation. Motion data, demos and code can be found in our website.
[AI-67] CircuChain: Disentangling Competence and Compliance in LLM Circuit Analysis
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在工程领域任务中,如电路分析(circuit analysis)中,尽管具备较强的物理推理能力,却可能因遵循训练数据中的隐含先验而违背用户明确指定的约束条件(如符号约定、电流方向或极性定义)的问题。这种“合规性-能力分离”现象可能导致生成结果在数值上正确但违反方法论规范,从而在安全关键系统中引发风险。解决方案的关键在于提出CircuChain这一诊断基准,通过设计成对的控制组与陷阱问题(Control/Trap problem pairs),系统性地扰动符号规则和方向设定,并结合符号求解器、SPICE仿真及基于LLM的错误分类体系,实现对失败原因的细粒度归因(区分惯例错误、物理错误、算术错误或幻觉)。实验表明,强模型虽具高物理推理能力,却易违反显式指令;弱模型则更倾向遵守指令,揭示了模型能力提升并不自动带来约束对齐,强调需构建针对数学刚性领域中指令遵循能力的新评估框架。
链接: https://arxiv.org/abs/2602.15037
作者: Mayank Ravishankara
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models (LLMs) advance toward expert-level performance in engineering domains, reliable reasoning under user-specified constraints becomes critical. In circuit analysis, for example, a numerically correct solution is insufficient if it violates established methodological conventions such as mesh directionality or polarity assignments, errors that can propagate in safety-critical systems. Yet it remains unclear whether frontier models truly apply first-principles reasoning or rely on entrenched training priors that conflict with explicit instructions. We introduce CircuChain, a diagnostic benchmark designed to disentangle instruction compliance from physical reasoning competence in electrical circuit analysis. CircuChain consists of counterbalanced Control/Trap problem pairs across five canonical circuit topologies, augmented with systematic variations in sign conventions, current orientations, and polarity definitions. A multi-stage verification pipeline, combining symbolic solvers, SPICE simulation, and an LLM-based error taxonomy, enables fine-grained attribution of failures to convention errors, physics errors, arithmetic mistakes, or hallucinations. Across 100 tasks per model, we observe a consistent Compliance-Competence Divergence. The strongest model evaluated exhibits near-perfect physical reasoning but a high rate of convention violations when Trap conditions deliberately invert natural sign patterns. Conversely, weaker models display lower physical fidelity yet superior adherence to explicit instructions. These results suggest that increased model capability does not guarantee improved constraint alignment and highlight the need for new evaluation frameworks that stress instruction-following under mathematically rigid domains. CircuChain provides one such framework and offers actionable insights for both engineering education and AI alignment research.
[AI-68] Decision Quality Evaluation Framework at Pinterest
【速读】:该论文旨在解决在线平台在内容安全政策执行中面临的挑战,即如何高效、可信地评估由人工审核员和大语言模型(Large Language Models, LLMs)做出的 moderation 决策质量。由于成本、规模与可信度之间存在权衡,且政策持续演进,传统主观评估方法难以满足实际需求。解决方案的关键在于提出一个全面的决策质量评估框架(Decision Quality Evaluation Framework),其核心是通过领域专家(Subject Matter Experts, SMEs)构建的高可信黄金数据集(Golden Dataset, GDS)作为基准,并结合基于倾向得分(propensity scores)的自动化智能采样管道,实现数据覆盖的高效扩展。该框架支持LLM代理的成本-性能对比、提示优化的数据驱动方法、政策演化管理及内容流行度指标的持续验证,从而推动内容安全系统从主观判断向量化、可衡量的实践转变。
链接: https://arxiv.org/abs/2602.15809
作者: Yuqi Tian,Robert Paine,Attila Dobi,Kevin O’Sullivan,Aravindh Manickavasagam,Faisal Farooq
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI)
备注:
Abstract:Online platforms require robust systems to enforce content safety policies at scale. A critical component of these systems is the ability to evaluate the quality of moderation decisions made by both human agents and Large Language Models (LLMs). However, this evaluation is challenging due to the inherent trade-offs between cost, scale, and trustworthiness, along with the complexity of evolving policies. To address this, we present a comprehensive Decision Quality Evaluation Framework developed and deployed at Pinterest. The framework is centered on a high-trust Golden Set (GDS) curated by subject matter experts (SMEs), which serves as a ground truth benchmark. We introduce an automated intelligent sampling pipeline that uses propensity scores to efficiently expand dataset coverage. We demonstrate the framework’s practical application in several key areas: benchmarking the cost-performance trade-offs of various LLM agents, establishing a rigorous methodology for data-driven prompt optimization, managing complex policy evolution, and ensuring the integrity of policy content prevalence metrics via continuous validation. The framework enables a shift from subjective assessments to a data-driven and quantitative practice for managing content safety systems.
[AI-69] Molecular Design beyond Training Data with Novel Extended Objective Functionals of Generative AI Models Driven by Quantum Annealing Computer
【速读】:该论文旨在解决分子生成模型中药物样化合物(drug-like compounds)频率较低的问题,从而提升生成分子的质量与实用性。其解决方案的关键在于提出了一种融合D-Wave量子退火计算机的新型优化框架,其中核心创新是引入神经哈希函数(Neural Hash Function, NHF),该函数同时作为正则化和二值化机制:一方面用于约束生成过程,另一方面实现经典神经网络与量子神经网络之间连续信号与离散信号的转换,以在目标函数中准确评估误差。实验表明,该量子退火驱动的生成模型不仅在有效性和药物相似性方面优于纯经典模型,甚至在未施加任何额外约束条件下超越了训练数据的药物相似性特征,证明了量子计算在扩展特征空间采样和提取关键药物设计特征方面的优势。
链接: https://arxiv.org/abs/2602.15451
作者: Hayato Kunugi,Mohsen Rahmani,Yosuke Iyama,Yutaro Hirono,Akira Suma,Matthew Woolway,Vladimir Vargas-Calderón,William Kim,Kevin Chern,Mohammad Amin,Masaru Tateno
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantum Physics (quant-ph)
备注: 42 pages, 7 figures
Abstract:Deep generative modeling to stochastically design small molecules is an emerging technology for accelerating drug discovery and development. However, one major issue in molecular generative models is their lower frequency of drug-like compounds. To resolve this problem, we developed a novel framework for optimization of deep generative models integrated with a D-Wave quantum annealing computer, where our Neural Hash Function (NHF) presented herein is used both as the regularization and binarization schemes simultaneously, of which the latter is for transformation between continuous and discrete signals of the classical and quantum neural networks, respectively, in the error evaluation (i.e., objective) function. The compounds generated via the quantum-annealing generative models exhibited higher quality in both validity and drug-likeness than those generated via the fully-classical models, and was further indicated to exceed even the training data in terms of drug-likeness features, without any restraints and conditions to deliberately induce such an optimization. These results indicated an advantage of quantum annealing to aim at a stochastic generator integrated with our novel neural network architectures, for the extended performance of feature space sampling and extraction of characteristic features in drug design.
[AI-70] SCENE OTA-FD: Self-Centering Noncoherent Estimator for Over-the-Air Federated Distillation
【速读】:该论文旨在解决过空气联邦蒸馏(OTA-FD)中因信道状态信息(CSI)获取开销大、硬件约束严而导致的聚合精度下降与系统效率低下的问题。解决方案的关键在于提出一种无导频、相位无关的聚合原语SCENE(Self-Centering Noncoherent Estimator),其通过将软标签(soft-label)向量映射为恒定功率和恒定包络信号(PAPR接近1)下的非负发射能量,在服务器端利用自中心化能量估计器消除噪声偏移,实现对加权软标签平均值的无偏估计,且方差随接收天线数M和重复因子S以1/(SM)的速率衰减;同时设计了无导频比值归一化变体以抵消未知的大尺度衰落增益,并在收敛性上与相干OTA-FD分析一致,从而在短相干时延和硬件受限场景下实现零上行导频开销、无偏聚合与硬件友好传输的协同优化。
链接: https://arxiv.org/abs/2602.15326
作者: Hao Chen,Zavareh Bozorgasl
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: Work in progress. Codes will be available on: this https URL
Abstract:We propose SCENE (Self-Centering Noncoherent Estimator), a pilot-free and phase-invariant aggregation primitive for over-the-air federated distillation (OTA-FD). Each device maps its soft-label (class-probability) vector to nonnegative transmit energies under constant per-round power and constant-envelope signaling (PAPR near 1). At the server, a self-centering energy estimator removes the noise-energy offset and yields an unbiased estimate of the weighted soft-label average, with variance decaying on the order of 1/(SM) in the number of receive antennas M and repetition factor S. We also develop a pilot-free ratio-normalized variant that cancels unknown large-scale gains, provide a convergence bound consistent with coherent OTA-FD analyses, and present an overhead-based crossover comparison. SCENE targets short-coherence and hardware-constrained regimes, where avoiding per-round CSI is essential: it trades a modest noncoherent variance constant for zero uplink pilots, unbiased aggregation, and hardware-friendly transmission, and can outperform coherent designs when pilot overhead is non-negligible.
[AI-71] omography by Design: An Algebraic Approach to Low-Rank Quantum States
【速读】:该论文旨在解决量子态层析(quantum state tomography)中测量资源消耗高、计算复杂度大以及恢复不确定性的问题。其解决方案的关键在于提出一种代数矩阵补全(algebraic matrix completion)框架,通过测量特定可观测量(observables)来估计密度矩阵的结构化元素,在低秩假设下,剩余元素仅需借助标准数值线性代数运算即可恢复,从而在保证确定性恢复的前提下显著提升计算效率。
链接: https://arxiv.org/abs/2602.15202
作者: Shakir Showkat Sofi,Charlotte Vermeylen,Lieven De Lathauwer
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Numerical Analysis (math.NA); Computation (stat.CO)
备注: 5 pages, Submitted to EUSIPCO2026
Abstract:We present an algebraic algorithm for quantum state tomography that leverages measurements of certain observables to estimate structured entries of the underlying density matrix. Under low-rank assumptions, the remaining entries can be obtained solely using standard numerical linear algebra operations. The proposed algebraic matrix completion framework applies to a broad class of generic, low-rank mixed quantum states and, compared with state-of-the-art methods, is computationally efficient while providing deterministic recovery guarantees.
[AI-72] okaMind: A Multi-Modal Transformer Foundation Model for Tokamak Plasma Dynamics
【速读】:该论文旨在解决托卡马克(Tokamak)等离子体建模中多模态数据融合与高效任务适应的挑战,尤其针对来自不同诊断设备的时间序列、二维剖面和视频等异构数据的处理难题。其解决方案的关键在于提出一个基于多模态Transformer(Multi-Modal Transformer, MMT)的开源基础模型框架TokaMind,通过训练-free的离散余弦变换嵌入(DCT3D)实现对多模态信号的有效表征,并支持选择性加载与冻结四个模型组件以实现轻量级微调(lightweight fine-tuning)。该设计不仅提升了跨任务泛化能力,还在MST基准测试(TokaMark)上验证了其优于传统基线方法的性能表现,为未来聚变建模提供了可扩展且实用的基础架构。
链接: https://arxiv.org/abs/2602.15084
作者: Tobia Boschi,Andrea Loreti,Nicola C. Amorisco,Rodrigo H. Ordonez-Hurtado,Cécile Rousseau,George K. Holt,Eszter Székely,Alexander Whittle,Samuel Jackson,Adriano Agnello,Stanislas Pamela,Alessandra Pascale,Robert Akers,Juan Bernabe Moreno,Vassil Alexandrov,Mykhaylo Zayats
机构: 未知
类目: Plasma Physics (physics.plasm-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We present TokaMind, an open-source foundation model framework for fusion plasma modeling, based on a Multi-Modal Transformer (MMT) and trained on heterogeneous tokamak diagnostics from the publicly available MAST dataset. TokaMind supports multiple data modalities (time-series, 2D profiles, and videos) with different sampling rates, robust missing-signal handling, and efficient task adaptation via selectively loading and freezing four model components. To represent multi-modal signals, we use a training-free Discrete Cosine Transform embedding (DCT3D) and provide a clean interface for alternative embeddings (e.g., Variational Autoencoders - VAEs). We evaluate TokaMind on the recently introduced MAST benchmark TokaMark, comparing training and embedding strategies. Our results show that fine-tuned TokaMind outperforms the benchmark baseline on all but one task, and that, for several tasks, lightweight fine-tuning yields better performance than training the same architecture from scratch under a matched epoch budget. These findings highlight the benefits of multi-modal pretraining for tokamak plasma dynamics and provide a practical, extensible foundation for future fusion modeling tasks. Training code and model weights will be made publicly available.
[AI-73] Structural Divergence Between AI-Agent and Human Social Networks in Moltbook
【速读】:该论文试图解决的问题是:在AI代理(AI agents)与人类共存的在线环境中,AI代理群体的集体交互模式是否与人类社会系统具有相似性,以及其结构特征是否存在本质差异。解决方案的关键在于对Moltbook平台中AI代理与人类共存的完整交互网络进行系统性分析,并将其结构特性与已知的人类通信网络进行对比,发现尽管两者遵循相同的节点-边缩放关系(表明全球增长约束一致),但AI代理网络表现出极端注意力不平等、重尾且不对称的度分布、 reciprocity(互惠性)抑制以及三角闭包结构的全局欠代表等内部组织机制的根本差异,从而揭示出人类社会结构特征并非普遍适用,而是依赖于交互主体的本质属性。
链接: https://arxiv.org/abs/2602.15064
作者: Wenpin Hou,Zhicheng Ji
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Large populations of AI agents are increasingly embedded in online environments, yet little is known about how their collective interaction patterns compare to human social systems. Here, we analyze the full interaction network of Moltbook, a platform where AI agents and humans coexist, and systematically compare its structure to well-characterized human communication networks. Although Moltbook follows the same node-edge scaling relationship observed in human systems, indicating comparable global growth constraints, its internal organization diverges markedly. The network exhibits extreme attention inequality, heavy-tailed and asymmetric degree distributions, suppressed reciprocity, and a global under-representation of connected triadic structures. Community analysis reveals a structured modular architecture with elevated modularity and comparatively lower community size inequality relative to degree-preserving null models. Together, these findings show that AI-agent societies can reproduce global structural regularities of human networks while exhibiting fundamentally different internal organizing principles, highlighting that key features of human social organization are not universal but depend on the nature of the interacting agents.
[AI-74] Reconstructing Carbon Monoxide Reanalysis with Machine Learning
【速读】:该论文旨在解决因卫星观测数据中断(如MOPITT仪器在2025年初停止提供一氧化碳CO)导致的再分析产品(reanalysis products)质量下降问题,其核心挑战在于如何在缺乏关键观测数据的情况下维持大气成分再分析产品的准确性。解决方案的关键在于利用机器学习方法,通过学习控制模型模拟与实际观测之间的系统性偏差,从而预测月平均总柱一氧化碳(total column of Carbon Monoxide)的再分析值,以补偿观测数据缺失带来的影响。
链接: https://arxiv.org/abs/2602.15056
作者: Paula Harder,Johannes Flemming
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The Copernicus Atmospheric Monitoring Service provides reanalysis products for atmospheric composition by combining model simulations with satellite observations. The quality of these products depends strongly on the availability of the observational data, which can vary over time as new satellite instruments become available or are discontinued, such as Carbon Monoxide (CO) observations of the Measurements Of Pollution In The Troposphere (MOPITT) satellite in early 2025. Machine learning offers a promising approach to compensate for such data losses by learning systematic discrepancies between model configurations. In this study, we investigate machine learning methods to predict monthly-mean total column of Carbon Monoxide re-analysis from a control model simulation.
[AI-75] Combining scEEG and PPG for reliable sleep staging using lightweight wearables
【速读】:该论文旨在解决轻量化可穿戴设备(如单导联脑电图scEEG或光电容积脉搏波描记法PPG)在短窗口(30秒至30分钟)条件下进行可靠四分类睡眠分期的难题。其核心挑战在于:scEEG虽能直接反映皮层活动但对轻度睡眠阶段识别性能有限,而传统PPG方法依赖整夜长时段(8–10小时)输入以实现有效检测,难以支持即时睡眠干预反馈。解决方案的关键在于提出三种融合策略——评分级融合、交叉注意力融合(实现特征层面交互)以及基于Mamba架构增强的时序建模融合,并通过多数据集验证表明,Mamba-enhanced融合在MESA数据集上达到最佳性能(Cohen’s Kappa κ = 0.798,准确率Acc = 86.9%),尤其显著提升轻度睡眠分类效果(F1-score达85.63%,较仅用scEEG提高7.87个百分点),且具备跨人群泛化能力,为可穿戴设备实现高效、精准的睡眠监测提供了可行路径。
链接: https://arxiv.org/abs/2602.15042
作者: Jiawei Wang,Liang Xu,Shuntian Zheng,Yu Guan,Kaichen Wang,Ziqing Zhang,Chen Chen,Laurence T. Yang,Sai Gu
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:Reliable sleep staging remains challenging for lightweight wearable devices such as single-channel electroencephalography (scEEG) or photoplethysmography (PPG). scEEG offers direct measurement of cortical activity and serves as the foundation for sleep staging, yet exhibits limited performance on light sleep stages. PPG provides a low-cost complement that captures autonomic signatures effective for detecting light sleep. However, prior PPG-based methods rely on full night recordings (8 - 10 hours) as input context, which is less practical to provide timely feedback for sleep intervention. In this work, we investigate scEEG-PPG fusion for 4-class sleep staging under short-window (30 s - 30 min) constraints. First, we evaluate the temporal context required for each modality, to better understand the relationship of sleep staging performance with respect to monitoring window. Second, we investigate three fusion strategies: score-level fusion, cross-attention fusion enabling feature-level interactions, and Mamba-enhanced fusion incorporating temporal context modeling. Third, we train and evaluate on the Multi-Ethnic Study of Atherosclerosis (MESA) dataset and perform cross-dataset validation on the Cleveland Family Study (CFS) and the Apnea, Bariatric surgery, and CPAP (ABC) datasets. The Mamba-enhanced fusion achieves the best performance on MESA (Cohen’s Kappa \kappa = 0.798, Acc = 86.9%), with particularly notable improvement in light sleep classification (F1-score: 85.63% vs. 77.76%, recall: 82.85% vs. 69.95% for scEEG alone), and generalizes well to CFS and ABC datasets with different populations. These findings suggest that scEEG-PPG fusion is a promising approach for lightweight wearable based sleep monitoring, offering a pathway toward more accessible sleep health assessment. Source code of this project can be found at: this https URL
[AI-76] GRACE: an Agent ic AI for Particle Physics Experiment Design and Simulation
【速读】:该论文旨在解决高能物理与核物理实验中**实验设计(experimental design)**的自动化问题,即如何在满足物理规律和实际约束条件下,自主提出非显而易见的探测器几何、材料及配置优化方案,以提升物理性能。传统代理系统多聚焦于操作控制或预设流程执行,而本文提出的GRACE(Simulation-Native Agent for Autonomous Experimental Design)则面向上游设计阶段,其关键在于:通过自然语言提示或已发表论文提取结构化实验表示,构建可运行的简化模拟,并基于第一性原理蒙特卡洛方法进行自主探索;同时利用物理驱动的效用函数与预算感知的仿真层级递进策略(从快速参数模型到完整Geant4模拟),实现高效且可复现的设计优化。此框架将实验设计建模为受物理定律约束的搜索问题,首次引入了面向复杂仪器的自主、仿真驱动科学推理的新基准。
链接: https://arxiv.org/abs/2602.15039
作者: Justin Hill,Hong Joo Ryoo
机构: 未知
类目: High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI)
备注: Both authors contributed equally. 43 pages, 12 figures, 6 tables, data can be found in this https URL
Abstract:We present GRACE, a simulation-native agent for autonomous experimental design in high-energy and nuclear physics. Given multimodal input in the form of a natural-language prompt or a published experimental paper, the agent extracts a structured representation of the experiment, constructs a runnable toy simulation, and autonomously explores design modifications using first-principles Monte Carlo methods. Unlike agentic systems focused on operational control or execution of predefined procedures, GRACE addresses the upstream problem of experimental design: proposing non-obvious modifications to detector geometry, materials, and configurations that improve physics performance under physical and practical constraints. The agent evaluates candidate designs through repeated simulation, physics-motivated utility functions, and budget-aware escalation from fast parametric models to full Geant4 simulations, while maintaining strict reproducibility and provenance tracking. We demonstrate the framework on historical experimental setups, showing that the agent can identify optimization directions that align with known upgrade priorities, using only baseline simulation inputs. We also conducted a benchmark in which the agent identified the setup and proposed improvements from a suite of natural language prompts, with some supplied with a relevant physics research paper, of varying high energy physics (HEP) problem settings. This work establishes experimental design as a constrained search problem under physical law and introduces a new benchmark for autonomous, simulation-driven scientific reasoning in complex instruments.
[AI-77] ransforming Computational Lithography with AC and AI – Faster More Accurate and Energy-efficient
【速读】:该论文旨在解决科学计算(如气候模拟、药物发现)及半导体制造中因数据规模扩大、模型复杂度提升和仿真精度提高而导致的计算需求激增问题,这一增长远超晶体管缩放速度,造成成本、能耗与碳排放不可持续上升。针对计算光刻(computational lithography)这一半导体制造中最繁重的工作负载,其复杂性随原子级精度要求显著增加,传统方法难以满足高效、高精度建模需求。解决方案的关键在于引入加速计算(Accelerated Computing, AC)与生成式 AI(Generative AI)协同重构软件栈:通过 NVIDIA cuLitho 重新设计核心原语(如衍射光学、计算几何、多变量优化等),实现端到端 57 倍加速;同时利用 AI 作为高保真替代模型,释放大量计算资源用于更严谨的工艺优化(如曲边掩膜、高数值孔径极紫外光刻、亚原子级建模),并仅用少量释放的算力实现焦点穿透校正,从而在 IMEC 硅片实验中验证了 35% 更优的工艺窗口和 19% 更低的边缘定位误差,首次量化证明了 AC 和 AI 在芯片级光刻中的实际效益。
链接: https://arxiv.org/abs/2602.15036
作者: Saumyadip Mukhopadhyay,Kiho Yang,Kasyap Thottasserymana Vasudevan,Mounica Jyothi Divvela,Selim Dogru,Dilip Krishnamurthy,Fergo Treska,Werner Gillijns,Ryan Ryoung han Kim,Kumara Sastry,Vivek Singh
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
备注:
Abstract:From climate science to drug discovery, scientific computing demands have surged dramatically in recent years – driven by larger datasets, more sophisticated models, and higher simulation fidelity. This growth rate far outpaces transistor scaling, leading to unsustainably rising costs, energy consumption, and emissions. Semiconductor manufacturing is no exception. Computational lithography – involving transferring circuitry to silicon in diffraction-limited conditions – is the largest workload in semiconductor manufacturing. It has also grown exceptionally complex as miniaturization has advanced in the angstrom-era, requiring more accurate modeling, intricate corrections, and broader solution-space exploration. Accelerated computing (AC) offers a solution by dramatically freeing up the compute and power envelope. AI augments these gains by serving as high-fidelity surrogates for compute-intensive steps. Together, they present a sustainable, next-generation computing platform for scientific workloads. This new paradigm needs a fundamental redesign of the software stack. For computational lithography, NVIDIA cuLitho reinvents the core primitives – diffractive optics, computational geometry, multi-variant optimization, data processing – to achieve a transformative 57X end-to-end acceleration. Beyond dramatically faster cycles, this expanded compute envelope enables more rigorous solutions, including curvilinear masks, high-numerical aperture extreme ultraviolet (high-NA EUV) lithography, and subatomic modeling. We reinvest a small fraction of the freed-up compute to include through-focus correction for better process resilience. Silicon experiments at IMEC show significant benefits compared to conventional methods – 35% better process window and 19% better edge placement error. This is the first quantified chip-scale demonstration of the lithography benefits of AC and AI in silicon.
[AI-78] LemonadeBench: Evaluating the Economic Intuition of Large Language Models in Simple Markets
【速读】:该论文旨在解决如何有效评估大型语言模型(Large Language Models, LLMs)在经济直觉、长期规划及不确定性下的决策能力问题。现有基准多聚焦于静态任务或单一技能,难以反映真实商业场景中动态资源管理与策略权衡的复杂性。解决方案的关键在于构建一个名为LemonadeBench v0.5的最小化模拟基准,通过一个为期30天的柠檬水摊位经营任务,要求模型处理具有保质期的商品库存管理、价格设定、营业时间选择等多维决策,并以利润最大化为目标。实验表明,不同成熟度的模型均能实现盈利,且性能随模型复杂度显著提升——前沿模型达到理论最优利润的70%,较基础模型提升超过10倍;但进一步分解六维度业务效率发现,模型普遍仅实现局部优化,在某些关键领域存在明显盲区,揭示了当前LLMs在全局决策一致性上的局限性。
链接: https://arxiv.org/abs/2602.13209
作者: Aidan Vyas
机构: 未知
类目: General Finance (q-fin.GN); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce LemonadeBench v0.5, a minimal benchmark for evaluating economic intuition, long-term planning, and decision-making under uncertainty in large language models (LLMs) through a simulated lemonade stand business. Models must manage inventory with expiring goods, set prices, choose operating hours, and maximize profit over a 30-day period-tasks that any small business owner faces daily. All models demonstrate meaningful economic agency by achieving profitability, with performance scaling dramatically by sophistication-from basic models earning minimal profits to frontier models capturing 70% of theoretical optimal, a greater than 10x improvement. Yet our decomposition of business efficiency across six dimensions reveals a consistent pattern: models achieve local rather than global optimization, excelling in select areas while exhibiting surprising blind spots elsewhere.
机器学习
[LG-0] Operationalising the Superficial Alignment Hypothesis via Task Complexity
链接: https://arxiv.org/abs/2602.15829
作者: Tomás Vergara-Browne,Darshan Patil,Ivan Titov,Siva Reddy,Tiago Pimentel,Marius Mosbach
类目: Machine Learning (cs.LG)
*备注:
Abstract:The superficial alignment hypothesis (SAH) posits that large language models learn most of their knowledge during pre-training, and that post-training merely surfaces this knowledge. The SAH, however, lacks a precise definition, which has led to (i) different and seemingly orthogonal arguments supporting it, and (ii) important critiques to it. We propose a new metric called task complexity: the length of the shortest program that achieves a target performance on a task. In this framework, the SAH simply claims that pre-trained models drastically reduce the complexity of achieving high performance on many tasks. Our definition unifies prior arguments supporting the SAH, interpreting them as different strategies to find such short programs. Experimentally, we estimate the task complexity of mathematical reasoning, machine translation, and instruction following; we then show that these complexities can be remarkably low when conditioned on a pre-trained model. Further, we find that pre-training enables access to strong performances on our tasks, but it can require programs of gigabytes of length to access them. Post-training, on the other hand, collapses the complexity of reaching this same performance by several orders of magnitude. Overall, our results highlight that task adaptation often requires surprisingly little information – often just a few kilobytes.
[LG-1] Stabilizing Test-Time Adaptation of High-Dimensional Simulation Surrogates via D-Optimal Statistics
链接: https://arxiv.org/abs/2602.15820
作者: Anna Zimmel,Paul Setinek,Gianluca Galletti,Johannes Brandstetter,Werner Zellinger
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning surrogates are increasingly used in engineering to accelerate costly simulations, yet distribution shifts between training and deployment often cause severe performance degradation (e.g., unseen geometries or configurations). Test-Time Adaptation (TTA) can mitigate such shifts, but existing methods are largely developed for lower-dimensional classification with structured outputs and visually aligned input-output relationships, making them unstable for the high-dimensional, unstructured and regression problems common in simulation. We address this challenge by proposing a TTA framework based on storing maximally informative (D-optimal) statistics, which jointly enables stable adaptation and principled parameter selection at test time. When applied to pretrained simulation surrogates, our method yields up to 7% out-of-distribution improvements at negligible computational cost. To the best of our knowledge, this is the first systematic demonstration of effective TTA for high-dimensional simulation regression and generative design optimization, validated on the SIMSHIFT and EngiBench benchmarks.
[LG-2] Solving Parameter-Robust Avoid Problems with Unknown Feasibility using Reinforcement Learning ICLR2026
链接: https://arxiv.org/abs/2602.15817
作者: Oswin So,Eric Yang Yu,Songyuan Zhang,Matthew Cleaveland,Mitchell Black,Chuchu Fan
类目: Machine Learning (cs.LG); Robotics (cs.RO); Optimization and Control (math.OC)
*备注: ICLR 2026. The project page can be found at this https URL
Abstract:Recent advances in deep reinforcement learning (RL) have achieved strong results on high-dimensional control tasks, but applying RL to reachability problems raises a fundamental mismatch: reachability seeks to maximize the set of states from which a system remains safe indefinitely, while RL optimizes expected returns over a user-specified distribution. This mismatch can result in policies that perform poorly on low-probability states that are still within the safe set. A natural alternative is to frame the problem as a robust optimization over a set of initial conditions that specify the initial state, dynamics and safe set, but whether this problem has a solution depends on the feasibility of the specified set, which is unknown a priori. We propose Feasibility-Guided Exploration (FGE), a method that simultaneously identifies a subset of feasible initial conditions under which a safe policy exists, and learns a policy to solve the reachability problem over this set of initial conditions. Empirical results demonstrate that FGE learns policies with over 50% more coverage than the best existing method for challenging initial conditions across tasks in the MuJoCo simulator and the Kinetix simulator with pixel observations.
[LG-3] A Note on Non-Composability of Layerwise Approximate Verification for Neural Inference
链接: https://arxiv.org/abs/2602.15756
作者: Or Zamir
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:A natural and informal approach to verifiable (or zero-knowledge) ML inference over floating-point data is: ``prove that each layer was computed correctly up to tolerance \delta ; therefore the final output is a reasonable inference result’'. This short note gives a simple counterexample showing that this inference is false in general: for any neural network, we can construct a functionally equivalent network for which adversarially chosen approximation-magnitude errors in individual layer computations suffice to steer the final output arbitrarily (within a prescribed bounded range).
[LG-4] Beyond Match Maximization and Fairness: Retention-Optimized Two-Sided Matching ICLR2026
链接: https://arxiv.org/abs/2602.15752
作者: Ren Kishimoto,Rikiya Takehi,Koichi Tanaka,Masahiro Nomura,Riku Togashi,Yoji Tomita,Yuta Saito
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at ICLR 2026
Abstract:On two-sided matching platforms such as online dating and recruiting, recommendation algorithms often aim to maximize the total number of matches. However, this objective creates an imbalance, where some users receive far too many matches while many others receive very few and eventually abandon the platform. Retaining users is crucial for many platforms, such as those that depend heavily on subscriptions. Some may use fairness objectives to solve the problem of match maximization. However, fairness in itself is not the ultimate objective for many platforms, as users do not suddenly reward the platform simply because exposure is equalized. In practice, where user retention is often the ultimate goal, casually relying on fairness will leave the optimization of retention up to luck. In this work, instead of maximizing matches or axiomatically defining fairness, we formally define the new problem setting of maximizing user retention in two-sided matching platforms. To this end, we introduce a dynamic learning-to-rank (LTR) algorithm called Matching for Retention (MRet). Unlike conventional algorithms for two-sided matching, our approach models user retention by learning personalized retention curves from each user’s profile and interaction history. Based on these curves, MRet dynamically adapts recommendations by jointly considering the retention gains of both the user receiving recommendations and those who are being recommended, so that limited matching opportunities can be allocated where they most improve overall retention. Naturally but importantly, empirical evaluations on synthetic and real-world datasets from a major online dating platform show that MRet achieves higher user retention, since conventional methods optimize matches or fairness rather than retention. Comments: Published as a conference paper at ICLR 2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.15752 [cs.LG] (or arXiv:2602.15752v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.15752 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-5] Controlled oscillation modeling using port-Hamiltonian neural networks
链接: https://arxiv.org/abs/2602.15704
作者: Maximino Linares,Guillaume Doras,Thomas Hélie
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS)
*备注:
Abstract:Learning dynamical systems through purely data-driven methods is challenging as they do not learn the underlying conservation laws that enable them to correctly generalize. Existing port-Hamiltonian neural network methods have recently been successfully applied for modeling mechanical systems. However, even though these methods are designed on power-balance principles, they usually do not consider power-preserving discretizations and often rely on Runge-Kutta numerical methods. In this work, we propose to use a second-order discrete gradient method embedded in the learning of dynamical systems with port-Hamiltonian neural networks. Numerical results are provided for three systems deliberately selected to span different ranges of dynamical behavior under control: a baseline harmonic oscillator with quadratic energy storage; a Duffing oscillator, with a non-quadratic Hamiltonian offering amplitude-dependent effects; and a self-sustained oscillator, which can stabilize in a controlled limit cycle through the incorporation of a nonlinear dissipation. We show how the use of this discrete gradient method outperforms the performance of a Runge-Kutta method of the same order. Experiments are also carried out to compare two theoretically equivalent port-Hamiltonian systems formulations and to analyze the impact of regularizing the Jacobian of port-Hamiltonian neural networks during training.
[LG-6] CAMEL: An ECG Language Model for Forecasting Cardiac Events
链接: https://arxiv.org/abs/2602.15677
作者: Neelay Velingker,Alaia Solko-Breslin,Mayank Keoliya,Seewon Choi,Jiayi Xin,Anika Marathe,Alireza Oraii,Rajat Deo,Sameed Khatana,Rajeev Alur,Mayur Naik,Eric Wong
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 24 pages, 6 figures
Abstract:Electrocardiograms (ECG) are electrical recordings of the heart that are critical for diagnosing cardiovascular conditions. ECG language models (ELMs) have recently emerged as a promising framework for ECG classification accompanied by report generation. However, current models cannot forecast future cardiac events despite the immense clinical value for planning earlier intervention. To address this gap, we propose CAMEL, the first ELM that is capable of inference over longer signal durations which enables its forecasting capability. Our key insight is a specialized ECG encoder which enables cross-understanding of ECG signals with text. We train CAMEL using established LLM training procedures, combining LoRA adaptation with a curriculum learning pipeline. Our curriculum includes ECG classification, metrics calculations, and multi-turn conversations to elicit reasoning. CAMEL demonstrates strong zero-shot performance across 6 tasks and 9 datasets, including ECGForecastBench, a new benchmark that we introduce for forecasting arrhythmias. CAMEL is on par with or surpasses ELMs and fully supervised baselines both in- and out-of-distribution, achieving SOTA results on ECGBench (+7.0% absolute average gain) as well as ECGForecastBench (+12.4% over fully supervised models and +21.1% over zero-shot ELMs).
[LG-7] Continuous-Time Piecewise-Linear Recurrent Neural Networks
链接: https://arxiv.org/abs/2602.15649
作者: Alena Brändle,Lukas Eisenmann,Florian Götz,Daniel Durstewitz
类目: Machine Learning (cs.LG)
*备注:
Abstract:In dynamical systems reconstruction (DSR) we aim to recover the dynamical system (DS) underlying observed time series. Specifically, we aim to learn a generative surrogate model which approximates the underlying, data-generating DS, and recreates its long-term properties (`climate statistics’). In scientific and medical areas, in particular, these models need to be mechanistically tractable – through their mathematical analysis we would like to obtain insight into the recovered system’s workings. Piecewise-linear (PL), ReLU-based RNNs (PLRNNs) have a strong track-record in this regard, representing SOTA DSR models while allowing mathematical insight by virtue of their PL design. However, all current PLRNN variants are discrete-time maps. This is in disaccord with the assumed continuous-time nature of most physical and biological processes, and makes it hard to accommodate data arriving at irregular temporal intervals. Neural ODEs are one solution, but they do not reach the DSR performance of PLRNNs and often lack their tractability. Here we develop theory for continuous-time PLRNNs (cPLRNNs): We present a novel algorithm for training and simulating such models, bypassing numerical integration by efficiently exploiting their PL structure. We further demonstrate how important topological objects like equilibria or limit cycles can be determined semi-analytically in trained models. We compare cPLRNNs to both their discrete-time cousins as well as Neural ODEs on DSR benchmarks, including systems with discontinuities which come with hard thresholds.
[LG-8] he Stationarity Bias: Stratified Stress-Testing for Time-Series Imputation in Regulated Dynamical Systems
链接: https://arxiv.org/abs/2602.15637
作者: Amirreza Dolatpour Fathkouhi,Alireza Namazi,Heman Shakeri
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time-series imputation benchmarks employ uniform random masking and shape-agnostic metrics (MSE, RMSE), implicitly weighting evaluation by regime prevalence. In systems with a dominant attractor – homeostatic physiology, nominal industrial operation, stable network traffic – this creates a systematic \emphStationarity Bias: simple methods appear superior because the benchmark predominantly samples the easy, low-entropy regime where they trivially succeed. We formalize this bias and propose a \emphStratified Stress-Test that partitions evaluation into Stationary and Transient regimes. Using Continuous Glucose Monitoring (CGM) as a testbed – chosen for its rigorous ground-truth forcing functions (meals, insulin) that enable precise regime identification – we establish three findings with broad implications:(i)~Stationary Efficiency: Linear interpolation achieves state-of-the-art reconstruction during stable intervals, confirming that complex architectures are computationally wasteful in low-entropy regimes.(ii)~Transient Fidelity: During critical transients (post-prandial peaks, hypoglycemic events), linear methods exhibit drastically degraded morphological fidelity (DTW), disproportionate to their RMSE – a phenomenon we term the \emphRMSE Mirage, where low pointwise error masks the destruction of signal shape.(iii)~Regime-Conditional Model Selection: Deep learning models preserve both pointwise accuracy and morphological integrity during transients, making them essential for safety-critical downstream tasks. We further derive empirical missingness distributions from clinical trials and impose them on complete training data, preventing models from exploiting unrealistically clean observations and encouraging robustness under real-world missingness. This framework generalizes to any regulated system where routine stationarity dominates critical transients.
[LG-9] Beyond ReLU: Bifurcation Oversmoothing and Topological Priors
链接: https://arxiv.org/abs/2602.15634
作者: Erkan Turan,Gaspard Abel,Maysam Behmanesh,Emery Pierson,Maks Ovsjanikov
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph Neural Networks (GNNs) learn node representations through iterative network-based message-passing. While powerful, deep GNNs suffer from oversmoothing, where node features converge to a homogeneous, non-informative state. We re-frame this problem of representational collapse from a \emphbifurcation theory perspective, characterizing oversmoothing as convergence to a stable ``homogeneous fixed point.‘’ Our central contribution is the theoretical discovery that this undesired stability can be broken by replacing standard monotone activations (e.g., ReLU) with a class of functions. Using Lyapunov-Schmidt reduction, we analytically prove that this substitution induces a bifurcation that destabilizes the homogeneous state and creates a new pair of stable, non-homogeneous \emphpatterns that provably resist oversmoothing. Our theory predicts a precise, nontrivial scaling law for the amplitude of these emergent patterns, which we quantitatively validate in experiments. Finally, we demonstrate the practical utility of our theory by deriving a closed-form, bifurcation-aware initialization and showing its utility in real benchmark experiments.
[LG-10] DNN-Enabled Multi-User Beamforming for Throughput Maximization under Adjustable Fairness
链接: https://arxiv.org/abs/2602.15617
作者: Kaifeng Lu,Markus Rupp,Stefan Schwarz
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Ensuring user fairness in wireless communications is a fundamental challenge, as balancing the trade-off between fairness and sum rate leads to a non-convex, multi-objective optimization whose complexity grows with network scale. To alleviate this conflict, we propose an optimization-based unsupervised learning approach based on the wireless transformer (WiT) architecture that learns from channel state information (CSI) features. We reformulate the trade-off by combining the sum rate and fairness objectives through a Lagrangian multiplier, which is updated automatically via a dual-ascent algorithm. This mechanism allows for a controllable fairness constraint while simultaneously maximizing the sum rate, effectively realizing a trace on the Pareto front between two conflicting objectives. Our findings show that the proposed approach offers a flexible solution for managing the trade-off optimization under prescribed fairness.
[LG-11] Symbolic recovery of PDEs from measurement data
链接: https://arxiv.org/abs/2602.15603
作者: Erion Morina,Philipp Scholl,Martin Holler
类目: Machine Learning (cs.LG); Symbolic Computation (cs.SC); Optimization and Control (math.OC)
*备注:
Abstract:Models based on partial differential equations (PDEs) are powerful for describing a wide range of complex relationships in the natural sciences. Accurately identifying the PDE model, which represents the underlying physical law, is essential for a proper understanding of the problem. This reconstruction typically relies on indirect and noisy measurements of the system’s state and, without specifically tailored methods, rarely yields symbolic expressions, thereby hindering interpretability. In this work, we address this issue by considering existing neural network architectures based on rational functions for the symbolic representation of physical laws. These networks leverage the approximation power of rational functions while also benefiting from their flexibility in representing arithmetic operations. Our main contribution is an identifiability result, showing that, in the limit of noiseless, complete measurements, such symbolic networks can uniquely reconstruct the simplest physical law within the PDE model. Specifically, reconstructed laws remain expressible within the symbolic network architecture, with regularization-minimizing parameterizations promoting interpretability and sparsity in case of L^1 -regularization. In addition, we provide regularity results for symbolic networks. Empirical validation using the ParFam architecture supports these theoretical findings, providing evidence for the practical reconstructibility of physical laws.
[LG-12] Certified Per-Instance Unlearning Using Individual Sensitivity Bounds
链接: https://arxiv.org/abs/2602.15602
作者: Hanna Benarroch(DI-ENS),Jamal Atif(CMAP),Olivier Cappé(DI-ENS)
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Certified machine unlearning can be achieved via noise injection leading to differential privacy guarantees, where noise is calibrated to worst-case sensitivity. Such conservative calibration often results in performance degradation, limiting practical applicability. In this work, we investigate an alternative approach based on adaptive per-instance noise calibration tailored to the individual contribution of each data point to the learned solution. This raises the following challenge: how can one establish formal unlearning guarantees when the mechanism depends on the specific point to be removed? To define individual data point sensitivities in noisy gradient dynamics, we consider the use of per-instance differential privacy. For ridge regression trained via Langevin dynamics, we derive high-probability per-instance sensitivity bounds, yielding certified unlearning with substantially less noise injection. We corroborate our theoretical findings through experiments in linear settings and provide further empirical evidence on the relevance of the approach in deep learning settings.
[LG-13] Multi-Objective Coverag e via Constraint Active Search
链接: https://arxiv.org/abs/2602.15595
作者: Zakaria Shams Siam,Xuefeng Liu,Chong Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we formulate the new multi-objective coverage (MOC) problem where our goal is to identify a small set of representative samples whose predicted outcomes broadly cover the feasible multi-objective space. This problem is of great importance in many critical real-world applications, e.g., drug discovery and materials design, as this representative set can be evaluated much faster than the whole feasible set, thus significantly accelerating the scientific discovery process. Existing works cannot be directly applied as they either focus on sample space coverage or multi-objective optimization that targets the Pareto front. However, chemically diverse samples often yield identical objective profiles, and safety constraints are usually defined on the objectives. To solve this MOC problem, we propose a novel search algorithm, MOC-CAS, which employs an upper confidence bound-based acquisition function to select optimistic samples guided by Gaussian process posterior predictions. For enabling efficient optimization, we develop a smoothed relaxation of the hard feasibility test and derive an approximate optimizer. Compared to the competitive baselines, we show that our MOC-CAS empirically achieves superior performances across large-scale protein-target datasets for SARS-CoV-2 and cancer, each assessed on five objectives derived from SMILES-based features.
[LG-14] A unified theory of feature learning in RNNs and DNNs
链接: https://arxiv.org/abs/2602.15593
作者: Jan P. Bauer,Kirsten Fischer,Moritz Helias,Agostina Palmigiano
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注:
Abstract:Recurrent and deep neural networks (RNNs/DNNs) are cornerstone architectures in machine learning. Remarkably, RNNs differ from DNNs only by weight sharing, as can be shown through unrolling in time. How does this structural similarity fit with the distinct functional properties these networks exhibit? To address this question, we here develop a unified mean-field theory for RNNs and DNNs in terms of representational kernels, describing fully trained networks in the feature learning ( \mu P) regime. This theory casts training as Bayesian inference over sequences and patterns, directly revealing the functional implications induced by the RNNs’ weight sharing. In DNN-typical tasks, we identify a phase transition when the learning signal overcomes the noise due to randomness in the weights: below this threshold, RNNs and DNNs behave identically; above it, only RNNs develop correlated representations across timesteps. For sequential tasks, the RNNs’ weight sharing furthermore induces an inductive bias that aids generalization by interpolating unsupervised time steps. Overall, our theory offers a way to connect architectural structure to functional biases.
[LG-15] Uniform error bounds for quantized dynamical models
链接: https://arxiv.org/abs/2602.15586
作者: Abdelkader Metakalard(CRAN, SYNALP),Fabien Lauer(SYNALP, LORIA),Kevin Colin(CRAN),Marion Gilson(CRAN)
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:This paper provides statistical guarantees on the accuracy of dynamical models learned from dependent data sequences. Specifically, we develop uniform error bounds that apply to quantized models and imperfect optimization algorithms commonly used in practical contexts for system identification, and in particular hybrid system identification. Two families of bounds are obtained: slow-rate bounds via a block decomposition and fast-rate, variance-adaptive, bounds via a novel spaced-point strategy. The bounds scale with the number of bits required to encode the model and thus translate hardware constraints into interpretable statistical complexities.
[LG-16] Accelerated Predictive Coding Networks via Direct Kolen-Pollack Feedback Alignment
链接: https://arxiv.org/abs/2602.15571
作者: Davide Casnici,Martin Lefebvre,Justin Dauwels,Charlotte Frenkel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Predictive coding (PC) is a biologically inspired algorithm for training neural networks that relies only on local updates, allowing parallel learning across layers. However, practical implementations face two key limitations: error signals must still propagate from the output to early layers through multiple inference-phase steps, and feedback decays exponentially during this process, leading to vanishing updates in early layers. We propose direct Kolen-Pollack predictive coding (DKP-PC), which simultaneously addresses both feedback delay and exponential decay, yielding a more efficient and scalable variant of PC while preserving update locality. Leveraging direct feedback alignment and direct Kolen-Pollack algorithms, DKP-PC introduces learnable feedback connections from the output layer to all hidden layers, establishing a direct pathway for error transmission. This yields an algorithm that reduces the theoretical error propagation time complexity from O(L), with L being the network depth, to O(1), removing depth-dependent delay in error signals. Moreover, empirical results demonstrate that DKP-PC achieves performance at least comparable to, and often exceeding, that of standard PC, while offering improved latency and computational performance, supporting its potential for custom hardware-efficient implementations.
[LG-17] 1-Bit Wonder: Improving QAT Performance in the Low-Bit Regime through K-Means Quantization
链接: https://arxiv.org/abs/2602.15563
作者: Sohir Maskey,Constantin Eichenberg,Johannes Messner,Douglas Orr
类目: Machine Learning (cs.LG)
*备注: Preprint. Under Review. 23 pages, 9 figures
Abstract:Quantization-aware training (QAT) is an effective method to drastically reduce the memory footprint of LLMs while keeping performance degradation at an acceptable level. However, the optimal choice of quantization format and bit-width presents a challenge in practice. The full design space of quantization is not fully explored in the context of QAT, and the precise trade-off between quantization and downstream performance is poorly understood, as comparisons often rely solely on perplexity-based evaluations. In this work, we address these shortcomings with an empirical study of QAT in the low-bit regime. We show that k-means based weight quantization outperforms integer formats and can be implemented efficiently on standard hardware. Furthermore, we find that, under a fixed inference memory budget, the best performance on generative downstream tasks is achieved with 1 -bit quantized weights.
[LG-18] Latent Regularization in Generative Test Input Generation ICSE2026
链接: https://arxiv.org/abs/2602.15552
作者: Giorgi Merabishvili,Oliver Weißl,Andrea Stocco
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Accepted for publication at the 7th International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest 2026), co-located with ICSE 2026
Abstract:This study investigates the impact of regularization of latent spaces through truncation on the quality of generated test inputs for deep learning classifiers. We evaluate this effect using style-based GANs, a state-of-the-art generative approach, and assess quality along three dimensions: validity, diversity, and fault detection. We evaluate our approach on the boundary testing of deep learning image classifiers across three datasets, MNIST, Fashion MNIST, and CIFAR-10. We compare two truncation strategies: latent code mixing with binary search optimization and random latent truncation for generative exploration. Our experiments show that the latent code-mixing approach yields a higher fault detection rate than random truncation, while also improving both diversity and validity.
[LG-19] CEPAE: Conditional Entropy-Penalized Autoencoders for Time Series Counterfactuals
链接: https://arxiv.org/abs/2602.15546
作者: Tomàs Garriga,Gerard Sanz,Eduard Serrahima de Cambra,Axel Brando
类目: Machine Learning (cs.LG)
*备注:
Abstract:The ability to accurately perform counterfactual inference on time series is crucial for decision-making in fields like finance, healthcare, and marketing, as it allows us to understand the impact of events or treatments on outcomes over time. In this paper, we introduce a new counterfactual inference approach tailored to time series data impacted by market events, which is motivated by an industrial application. Utilizing the abduction-action-prediction procedure and the Structural Causal Model framework, we first adapt methods based on variational autoencoders and adversarial autoencoders, both previously used in counterfactual literature although not in time series settings. Then, we present the Conditional Entropy-Penalized Autoencoder (CEPAE), a novel autoencoder-based approach for counterfactual inference, which employs an entropy penalization loss over the latent space to encourage disentangled data representations. We validate our approach both theoretically and experimentally on synthetic, semi-synthetic, and real-world datasets, showing that CEPAE generally outperforms the other approaches in the evaluated metrics.
[LG-20] On the Geometric Coherence of Global Aggregation in Federated GNN
链接: https://arxiv.org/abs/2602.15510
作者: Chethana Prasad Kabgere,Shylaja SS
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注: This is a developing preprint of an 18-page journal manuscript (6 figures), currently being prepared for formal peer-review submission
Abstract:Federated Learning (FL) enables distributed training across multiple clients without centralized data sharing, while Graph Neural Networks (GNNs) model relational data through message passing. In federated GNN settings, client graphs often exhibit heterogeneous structural and propagation characteristics. When standard aggregation mechanisms are applied to such heterogeneous updates, the global model may converge numerically while exhibiting degraded relational this http URL work identifies a geometric failure mode of global aggregation in Cross- Domain Federated GNNs. Although GNN parameters are numerically represented as vectors, they encode relational transformations that govern the direction, strength, and sensitivity of information flow across graph neighborhoods. Aggregating updates originating from incompatible propagation regimes can therefore introduce destructive interference in this transformation this http URL leads to loss of coherence in global message passing. Importantly, this degradation is not necessarily reflected in conventional metrics such as loss or this http URL address this issue, we propose GGRS (Global Geometric Reference Structure), a server-side framework that regulates client updates prior to aggregation based on geometric admissibility criteria. GGRS preserves directional consistency of relational transformations as well as maintains diversity of admissible propagation subspaces. It also stabilizes sensitivity to neighborhood interactions, without accessing client data or graph topology. Experiments on heterogeneous GNN-native, Amazon Co-purchase datasets demonstrate that GGRS preserves global message-passing coherence across training rounds by highlighting the necessity of geometry-aware regulation in federated graph learning.
[LG-21] Approximation Theory for Lipschitz Continuous Transformers
链接: https://arxiv.org/abs/2602.15503
作者: Takashi Furuya,Davide Murari,Carola-Bibiane Schönlieb
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Stability and robustness are critical for deploying Transformers in safety-sensitive settings. A principled way to enforce such behavior is to constrain the model’s Lipschitz constant. However, approximation-theoretic guarantees for architectures that explicitly preserve Lipschitz continuity have yet to be established. In this work, we bridge this gap by introducing a class of gradient-descent-type in-context Transformers that are Lipschitz-continuous by construction. We realize both MLP and attention blocks as explicit Euler steps of negative gradient flows, ensuring inherent stability without sacrificing expressivity. We prove a universal approximation theorem for this class within a Lipschitz-constrained function space. Crucially, our analysis adopts a measure-theoretic formalism, interpreting Transformers as operators on probability measures, to yield approximation guarantees independent of token count. These results provide a rigorous theoretical foundation for the design of robust, Lipschitz continuous Transformer architectures.
[LG-22] ExLipBaB: Exact Lipschitz Constant Computation for Piecewise Linear Neural Networks
链接: https://arxiv.org/abs/2602.15499
作者: Tom A. Splittgerber
类目: Machine Learning (cs.LG)
*备注: 14 pages, 1 figure
Abstract:It has been shown that a neural network’s Lipschitz constant can be leveraged to derive robustness guarantees, to improve generalizability via regularization or even to construct invertible networks. Therefore, a number of methods varying in the tightness of their bounds and their computational cost have been developed to approximate the Lipschitz constant for different classes of networks. However, comparatively little research exists on methods for exact computation, which has been shown to be NP-hard. Nonetheless, there are applications where one might readily accept the computational cost of an exact method. These applications could include the benchmarking of new methods or the computation of robustness guarantees for small models on sensitive data. Unfortunately, existing exact algorithms restrict themselves to only ReLU-activated networks, which are known to come with severe downsides in the context of Lipschitz-constrained networks. We therefore propose a generalization of the LipBaB algorithm to compute exact Lipschitz constants for arbitrary piecewise linear neural networks and p -norms. With our method, networks may contain traditional activations like ReLU or LeakyReLU, activations like GroupSort or the related MinMax and FullSort, which have been of increasing interest in the context of Lipschitz constrained networks, or even other piecewise linear functions like MaxPool.
[LG-23] LLM -as-Judge on a Budget
链接: https://arxiv.org/abs/2602.15481
作者: Aadirupa Saha,Aniket Wagde,Branislav Kveton
类目: Machine Learning (cs.LG)
*备注:
Abstract:LLM-as-a-judge has emerged as a cornerstone technique for evaluating large language models by leveraging LLM reasoning to score prompt-response pairs. Since LLM judgments are stochastic, practitioners commonly query each pair multiple times to estimate mean scores accurately. This raises a critical challenge: given a fixed computational budget B , how to optimally allocate queries across K prompt-response pairs to minimize estimation error? % We present a principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities. Our method dynamically allocates queries based on estimated score variances, concentrating resources where uncertainty is highest. Further, our algorithm is shown to achieve a worst-case score-estimation error of \tildeO\left(\sqrt\frac\sum_i=1^K \sigma_i^2B\right) , \sigma_i^2 being the unknown score variance for pair i \in [K] with near-optimal budget allocation. % Experiments on \emphSummarize-From-Feedback and \emphHelpSteer2 demonstrate that our method significantly outperforms uniform allocation, reducing worst-case estimation error while maintaining identical budgets. Our work establishes a theoretical foundation for efficient LLM evaluation with practical implications for AI safety, model alignment, and automated assessment at scale.
[LG-24] Evaluating Federated Learning for Cross-Country Mood Inference from Smartphone Sensing Data
链接: https://arxiv.org/abs/2602.15478
作者: Sharmad Kalpande,Saurabh Shirke,Haroon R. Lone
类目: Machine Learning (cs.LG)
*备注: 21 pages, 6 figure
Abstract:Mood instability is a key behavioral indicator of mental health, yet traditional assessments rely on infrequent and retrospective reports that fail to capture its continuous nature. Smartphone-based mobile sensing enables passive, in-the-wild mood inference from everyday behaviors; however, deploying such systems at scale remains challenging due to privacy constraints, uneven sensing availability, and substantial variability in behavioral patterns. In this work, we study mood inference using smartphone sensing data in a cross-country federated learning setting, where each country participates as an independent client while retaining local data. We introduce FedFAP, a feature-aware personalized federated framework designed to accommodate heterogeneous sensing modalities across regions. Evaluations across geographically and culturally diverse populations show that FedFAP achieves an AUROC of 0.744, outperforming both centralized approaches and existing personalized federated baselines. Beyond inference, our results offer design insights for mood-aware systems, demonstrating how population-aware personalization and privacy-preserving learning can enable scalable and mood-aware mobile sensing technologies. Comments: 21 pages, 6 figure Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.15478 [cs.LG] (or arXiv:2602.15478v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.15478 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-25] POP: Prior-fitted Optimizer Policies
链接: https://arxiv.org/abs/2602.15473
作者: Jan Kobiolka,Christian Frey,Gresa Shala,Arlind Kadra,Erind Bedalli,Josif Grabocka
类目: Machine Learning (cs.LG)
*备注: Under Review
Abstract:Optimization refers to the task of finding extrema of an objective function. Classical gradient-based optimizers are highly sensitive to hyperparameter choices. In highly non-convex settings their performance relies on carefully tuned learning rates, momentum, and gradient accumulation. To address these limitations, we introduce POP (Prior-fitted Optimizer Policies), a meta-learned optimizer that predicts coordinate-wise step sizes conditioned on the contextual information provided in the optimization trajectory. Our model is learned on millions of synthetic optimization problems sampled from a novel prior spanning both convex and non-convex objectives. We evaluate POP on an established benchmark including 47 optimization functions of various complexity, where it consistently outperforms first-order gradient-based methods, non-convex optimization approaches (e.g., evolutionary strategies), Bayesian optimization, and a recent meta-learned competitor under matched budget constraints. Our evaluation demonstrates strong generalization capabilities without task-specific tuning.
[LG-26] Benchmarking IoT Time-Series AD with Event-Level Augmentations
链接: https://arxiv.org/abs/2602.15457
作者: Dmitry Zhevnenko,Ilya Makarov,Aleksandr Kovalenko,Fedor Meshchaninov,Anton Kozhukhov,Vladislav Travnikov,Makar Ippolitov,Kirill Yashunin,Iurii Katser
类目: Machine Learning (cs.LG)
*备注: this https URL
Abstract:Anomaly detection (AD) for safety-critical IoT time series should be judged at the event level: reliability and earliness under realistic perturbations. Yet many studies still emphasize point-level results on curated base datasets, limiting value for model selection in practice. We introduce an evaluation protocol with unified event-level augmentations that simulate real-world issues: calibrated sensor dropout, linear and log drift, additive noise, and window shifts. We also perform sensor-level probing via mask-as-missing zeroing with per-channel influence estimation to support root-cause analysis. We evaluate 14 representative models on five public anomaly datasets (SWaT, WADI, SMD, SKAB, TEP) and two industrial datasets (steam turbine, nuclear turbogenerator) using unified splits and event aggregation. There is no universal winner: graph-structured models transfer best under dropout and long events (e.g., on SWaT under additive noise F1 drops 0.804-0.677 for a graph autoencoder, 0.759-0.680 for a graph-attention variant, and 0.762-0.756 for a hybrid graph attention model); density/flow models work well on clean stationary plants but can be fragile to monotone drift; spectral CNNs lead when periodicity is strong; reconstruction autoencoders become competitive after basic sensor vetting; predictive/hybrid dynamics help when faults break temporal dependencies but remain window-sensitive. The protocol also informs design choices: on SWaT under log drift, replacing normalizing flows with Gaussian density reduces high-stress F1 from ~0.75 to ~0.57, and fixing a learned DAG gives a small clean-set gain (~0.5-1.0 points) but increases drift sensitivity by ~8x.
[LG-27] Fairness over Equality: Correcting Social Incentives in Asymmetric Sequential Social Dilemmas
链接: https://arxiv.org/abs/2602.15407
作者: Alper Demir,Hüseyin Aydın,Kale-ab Abebe Tessera,David Abel,Stefano V. Albrecht
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sequential Social Dilemmas (SSDs) provide a key framework for studying how cooperation emerges when individual incentives conflict with collective welfare. In Multi-Agent Reinforcement Learning, these problems are often addressed by incorporating intrinsic drives that encourage prosocial or fair behavior. However, most existing methods assume that agents face identical incentives in the dilemma and require continuous access to global information about other agents to assess fairness. In this work, we introduce asymmetric variants of well-known SSD environments and examine how natural differences between agents influence cooperation dynamics. Our findings reveal that existing fairness-based methods struggle to adapt under asymmetric conditions by enforcing raw equality that wrongfully incentivize defection. To address this, we propose three modifications: (i) redefining fairness by accounting for agents’ reward ranges, (ii) introducing an agent-based weighting mechanism to better handle inherent asymmetries, and (iii) localizing social feedback to make the methods effective under partial observability without requiring global information sharing. Experimental results show that in asymmetric scenarios, our method fosters faster emergence of cooperative policies compared to existing approaches, without sacrificing scalability or practicality.
[LG-28] Joint Enhancement and Classification using Coupled Diffusion Models of Signals and Logits
链接: https://arxiv.org/abs/2602.15405
作者: Gilad Nurko,Roi Benita,Yehoshua Dissen,Tomohiro Nakatani,Marc Delcroix,Shoko Araki,Joseph Keshet
类目: Machine Learning (cs.LG)
*备注:
Abstract:Robust classification in noisy environments remains a fundamental challenge in machine learning. Standard approaches typically treat signal enhancement and classification as separate, sequential stages: first enhancing the signal and then applying a classifier. This approach fails to leverage the semantic information in the classifier’s output during denoising. In this work, we propose a general, domain-agnostic framework that integrates two interacting diffusion models: one operating on the input signal and the other on the classifier’s output logits, without requiring any retraining or fine-tuning of the classifier. This coupled formulation enables mutual guidance, where the enhancing signal refines the class estimation and, conversely, the evolving class logits guide the signal reconstruction towards discriminative regions of the manifold. We introduce three strategies to effectively model the joint distribution of the input and the logit. We evaluated our joint enhancement method for image classification and automatic speech recognition. The proposed framework surpasses traditional sequential enhancement baselines, delivering robust and flexible improvements in classification accuracy under diverse noise conditions.
[LG-29] Fractional-Order Federated Learning
链接: https://arxiv.org/abs/2602.15380
作者: Mohammad Partohaghighi,Roummel Marcia,YangQuan Chen
类目: Machine Learning (cs.LG)
*备注: This paper is submitted to IEEE-TAI
Abstract:Federated learning (FL) allows remote clients to train a global model collaboratively while protecting client privacy. Despite its privacy-preserving benefits, FL has significant drawbacks, including slow convergence, high communication cost, and non-independent-and-identically-distributed (non-IID) data. In this work, we present a novel FedAvg variation called Fractional-Order Federated Averaging (FOFedAvg), which incorporates Fractional-Order Stochastic Gradient Descent (FOSGD) to capture long-range relationships and deeper historical information. By introducing memory-aware fractional-order updates, FOFedAvg improves communication efficiency and accelerates convergence while mitigating instability caused by heterogeneous, non-IID client data. We compare FOFedAvg against a broad set of established federated optimization algorithms on benchmark datasets including MNIST, FEMNIST, CIFAR-10, CIFAR-100, EMNIST, the Cleveland heart disease dataset, Sent140, PneumoniaMNIST, and Edge-IIoTset. Across a range of non-IID partitioning schemes, FOFedAvg is competitive with, and often outperforms, these baselines in terms of test performance and convergence speed. On the theoretical side, we prove that FOFedAvg converges to a stationary point under standard smoothness and bounded-variance assumptions for fractional order 0\alpha\le 1 . Together, these results show that fractional-order, memory-aware updates can substantially improve the robustness and effectiveness of federated learning, offering a practical path toward distributed training on heterogeneous data.
[LG-30] FlashMem: Supporting Modern DNN Workloads on Mobile with GPU Memory Hierarchy Optimizations
链接: https://arxiv.org/abs/2602.15379
作者: Zhihao Shu,Md Musfiqur Rahman Sanim,Hangyu Zheng,Kunxiong Zhu,Miao Yin,Gagan Agrawal,Wei Niu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:The increasing size and complexity of modern deep neural networks (DNNs) pose significant challenges for on-device inference on mobile GPUs, with limited memory and computational resources. Existing DNN acceleration frameworks primarily deploy a weight preloading strategy, where all model parameters are loaded into memory before execution on mobile GPUs. We posit that this approach is not adequate for modern DNN workloads that comprise very large model(s) and possibly execution of several distinct models in succession. In this work, we introduce FlashMem, a memory streaming framework designed to efficiently execute large-scale modern DNNs and multi-DNN workloads while minimizing memory consumption and reducing inference latency. Instead of fully preloading weights, FlashMem statically determines model loading schedules and dynamically streams them on demand, leveraging 2.5D texture memory to minimize data transformations and improve execution efficiency. Experimental results on 11 models demonstrate that FlashMem achieves 2.0x to 8.4x memory reduction and 1.7x to 75.0x speedup compared to existing frameworks, enabling efficient execution of large-scale models and multi-DNN support on resource-constrained mobile GPUs.
[LG-31] ER-MIA: Black-Box Adversarial Memory Injection Attacks on Long-Term Memory-Augmented Large Language Models
链接: https://arxiv.org/abs/2602.15344
作者: Mitchell Piehl,Zhaohan Xi,Zuobin Xiong,Pan He,Muchao Ye
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) are increasingly augmented with long-term memory systems to overcome finite context windows and enable persistent reasoning across interactions. However, recent research finds that LLMs become more vulnerable because memory provides extra attack surfaces. In this paper, we present the first systematic study of black-box adversarial memory injection attacks that target the similarity-based retrieval mechanism in long-term memory-augmented LLMs. We introduce ER-MIA, a unified framework that exposes this vulnerability and formalizes two realistic attack settings: content-based attacks and question-targeted attacks. In these settings, ER-MIA includes an arsenal of composable attack primitives and ensemble attacks that achieve high success rates under minimal attacker assumptions. Extensive experiments across multiple LLMs and long-term memory systems demonstrate that similarity-based retrieval constitutes a fundamental and system-level vulnerability, revealing security risks that persist across memory designs and application scenarios.
[LG-32] Directional Reasoning Trajectory Change (DRTC): Identifying Critical Trace Segments in Reasoning Models
链接: https://arxiv.org/abs/2602.15332
作者: Waldemar Chang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Understanding how language models carry out long-horizon reasoning remains an open challenge. Existing interpretability methods often highlight tokens or spans correlated with an answer, but they rarely reveal where the model makes consequential reasoning turns, which earlier context causally triggers those turns, or whether the highlighted text actually steers the reasoning process. We introduce Directional Reasoning Trajectory Change (DRTC), a process-causal framework for interpreting long-form reasoning from a single on-policy rollout. DRTC detects pivot decision points using uncertainty and distribution-shift signals, then applies receiver-side interventions that preserve the realized rollout without resampling the continuation while blocking information flow from selected earlier chunks only at a pivot. It measures whether each intervention redirects the direction of the model’s log-probability trajectory relative to the realized rollout direction, producing a signed per-chunk attribution score. We also compute turning-angle curvature changes on raw logits as a complementary diagnostic and introduce curvature signatures to summarize shared intervention-response geometry. Empirically, directional influence is sharply concentrated across four reasoning models (per-example |DRTC| shares yield Gini 0.50 to 0.58 and top-5 percent mass 0.23 to 0.28), and learned pivots induce stronger intervention magnitudes than matched random spans. In a scaling study on 500 MATH problems with R1-Distill-Qwen-1.5B, learned spans outperform matched random spans (median delta = 0.409, 355 of 500 positive; sign test p = 2.3e-21). Overall, DRTC provides a causally grounded, trajectory-level view of how specific context elements steer reasoning under on-policy dynamics.
[LG-33] Scaling Laws for Masked-Reconstruction Transformers on Single-Cell Transcriptomics
链接: https://arxiv.org/abs/2602.15253
作者: Ihor Kendiukhov
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:
Abstract:Neural scaling laws – power-law relationships between loss, model size, and data – have been extensively documented for language and vision transformers, yet their existence in single-cell genomics remains largely unexplored. We present the first systematic study of scaling behaviour for masked-reconstruction transformers trained on single-cell RNA sequencing (scRNA-seq) data. Using expression profiles from the CELLxGENE Census, we construct two experimental regimes: a data-rich regime (512 highly variable genes, 200,000 cells) and a data-limited regime (1,024 genes, 10,000 cells). Across seven model sizes spanning three orders of magnitude in parameter count (533 to 3.4 x 10^8 parameters), we fit the parametric scaling law to validation mean squared error (MSE). The data-rich regime exhibits clear power-law scaling with an irreducible loss floor of c ~ 1.44, while the data-limited regime shows negligible scaling, indicating that model capacity is not the binding constraint when data are scarce. These results establish that scaling laws analogous to those observed in natural language processing do emerge in single-cell transcriptomics when sufficient data are available, and they identify the data-to-parameter ratio as a critical determinant of scaling behaviour. A preliminary conversion of the data-rich asymptotic floor to information-theoretic units yields an estimate of approximately 2.30 bits of entropy per masked gene position. We discuss implications for the design of single-cell foundation models and outline the additional measurements needed to refine this entropy estimate.
[LG-34] Size Transferability of Graph Transformers with Convolutional Positional Encodings
链接: https://arxiv.org/abs/2602.15239
作者: Javier Porras-Valenzuela,Zhiyang Wang,Alejandro Ribeiro
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transformers have achieved remarkable success across domains, motivating the rise of Graph Transformers (GTs) as attention-based architectures for graph-structured data. A key design choice in GTs is the use of Graph Neural Network (GNN)-based positional encodings to incorporate structural information. In this work, we study GTs through the lens of manifold limit models for graph sequences and establish a theoretical connection between GTs with GNN positional encodings and Manifold Neural Networks (MNNs). Building on transferability results for GNNs under manifold convergence, we show that GTs inherit transferability guarantees from their positional encodings. In particular, GTs trained on small graphs provably generalize to larger graphs under mild assumptions. We complement our theory with extensive experiments on standard graph benchmarks, demonstrating that GTs exhibit scalable behavior on par with GNNs. To further show the efficiency in a real-world scenario, we implement GTs for shortest path distance estimation over terrains to better illustrate the efficiency of the transferable GTs. Our results provide new insights into the understanding of GTs and suggest practical directions for efficient training of GTs in large-scale settings.
[LG-35] BindCLIP: A Unified Contrastive-Generative Representation Learning Framework for Virtual Screening
链接: https://arxiv.org/abs/2602.15236
作者: Anjie Qiao,Zhen Wang,Yaliang Li,Jiahua Rao,Yuedong Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Virtual screening aims to efficiently identify active ligands from massive chemical libraries for a given target pocket. Recent CLIP-style models such as DrugCLIP enable scalable virtual screening by embedding pockets and ligands into a shared space. However, our analyses indicate that such representations can be insensitive to fine-grained binding interactions and may rely on shortcut correlations in training data, limiting their ability to rank ligands by true binding compatibility. To address these issues, we propose BindCLIP, a unified contrastive-generative representation learning framework for virtual screening. BindCLIP jointly trains pocket and ligand encoders using CLIP-style contrastive learning together with a pocket-conditioned diffusion objective for binding pose generation, so that pose-level supervision directly shapes the retrieval embedding space toward interaction-relevant features. To further mitigate shortcut reliance, we introduce hard-negative augmentation and a ligand-ligand anchoring regularizer that prevents representation collapse. Experiments on two public benchmarks demonstrate consistent improvements over strong baselines. BindCLIP achieves substantial gains on challenging out-of-distribution virtual screening and improves ligand-analogue ranking on the FEP+ benchmark. Together, these results indicate that integrating generative, pose-level supervision with contrastive learning yields more interaction-aware embeddings and improves generalization in realistic screening settings, bringing virtual screening closer to real-world applicability.
[LG-36] ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset
链接: https://arxiv.org/abs/2602.15210
作者: DatologyAI:Aldo Gael Carranza,Kaleigh Mentzer,Ricardo Pio Monti,Alex Fang,Alvin Deng,Amro Abbas,Anshuman Suri,Brett Larsen,Cody Blakeney,Darren Teh,David Schwab,Diego Kiner,Fan Pan,Haakon Mongstad,Jack Urbanek,Jason Lee,Jason Telanoff,Josh Wills,Luke Merrick,Parth Doshi,Paul Burstein,Pratyush Maini,Spandan Das,Tony Jiang,Vineeth Dorna,Zhengping Wang,Bogdan Gaza,Ari Morcos,Matthew Leavitt
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multilinguality is a core capability for modern foundation models, yet training high-quality multilingual models remains challenging due to uneven data availability across languages. A further challenge is the performance interference that can arise from joint multilingual training, commonly referred to as the “curse of multilinguality”. We study multilingual data curation across thirteen languages and find that many reported regressions are not inherent to multilingual scaling but instead stem from correctable deficiencies in data quality and composition rather than fundamental capacity limits. In controlled bilingual experiments, improving data quality for any single language benefits others: curating English improves non-English performance in 12 of 13 languages, while curating non-English yields reciprocal improvements in English. Bespoke per-language curation produces substantially larger within-language improvements. Extending these findings to large-scale general-purpose training mixtures, we show that curated multilingual allocations comprising under 8% of total tokens remain remarkably effective. We operationalize this approach within an effort that produced a 20T-token pretraining corpus derived entirely from public sources. Models with 3B and 8B parameters trained on a 1T-token random subset achieve competitive multilingual accuracy with 4-10x fewer training FLOPs than strong public baselines, establishing a new Pareto frontier in multilingual performance versus compute. Moreover, these benefits extend to frontier model scale: the 20T-token corpus served as part of the pretraining dataset for Trinity Large (400B/A13B), which exhibits strong multilingual performance relative to its training FLOPs. These results show that targeted, per-language data curation mitigates multilingual interference and enables compute-efficient multilingual scaling.
[LG-37] COMPOT: Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers Compression
链接: https://arxiv.org/abs/2602.15200
作者: Denis Makhov,Dmitriy Shopkhoev,Magauiya Zhussip,Ammar Ali,Baher Mohammad,Stamatios Lefkimmiatis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Post-training compression of Transformer models commonly relies on truncated singular value decomposition (SVD). However, enforcing a single shared subspace can degrade accuracy even at moderate compression. Sparse dictionary learning provides a more flexible union-of-subspaces representation, but existing approaches often suffer from iterative dictionary and coefficient updates. We propose COMPOT (Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers), a training-free compression framework that uses a small calibration dataset to estimate a sparse weight factorization. COMPOT employs orthogonal dictionaries that enable closed-form Procrustes updates for the dictionary and analytical single-step sparse coding for the coefficients, eliminating iterative optimization. To handle heterogeneous layer sensitivity under a global compression budget, COMPOT further introduces a one-shot dynamic allocation strategy that adaptively redistributes layer-wise compression rates. Extensive experiments across diverse architectures and tasks show that COMPOT consistently delivers a superior quality-compression trade-off over strong low-rank and sparse baselines, while remaining fully compatible with post-training quantization for extreme compression. Code is available \hrefthis https URLhere .
[LG-38] Learning Data-Efficient and Generalizable Neural Operators via Fundamental Physics Knowledge
链接: https://arxiv.org/abs/2602.15184
作者: Siying Ma,Mehrdad M. Zadeh,Mauricio Soroco,Wuyang Chen,Jiguo Cao,Vijay Ganesh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Recent advances in scientific machine learning (SciML) have enabled neural operators (NOs) to serve as powerful surrogates for modeling the dynamic evolution of physical systems governed by partial differential equations (PDEs). While existing approaches focus primarily on learning simulations from the target PDE, they often overlook more fundamental physical principles underlying these equations. Inspired by how numerical solvers are compatible with simulations of different settings of PDEs, we propose a multiphysics training framework that jointly learns from both the original PDEs and their simplified basic forms. Our framework enhances data efficiency, reduces predictive errors, and improves out-of-distribution (OOD) generalization, particularly in scenarios involving shifts of physical parameters and synthetic-to-real transfer. Our method is architecture-agnostic and demonstrates consistent improvements in normalized root mean square error (nRMSE) across a wide range of 1D/2D/3D PDE problems. Through extensive experiments, we show that explicit incorporation of fundamental physics knowledge significantly strengthens the generalization ability of neural operators. We will release models and codes at this https URL.
[LG-39] Learning Representations from Incomplete EHR Data with Dual-Masked Autoencoding
链接: https://arxiv.org/abs/2602.15159
作者: Xiao Xiang,David Restrepo,Hyewon Jeong,Yugang Jia,Leo Anthony Celi
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures
Abstract:Learning from electronic health records (EHRs) time series is challenging due to irregular sam- pling, heterogeneous missingness, and the resulting sparsity of observations. Prior self-supervised meth- ods either impute before learning, represent missingness through a dedicated input signal, or optimize solely for imputation, reducing their capacity to efficiently learn representations that support clinical downstream tasks. We propose the Augmented-Intrinsic Dual-Masked Autoencoder (AID-MAE), which learns directly from incomplete time series by applying an intrinsic missing mask to represent naturally missing values and an augmented mask that hides a subset of observed values for reconstruction during training. AID-MAE processes only the unmasked subset of tokens and consistently outperforms strong baselines, including XGBoost and DuETT, across multiple clinical tasks on two datasets. In addition, the learned embeddings naturally stratify patient cohorts in the representation space.
[LG-40] Hybrid Feature Learning with Time Series Embeddings for Equipment Anomaly Prediction
链接: https://arxiv.org/abs/2602.15089
作者: Takato Yasuno
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 17 pages, 7 figures, 1 table
Abstract:In predictive maintenance of equipment, deep learning-based time series anomaly detection has garnered significant attention; however, pure deep learning approaches often fail to achieve sufficient accuracy on real-world data. This study proposes a hybrid approach that integrates 64-dimensional time series embeddings from Granite TinyTimeMixer with 28-dimensional statistical features based on domain knowledge for HVAC equipment anomaly prediction tasks. Specifically, we combine time series embeddings extracted from a Granite TinyTimeMixer encoder fine-tuned with LoRA (Low-Rank Adaptation) and 28 types of statistical features including trend, volatility, and drawdown indicators, which are then learned using a LightGBM gradient boosting classifier. In experiments using 64 equipment units and 51,564 samples, we achieved Precision of 91–95% and ROC-AUC of 0.995 for anomaly prediction at 30-day, 60-day, and 90-day horizons. Furthermore, we achieved production-ready performance with a false positive rate of 1.1% or less and a detection rate of 88–94%, demonstrating the effectiveness of the system for predictive maintenance applications. This work demonstrates that practical anomaly detection systems can be realized by leveraging the complementary strengths between deep learning’s representation learning capabilities and statistical feature engineering.
[LG-41] Near-Optimal Sample Complexity for Online Constrained MDPs
链接: https://arxiv.org/abs/2602.15076
作者: Chang Liu,Yunfan Li,Lin F. Yang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Safety is a fundamental challenge in reinforcement learning (RL), particularly in real-world applications such as autonomous driving, robotics, and healthcare. To address this, Constrained Markov Decision Processes (CMDPs) are commonly used to enforce safety constraints while optimizing performance. However, existing methods often suffer from significant safety violations or require a high sample complexity to generate near-optimal policies. We address two settings: relaxed feasibility, where small violations are allowed, and strict feasibility, where no violation is allowed. We propose a model-based primal-dual algorithm that balances regret and bounded constraint violations, drawing on techniques from online RL and constrained optimization. For relaxed feasibility, we prove that our algorithm returns an \varepsilon -optimal policy with \varepsilon -bounded violation with arbitrarily high probability, requiring \tildeO\left(\fracSAH^3\varepsilon^2\right) learning episodes, matching the lower bound for unconstrained MDPs. For strict feasibility, we prove that our algorithm returns an \varepsilon -optimal policy with zero violation with arbitrarily high probability, requiring \tildeO\left(\fracSAH^5\varepsilon^2\zeta^2\right) learning episodes, where \zeta is the problem-dependent Slater constant characterizing the size of the feasible region. This result matches the lower bound for learning CMDPs with access to a generative model. Our results demonstrate that learning CMDPs in an online setting is as easy as learning with a generative model and is no more challenging than learning unconstrained MDPs when small violations are allowed. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2602.15076 [cs.LG] (or arXiv:2602.15076v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.15076 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: NeurIPS 2025
[LG-42] VQ-DSC-R: Robust Vector Quantized-Enabled Digital Semantic Communication With OFDM Transmission
链接: https://arxiv.org/abs/2602.15045
作者: Jianqiao Chen,Nan Ma,Xiaodong Xu,Tingting Zhu,Huishi Song,Chen Dong,Wenkai Liu,Rui Meng,Ping Zhang
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Digital mapping of semantic features is essential for achieving interoperability between semantic communication and practical digital infrastructure. However, current research efforts predominantly concentrate on analog semantic communication with simplified channel models. To bridge these gaps, we develop a robust vector quantized-enabled digital semantic communication (VQ-DSC-R) system built upon orthogonal frequency division multiplexing (OFDM) transmission. Our work encompasses the framework design of VQ-DSC-R, followed by a comprehensive optimization study. Firstly, we design a Swin Transformer-based backbone for hierarchical semantic feature extraction, integrated with VQ modules that map the features into a shared semantic quantized codebook (SQC) for efficient index transmission. Secondly, we propose a differentiable vector quantization with adaptive noise-variance (ANDVQ) scheme to mitigate quantization errors in SQC, which dynamically adjusts the quantization process using K-nearest neighbor statistics, while exponential moving average mechanism stabilizes SQC training. Thirdly, for robust index transmission over multipath fading channel and noise, we develop a conditional diffusion model (CDM) to refine channel state information, and design an attention-based module to dynamically adapt to channel noise. The entire VQ-DSC-R system is optimized via a three-stage training strategy. Extensive experiments demonstrate superiority of VQ-DSC-R over benchmark schemes, achieving high compression ratios and robust performance in practical scenarios.
[LG-43] High Convergence Rates of CMOS Invertible Logic Circuits Based on Many-Body Hamiltonians
链接: https://arxiv.org/abs/2602.15033
作者: Naoya Onizawa,Takahiro Hanyu
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 5 pages
Abstract:This paper introduces CMOS invertible-logic (CIL) circuits based on many-body Hamiltonians. CIL can realize probabilistic forward and backward operations of a function by annealing a corresponding Hamiltonian using stochastic computing. We have created a Hamiltonian that includes three-body interaction of spins (probabilistic nodes). It provides some degrees of freedom to design a simpler landscape of Hamiltonian (energy) than that of the conventional two-body Hamiltonian. The simpler landscape makes it easier to reach the global minimum energy. The proposed three-body CIL circuits are designed and evaluated with the conventional two-body CIL circuits, resulting in few-times higher convergence rates with negligible area overhead on FPGA.
[LG-44] Ensemble-size-dependence of deep-learning post-processing methods that minimize an (un)fair score: motivating examples and a proof-of-concept solution
链接: https://arxiv.org/abs/2602.15830
作者: Christopher David Roberts
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
Abstract:Fair scores reward ensemble forecast members that behave like samples from the same distribution as the verifying observations. They are therefore an attractive choice as loss functions to train data-driven ensemble forecasts or post-processing methods when large training ensembles are either unavailable or computationally prohibitive. The adjusted continuous ranked probability score (aCRPS) is fair and unbiased with respect to ensemble size, provided forecast members are exchangeable and interpretable as conditionally independent draws from an underlying predictive distribution. However, distribution-aware post-processing methods that introduce structural dependency between members can violate this assumption, rendering aCRPS unfair. We demonstrate this effect using two approaches designed to minimize the expected aCRPS of a finite ensemble: (1) a linear member-by-member calibration, which couples members through a common dependency on the sample ensemble mean, and (2) a deep-learning method, which couples members via transformer self-attention across the ensemble dimension. In both cases, the results are sensitive to ensemble size and apparent gains in aCRPS can correspond to systematic unreliability characterized by over-dispersion. We introduce trajectory transformers as a proof-of-concept that ensemble-size independence can be achieved. This approach is an adaptation of the Post-processing Ensembles with Transformers (PoET) framework and applies self-attention over lead time while preserving the conditional independence required by aCRPS. When applied to weekly mean T_2m forecasts from the ECMWF subseasonal forecasting system, this approach successfully reduces systematic model biases whilst also improving or maintaining forecast reliability regardless of the ensemble size used in training (3 vs 9 members) or real-time forecasts (9 vs 100 members).
[LG-45] Neural Scaling Laws for Boosted Jet Tagging
链接: https://arxiv.org/abs/2602.15781
作者: Matthias Vigl,Nicole Hartman,Michael Kagan,Lukas Heinrich
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 9 pages, 6 figures
Abstract:The success of Large Language Models (LLMs) has established that scaling compute, through joint increases in model capacity and dataset size, is the primary driver of performance in modern machine learning. While machine learning has long been an integral component of High Energy Physics (HEP) data analysis workflows, the compute used to train state-of-the-art HEP models remains orders of magnitude below that of industry foundation models. With scaling laws only beginning to be studied in the field, we investigate neural scaling laws for boosted jet classification using the public JetClass dataset. We derive compute optimal scaling laws and identify an effective performance limit that can be consistently approached through increased compute. We study how data repetition, common in HEP where simulation is expensive, modifies the scaling yielding a quantifiable effective dataset size gain. We then study how the scaling coefficients and asymptotic performance limits vary with the choice of input features and particle multiplicity, demonstrating that increased compute reliably drives performance toward an asymptotic limit, and that more expressive, lower-level features can raise the performance limit and improve results at fixed dataset size.
[LG-46] Enabling Low-Latency Machine learning on Radiation-Hard FPGAs with hls4ml
链接: https://arxiv.org/abs/2602.15751
作者: Katya Govorkova,Julian Garcia Pardinas,Vladimir Loncar,Victoria Nguyen,Sebastian Schmitt,Marco Pizzichemi,Loris Martinazzoli,Eluned Anne Smith
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG)
*备注:
Abstract:This paper presents the first demonstration of a viable, ultra-fast, radiation-hard machine learning (ML) application on FPGAs, which could be used in future high-energy physics experiments. We present a three-fold contribution, with the PicoCal calorimeter, planned for the LHCb Upgrade II experiment, used as a test case. First, we develop a lightweight autoencoder to compress a 32-sample timing readout, representative of that of the PicoCal, into a two-dimensional latent space. Second, we introduce a systematic, hardware-aware quantization strategy and show that the model can be reduced to 10-bit weights with minimal performance loss. Third, as a barrier to the adoption of on-detector ML is the lack of support for radiation-hard FPGAs in the High-Energy Physics community’s standard ML synthesis tool, hls4ml, we develop a new backend for this library. This new back-end enables the automatic translation of ML models into High-Level Synthesis (HLS) projects for the Microchip PolarFire family of FPGAs, one of the few commercially available and radiation hard FPGAs. We present the synthesis of the autoencoder on a target PolarFire FPGA, which indicates that a latency of 25 ns can be achieved. We show that the resources utilized are low enough that the model can be placed within the inherently protected logic of the FPGA. Our extension to hls4ml is a significant contribution, paving the way for broader adoption of ML on FPGAs in high-radiation environments.
[LG-47] Latency-aware Human-in-the-Loop Reinforcement Learning for Semantic Communications
链接: https://arxiv.org/abs/2602.15640
作者: Peizheng Li,Xinyi Lin,Adnan Aijaz
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 6 pages, 8 figures. This paper has been accepted for publication in IEEE ICC 2026
Abstract:Semantic communication promises task-aligned transmission but must reconcile semantic fidelity with stringent latency guarantees in immersive and safety-critical services. This paper introduces a time-constrained human-in-the-loop reinforcement learning (TC-HITL-RL) framework that embeds human feedback, semantic utility, and latency control within a semantic-aware Open radio access network (RAN) architecture. We formulate semantic adaptation driven by human feedback as a constrained Markov decision process (CMDP) whose state captures semantic quality, human preferences, queue slack, and channel dynamics, and solve it via a primal–dual proximal policy optimization algorithm with action shielding and latency-aware reward shaping. The resulting policy preserves PPO-level semantic rewards while tightening the variability of both air-interface and near-real-time RAN intelligent controller processing budgets. Simulations over point-to-multipoint links with heterogeneous deadlines show that TC-HITL-RL consistently meets per-user timing constraints, outperforms baseline schedulers in reward, and stabilizes resource consumption, providing a practical blueprint for latency-aware semantic adaptation.
[LG-48] Neural-POD: A Plug-and-Play Neural Operator Framework for Infinite-Dimensional Functional Nonlinear Proper Orthogonal Decomposition
链接: https://arxiv.org/abs/2602.15632
作者: Changhong Mou,Binghang Lu,Guang Lin
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:The rapid development of AI for Science is often hindered by the “discretization”, where learned representations remain restricted to the specific grids or resolutions used during training. We propose the Neural Proper Orthogonal Decomposition (Neural-POD), a plug-and-play neural operator framework that constructs nonlinear, orthogonal basis functions in infinite-dimensional space using neural networks. Unlike the classical Proper Orthogonal Decomposition (POD), which is limited to linear subspace approximations obtained through singular value decomposition (SVD), Neural-POD formulates basis construction as a sequence of residual minimization problems solved through neural network training. Each basis function is obtained by learning to represent the remaining structure in the data, following a process analogous to Gram–Schmidt orthogonalization. This neural formulation introduces several key advantages over classical POD: it enables optimization in arbitrary norms (e.g., L^2 , L^1 ), learns mappings between infinite-dimensional function spaces that is resolution-invariant, generalizes effectively to unseen parameter regimes, and inherently captures nonlinear structures in complex spatiotemporal systems. The resulting basis functions are interpretable, reusable, and enabling integration into both reduced order modeling (ROM) and operator learning frameworks such as deep operator learning (DeepONet). We demonstrate the robustness of Neural-POD with different complex spatiotemporal systems, including the Burgers’ and Navier-Stokes equations. We further show that Neural-POD serves as a high performance, plug-and-play bridge between classical Galerkin projection and operator learning that enables consistent integration with both projection-based reduced order models and DeepONet frameworks.
[LG-49] Uni-Flow: a unified autoregressive-diffusion model for complex multiscale flows
链接: https://arxiv.org/abs/2602.15592
作者: Xiao Xue,Tianyue Yang,Mingyang Gao,Leyu Pan,Maida Wang,Kewei Zhu,Shuo Wang,Jiuling Li,Marco F.P. ten Eikelder,Peter V. Coveney
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Spatiotemporal flows govern diverse phenomena across physics, biology, and engineering, yet modelling their multiscale dynamics remains a central challenge. Despite major advances in physics-informed machine learning, existing approaches struggle to simultaneously maintain long-term temporal evolution and resolve fine-scale structure across chaotic, turbulent, and physiological regimes. Here, we introduce Uni-Flow, a unified autoregressive-diffusion framework that explicitly separates temporal evolution from spatial refinement for modelling complex dynamical systems. The autoregressive component learns low-resolution latent dynamics that preserve large-scale structure and ensure stable long-horizon rollouts, while the diffusion component reconstructs high-resolution physical fields, recovering fine-scale features in a small number of denoising steps. We validate Uni-Flow across canonical benchmarks, including two-dimensional Kolmogorov flow, three-dimensional turbulent channel inflow generation with a quantum-informed autoregressive prior, and patient-specific simulations of aortic coarctation derived from high-fidelity lattice Boltzmann hemodynamic solvers. In the cardiovascular setting, Uni-Flow enables task-level faster than real-time inference of pulsatile hemodynamics, reconstructing high-resolution pressure fields over physiologically relevant time horizons in seconds rather than hours. By transforming high-fidelity hemodynamic simulation from an offline, HPC-bound process into a deployable surrogate, Uni-Flow establishes a pathway to faster-than-real-time modelling of complex multiscale flows, with broad implications for scientific machine learning in flow physics.
[LG-50] Scenario Approach with Post-Design Certification of User-Specified Properties
链接: https://arxiv.org/abs/2602.15568
作者: Algo Carè,Marco C. Campi,Simone Garatti
类目: Methodology (stat.ME); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:
Abstract:The scenario approach is an established data-driven design framework that comes equipped with a powerful theory linking design complexity to generalization properties. In this approach, data are simultaneously used both for design and for certifying the design’s reliability, without resorting to a separate test dataset. This paper takes a step further by guaranteeing additional properties, useful in post-design usage but not considered during the design phase. To this end, we introduce a two-level framework of appropriateness: baseline appropriateness, which guides the design process, and post-design appropriateness, which serves as a criterion for a posteriori evaluation. We provide distribution-free upper bounds on the risk of failing to meet the post-design appropriateness; these bounds are computable without using any additional test data. Under additional assumptions, lower bounds are also derived. As part of an effort to demonstrate the usefulness of the proposed methodology, the paper presents two practical examples in H2 and pole-placement problems. Moreover, a method is provided to infer comprehensive distributional knowledge of relevant performance indexes from the available dataset.
[LG-51] Functional Central Limit Theorem for Stochastic Gradient Descent
链接: https://arxiv.org/abs/2602.15538
作者: Kessang Flamand,Victor-Emmanuel Brunel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We study the asymptotic shape of the trajectory of the stochastic gradient descent algorithm applied to a convex objective function. Under mild regularity assumptions, we prove a functional central limit theorem for the properly rescaled trajectory. Our result characterizes the long-term fluctuations of the algorithm around the minimizer by providing a diffusion limit for the trajectory. In contrast with classical central limit theorems for the last iterate or Polyak-Ruppert averages, this functional result captures the temporal structure of the fluctuations and applies to non-smooth settings such as robust location estimation, including the geometric median.
[LG-52] Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction
链接: https://arxiv.org/abs/2602.15484
作者: Amartyaveer,Murali Kadambi,Chandra Mohan Sharma,Anupam Mondal,Prasanta Kumar Ghosh
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 7 pages, 7 tables, 2 figures, ASRU 2025
Abstract:In this study, we have presented a novel approach to predict the Short-Time Objective Intelligibility (STOI) metric using a bottleneck transformer architecture. Traditional methods for calculating STOI typically requires clean reference speech, which limits their applicability in the real world. To address this, numerous deep learning-based nonintrusive speech assessment models have garnered significant interest. Many studies have achieved commendable performance, but there is room for further improvement. We propose the use of bottleneck transformer, incorporating convolution blocks for learning frame-level features and a multi-head self-attention (MHSA) layer to aggregate the information. These components enable the transformer to focus on the key aspects of the input data. Our model has shown higher correlation and lower mean squared error for both seen and unseen scenarios compared to the state-of-the-art model using self-supervised learning (SSL) and spectral features as inputs. Comments: 7 pages, 7 tables, 2 figures, ASRU 2025 Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2602.15484 [eess.AS] (or arXiv:2602.15484v1 [eess.AS] for this version) https://doi.org/10.48550/arXiv.2602.15484 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-53] Fluids You Can Trust: Property-Preserving Operator Learning for Incompressible Flows
链接: https://arxiv.org/abs/2602.15472
作者: Ramansh Sharma,Matthew Lowery,Houman Owhadi,Varun Shankar
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
Abstract:We present a novel property-preserving kernel-based operator learning method for incompressible flows governed by the incompressible Navier-Stokes equations. Traditional numerical solvers incur significant computational costs to respect incompressibility. Operator learning offers efficient surrogate models, but current neural operators fail to exactly enforce physical properties such as incompressibility, periodicity, and turbulence. Our method maps input functions to expansion coefficients of output functions in a property-preserving kernel basis, ensuring that predicted velocity fields analytically and simultaneously preserve the aforementioned physical properties. We evaluate the method on challenging 2D and 3D, laminar and turbulent, incompressible flow problems. Our method achieves up to six orders of magnitude lower relative \ell_2 errors upon generalization and trains up to five orders of magnitude faster compared to neural operators. Moreover, while our method enforces incompressibility analytically, neural operators exhibit very large deviations. Our results show that our method provides an accurate and efficient surrogate for incompressible flows.
[LG-54] he Skeletal Trap: Mapping Spatial Inequality and Ghost Stops in Ankaras Transit Network
链接: https://arxiv.org/abs/2602.15470
作者: Elifnaz Kancan
类目: Physics and Society (physics.soc-ph); Machine Learning (cs.LG)
*备注: 13 pages, 12 figures. Spatial analysis of Ankara transit network using anomaly detection and grid-based modeling
Abstract:Ankara’s public transport crisis is commonly framed as a shortage of buses or operational inefficiency. This study argues that the problem is fundamentally morphological and structural. The city’s leapfrog urban expansion has produced fragmented peripheral clusters disconnected from a rigid, center-oriented bus network. As a result, demand remains intensely concentrated along the Kizilay-Ulus axis and western corridors, while peripheral districts experience either chronic under-service or enforced transfer dependency. The deficiency is therefore not merely quantitative but rooted in the misalignment between urban macroform and network architecture. The empirical analysis draws on a 173-day operational dataset derived from route-level passenger and trip reports published by EGO under the former “Transparent Ankara” initiative. To overcome the absence of stop-level geospatial data, a Connectivity-Based Weighted Distribution Model reallocates passenger volumes to 1 km x 1 km grid cells using network centrality. The findings reveal persistent center-periphery asymmetries, structural bottlenecks, and spatially embedded accessibility inequalities.
[LG-55] Sparse Additive Model Pruning for Order-Based Causal Structure Learning AAAI2026 AAAI
链接: https://arxiv.org/abs/2602.15306
作者: Kentaro Kanamori,Hirofumi Suzuki,Takuya Takagi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 15 pages, 12 figures, to appear in the 40th AAAI Conference on Artificial Intelligence (AAAI 2026)
Abstract:Causal structure learning, also known as causal discovery, aims to estimate causal relationships between variables as a form of a causal directed acyclic graph (DAG) from observational data. One of the major frameworks is the order-based approach that first estimates a topological order of the underlying DAG and then prunes spurious edges from the fully-connected DAG induced by the estimated topological order. Previous studies often focus on the former ordering step because it can dramatically reduce the search space of DAGs. In practice, the latter pruning step is equally crucial for ensuring both computational efficiency and estimation accuracy. Most existing methods employ a pruning technique based on generalized additive models and hypothesis testing, commonly known as CAM-pruning. However, this approach can be a computational bottleneck as it requires repeatedly fitting additive models for all variables. Furthermore, it may harm estimation quality due to multiple testing. To address these issues, we introduce a new pruning method based on sparse additive models, which enables direct pruning of redundant edges without relying on hypothesis testing. We propose an efficient algorithm for learning sparse additive models by combining the randomized tree embedding technique with group-wise sparse regression. Experimental results on both synthetic and real datasets demonstrated that our method is significantly faster than existing pruning methods while maintaining comparable or superior accuracy.
[LG-56] Learning the S-matrix from data: Rediscovering gravity from gauge theory via symbolic regression
链接: https://arxiv.org/abs/2602.15169
作者: Nathan Moynihan
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注:
Abstract:We demonstrate that modern machine-learning methods can autonomously reconstruct several flagship analytic structures in scattering amplitudes directly from numerical on-shell data. In particular, we show that the Kawai–Lewellen–Tye (KLT) relations can be rediscovered using symbolic regression applied to colour-ordered Yang–Mills amplitudes with Mandelstam invariants as input features. Using standard feature-selection techniques, specifically column-pivoted QR factorisation, we simultaneously recover the Kleiss–Kuijf and Bern–Carrasco–Johansson (BCJ) relations, identifying a minimal basis of partial amplitudes without any group-theoretic input. We obtain the tree-level KLT relations with high numerical accuracy up to five external legs, using only minimal theoretical priors, and we comment on the obstacles to generalising the method to higher multiplicity. Our results establish symbolic regression as a practical tool for exploring the analytic structure of the scattering-amplitude landscape, and suggests a general data-driven strategy for uncovering hidden relations in general theories. For comparison, we benchmark this general approach with a recently introduced neural-network based method.
[LG-57] Beyond Reinforcement Learning: Fast and Scalable Quantum Circuit Synthesis
链接: https://arxiv.org/abs/2602.15146
作者: Lukas Theissinger,Thore Gerlach,David Berghaus,Christian Bauckhage
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Quantum unitary synthesis addresses the problem of translating abstract quantum algorithms into sequences of hardware-executable quantum gates. Solving this task exactly is infeasible in general due to the exponential growth of the underlying combinatorial search space. Existing approaches suffer from misaligned optimization objectives, substantial training costs and limited generalization across different qubit counts. We mitigate these limitations by using supervised learning to approximate the minimum description length of residual unitaries and combining this estimate with stochastic beam search to identify near optimal gate sequences. Our method relies on a lightweight model with zero-shot generalization, substantially reducing training overhead compared to prior baselines. Across multiple benchmarks, we achieve faster wall-clock synthesis times while exceeding state-of-the-art methods in terms of success rate for complex circuits.
[LG-58] Universal priors: solving empirical Bayes via Bayesian inference and pretraining
链接: https://arxiv.org/abs/2602.15136
作者: Nick Cannella,Anzo Teh,Yanjun Han,Yury Polyanskiy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 40 pages, 5 figures
Abstract:We theoretically justify the recent empirical finding of [Teh et al., 2025] that a transformer pretrained on synthetically generated data achieves strong performance on empirical Bayes (EB) problems. We take an indirect approach to this question: rather than analyzing the model architecture or training dynamics, we ask why a pretrained Bayes estimator, trained under a prespecified training distribution, can adapt to arbitrary test distributions. Focusing on Poisson EB problems, we identify the existence of universal priors such that training under these priors yields a near-optimal regret bound of \widetildeO(\frac1n) uniformly over all test distributions. Our analysis leverages the classical phenomenon of posterior contraction in Bayesian statistics, showing that the pretrained transformer adapts to unknown test distributions precisely through posterior contraction. This perspective also explains the phenomenon of length generalization, in which the test sequence length exceeds the training length, as the model performs Bayesian inference using a generalized posterior.
[LG-59] Mixture-of-Experts under Finite-Rate Gating: Communication–Generalization Trade-offs
链接: https://arxiv.org/abs/2602.15091
作者: Ali Khalesi,Mohammad Reza Deylam Salehi
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Mixture-of-Experts (MoE) architectures decompose prediction tasks into specialized expert sub-networks selected by a gating mechanism. This letter adopts a communication-theoretic view of MoE gating, modeling the gate as a stochastic channel operating under a finite information rate. Within an information-theoretic learning framework, we specialize a mutual-information generalization bound and develop a rate-distortion characterization D(R_g) of finite-rate gating, where R_g:=I(X; T) , yielding (under a standard empirical rate-distortion optimality condition) \mathbbE[R(W)] \le D(R_g)+\delta_m+\sqrt(2/m), I(S; W) . The analysis yields capacity-aware limits for communication-constrained MoE systems, and numerical simulations on synthetic multi-expert models empirically confirm the predicted trade-offs between gating rate, expressivity, and generalization.
[LG-60] IT-DPC-SRI: A Cloud-Optimized Archive of Italian Radar Precipitation (2010-2025)
链接: https://arxiv.org/abs/2602.15088
作者: Gabriele Franch,Elena Tomasi,Uladzislau Azhel,Giacomo Tomezzoli,Alessandro Camilletti,Virginia Poli,Renata Pelosini,Gianfranco Vulpiani,Gabriella Scipione,Giuseppe Trotta,Matteo Angelinelli,Leif Denby,Irene Livia Kruse,Marco Cristoforetti
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 15 pages, 7 figures
Abstract:We present IT-DPC-SRI, the first publicly available long-term archive of Italian weather radar precipitation estimates, spanning 16 years (2010–2025). The dataset contains Surface Rainfall Intensity (SRI) observations from the Italian Civil Protection Department’s national radar mosaic, harmonized into a coherent Analysis-Ready Cloud-Optimized (ARCO) Zarr datacube. The archive comprises over one million timesteps at temporal resolutions from 15 to 5 minutes, covering a 1200\times1400 kilometer domain at 1 kilometer spatial resolution, compressed from 7TB to 51GB on disk. We address the historical fragmentation of Italian radar data - previously scattered across heterogeneous formats (OPERA BUFR, HDF5, GeoTIFF) with varying spatial domains and projections - by reprocessing the entire record into a unified store. The dataset is accessible as a static versioned snapshot on Zenodo, via cloud-native access on the ECMWF European Weather Cloud, and as a continuously updated live version on the ArcoDataHub platform. This release fills a significant gap in European radar data availability, as Italy does not participate in the EUMETNET OPERA pan-European radar composite. The dataset is released under a CC BY-SA 4.0 license.
[LG-61] SOON: Symmetric Orthogonal Operator Network for Global Subseasonal-to-Seasonal Climate Forecasting
链接: https://arxiv.org/abs/2602.15040
作者: Ziyu Zhou,Tian Zhou,Shiyu Wang,James Kwok,Yuxuan Liang
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
Abstract:Accurate global Subseasonal-to-Seasonal (S2S) climate forecasting is critical for disaster preparedness and resource management, yet it remains challenging due to chaotic atmospheric dynamics. Existing models predominantly treat atmospheric fields as isotropic images, conflating the distinct physical processes of zonal wave propagation and meridional transport, and leading to suboptimal modeling of anisotropic dynamics. In this paper, we propose the Symmetric Orthogonal Operator Network (SOON) for global S2S climate forecasting. It couples: (1) an Anisotropic Embedding strategy that tokenizes the global grid into latitudinal rings, preserving the integrity of zonal periodic structures; and (2) a stack of SOON Blocks that models the alternating interaction of Zonal and Meridional Operators via a symmetric decomposition, structurally mitigating discretization errors inherent in long-term integration. Extensive experiments on the Earth Reanalysis 5 dataset demonstrate that SOON establishes a new state-of-the-art, significantly outperforming existing methods in both forecasting accuracy and computational efficiency.
[LG-62] Accurate 2D Reconstruction for PET Scanners based on the Analytical White Image Model
链接: https://arxiv.org/abs/2306.17652
作者: Tomislav Matulić,Damir Seršić
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 37 pages, 16 figures
Abstract:In this paper, we provide a precise mathematical model of crystal-to-crystal response which is used to generate the white image - a necessary compensation model needed to overcome the physical limitations of the PET scanner. We present a closed-form solution, as well as several accurate approximations, due to the complexity of the exact mathematical expressions. We prove, experimentally and analytically, that the difference between the best approximations and real crystal-to-crystal response is insignificant. The obtained responses are used to generate the white image compensation model. It can be written as a single closed-form expression making it easy to implement in known reconstruction methods. The maximum likelihood expectation maximization (MLEM) algorithm is modified and our white image model is integrated into it. The modified MLEM algorithm is not based on the system matrix, rather it is based on ray-driven projections and back-projections. The compensation model provides all necessary information about the system. Finally, we check our approach on synthetic and real data. For the real-world acquisition, we use the Raytest ClearPET camera for small animals and the NEMA NU 4-2008 phantom. The proposed approach overperforms competitive, non-compensated reconstruction methods.



